怎么判断压力测试的瓶颈？

zhanggame1 · 2023 年12 月 19 日 08:09

【 TiDB 使用环境】生产环境 /测试/ Poc
【 TiDB 版本】7.5.0
【遇到的问题：问题现象及影响】
集群，3台混合部署，3台压力负载均衡

使用tpcc进行500线程压力测试：
tiup bench tpcc prepare --warehouses 20 --db test -H 10.5.6.200 -P 4000 -U root -p
tiup bench tpcc -H 10.5.6.200 -P 4000 --db test --warehouses 20 --threads 500 --time 10m run -U root -p

看起来集群遇到瓶颈了，不知道是不是硬盘瓶颈

测试结果：

一些性能监控情况：

像风一样的男子 · 2023 年12 月 19 日 08:15

有木有到瓶颈无非几个指标：cpu、内存、磁盘读写、网络带宽，你看看哪个到极限了

tidb菜鸟一只 · 2023 年12 月 19 日 08:17

io到瓶颈了

Miracle · 2023 年12 月 19 日 08:20

这个盘不是SSD吗？怎么几十兆就打满了

zhanggame1 · 2023 年12 月 19 日 08:23

sata的ssd，fio测试结果：

root@tidb1:/tidb-data# fio --bs=64k --ioengine=libaio --iodepth=64 --direct=1 --rw=write --numjobs=32 --time_based --runtime=30   --randrepeat=0 --group_reporting --name=fio-read --size=10G --filename=/tidb-data/fiotest
fio-read: (g=0): rw=write, bs=(R) 64.0KiB-64.0KiB, (W) 64.0KiB-64.0KiB, (T) 64.0KiB-64.0KiB, ioengine=libaio, iodepth=64
...
fio-3.28
Starting 32 processes
fio-read: Laying out IO file (1 file / 10240MiB)
Jobs: 32 (f=32): [W(32)][100.0%][w=889MiB/s][w=14.2k IOPS][eta 00m:00s]
fio-read: (groupid=0, jobs=32): err= 0: pid=55609: Fri Dec 15 08:06:36 2023
  write: IOPS=14.1k, BW=878MiB/s (921MB/s)(25.8GiB/30021msec); 0 zone resets
    slat (usec): min=12, max=26231, avg=2261.86, stdev=1810.65
    clat (msec): min=5, max=240, avg=143.15, stdev=15.15
     lat (msec): min=10, max=243, avg=145.41, stdev=15.31
    clat percentiles (msec):
     |  1.00th=[  101],  5.00th=[  118], 10.00th=[  125], 20.00th=[  133],
     | 30.00th=[  138], 40.00th=[  142], 50.00th=[  144], 60.00th=[  148],
     | 70.00th=[  153], 80.00th=[  155], 90.00th=[  161], 95.00th=[  165],
     | 99.00th=[  176], 99.50th=[  180], 99.90th=[  190], 99.95th=[  197],
     | 99.99th=[  205]
   bw (  KiB/s): min=627878, max=1030656, per=99.67%, avg=896516.98, stdev=1800.37, samples=1888
   iops        : min= 9810, max=16104, avg=14008.07, stdev=28.13, samples=1888
  lat (msec)   : 10=0.01%, 20=0.02%, 50=0.10%, 100=0.84%, 250=99.04%
  cpu          : usr=0.74%, sys=5.20%, ctx=514522, majf=0, minf=417
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.2%, >=64=99.5%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=0,421935,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=878MiB/s (921MB/s), 878MiB/s-878MiB/s (921MB/s-921MB/s), io=25.8GiB (27.7GB), run=30021-30021msec

Disk stats (read/write):
  sdb: ios=0/408267, merge=0/14825, ticks=0/7087369, in_queue=7087370, util=100.00%

root@tidb1:/tidb-data# fio --bs=4k --ioengine=libaio --iodepth=64 --direct=1 --rw=write --numjobs=32 --time_based --runtime=30   --randrepeat=0 --group_reporting --name=fio-read --size=10G --filename=/tidb-data/fiotest
fio-read: (g=0): rw=write, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=64
...
fio-3.28
Starting 32 processes
Jobs: 32 (f=32): [W(32)][100.0%][w=642MiB/s][w=164k IOPS][eta 00m:00s]
fio-read: (groupid=0, jobs=32): err= 0: pid=55716: Fri Dec 15 08:07:35 2023
  write: IOPS=163k, BW=637MiB/s (668MB/s)(18.7GiB/30006msec); 0 zone resets
    slat (usec): min=3, max=5428, avg=193.15, stdev=195.33
    clat (usec): min=462, max=49280, avg=12355.59, stdev=1834.97
     lat (usec): min=807, max=49450, avg=12549.13, stdev=1853.74
    clat percentiles (usec):
     |  1.00th=[ 8979],  5.00th=[ 9896], 10.00th=[10421], 20.00th=[10945],
     | 30.00th=[11469], 40.00th=[11863], 50.00th=[12256], 60.00th=[12649],
     | 70.00th=[13042], 80.00th=[13566], 90.00th=[14484], 95.00th=[15270],
     | 99.00th=[17695], 99.50th=[19268], 99.90th=[27132], 99.95th=[31065],
     | 99.99th=[36963]
   bw (  KiB/s): min=402097, max=704703, per=100.00%, avg=652565.98, stdev=1394.31, samples=1888
   iops        : min=100524, max=176172, avg=163139.81, stdev=348.57, samples=1888
  lat (usec)   : 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=0.01%, 4=0.01%, 10=6.01%, 20=93.62%, 50=0.37%
  cpu          : usr=1.59%, sys=54.20%, ctx=2344372, majf=0, minf=424
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
     issued rwts: total=0,4893391,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
  WRITE: bw=637MiB/s (668MB/s), 637MiB/s-637MiB/s (668MB/s-668MB/s), io=18.7GiB (20.0GB), run=30006-30006msec

Disk stats (read/write):
  sdb: ios=0/4847188, merge=0/40007, ticks=0/7442861, in_queue=7442860, util=99.97%

托马斯滑板鞋 · 2023 年12 月 19 日 08:44

这延迟怎么感觉有点高了？

zhanggame1 · 2023 年12 月 19 日 08:46

不用管延时，只要tps能提高就行

托马斯滑板鞋 · 2023 年12 月 19 日 08:53

有在其他数据库上试过20仓的tpcc吗？我怀疑是数据量太小有热点，试试加到1000仓再压

Miracle · 2023 年12 月 19 日 08:58

磁盘被其他服务用了？
是不是可以再跑一次，然后iostst看看是不是跟这个报告里的io使用情况一致？

RenlySir · 2023 年12 月 19 日 09:08

调整几个参数试试：
tidb配置文件：
log.level: “error”
prepared-plan-cache.enabled: true
tikv-client.max-batch-wait-time: 2000000
tikv配置文件：
raftstore.apply-max-batch-size: 2048
raftstore.apply-pool-size: 3
raftstore.store-max-batch-size: 2048
raftstore.store-pool-size: 2
readpool.storage.normal-concurrency: 10
server.grpc-concurrency: 6
enable-async-apply-prewrite: true
enable-log-recycle ： true

`compression-per-level`

每一层默认压缩算法。
defaultcf 的默认值：[“no”, “no”, “lz4”, “lz4”, “lz4”, “zstd”, “zstd”]
writecf 的默认值：[“no”, “no”, “lz4”, “lz4”, “lz4”, “zstd”, “zstd”]
lockcf 的默认值：[“no”, “no”, “no”, “no”, “no”, “no”, “no”]

defaultcf 修改为[“no”, “no”, “zstd”, “zstd”, “zstd”, “zstd”, “zstd”]

全局变量：
set global tidb_hashagg_final_concurrency=1;
set global tidb_hashagg_partial_concurrency=1;
set global tidb_enable_async_commit = 1;
set global tidb_enable_1pc = 1;
set global tidb_guarantee_linearizability = 0;
set global tidb_enable_clustered_index = 1;
set global tidb_prepared_plan_cache_size=1000;

set global tidb_enable_stmt_summary = off;

再试试，看看tpmc大小

zhanggame1 · 2023 年12 月 19 日 09:11

物理专用的测试机，没有部署其他应用

RenlySir · 2023 年12 月 19 日 09:15

开启压缩，用cpu换io应该会好一些

zhanggame1 · 2023 年12 月 19 日 09:25

我试试加硬盘多起tikv看看

RenlySir · 2023 年12 月 19 日 10:32

多ratio的？还是一个tikv一个盘，一台机器部署多个tikv？

Jellybean · 2023 年12 月 19 日 14:59

可以尝试把压测线程数提高，比如提高到1000、2000、甚至更高，看看性能有没有提高，如果一直提升就一直提高并发，总会到一个拐点的。

如果期间有出现 OOM 或者宕机的情况，说明当前部署资源是瓶颈，因为都还没达到数据库的峰值，机器就撑不住了，可以换更好的资源配置试试。

如果性能出现下降后，没有OOM或宕机，而是QPS出现下降，就可以分析对比下降前、后的资源指标（内存、CPU、磁盘IO、带宽）使用情况，以及TiKV、pd关键的读写监控面板，就可以得出一定的对比结论。

一句话，要么是当前环境条件的机器资源有瓶颈，要么是数据库本身有瓶颈，多组测试再进行分析吧、

zhanggame1 · 2023 年12 月 20 日 01:55

1个tikv一个盘，每个机器从1个tikv加到2个，总共6个，已经改完了，正在测

托马斯滑板鞋 · 2023 年12 月 20 日 01:59

x86还是arm，numa关了还是绑了？

wluckdog · 2023 年12 月 20 日 02:25

1、tidb节点也会成为瓶颈，增加tidb实例的节点数，给tidb节点配置代理，也会提升

zhanggame1 · 2023 年12 月 20 日 02:32

硬件是物理机
cpu: Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz

内存:128g

硬盘sata ssd若干

numa相关的没有配置

zhanggame1 · 2023 年12 月 20 日 02:55

测试已经完成，说下结论：

3服务器混合部署，开始测试了3pd 3tidb 3tikv混合部署后面调整为3pd 3tidb 6tikv ，每个tikv独立硬盘

测试数据：

可以认为始终瓶颈是硬盘，增加tikv能有效提升性能，另外新部署也按照上面的RenlySir建议调整了一些参数
tidb:
log.level: “error”
prepared-plan-cache.enabled: true
tikv-client.max-batch-wait-time: 2000000
tikv:
raftstore.apply-max-batch-size: 2048
raftstore.apply-pool-size: 3
raftstore.store-max-batch-size: 2048
raftstore.store-pool-size: 2
readpool.storage.normal-concurrency: 10
server.grpc-concurrency: 6