kill 一个 tikv 节点后 IO 中断 10s 以上

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:

【背景】
我们构建了一套 juicefs + tikv + ceph 的分布式文件系统,其中 tikv 集群用于存放元数据,由 3 个物理节点构成,每个节点上都混部了一个 tikv 服务和一个 pd 服务。

【现象】
我们用 vdbench 对文件系统进行读写测试,当所有服务都正常时,IO是正常的。此时我们模拟故障,主动 kill 一个节点上的 tikv server,出现 IO 中断,时间大概15s,对于业务来说,这个中断时间太长了。

【TiDB 版本】
v 6.5.1

【附件】 相关日志及监控
go-client 可以立即捕获到 tikv 服务异常的错误

3s-5s 后 pd 的日志中有 region 切主的日志,正常切主后应该就可以开始服务了 为什么 IO 还是为 0 呢?

根据您提供的信息,我们可以初步判断出问题可能出现在 TiKV 节点故障后的 Region Leader 切换上。当 TiKV 节点故障后,PD 会检测到该节点的存活状态发生变化,然后会将该节点上的 Region Leader 切换到其他节点上。但是,由于您的业务出现了 IO 中断,说明在切换过程中出现了问题。

其中的一个可能原因是,当 Region Leader 切换到其他节点上时,新的 Leader 节点需要从其他节点同步数据,这个过程可能需要一些时间。如果同步数据的时间过长,就会导致 IO 中断。因此,我们建议您检查 TiKV 集群的网络状况和负载情况,确保数据同步的速度足够快。

非常感谢您的回答。

在进行故障模拟前,所有服务均稳定运行了一段时间,彼此之间的数据差异应该不大。

另外我们了解到 region 之间的同步是基于 raft,raft peers 中有一个异常后,应该是数据最新的那个成为 leader,这种情况下理应不需要从其他 follower 获取数据了。您这边指的同步数据可以更加详细些吗?

请问是否方便在相同条件下用 go-ycsb 再做一次类似的测试?

目的是缩短链条,排除一下 juicefs 的影响

https://github.com/pingcap/go-ycsb#tikv
其中:
tikv.type 填 txn
tikv.batchsize 填 0 (用于规避 https://github.com/tikv/client-go/issues/522)

补充:测试结束后,请采集并上传测试过程的监控数据

参见 【SOP 系列 22】TiDB 集群诊断信息收集 Clinic 使用指南&资料大全

非常感谢您的建议,很抱歉没有及时回复,现将我们的测试结果反馈如下:

在集群正常情况下使用 go-ycsb 进行测试:

nohup ./go-ycsb run tikv --interval 1 -P workloads/workloada -p tikv.pd="100.73.8.11:2379,100.73.8.12:2379,100.73.8.13:2379" -p tikv.type="txn" -p tikv.batchsize=0 > run_with_down_tikv 2>&1 &

image

正常运行后 kill 一个节点的 tikv 服务,此时我们可以看到 IO 出现了中断,因为各类操作的 count 并没有增加

差不多经过 15s 后慢慢恢复。

收集诊断信息如下:

tiup diag collect tikvtest -f="2023-05-11T16:13:42+08:00" -t="2023-05-11T16:23:23+08:00"

日志显示,kill TiKV 发生在 16:14:55

May 11, 2023 @ 16:14:55.810 [store.rs:2808] ["broadcasting <mark>unreachable</mark>"] [unreachable_store_id=1] [store_id=5]
May 11, 2023 @ 16:14:55.810 [store.rs:2808] ["broadcasting <mark>unreachable</mark>"] [unreachable_store_id=1] [store_id=4]

而 Raft 重新发起选举发生在 16:15:05 之后

May 11, 2023 @ 16:15:05.557 [raft.rs:1550] ["starting a new <mark>election</mark>"] [term=11] [raft_id=123] [region_id=120]
May 11, 2023 @ 16:15:06.578 [raft.rs:1550] ["starting a new <mark>election</mark>"] [term=13] [raft_id=11] [region_id=8]
May 11, 2023 @ 16:15:06.582 [raft.rs:1550] ["starting a new <mark>election</mark>"] [term=11] [raft_id=79] [region_id=76]
May 11, 2023 @ 16:15:07.579 [raft.rs:1550] ["starting a new <mark>election</mark>"] [term=9] [raft_id=35] [region_id=32]
May 11, 2023 @ 16:15:07.580 [raft.rs:1550] ["starting a new <mark>election</mark>"] [term=10] [raft_id=43] [region_id=40]

中间间隔超过 10s,是因为 TiKV 默认配置中 leader 心跳超时为 10s

如果要缩短中断时间,可以相应减少这个心跳超时。可以在 TiKV 配置中增加以下内容:

[raftstore]
raft-base-tick-interval = "1s"
raft-election-timeout-ticks = 3
raft-heartbeat-ticks = 1
pd-heartbeat-tick-interval = "5s"
raft-store-max-leader-lease = "2s"

测试效果如下(注意 TOTAL 行。v6.5.2,PD x 1 + TiKV x 3):

Using request distribution 'uniform' a keyrange of [0 99999]
[2023/05/14 10:41:59.791 +08:00] [INFO] [client.go:392] ["[pd] create pd client with endpoints"] [pd-address="[127.0.0.1:33379]"]
[2023/05/14 10:41:59.796 +08:00] [INFO] [base_client.go:350] ["[pd] switch leader"] [new-leader=http://127.0.0.1:33379] [old-leader=]
[2023/05/14 10:41:59.796 +08:00] [INFO] [base_client.go:105] ["[pd] init cluster id"] [cluster-id=7232856776285670424]
[2023/05/14 10:41:59.797 +08:00] [INFO] [client.go:687] ["[pd] tso dispatcher created"] [dc-location=global]
***************** properties *****************
"requestdistribution"="uniform"
"operationcount"="100000000"
"updateproportion"="0.5"
"readallfields"="true"
"measurement.interval"="1"
"tikv.batchsize"="0"
"dotransactions"="true"
"command"="run"
"readproportion"="0.5"
"tikv.pd"="127.0.0.1:33379"
"tikv.type"="txn"
"workload"="core"
"insertproportion"="0"
"threadcount"="20"
"recordcount"="100000"
"scanproportion"="0"
**********************************************
...
...
READ   - Takes(s): 10.9, Count: 76030, OPS: 6946.1, Avg(us): 746, Min(us): 360, Max(us): 92671, 50th(us): 677, 90th(us): 902, 95th(us): 1020, 99th(us): 2001, 99.9th(us): 5643, 99.99th(us): 55007
TOTAL  - Takes(s): 10.9, Count: 151528, OPS: 13845.0, Avg(us): 1441, Min(us): 360, Max(us): 96319, 50th(us): 1537, 90th(us): 2305, 95th(us): 2589, 99th(us): 4227, 99.9th(us): 7951, 99.99th(us): 56575
UPDATE - Takes(s): 10.9, Count: 75498, OPS: 6909.5, Avg(us): 2141, Min(us): 1255, Max(us): 96319, 50th(us): 1993, 90th(us): 2555, 95th(us): 2915, 99th(us): 5743, 99.9th(us): 8631, 99.99th(us): 74815
UPDATE_ERROR - Takes(s): 9.9, Count: 8, OPS: 0.8, Avg(us): 2803, Min(us): 1452, Max(us): 7531, 50th(us): 2261, 90th(us): 3615, 95th(us): 7531, 99th(us): 7531, 99.9th(us): 7531, 99.99th(us): 7531
READ   - Takes(s): 11.9, Count: 83421, OPS: 6983.2, Avg(us): 741, Min(us): 360, Max(us): 92671, 50th(us): 674, 90th(us): 897, 95th(us): 1014, 99th(us): 1973, 99.9th(us): 5615, 99.99th(us): 55007
TOTAL  - Takes(s): 11.9, Count: 166370, OPS: 13928.4, Avg(us): 1432, Min(us): 360, Max(us): 96319, 50th(us): 1533, 90th(us): 2289, 95th(us): 2563, 99th(us): 4187, 99.9th(us): 7843, 99.99th(us): 56447
UPDATE - Takes(s): 11.9, Count: 82949, OPS: 6954.9, Avg(us): 2126, Min(us): 1255, Max(us): 96319, 50th(us): 1981, 90th(us): 2531, 95th(us): 2883, 99th(us): 5715, 99.9th(us): 8399, 99.99th(us): 74815
UPDATE_ERROR - Takes(s): 10.9, Count: 8, OPS: 0.7, Avg(us): 2803, Min(us): 1452, Max(us): 7531, 50th(us): 2261, 90th(us): 3615, 95th(us): 7531, 99th(us): 7531, 99.9th(us): 7531, 99.99th(us): 7531
[2023/05/14 10:42:12.335 +08:00] [INFO] [region_cache.go:2524] ["[health check] check health error"] [store=127.0.0.1:20162] [error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:20162: connect: connection refused\""]
[2023/05/14 10:42:12.335 +08:00] [INFO] [region_request.go:794] ["mark store's regions need be refill"] [id=2] [addr=127.0.0.1:20162] [error="rpc error: code = Unavailable desc = error reading from server: read tcp 127.0.0.1:50720->127.0.0.1:20162: read: connection reset by peer"] [errorVerbose="rpc error: code = Unavailable desc = error reading from server: read tcp 127.0.0.1:50720->127.0.0.1:20162: read: connection reset by peer\ngithub.com/tikv/client-go/v2/tikvrpc.CallRPC\n\t/disk1/home/pingyu/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220720064224-aa9ded37d17d/tikvrpc/tikvrpc.go:1064\ngithub.com/tikv/client-go/v2/internal/client.(*RPCClient).sendRequest\n\t/disk1/home/pingyu/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220720064224-aa9ded37d17d/internal/client/client.go:524\ngithub.com/tikv/client-go/v2/internal/client.(*RPCClient).SendRequest\n\t/disk1/home/pingyu/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220720064224-aa9ded37d17d/internal/client/client.go:533\ngithub.com/tikv/client-go/v2/internal/client.interceptedClient.SendRequest\n\t/disk1/home/pingyu/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220720064224-aa9ded37d17d/internal/client/client_interceptor.go:42\ngithub.com/tikv/client-go/v2/internal/client.reqCollapse.SendRequest\n\t/disk1/home/pingyu/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220720064224-aa9ded37d17d/internal/client/client_collapse.go:74\ngithub.com/tikv/client-go/v2/internal/locate.(*RegionRequestSender).sendReqToRegion\n\t/disk1/home/pingyu/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220720064224-aa9ded37d17d/internal/locate/region_request.go:1184\ngithub.com/tikv/client-go/v2/internal/locate.(*RegionRequestSender).SendReqCtx\n\t/disk1/home/pingyu/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220720064224-aa9ded37d17d/internal/locate/region_request.go:1017\ngithub.com/tikv/client-go/v2/txnkv/txnsnapshot.(*ClientHelper).SendReqCtx\n\t/disk1/home/pingyu/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220720064224-aa9ded37d17d/txnkv/txnsnapshot/client_helper.go:146\ngithub.com/tikv/client-go/v2/txnkv/txnsnapshot.(*KVSnapshot).get\n\t/disk1/home/pingyu/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220720064224-aa9ded37d17d/txnkv/txnsnapshot/snapshot.go:620\ngithub.com/tikv/client-go/v2/txnkv/txnsnapshot.(*KVSnapshot).Get\n\t/disk1/home/pingyu/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220720064224-aa9ded37d17d/txnkv/txnsnapshot/snapshot.go:529\ngithub.com/tikv/client-go/v2/internal/unionstore.(*KVUnionStore).Get\n\t/disk1/home/pingyu/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220720064224-aa9ded37d17d/internal/unionstore/union_store.go:102\ngithub.com/tikv/client-go/v2/txnkv/transaction.(*KVTxn).Get\n\t/disk1/home/pingyu/go/pkg/mod/github.com/tikv/client-go/v2@v2.0.1-0.20220720064224-aa9ded37d17d/txnkv/transaction/txn.go:172\ngithub.com/pingcap/go-ycsb/db/tikv.(*txnDB).Read\n\t/disk1/home/pingyu/workspace/go-ycsb/db/tikv/txn.go:103\ngithub.com/pingcap/go-ycsb/pkg/client.DbWrapper.Read\n\t/disk1/home/pingyu/workspace/go-ycsb/pkg/client/dbwrapper.go:59\ngithub.com/pingcap/go-ycsb/pkg/workload.(*core).doTransactionRead\n\t/disk1/home/pingyu/workspace/go-ycsb/pkg/workload/core.go:429\ngithub.com/pingcap/go-ycsb/pkg/workload.(*core).DoTransaction\n\t/disk1/home/pingyu/workspace/go-ycsb/pkg/workload/core.go:366\ngithub.com/pingcap/go-ycsb/pkg/client.(*worker).run\n\t/disk1/home/pingyu/workspace/go-ycsb/pkg/client/client.go:129\ngithub.com/pingcap/go-ycsb/pkg/client.(*Client).Run.func2\n\t/disk1/home/pingyu/workspace/go-ycsb/pkg/client/client.go:215\nruntime.goexit\n\t/disk1/home/pingyu/opt/go-1.20.2/src/runtime/asm_amd64.s:1598"]
[2023/05/14 10:42:12.336 +08:00] [INFO] [region_cache.go:2524] ["[health check] check health error"] [store=127.0.0.1:20162] [error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:20162: connect: connection refused\""]
READ   - Takes(s): 12.9, Count: 85747, OPS: 6622.9, Avg(us): 740, Min(us): 360, Max(us): 92671, 50th(us): 673, 90th(us): 896, 95th(us): 1013, 99th(us): 1994, 99.9th(us): 5615, 99.99th(us): 54911
TOTAL  - Takes(s): 12.9, Count: 170985, OPS: 13207.7, Avg(us): 1430, Min(us): 360, Max(us): 96319, 50th(us): 1530, 90th(us): 2283, 95th(us): 2559, 99th(us): 4203, 99.9th(us): 7887, 99.99th(us): 56447
UPDATE - Takes(s): 12.9, Count: 85238, OPS: 6594.1, Avg(us): 2123, Min(us): 1103, Max(us): 96319, 50th(us): 1978, 90th(us): 2527, 95th(us): 2879, 99th(us): 5711, 99.9th(us): 8623, 99.99th(us): 74687
UPDATE_ERROR - Takes(s): 11.9, Count: 8, OPS: 0.7, Avg(us): 2803, Min(us): 1452, Max(us): 7531, 50th(us): 2261, 90th(us): 3615, 95th(us): 7531, 99th(us): 7531, 99.9th(us): 7531, 99.99th(us): 7531
[2023/05/14 10:42:13.336 +08:00] [INFO] [region_cache.go:2524] ["[health check] check health error"] [store=127.0.0.1:20162] [error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:20162: connect: connection refused\""]
READ   - Takes(s): 13.9, Count: 85747, OPS: 6148.1, Avg(us): 740, Min(us): 360, Max(us): 92671, 50th(us): 673, 90th(us): 896, 95th(us): 1013, 99th(us): 1994, 99.9th(us): 5615, 99.99th(us): 54911
TOTAL  - Takes(s): 13.9, Count: 170985, OPS: 12260.6, Avg(us): 1430, Min(us): 360, Max(us): 96319, 50th(us): 1530, 90th(us): 2283, 95th(us): 2559, 99th(us): 4203, 99.9th(us): 7887, 99.99th(us): 56447
UPDATE - Takes(s): 13.9, Count: 85238, OPS: 6120.8, Avg(us): 2123, Min(us): 1103, Max(us): 96319, 50th(us): 1978, 90th(us): 2527, 95th(us): 2879, 99th(us): 5711, 99.9th(us): 8623, 99.99th(us): 74687
UPDATE_ERROR - Takes(s): 12.9, Count: 8, OPS: 0.6, Avg(us): 2803, Min(us): 1452, Max(us): 7531, 50th(us): 2261, 90th(us): 3615, 95th(us): 7531, 99th(us): 7531, 99.9th(us): 7531, 99.99th(us): 7531
[2023/05/14 10:42:14.336 +08:00] [INFO] [region_cache.go:2524] ["[health check] check health error"] [store=127.0.0.1:20162] [error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:20162: connect: connection refused\""]
READ   - Takes(s): 14.9, Count: 85751, OPS: 5736.9, Avg(us): 822, Min(us): 360, Max(us): 2334719, 50th(us): 673, 90th(us): 896, 95th(us): 1014, 99th(us): 1995, 99.9th(us): 5643, 99.99th(us): 56447
TOTAL  - Takes(s): 14.9, Count: 170993, OPS: 11440.1, Avg(us): 1471, Min(us): 360, Max(us): 2334719, 50th(us): 1530, 90th(us): 2285, 95th(us): 2559, 99th(us): 4207, 99.9th(us): 7891, 99.99th(us): 74047
UPDATE - Takes(s): 14.9, Count: 85242, OPS: 5709.7, Avg(us): 2123, Min(us): 1103, Max(us): 96319, 50th(us): 1978, 90th(us): 2527, 95th(us): 2879, 99th(us): 5711, 99.9th(us): 8623, 99.99th(us): 74687
UPDATE_ERROR - Takes(s): 13.9, Count: 8, OPS: 0.6, Avg(us): 2803, Min(us): 1452, Max(us): 7531, 50th(us): 2261, 90th(us): 3615, 95th(us): 7531, 99th(us): 7531, 99.9th(us): 7531, 99.99th(us): 7531
[2023/05/14 10:42:14.928 +08:00] [WARN] [prewrite.go:328] ["1pc failed and fallbacks to normal commit procedure"] [startTS=441458919238009000]
[2023/05/14 10:42:15.336 +08:00] [INFO] [region_cache.go:2524] ["[health check] check health error"] [store=127.0.0.1:20162] [error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:20162: connect: connection refused\""]
READ   - Takes(s): 15.9, Count: 88471, OPS: 5547.9, Avg(us): 1204, Min(us): 360, Max(us): 3330047, 50th(us): 672, 90th(us): 896, 95th(us): 1013, 99th(us): 2018, 99.9th(us): 5795, 99.99th(us): 2807807
TOTAL  - Takes(s): 15.9, Count: 176332, OPS: 11058.2, Avg(us): 1804, Min(us): 360, Max(us): 3330047, 50th(us): 1522, 90th(us): 2279, 95th(us): 2555, 99th(us): 4231, 99.9th(us): 8175, 99.99th(us): 2334719
UPDATE - Takes(s): 15.9, Count: 87861, OPS: 5516.9, Avg(us): 2408, Min(us): 1103, Max(us): 3315711, 50th(us): 1975, 90th(us): 2521, 95th(us): 2871, 99th(us): 5715, 99.9th(us): 9039, 99.99th(us): 524031
UPDATE_ERROR - Takes(s): 14.9, Count: 9, OPS: 0.6, Avg(us): 2711, Min(us): 1452, Max(us): 7531, 50th(us): 2261, 90th(us): 3615, 95th(us): 7531, 99th(us): 7531, 99.9th(us): 7531, 99.99th(us): 7531
[2023/05/14 10:42:16.336 +08:00] [INFO] [region_cache.go:2524] ["[health check] check health error"] [store=127.0.0.1:20162] [error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:20162: connect: connection refused\""]
READ   - Takes(s): 16.9, Count: 96508, OPS: 5694.8, Avg(us): 1157, Min(us): 335, Max(us): 3330047, 50th(us): 665, 90th(us): 888, 95th(us): 1004, 99th(us): 2003, 99.9th(us): 5703, 99.99th(us): 2334719
TOTAL  - Takes(s): 16.9, Count: 192374, OPS: 11353.1, Avg(us): 1756, Min(us): 335, Max(us): 3330047, 50th(us): 1487, 90th(us): 2261, 95th(us): 2531, 99th(us): 4199, 99.9th(us): 8023, 99.99th(us): 2328575
UPDATE - Takes(s): 16.9, Count: 95866, OPS: 5663.4, Avg(us): 2360, Min(us): 1103, Max(us): 3315711, 50th(us): 1955, 90th(us): 2501, 95th(us): 2851, 99th(us): 5619, 99.9th(us): 8727, 99.99th(us): 523263
UPDATE_ERROR - Takes(s): 15.9, Count: 9, OPS: 0.6, Avg(us): 2711, Min(us): 1452, Max(us): 7531, 50th(us): 2261, 90th(us): 3615, 95th(us): 7531, 99th(us): 7531, 99.9th(us): 7531, 99.99th(us): 7531
[2023/05/14 10:42:17.336 +08:00] [INFO] [region_cache.go:2524] ["[health check] check health error"] [store=127.0.0.1:20162] [error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:20162: connect: connection refused\""]
READ   - Takes(s): 17.9, Count: 104498, OPS: 5822.7, Avg(us): 1117, Min(us): 335, Max(us): 3330047, 50th(us): 659, 90th(us): 881, 95th(us): 997, 99th(us): 1994, 99.9th(us): 5643, 99.99th(us): 2334719
TOTAL  - Takes(s): 17.9, Count: 208098, OPS: 11596.1, Avg(us): 1719, Min(us): 335, Max(us): 3330047, 50th(us): 1472, 90th(us): 2247, 95th(us): 2513, 99th(us): 4215, 99.9th(us): 7887, 99.99th(us): 529407
UPDATE - Takes(s): 17.9, Count: 103600, OPS: 5779.4, Avg(us): 2326, Min(us): 1103, Max(us): 3315711, 50th(us): 1941, 90th(us): 2485, 95th(us): 2835, 99th(us): 5607, 99.9th(us): 8519, 99.99th(us): 523263
UPDATE_ERROR - Takes(s): 16.9, Count: 9, OPS: 0.5, Avg(us): 2711, Min(us): 1452, Max(us): 7531, 50th(us): 2261, 90th(us): 3615, 95th(us): 7531, 99th(us): 7531, 99.9th(us): 7531, 99.99th(us): 7531
[2023/05/14 10:42:18.336 +08:00] [INFO] [region_cache.go:2524] ["[health check] check health error"] [store=127.0.0.1:20162] [error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:20162: connect: connection refused\""]
READ   - Takes(s): 18.9, Count: 112175, OPS: 5920.6, Avg(us): 1086, Min(us): 335, Max(us): 3330047, 50th(us): 655, 90th(us): 876, 95th(us): 991, 99th(us): 1995, 99.9th(us): 5611, 99.99th(us): 2334719
TOTAL  - Takes(s): 18.9, Count: 223470, OPS: 11795.4, Avg(us): 1689, Min(us): 335, Max(us): 3330047, 50th(us): 1467, 90th(us): 2233, 95th(us): 2499, 99th(us): 4227, 99.9th(us): 7831, 99.99th(us): 527359
UPDATE - Takes(s): 18.9, Count: 111295, OPS: 5880.6, Avg(us): 2298, Min(us): 1103, Max(us): 3315711, 50th(us): 1930, 90th(us): 2469, 95th(us): 2819, 99th(us): 5631, 99.9th(us): 8383, 99.99th(us): 522751
UPDATE_ERROR - Takes(s): 17.9, Count: 10, OPS: 0.6, Avg(us): 2700, Min(us): 1452, Max(us): 7531, 50th(us): 2261, 90th(us): 3615, 95th(us): 7531, 99th(us): 7531, 99.9th(us): 7531, 99.99th(us): 7531
[2023/05/14 10:42:19.336 +08:00] [INFO] [region_cache.go:2524] ["[health check] check health error"] [store=127.0.0.1:20162] [error="rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 127.0.0.1:20162: connect: connection refused\""]
READ   - Takes(s): 19.9, Count: 119907, OPS: 6011.3, Avg(us): 1058, Min(us): 335, Max(us): 3330047, 50th(us): 651, 90th(us): 871, 95th(us): 986, 99th(us): 1994, 99.9th(us): 5567, 99.99th(us): 2328575
TOTAL  - Takes(s): 19.9, Count: 238969, OPS: 11981.5, Avg(us): 1663, Min(us): 335, Max(us): 3330047, 50th(us): 1461, 90th(us): 2223, 95th(us): 2487, 99th(us): 4259, 99.9th(us): 7759, 99.99th(us): 523263
UPDATE - Takes(s): 19.9, Count: 119062, OPS: 5975.3, Avg(us): 2272, Min(us): 1103, Max(us): 3315711, 50th(us): 1920, 90th(us): 2455, 95th(us): 2809, 99th(us): 5647, 99.9th(us): 8215, 99.99th(us): 522751
UPDATE_ERROR - Takes(s): 18.9, Count: 12, OPS: 0.6, Avg(us): 2491, Min(us): 1385, Max(us): 7531, 50th(us): 1972, 90th(us): 3615, 95th(us): 3615, 99th(us): 7531, 99.9th(us): 7531, 99.99th(us): 7531

注意,修改后,Raft 相关开销会增加。可以结合实际业务需要调整配置。

此外,相关原理可以参考 专栏 - 高可用测试:KILL TiKV-Server,事务 TPS 掉零现象解读 | TiDB 社区

参数验证后,确实有非常明显的改善,提供的运维链接也非常有用,感谢大佬

学习了,最近我也准备做个类似的测试

TiKV 利用 Raft 来做数据复制,每个数据变更都会落地为一条 Raft 日志,通过 Raft 的日志复制功能,将数据安全可靠地同步到复制组的每一个节点中。不过在实际写入中,根据 Raft 的协议,只需要同步复制到多数节点,即可安全地认为数据写入成功。
https://docs.pingcap.com/zh/tidb/stable/tidb-storage#raft-协议
新的leader节点不需要从其他节点同步数据,如果他的数据落后,他不会被选举为Leader节点。

此话题已在最后回复的 60 天后被自动关闭。不再允许新回复。