tikv snap 目录不断增加

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:
【 TiDB 使用环境】
【概述】集群中其中一个tikv节点snap目录突然持续增大,使用tiup修改配置并尝试重启tikv失败
【操作】修改该节点 snap-gc-timeout 参数后重启节点失败(超时)
【业务影响】kv节点所在磁盘即将爆满
【TiDB 版本】v4.0.0-rc
【tikv 日志】

[2022/01/26 01:01:01.664 +08:00] [INFO] [snap.rs:759] [“scan snapshot”] [takes=4.036666ms] [size=9325536] [key_count=0] [snapshot=/data_db3/tikv/tikv-20160/snap/gen_76308642_9_49774_(default|lock|write).sst] [region_id=76308642]
[2022/01/26 01:01:01.677 +08:00] [WARN] [snap.rs:364] [“too many sending snapshot tasks, drop Send Snap[to: 192.168.100.181:20160, snap: region_id: 76308126 from_peer { id: 76308128 store_id: 4 } to_peer { id: 76308127 store_id: 7 } message { msg_type: MsgSnapshot to: 76308127 from: 76308128 term: 9 snapshot { data: 0A8701089EBDB124122D748000000000001CFF345F698000000000FF00000303C8728F3EFF77045EB6038B50B4FF58002B3907000000FC1A2D748000000000001CFF345F698000000000FF00000303C872A111FFF59E00B6038B50BEFF98002B3D9B000000FC220708B3B50510BF2B2A07089FBDB12410072A0708A0BDB12410042A0708A1BDB124100910D1A2EF0520022A270A090A0764656661756C740A060A046C6F636B0A120A05777269746510D1A2EF0518F7A3D8A90A metadata { conf_state { voters: 76308127 voters: 76308128 voters: 76308129 } index: 50726 term: 9 } } } region_epoch { conf_ver: 88755 version: 5567 }]”]
[2022/01/26 01:01:01.677 +08:00] [INFO] [peer.rs:563] [“report snapshot status”] [status=Failure] [to=“id: 76308127 store_id: 7”] [peer_id=76308128] [region_id=76308126]
[2022/01/26 01:01:01.677 +08:00] [WARN] [snap.rs:364] [“too many sending snapshot tasks, drop Send Snap[to: 192.168.100.181:20160, snap: region_id: 76308642 from_peer { id: 76308644 store_id: 4 } to_peer { id: 76308643 store_id: 7 } message { msg_type: MsgSnapshot to: 76308643 from: 76308644 term: 9 snapshot { data: 0A7508A2C1B1241224748000000000001CFF345F698000000000FF00000400038B50B8FF40002B3F80000000FC1A24748000000000001CFF345F698000000000FF00000400038B50BFFF30002B3A82000000FC220708F0B10510A42C2A0708A3C1B12410072A0708A4C1B12410042A0708A5C1B124100910E097B90420022A270A090A0764656661756C740A060A046C6F636B0A120A05777269746510E097B9041885AFC2B806 metadata { conf_state { voters: 76308643 voters: 76308644 voters: 76308645 } index: 49774 term: 9 } } } region_epoch { conf_ver: 88304 version: 5668 }]”]
[2022/01/26 01:01:01.677 +08:00] [INFO] [peer.rs:563] [“report snapshot status”] [status=Failure] [to=“id: 76308643 store_id: 7”] [peer_id=76308644] [region_id=76308642]
[2022/01/26 01:01:01.912 +08:00] [INFO] [peer_storage.rs:871] [“requesting snapshot”] [request_index=0] [peer_id=76095348] [region_id=5056764]
[2022/01/26 01:01:01.912 +08:00] [INFO] [peer_storage.rs:871] [“requesting snapshot”] [request_index=0] [peer_id=76095355] [region_id=53697994]
[2022/01/26 01:01:01.920 +08:00] [INFO] [snap.rs:759] [“scan snapshot”] [takes=7.688361ms] [size=16479609] [key_count=0] [snapshot=/data_db3/tikv/tikv-20160/snap/gen_53697994_8816_97550859_(default|lock|write).sst] [region_id=53697994]
[2022/01/26 01:01:01.926 +08:00] [INFO] [snap.rs:759] [“scan snapshot”] [takes=13.788957ms] [size=24544300] [key_count=0] [snapshot=/data_db3/tikv/tikv-20160/snap/gen_5056764_37318_505705038_(default|lock|write).sst] [region_id=5056764]

【重启超时日志】
2022-01-26T01:15:08.247+0800 DEBUG Still waitting for 143 store leaders to transfer…
2022-01-26T01:15:10.249+0800 DEBUG Still waitting for 141 store leaders to transfer…
2022-01-26T01:15:12.251+0800 DEBUG Still waitting for 141 store leaders to transfer…
2022-01-26T01:15:14.254+0800 DEBUG Still waitting for 141 store leaders to transfer…
2022-01-26T01:15:16.256+0800 DEBUG Still waitting for 141 store leaders to transfer…
2022-01-26T01:15:18.258+0800 DEBUG Still waitting for 141 store leaders to transfer…
2022-01-26T01:15:20.260+0800 DEBUG Still waitting for 141 store leaders to transfer…
2022-01-26T01:15:22.263+0800 DEBUG Still waitting for 141 store leaders to transfer…
2022-01-26T01:15:24.265+0800 DEBUG Still waitting for 142 store leaders to transfer…
2022-01-26T01:15:26.267+0800 DEBUG Still waitting for 142 store leaders to transfer…
2022-01-26T01:15:28.269+0800 DEBUG Still waitting for 142 store leaders to transfer…
2022-01-26T01:15:30.271+0800 DEBUG Still waitting for 142 store leaders to transfer…
2022-01-26T01:15:32.273+0800 DEBUG Still waitting for 142 store leaders to transfer…
2022-01-26T01:15:34.275+0800 DEBUG Still waitting for 142 store leaders to transfer…
2022-01-26T01:15:36.277+0800 DEBUG Still waitting for 142 store leaders to transfer…
2022-01-26T01:15:36.277+0800 DEBUG TaskFinish {“task”: “ClusterOperate: operation=UpgradeOperation, options={Roles:[] Nodes:[192.168.100.151:20160] Force:false Timeout:300}”, “error”: “failed to upgrade: failed to evict store leader 192.168.100.151: error evicting store leader from 192.168.100.151:20160, operation timed out after 5m0s”, “errorVerbose”: “error evicting store leader from 192.168.100.151:20160, operation timed out after 5m0s\ failed to evict store leader 192.168.100.151\ngithub.com/pingcap-incubator/tiup-cluster/pkg/operation.Upgrade\ \t/home/jenkins/agent/workspace/tiup-cluster-release/pkg/operation/upgrade.go:92\ github.com/pingcap-incubator/tiup-cluster/pkg/task.(*ClusterOperate).Execute\ \t/home/jenkins/agent/workspace/tiup-cluster-release/pkg/task/action.go:53\ngithub.com/pingcap-incubator/tiup-cluster/pkg/task.(*Serial).Execute\ \t/home/jenkins/agent/workspace/tiup-cluster-release/pkg/task/task.go:161\ngithub.com/pingcap-incubator/tiup-cluster/cmd.newReloadCmd.func1\ \t/home/jenkins/agent/workspace/tiup-cluster-release/cmd/reload.go:57\ github.com/spf13/cobra.(*Command).execute\ \t/go/pkg/mod/github.com/spf13/cobra@v0.0.6/command.go:840\ngithub.com/spf13/cobra.(*Command).ExecuteC\ \t/go/pkg/mod/github.com/spf13/cobra@v0.0.6/command.go:945\ngithub.com/spf13/cobra.(*Command).Execute\ \t/go/pkg/mod/github.com/spf13/cobra@v0.0.6/command.go:885\ngithub.com/pingcap-incubator/tiup-cluster/cmd.Execute\ \t/home/jenkins/agent/workspace/tiup-cluster-release/cmd/root.go:168\ main.main\ \t/home/jenkins/agent/workspace/tiup-cluster-release/main.go:19\ runtime.main\ \t/usr/local/go/src/runtime/proc.go:203\ runtime.goexit\ \t/usr/local/go/src/runtime/asm_amd64.s:1357\ failed to upgrade”}
2022-01-26T01:15:36.278+0800 INFO Execute command finished {“code”: 1, “error”: “failed to upgrade: failed to evict store leader 192.168.100.151: error evicting store leader from 192.168.100.151:20160, operation timed out after 5m0s”, “errorVerbose”: “error evicting store leader from 192.168.100.151:20160, operation timed out after 5m0s\ failed to evict store leader 192.168.100.151\ngithub.com/pingcap-incubator/tiup-cluster/pkg/operation.Upgrade\ \t/home/jenkins/agent/workspace/tiup-cluster-release/pkg/operation/upgrade.go:92\ github.com/pingcap-incubator/tiup-cluster/pkg/task.(*ClusterOperate).Execute\ \t/home/jenkins/agent/workspace/tiup-cluster-release/pkg/task/action.go:53\ngithub.com/pingcap-incubator/tiup-cluster/pkg/task.(*Serial).Execute\ \t/home/jenkins/agent/workspace/tiup-cluster-release/pkg/task/task.go:161\ngithub.com/pingcap-incubator/tiup-cluster/cmd.newReloadCmd.func1\ \t/home/jenkins/agent/workspace/tiup-cluster-release/cmd/reload.go:57\ github.com/spf13/cobra.(*Command).execute\ \t/go/pkg/mod/github.com/spf13/cobra@v0.0.6/command.go:840\ngithub.com/spf13/cobra.(*Command).ExecuteC\ \t/go/pkg/mod/github.com/spf13/cobra@v0.0.6/command.go:945\ngithub.com/spf13/cobra.(*Command).Execute\ \t/go/pkg/mod/github.com/spf13/cobra@v0.0.6/command.go:885\ngithub.com/pingcap-incubator/tiup-cluster/cmd.Execute\ \t/home/jenkins/agent/workspace/tiup-cluster-release/cmd/root.go:168\ main.main\ \t/home/jenkins/agent/workspace/tiup-cluster-release/main.go:19\ runtime.main\ \t/usr/local/go/src/runtime/proc.go:203\ runtime.goexit\ \t/usr/local/go/src/runtime/asm_amd64.s:1357\ failed to upgrade”}

1 个赞

tiup是restart或reload的tikv?怎么看着像是upgrade?

1 个赞

重启失败的 TiKV 有没有什么报错,尝试手动 systemctl 重启下,有个隐藏参数 server.snap-max-total-size 可以限制 snapshot 空间占用

2 个赞

你好,是执行了reload,有141个leader一直无法完成转移,直到超时

1 个赞

我尝试tiup stop/start 的方式重启了该节点,目前集群已启动,但是日志中有大量的选举:

[2022/01/26 10:19:39.502 +08:00] [INFO] [raft.rs:1203] [“starting a new election”] [term=25] [raft_id=64568037] [region_id=5131432]
[2022/01/26 10:19:39.502 +08:00] [INFO] [raft.rs:859] [“became pre-candidate at term 25”] [term=25] [raft_id=64568037] [region_id=5131432]
[2022/01/26 10:19:39.502 +08:00] [INFO] [raft.rs:928] [“64568037 received message from 64568037”] [term=25] [msg=MsgRequestPreVote] [from=64568037] [id=64568037] [raft_id=64568037] [region_id=5131432]
[2022/01/26 10:19:39.502 +08:00] [INFO] [raft.rs:949] [“[logterm: 24, index: 46686] sent request to 69587725”] [msg=MsgRequestPreVote] [term=25] [id=69587725] [log_index=46686] [log_term=24] [raft_id=64568037] [region_id=5131432]
[2022/01/26 10:19:39.502 +08:00] [INFO] [raft.rs:949] [“[logterm: 24, index: 46686] sent request to 75352498”] [msg=MsgRequestPreVote] [term=25] [id=75352498] [log_index=46686] [log_term=24] [raft_id=64568037] [region_id=5131432]
[2022/01/26 10:19:39.509 +08:00] [INFO] [raft.rs:1203] [“starting a new election”] [term=11] [raft_id=75370337] [region_id=67369432]
[2022/01/26 10:19:39.509 +08:00] [INFO] [raft.rs:859] [“became pre-candidate at term 11”] [term=11] [raft_id=75370337] [region_id=67369432]
[2022/01/26 10:19:39.509 +08:00] [INFO] [raft.rs:928] [“75370337 received message from 75370337”] [term=11] [msg=MsgRequestPreVote] [from=75370337] [id=75370337] [raft_id=75370337] [region_id=67369432]
[2022/01/26 10:19:39.509 +08:00] [INFO] [raft.rs:949] [“[logterm: 10, index: 68] sent request to 75944523”] [msg=MsgRequestPreVote] [term=11] [id=75944523] [log_index=68] [log_term=10] [raft_id=75370337] [region_id=67369432]
[2022/01/26 10:19:39.509 +08:00] [INFO] [raft.rs:949] [“[logterm: 10, index: 68] sent request to 67370031”] [msg=MsgRequestPreVote] [term=11] [id=67370031] [log_index=68] [log_term=10] [raft_id=75370337] [region_id=67369432]
[2022/01/26 10:19:39.520 +08:00] [INFO] [raft.rs:1203] [“starting a new election”] [term=30] [raft_id=76165622] [region_id=5232180]
[2022/01/26 10:19:39.520 +08:00] [INFO] [raft.rs:859] [“became pre-candidate at term 30”] [term=30] [raft_id=76165622] [region_id=5232180]
[2022/01/26 10:19:39.520 +08:00] [INFO] [raft.rs:928] [“76165622 received message from 76165622”] [term=30] [msg=MsgRequestPreVote] [from=76165622] [id=76165622] [raft_id=76165622] [region_id=5232180]
[2022/01/26 10:19:39.520 +08:00] [INFO] [raft.rs:949] [“[logterm: 29, index: 90] sent request to 5289705”] [msg=MsgRequestPreVote] [term=30] [id=5289705] [log_index=90] [log_term=29] [raft_id=76165622] [region_id=5232180]
[2022/01/26 10:19:39.520 +08:00] [INFO] [raft.rs:949] [“[logterm: 29, index: 90] sent request to 75638567”] [msg=MsgRequestPreVote] [term=30] [id=75638567] [log_index=90] [log_term=29] [raft_id=76165622] [region_id=5232180]
[2022/01/26 10:19:39.520 +08:00] [INFO] [raft.rs:1203] [“starting a new election”] [term=29] [raft_id=71139751] [region_id=26441416]
[2022/01/26 10:19:39.520 +08:00] [INFO] [raft.rs:859] [“became pre-candidate at term 29”] [term=29] [raft_id=71139751] [region_id=26441416]
[2022/01/26 10:19:39.520 +08:00] [INFO] [raft.rs:928] [“71139751 received message from 71139751”] [term=29] [msg=MsgRequestPreVote] [from=71139751] [id=71139751] [raft_id=71139751] [region_id=26441416]
[2022/01/26 10:19:39.520 +08:00] [INFO] [raft.rs:949] [“[logterm: 28, index: 514734] sent request to 75206171”] [msg=MsgRequestPreVote] [term=29] [id=75206171] [log_index=514734] [log_term=28] [raft_id=71139751] [region_id=26441416]
[2022/01/26 10:19:39.520 +08:00] [INFO] [raft.rs:949] [“[logterm: 28, index: 514734] sent request to 69003189”] [msg=MsgRequestPreVote] [term=29] [id=69003189] [log_index=514734] [log_term=28] [raft_id=71139751] [region_id=26441416]
[2022/01/26 10:19:39.520 +08:00] [INFO] [raft.rs:1203] [“starting a new election”] [term=30] [raft_id=76162734] [region_id=5147124]
[2022/01/26 10:19:39.520 +08:00] [INFO] [raft.rs:859] [“became pre-candidate at term 30”] [term=30] [raft_id=76162734] [region_id=5147124]
[2022/01/26 10:19:39.520 +08:00] [INFO] [raft.rs:928] [“76162734 received message from 76162734”] [term=30] [msg=MsgRequestPreVote] [from=76162734] [id=76162734] [raft_id=76162734] [region_id=5147124]
[2022/01/26 10:19:39.520 +08:00] [INFO] [raft.rs:949] [“[logterm: 29, index: 129] sent request to 69328396”] [msg=MsgRequestPreVote] [term=30] [id=69328396] [log_index=129] [log_term=29] [raft_id=76162734] [region_id=5147124]
[2022/01/26 10:19:39.520 +08:00] [INFO] [raft.rs:949] [“[logterm: 29, index: 129] sent request to 74798530”] [msg=MsgRequestPreVote] [term=30] [id=74798530] [log_index=129] [log_term=29] [raft_id=76162734] [region_id=5147124]
[2022/01/26 10:19:39.533 +08:00] [INFO] [raft.rs:1203] [“starting a new election”] [term=8] [raft_id=67676236] [region_id=67676235]
[2022/01/26 10:19:39.533 +08:00] [INFO] [raft.rs:859] [“became pre-candidate at term 8”] [term=8] [raft_id=67676236] [region_id=67676235]
[2022/01/26 10:19:39.533 +08:00] [INFO] [raft.rs:928] [“67676236 received message from 67676236”] [term=8] [msg=MsgRequestPreVote] [from=67676236] [id=67676236] [raft_id=67676236] [region_id=67676235]
[2022/01/26 10:19:39.533 +08:00] [INFO] [raft.rs:949] [“[logterm: 7, index: 37] sent request to 67676237”] [msg=MsgRequestPreVote] [term=8] [id=67676237] [log_index=37] [log_term=7] [raft_id=67676236] [region_id=67676235]
[2022/01/26 10:19:39.533 +08:00] [INFO] [raft.rs:949] [“[logterm: 7, index: 37] sent request to 76154465”] [msg=MsgRequestPreVote] [term=8] [id=76154465] [log_index=37] [log_term=7] [raft_id=67676236] [region_id=67676235]
[2022/01/26 10:19:39.534 +08:00] [INFO] [raft.rs:1203] [“starting a new election”] [term=9] [raft_id=76514547] [region_id=76514544]
[2022/01/26 10:19:39.534 +08:00] [INFO] [raft.rs:859] [“became pre-candidate at term 9”] [term=9] [raft_id=76514547] [region_id=76514544]
[2022/01/26 10:19:39.534 +08:00] [INFO] [raft.rs:928] [“76514547 received message from 76514547”] [term=9] [msg=MsgRequestPreVote] [from=76514547] [id=76514547] [raft_id=76514547] [region_id=76514544]
[2022/01/26 10:19:39.534 +08:00] [INFO] [raft.rs:949] [“[logterm: 8, index: 36] sent request to 76514545”] [msg=MsgRequestPreVote] [term=9] [id=76514545] [log_index=36] [log_term=8] [raft_id=76514547] [region_id=76514544]
[2022/01/26 10:19:39.534 +08:00] [INFO] [raft.rs:949] [“[logterm: 8, index: 36] sent request to 76514546”] [msg=MsgRequestPreVote] [term=9] [id=76514546] [log_index=36] [log_term=8] [raft_id=76514547] [region_id=76514544]
[2022/01/26 10:19:39.534 +08:00] [INFO] [raft.rs:1203] [“starting a new election”] [term=16] [raft_id=72055070] [region_id=30409794]
[2022/01/26 10:19:39.534 +08:00] [INFO] [raft.rs:859] [“became pre-candidate at term 16”] [term=16] [raft_id=72055070] [region_id=30409794]
[2022/01/26 10:19:39.534 +08:00] [INFO] [raft.rs:928] [“72055070 received message from 72055070”] [term=16] [msg=MsgRequestPreVote] [from=72055070] [id=72055070] [raft_id=72055070] [region_id=30409794]
[2022/01/26 10:19:39.534 +08:00] [INFO] [raft.rs:949] [“[logterm: 15, index: 2497] sent request to 70960910”] [msg=MsgRequestPreVote] [term=16] [id=70960910] [log_index=2497] [log_term=15] [raft_id=72055070] [region_id=30409794]
[2022/01/26 10:19:39.534 +08:00] [INFO] [raft.rs:949] [“[logterm: 15, index: 2497] sent request to 30409795”] [msg=MsgRequestPreVote] [term=16] [id=30409795] [log_index=2497] [log_term=15] [raft_id=72055070] [region_id=30409794]

1 个赞

reload 时可以调大 evict leader 超时时间 --transfer-timeout;
leader transfer 慢可能是 PD 生成 operator 慢,可以通过 PD 监控 create operator 和 operator step 耗时来确认;也可能是 TiKV 执行 transfer leader 慢,在写入压力比较大leader 和 follower 的日志差的比较多的情况下可能会拒绝 transfer leader。

2 个赞

现在集群有几个kv节点出现大量 [“starting a new election”] 日志,持续好几个小时了,集群处于查询阻塞状态

1 个赞

snap 空间释放了没有,是不是单个 kv 节点 region 数太多一直刷日志,还是有其他 ERROR 级别的报错,查询阻塞状态是什么意思,有没有超时报错之类的

1 个赞

snap已经释放了,现在就是频繁在刷 [“starting a new election”] 。 没有发现ERROR级别的报错。部分查询现在会爆 region is unavailable[try again later]。

另外通过 region check down-peer /pending-peer 发现不少down和pending的peer,不知道是否有影响

1 个赞

down peer 和 pending peer 可以在 PD 监控看到,如果一个 region 的多个副本都是这个状态,有可能报错 region is unavailable,看看 tikv store 状态是 up 还是 disconnected

1 个赞

tikv store 的状态是up,down和pending状态的region有增加Test-Cluster-PD_2022-01-26T10_57_11.978Z.json (6.7 MB)

1 个赞

这里有个残留的 evict leader 的调度,可以先删掉
https://docs.pingcap.com/zh/tidb/v4.0/pd-control

scheduler config evict-leader-scheduler // v4.0.0 起,展示该调度器具体在哪些 store 上
scheduler remove grant-leader-scheduler-1 // 把对应的调度器删掉,1 对应 store ID

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。