单个tikv节点反复重启

【 TiDB 使用环境】生产环境
【 TiDB 版本】 v4.0.14
【复现路径】正常线上使用
【遇到的问题:问题现象及影响】
这台物理机中有两个节点,其中集群中一个固定tikv节点在几个小时几反复重启4次。(另外一个节点没问题)日志如下:
[2024/10/16 02:15:14.152 +08:00] [INFO] [raft.rs:971] [“[logterm: 6, index: 6, vote: 247410446] ignored vote from 247410444 [logterm: 6, index: 6]: lease is not expired”] [“msg type”=MsgRequestPreVote] [“remaining ticks”=1] [term=6] [msg_index=6] [msg_term=6] [from=247410444] [vote=247410446] [log_index=6] [log_term=6] [raft_id=247410446] [region_id=247410443]
[2024/10/16 02:15:14.355 +08:00] [WARN] [endpoint.rs:537] [error-response] [err=“Region error (will back off and retry) message: "peer is not leader for region 236493048, leader may Some(id: 236493051 store_id: 163501693)" not_leader { region_id: 236493048 leader { id: 236493051 store_id: 163501693 } }”]
[2024/10/16 02:15:14.467 +08:00] [ERROR] [] [“ipv4:10.20.70.11:59186: Keepalive watchdog fired. Closing transport.”]
[2024/10/16 02:15:14.475 +08:00] [ERROR] [] [“ipv4:10.20.70.19:2349: Keepalive watchdog fired. Closing transport.”]
[2024/10/16 02:15:14.510 +08:00] [INFO] [raft.rs:1177] [“starting a new election”] [term=81] [raft_id=173885086] [region_id=173885084]
[2024/10/16 02:15:14.510 +08:00] [INFO] [raft.rs:833] [“became pre-candidate at term 81”] [term=81] [raft_id=173885086] [region_id=173885084]

[2024/10/16 02:11:49.688 +08:00] [INFO] [raft.rs:1192] [“[logterm: 61, index: 64, vote: 0] cast vote for 168602012 [logterm: 61, index: 64] at term 62”] [“msg type”=MsgRequestVote] [term=62] [msg_index=64] [msg_term=61] [from=168602012] [vote=0] [log_index=64] [log_term=61] [raft_id=168602010] [region_id=168602009]
[2024/10/16 02:11:53.289 +08:00] [ERROR] [] [“ipv4:10.20.70.18:20161: Keepalive watchdog fired. Closing transport.”]
[2024/10/16 02:11:53.433 +08:00] [WARN] [raft_client.rs:296] [“RPC batch_raft fail”] [err=“Some(RpcFailure(RpcStatus { status: 14-UNAVAILABLE, details: Some("keepalive watchdog timeout") }))”] [sink_err=“Some(RpcFinished(Some(RpcStatus { status: 14-UNAVAILABLE, details: Some("keepalive watchdog timeout") })))”] [to_addr=10.20.70.18:20161]
[2024/10/16 02:11:53.435 +08:00] [INFO] [store.rs:2258] [“broadcasting unreachable”] [unreachable_store_id=163499807] [store_id=163501697]
[2024/10/16 02:11:53.458 +08:00] [WARN] [raft_client.rs:199] [“send to 10.20.70.18:20161 fail, the gRPC connection could be broken”]
[2024/10/16 02:11:53.458 +08:00] [ERROR] [transport.rs:163] [“send raft msg err”] [err=“Other("[src/server/raft_client.rs:208]: RaftClient send fail")”]
[2024/10/16 02:11:53.459 +08:00] [INFO] [transport.rs:144] [“resolve store address ok”] [addr=10.20.70.18:20161] [store_id=163499807]
[2024/10/16 02:11:53.459 +08:00] [INFO] [raft_client.rs:48] [“server: new connection with tikv endpoint”] [addr=10.20.70.18:20161]

1] [raft_id=168602010] [region_id=168602009]
[2024/10/16 02:11:53.289 +08:00] [ERROR] [] [“ipv4:10.20.70.18:20161: Keepalive watchdog fired. Closing transport.”]
[2024/10/16 02:11:53.433 +08:00] [WARN] [raft_client.rs:296] [“RPC batch_raft fail”] [err=“Some(RpcFailure(RpcStatus { status: 14-UNAVAILABLE, details: Some("keepalive watchdog timeout") }))”] [sink_err=“Some(RpcFinished(Some(RpcStatus { status: 14-UNAVAILABLE, details: Some("keepalive watchdog timeout") })))”] [to_addr=10.20.70.18:20161]
[2024/10/16 02:11:53.435 +08:00] [INFO] [store.rs:2258] [“broadcasting unreachable”] [unreachable_store_id=163499807] [store_id=163501697]
[2024/10/16 02:11:53.458 +08:00] [WARN] [raft_client.rs:199] [“send to 10.20.70.18:20161 fail, the gRPC connection could be broken”]
[2024/10/16 02:11:53.458 +08:00] [ERROR] [transport.rs:163] [“send raft msg err”] [err=“Other("[src/server/raft_client.rs:208]: RaftClient send fail")”]
[2024/10/16 02:11:53.459 +08:00] [INFO] [transport.rs:144] [“resolve store address ok”] [addr=10.20.70.18:20161] [store_id=163499807]
[2024/10/16 02:11:53.459 +08:00] [INFO] [raft_client.rs:48] [“server: new connection with tikv endpoint”] [addr=10.20.70.18:20161]
[2024/10/16 02:11:53.566 +08:00] [INFO] [] [“New connected subchannel at 0x7f85d7234640 for subchannel 0x7f86dee68540”]

[2024/10/16 02:13:05.910 +08:00] [INFO] [raft.rs:923] [“[logterm: 829, index: 1929] sent request to 213317076”] [msg=MsgRequestPreVote] [term=829] [id=213317076] [log_index=1929] [log_term=829] [raft_id=192993955] [region_id=1607]
[2024/10/16 02:13:05.910 +08:00] [INFO] [raft.rs:923] [“[logterm: 829, index: 1929] sent request to 221007931”] [msg=MsgRequestPreVote] [term=829] [id=221007931] [log_index=1929] [log_term=829] [raft_id=192993955] [region_id=1607]
[2024/10/16 02:13:05.911 +08:00] [ERROR] [kv.rs:597] [“dispatch raft msg from gRPC to raftstore fail”] [err=Grpc(RpcFinished(None))]
[2024/10/16 02:13:05.911 +08:00] [ERROR] [kv.rs:603] [“KvService::batch_raft send response fail”] [err=RemoteStopped]
[2024/10/16 02:13:05.916 +08:00] [INFO] [raft.rs:1177] [“starting a new election”] [term=10] [raft_id=241247020] [region_id=241247019]
[2024/10/16 02:13:05.916 +08:00] [INFO] [raft.rs:833] [“became pre-candidate at term 10”] [term=10] [raft_id=241247020] [region_id=241247019]

[2024/10/16 02:15:18.721 +08:00] [INFO] [raft.rs:783] [“became follower at term 78”] [term=78] [raft_id=168987437] [region_id=168987436]
[2024/10/16 02:15:18.754 +08:00] [WARN] [raft.rs:1481] [“stepped down to follower since quorum is not active”] [raft_id=245775907] [region_id=245775905]
[2024/10/16 02:15:18.754 +08:00] [INFO] [raft.rs:783] [“became follower at term 10”] [term=10] [raft_id=245775907] [region_id=245775905]
[2024/10/16 02:15:18.871 +08:00] [INFO] [split_check.rs:277] [“update approximate size and keys with accurate value”] [keys=1044689] [size=201326415] [region_id=245805961]
[2024/10/16 02:15:18.871 +08:00] [INFO] [peer.rs:3331] [“on split”] [source=“split checker”] [split_keys=“key 7480000000000022FF5C5F72800000002DFF2EA8670000000000FAF9D5FDB8AC87F905”] [peer_id=245805964] [region_id=245805961]
[2024/10/16 02:15:18.872 +08:00] [INFO] [pd.rs:507] [“try to batch split region”] [task=batch_split] [region=“id: 245805961 start_key: 7480000000000022FF5C5F72800000002DFF26A2E50000000000FA end_key: 7480000000000022FF5C5F72800000002DFF36AB0F0000000000FA region_epoch { conf_ver: 4154 version: 4830 } peers { id: 245805962 store_id: 163499805 } peers { id: 245805963 store_id: 163499807 } peers { id: 245805964 store_id: 163501697 }”] [new_region_ids=“[new_region_id: 247411718 new_peer_ids: 247411719 new_peer_ids: 247411720 new_peer_ids: 247411721]”] [region_id=245805961]
[2024/10/16 02:15:19.027 +08:00] [ERROR] [] [“assertion failed: slice_buffer->length <= UINT32_MAX”]
[2024/10/16 02:16:03.531 +08:00] [INFO] [lib.rs:94] [“Welcome to TiKV”]
[2024/10/16 02:16:03.531 +08:00] [INFO] [lib.rs:96]
[2024/10/16 02:16:03.531 +08:00] [INFO] [lib.rs:96] [“Release Version: 4.0.14”]
[2024/10/16 02:16:03.531 +08:00] [INFO] [lib.rs:96] [“Edition: Community”]
[2024/10/16 02:16:03.531 +08:00] [INFO] [lib.rs:96] [“Git Commit Hash: d7dc4fff51ca71c76a928a0780a069efaaeaae70”]
[2024/10/16 02:16:03.531 +08:00] [INFO] [lib.rs:96] [“Git Commit Branch: heads/refs/tags/v4.0.14”]
[2024/10/16 02:16:03.531 +08:00] [INFO] [lib.rs:96] [“UTC Build Time: 2021-07-17 06:28:14”]
[2024/10/16 02:16:03.531 +08:00] [INFO] [lib.rs:96] [“Rust Version: rustc 1.42.0-nightly (0de96d37f 2019-12-19)”]
[2024/10/16 02:16:03.531 +08:00] [INFO] [lib.rs:96] [“Enable Features: jemalloc mem-profiling portable sse protobuf-codec”]
[2024/10/16 02:16:03.531 +08:00] [INFO] [lib.rs:96] [“Profile: dist_release”]

系统日志如下:
Oct 16 02:15:19 l70-27-p-yz abrt-hook-ccpp: Process 19941 (tikv-server) of user 1004 killed by SIGABRT - dumping core
Oct 16 02:15:36 l70-27-p-yz abrt-server: Executable ‘/data1/tidb-deploy/tikv-20160/bin/tikv-server’ doesn’t belong to any package and ProcessUnpackaged is set to ‘no’
Oct 16 02:15:36 l70-27-p-yz abrt-server: ‘post-create’ on ‘/var/spool/abrt/ccpp-2024-10-16-02:15:19-19941’ exited with 1
Oct 16 02:15:36 l70-27-p-yz abrt-server: Deleting problem directory ‘/var/spool/abrt/ccpp-2024-10-16-02:15:19-19941’
Oct 16 02:15:48 l70-27-p-yz systemd: tikv-20160.service: main process exited, code=killed, status=6/ABRT
Oct 16 02:15:48 l70-27-p-yz systemd: Unit tikv-20160.service entered failed state.
Oct 16 02:15:48 l70-27-p-yz systemd: tikv-20160.service failed.
Oct 16 02:16:03 l70-27-p-yz systemd: tikv-20160.service holdoff time over, scheduling restart.
Oct 16 02:16:03 l70-27-p-yz systemd: Stopped tikv service.
Oct 16 02:16:03 l70-27-p-yz systemd: Started tikv service.
Oct 16 02:16:03 l70-27-p-yz run_tikv.sh: sync …
Oct 16 02:16:03 l70-27-p-yz run_tikv.sh: real#0110m0.022s
Oct 16 02:16:03 l70-27-p-yz run_tikv.sh: user#0110m0.000s
Oct 16 02:16:03 l70-27-p-yz run_tikv.sh: sys#0110m0.020s
Oct 16 02:16:03 l70-27-p-yz run_tikv.sh: ok

原因不是oom.

看看系统资源使用情况;
看下操作系统日志,应该是oom之类的

主机内存资源对应时间点看一下

看看监控,混合部署要看内存占用的限制写对没,不然很容易oom

内存是256G,每个节点设置56G

版本有些老啊,有考虑过升级吗?

计划是要升级的

测试好后,尽快实施。4.0的版本已经停更几年了。

看看对应时间点的资源使用情况