tiup启动tikv报超时错误

tidb版本:

v4.0.2
tiup cluster restart testcluster -R tikv

上面命令超时了

tiup cluster restart testcluster -R tikv --wait-timeout 600

上面命令可以正常返回

请问是由于region多了原因导致tikv启动较慢么?

1 个赞

启动时候会有 region 校验的过程,但是并不会超时。具体要看一下 tikv 启动日志,定位一下原因。麻烦提供 tikv log 日志和 tikv 的监控的数据,我们看一下 region 分布和调度情况。

完整的日志不太好传,截了几段,差不多就是在启动的时候输出的日志:
tikv.log.txt (12.1 KB)

最后就是卡在not on SSD device这行上,然后过了差不多5分钟之后自动启动了

这个是tikv详细的监控,执行启动命令的时间应该是17:29分左右:
tikv-detail-monitor.pdf (28.0 MB)

重启的时候,tikv只记录如下信息:

[2021/12/27 10:13:08.381 +08:00] [ERROR] [kv.rs:613] ["KvService::batch_raft send response fail"] [err=RemoteStopped]
[2021/12/27 10:13:08.383 +08:00] [ERROR] [kv.rs:613] ["KvService::batch_raft send response fail"] [err=RemoteStopped]
[2021/12/27 10:13:08.415 +08:00] [ERROR] [kv.rs:613] ["KvService::batch_raft send response fail"] [err=RemoteStopped]
[2021/12/27 10:13:08.415 +08:00] [ERROR] [kv.rs:613] ["KvService::batch_raft send response fail"] [err=RemoteStopped]
[2021/12/27 10:13:08.488 +08:00] [ERROR] [peer.rs:2551] ["failed to notify pd"] [err="channel has been closed"] [peer_id=109236403] [region_id=109236400]
[2021/12/27 10:13:08.489 +08:00] [ERROR] [peer.rs:2551] ["failed to notify pd"] [err="channel has been closed"] [peer_id=109236399] [region_id=109236396]
[2021/12/27 10:13:08.506 +08:00] [ERROR] [peer.rs:2551] ["failed to notify pd"] [err="channel has been closed"] [peer_id=109239417] [region_id=109239414]
[2021/12/27 10:13:08.506 +08:00] [ERROR] [peer.rs:2551] ["failed to notify pd"] [err="channel has been closed"] [peer_id=109239413] [region_id=109239410]
[2021/12/27 10:13:08.506 +08:00] [ERROR] [peer.rs:2551] ["failed to notify pd"] [err="channel has been closed"] [peer_id=109239421] [region_id=109239418]
[2021/12/27 10:13:08.547 +08:00] [ERROR] [peer.rs:2551] ["failed to notify pd"] [err="channel has been closed"] [peer_id=109221607] [region_id=109221605]
[2021/12/27 10:13:08.658 +08:00] [ERROR] [peer.rs:2551] ["failed to notify pd"] [err="channel has been closed"] [peer_id=108569107] [region_id=13537702]
[2021/12/27 10:13:10.164 +08:00] [WARN] [raft_client.rs:296] ["RPC batch_raft fail"] [err="Some(RpcFailure(RpcStatus { status: 2-UNKNOWN, details: Some(\"\") }))"] [sink_err=None] [to_addr=10.59.111.10:20170]
[2021/12/27 10:13:10.166 +08:00] [WARN] [raft_client.rs:296] ["RPC batch_raft fail"] [err="Some(RpcFailure(RpcStatus { status: 2-UNKNOWN, details: Some(\"\") }))"] [sink_err=None] [to_addr=10.59.111.133:20160]
[2021/12/27 10:13:10.213 +08:00] [WARN] [raft_client.rs:296] ["RPC batch_raft fail"] [err="Some(RpcFailure(RpcStatus { status: 2-UNKNOWN, details: Some(\"\") }))"] [sink_err=None] [to_addr=10.59.111.224:20160]
[2021/12/27 10:13:10.249 +08:00] [WARN] [raft_client.rs:296] ["RPC batch_raft fail"] [err="Some(RpcFailure(RpcStatus { status: 2-UNKNOWN, details: Some(\"\") }))"] [sink_err=None] [to_addr=10.59.111.132:20160]
[2021/12/27 10:13:16.455 +08:00] [WARN] [lib.rs:528] ["environment variable `TZ` is missing, using `/etc/localtime`"]
[2021/12/27 10:13:16.457 +08:00] [WARN] [server.rs:827] ["check: kernel"] [err="kernel parameters net.ipv4.tcp_syncookies got 1, expect 0"]
[2021/12/27 10:13:16.705 +08:00] [WARN] [config.rs:712] ["not on SSD device"] [data_path=/data/tidb_data/tikv-20160]
[2021/12/27 10:13:16.706 +08:00] [WARN] [config.rs:712] ["not on SSD device"] [data_path=/data/tidb_data/tikv-20160/raft]

看日志报错应该是 Region Group 的 peer 间通信问题,建议是通过 telnet 或者 ping 等网络工具查看一下网络状态是否 ok

网络都是走内网,应该没问题

应该是同一个机房对吧?现在正常了吗 ?

是同一个机房,目前还是跟之前一样的问题:joy:

确认一下端口之间的防火墙和通信是否正常

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。