TiKV各节点都提示错误【failed to send extra message】

【 TiDB 使用环境】生产环境
【 TiDB 版本】7.1.0
【复现路径】持续报错
【遇到的问题:TiKV各节点都提示错误【failed to send extra message】】
【TiKV日志】
[2024/01/10 23:12:53.145 +08:00] [ERROR] [snap.rs:546] [“failed to send snap”] [err=Grpc(RemoteStopped)] [region_id=23107861] [to_addr=172.25.227.51:20170]
[2024/01/10 23:12:53.148 +08:00] [ERROR] [snap.rs:546] [“failed to send snap”] [err=“Grpc(RpcFinished(Some(RpcStatus { code: 8-RESOURCE_EXHAUSTED, message: "the number of received snapshot tasks 32 exceeded the limitation 32", details: [] })))”] [region_id=23101477] [to_addr=172.25.227.51:20170]
[2024/01/10 23:12:53.149 +08:00] [ERROR] [snap.rs:546] [“failed to send snap”] [err=“Grpc(RpcFinished(Some(RpcStatus { code: 8-RESOURCE_EXHAUSTED, message: "the number of received snapshot tasks 32 exceeded the limitation 32", details: [] })))”] [region_id=23109861] [to_addr=172.25.227.51:20170]
[2024/01/10 23:12:53.150 +08:00] [ERROR] [snap.rs:546] [“failed to send snap”] [err=“Grpc(RpcFinished(Some(RpcStatus { code: 8-RESOURCE_EXHAUSTED, message: "the number of received snapshot tasks 32 exceeded the limitation 32", details: [] })))”] [region_id=23101465] [to_addr=172.25.227.51:20170]
[2024/01/10 23:13:59.211 +08:00] [ERROR] [snap.rs:546] [“failed to send snap”] [err=“Grpc(RpcFinished(Some(RpcStatus { code: 8-RESOURCE_EXHAUSTED, message: "the number of received snapshot tasks 32 exceeded the limitation 32", details: [] })))”] [region_id=23092977] [to_addr=172.25.227.51:20170]
[2024/01/10 23:13:59.212 +08:00] [ERROR] [snap.rs:546] [“failed to send snap”] [err=“Grpc(RpcFailure(RpcStatus { code: 8-RESOURCE_EXHAUSTED, message: "the number of received snapshot tasks 32 exceeded the limitation 32", details: [] }))”] [region_id=23094445] [to_addr=172.25.227.51:20170]
[2024/01/10 23:13:59.212 +08:00] [ERROR] [snap.rs:546] [“failed to send snap”] [err=“Grpc(RpcFinished(Some(RpcStatus { code: 8-RESOURCE_EXHAUSTED, message: "the number of received snapshot tasks 32 exceeded the limitation 32", details: [] })))”] [region_id=23092865] [to_addr=172.25.227.51:20170]
[2024/01/10 23:14:07.319 +08:00] [ERROR] [snap.rs:546] [“failed to send snap”] [err=“Grpc(RpcFinished(Some(RpcStatus { code: 8-RESOURCE_EXHAUSTED, message: "the number of received snapshot tasks 32 exceeded the limitation 32", details: [] })))”] [region_id=23096025] [to_addr=172.25.227.51:20170]
[2024/01/10 23:14:08.727 +08:00] [ERROR] [snap.rs:546] [“failed to send snap”] [err=“Grpc(RpcFinished(Some(RpcStatus { code: 8-RESOURCE_EXHAUSTED, message: "the number of received snapshot tasks 32 exceeded the limitation 32", details: [] })))”] [region_id=23109213] [to_addr=172.25.227.51:20170]
[2024/01/10 23:14:42.200 +08:00] [ERROR] [snap.rs:546] [“failed to send snap”] [err=“Grpc(RpcFinished(Some(RpcStatus { code: 8-RESOURCE_EXHAUSTED, message: "the number of received snapshot tasks 32 exceeded the limitation 32", details: [] })))”] [region_id=23092101] [to_addr=172.25.227.51:20170]
[2024/01/10 23:14:42.390 +08:00] [ERROR] [snap.rs:546] [“failed to send snap”] [err=“Grpc(RpcFinished(Some(RpcStatus { code: 1-CANCELLED, message: "CANCELLED", details: [] })))”] [region_id=23094665] [to_addr=172.25.227.51:20170]
[2024/01/10 23:17:01.316 +08:00] [ERROR] [transport.rs:99] [“failed to send significant msg”] [msg=“RaftlogFetched(FetchedLogs { context: GetEntriesContext(SendAppend { to: 23091188, term: 7, aggressively: false }), logs: RaftlogFetchResult { ents: Ok([]), low: 6, max_size: 1048576, hit_size_limit: true, tried_cnt: 1, term: 7 } })”]
[2024/01/10 23:17:27.779 +08:00] [ERROR] [pd.rs:2393] [“send request failed”] [err=“"Disconnected(…)"”] [cmd_type=PrepareMerge] [region_id=23091037]
[2024/01/10 23:25:08.617 +08:00] [ERROR] [peer.rs:5327] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 19665127 store_id: 1”] [peer_id=19665128] [region_id=19665126] [type=MsgHibernateRequest]
[2024/01/10 23:25:08.807 +08:00] [ERROR] [peer.rs:5327] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 231563 store_id: 1”] [peer_id=4427582] [region_id=231561] [type=MsgHibernateRequest]
[2024/01/10 23:25:08.807 +08:00] [ERROR] [peer.rs:5327] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 23091072 store_id: 4744287”] [peer_id=23091070] [region_id=23091069] [type=MsgHibernateRequest]
[2024/01/10 23:25:08.807 +08:00] [ERROR] [peer.rs:5327] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 5034971 store_id: 4744287”] [peer_id=4436417] [region_id=15841] [type=MsgHibernateRequest]
[2024/01/10 23:25:08.807 +08:00] [ERROR] [peer.rs:5327] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 5026146 store_id: 4744287”] [peer_id=4436414] [region_id=4427171] [type=MsgHibernateRequest]
[2024/01/10 23:25:08.807 +08:00] [ERROR] [peer.rs:5327] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 5034352 store_id: 4744287”] [peer_id=4431039] [region_id=18865] [type=MsgHibernateRequest]
[2024/01/10 23:25:08.807 +08:00] [ERROR] [peer.rs:5327] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 5034573 store_id: 4744287”] [peer_id=4432826] [region_id=21889] [type=MsgHibernateRequest]
[2024/01/10 23:35:19.489 +08:00] [ERROR] [peer.rs:5327] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 20084015 store_id: 1”] [peer_id=20084014] [region_id=20084013] [type=MsgHibernateRequest]
[2024/01/10 23:35:19.623 +08:00] [ERROR] [peer.rs:5327] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 44369 store_id: 1”] [peer_id=4434098] [region_id=44367] [type=MsgHibernateRequest]
[2024/01/10 23:35:19.624 +08:00] [ERROR] [peer.rs:5327] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 23065141 store_id: 4744287”] [peer_id=23065139] [region_id=23065138] [type=MsgHibernateRequest]
[2024/01/10 23:35:19.624 +08:00] [ERROR] [peer.rs:5327] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 5035593 store_id: 4744287”] [peer_id=19586103] [region_id=4424257] [type=MsgHibernateRequest]
[2024/01/10 23:35:19.624 +08:00] [ERROR] [peer.rs:5327] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 23058743 store_id: 1”] [peer_id=23058742] [region_id=23058741] [type=MsgHibernateRequest]
[2024/01/10 23:35:19.624 +08:00] [ERROR] [peer.rs:5327] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 19969866 store_id: 1”] [peer_id=19969865] [region_id=19969864] [type=MsgHibernateRequest]
[2024/01/10 23:35:19.624 +08:00] [ERROR] [peer.rs:5327] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 9286837 store_id: 1”] [peer_id=4430861] [region_id=73773] [type=MsgHibernateRequest]
[2024/01/10 23:35:19.624 +08:00] [ERROR] [peer.rs:5327] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 19897511 store_id: 4744287”] [peer_id=19897509] [region_id=19897508] [type=MsgHibernateRequest]
[2024/01/10 23:35:19.643 +08:00] [ERROR] [peer.rs:5327] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 5026285 store_id: 4744287”] [peer_id=4428307] [region_id=25101] [type=MsgHibernateRequest]
[2024/01/10 23:35:19.643 +08:00] [ERROR] [peer.rs:5327] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 20091724 store_id: 4744287”] [peer_id=20091722] [region_id=20091721] [type=MsgHibernateRequest]

你搞了啥

看起来断开了链接,集群不太正常了

快照有关文档 TiKV 源码解析系列文章(十)Snapshot 的发送和接收 | PingCAP

你在这个节点看看网络有问题嘛

集群之间ping和telnet端口都是正常的

你在tiup上重启单个tikv试试呢

之前有一次异常,在同一个机器再启动了一个TiKV,然后原来那个20160又正常启动了。后面就把原来的20160下线,一直用20170。也不知道是不是这个原因,中间有重启TiKV这个节点

重启了的,还是会报这种错误

TiKV其中一个节点,是重启了的。但起来后,还是一直报这个错误

4744287和1这两个store是你说的对应的两个store吗?

集群怎么部署的?发一下集群display

1 个赞

昨天重启了51节点,之前一直下线不了的节点也下线了。早上看日志也没有这个报错了、
瞎搞一通

堵塞了不能发送消息了。

要描述做什么了,这样才好判断