实验中的TIDB集群运行中,发现TIKV日志量过大,查询日志发现ERROR日志过大,集群能正式读写

【 TiDB 使用环境`】实验环境,准备上生产环境
【 TiDB 版本】5.4.1
【遇到的问题】TIKV中的日志大量的ERROR
【复现路径】目前没有,因为还没上生产环境,集群搭建后就使用DM进行mysql同步数据过来
【问题现象及影响】目前不清楚,因为没上生产环境,还在同步数据中,担心后续上了生产环境会出现一些问题

【附件】
以下是统计TIKV中的部分日志的ERROR日志
数量 详细日志
3 [peer.rs:4305] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 983014 store_id: 1”] [peer_id=983016] [region_id=983013] [type=MsgHibernateRequest]
3 [peer.rs:4305] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 993014 store_id: 1”] [peer_id=993016] [region_id=993013] [type=MsgHibernateRequest]
5 [raft_client.rs:504] [“connection aborted”] [addr=172.18.3.42:20160] [receiver_err=“Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: “Connection reset by peer”, details: [] }))”] [sink_error=“Some(RpcFinished(Some(RpcStatus { code: 14-UNAVAILABLE, message: “Connection reset by peer”, details: [] })))”] [store_id=2]
7 [gc_manager.rs:357] [“failed to get safe point from pd”] [err_code=KV:Storage:Unknown] [err=“Error(Other(”[src/server/gc_worker/gc_worker.rs:67]: failed to get safe point from PD: Other(\"[components/pd_client/src/util.rs:384]: request retry exceeds limit\")"))"]
8 [server.rs:1071] [“failed to init io snooper”] [err_code=KV:Unknown] [err="“IO snooper is not started due to not compiling with BCC”"]
9 [raft_client.rs:748] [“resolve store address failed”] [err_code=KV:Unknown] [err=“Other(”[src/server/resolve.rs:102]: RpcFailure: 2-UNKNOWN rpc error: code = Unavailable desc = not leader")"] [store_id=1]
9 [util.rs:679] [“failed to connect to PD member”] [error=“RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: “Deadline Exceeded”, details: [] })”] [endpoints=http://172.18.3.41:2379]
10 [kv.rs:734] [“KvService::batch_raft send response fail”] [err=RemoteStopped]
10 [util.rs:679] [“failed to connect to PD member”] [error=“RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: “Deadline Exceeded”, details: [] })”] [endpoints=http://172.18.3.40:2379]
12 [raft_client.rs:504] [“connection aborted”] [addr=172.18.3.41:20160] [receiver_err=“Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: “failed to connect to all addresses”, details: [] }))”] [sink_error=“Some(RpcFinished(Some(RpcStatus { code: 14-UNAVAILABLE, message: “failed to connect to all addresses”, details: [] })))”] [store_id=1]
13 [util.rs:419] [“request failed, retry”] [err_code=KV:PD:gRPC] [err=“Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: “Deadline Exceeded”, details: [] }))”]
14 [raft_client.rs:504] [“connection aborted”] [addr=172.18.3.42:20160] [receiver_err=“Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: “failed to connect to all addresses”, details: [] }))”] [sink_error=“Some(RpcFinished(Some(RpcStatus { code: 14-UNAVAILABLE, message: “failed to connect to all addresses”, details: [] })))”] [store_id=2]
19 [pd.rs:1139] [“store heartbeat failed”] [err=“Other(”[components/pd_client/src/util.rs:384]: request retry exceeds limit")"]
21 [peer.rs:4305] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 135006 store_id: 1”] [peer_id=135008] [region_id=135005] [type=MsgHibernateRequest]
21 [peer.rs:4305] [“failed to send extra message”] [err_code=KV:Raftstore:Transport] [err=Transport(Full)] [target=“id: 454 store_id: 1”] [peer_id=456] [region_id=453] [type=MsgHibernateRequest]
25 [util.rs:469] [“reconnect failed”] [err_code=KV:PD:Unknown] [err=“Other(”[components/pd_client/src/util.rs:598]: failed to connect to [name: \“pd-2\” member_id: 4194890414609143144 peer_urls: \“http://172.18.3.41:2380\” client_urls: \“http://172.18.3.41:2379\”, name: \“pd-3\” member_id: 6804184977433165527 peer_urls: \“http://172.18.3.42:2380\” client_urls: \“http://172.18.3.42:2379\”, name: \“pd-1\” member_id: 9398170741118694543 peer_urls: \“http://172.18.3.40:2380\” client_urls: \“http://172.18.3.40:2379\”]")"]
31 [raft_client.rs:504] [“connection aborted”] [addr=172.18.3.41:20160] [receiver_err=“Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: “failed to connect to all addresses”, details: [] }))”] [sink_error=Some(RemoteStopped)] [store_id=1]
39 [raft_client.rs:504] [“connection aborted”] [addr=172.18.3.42:20160] [receiver_err=“Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: “failed to connect to all addresses”, details: [] }))”] [sink_error=Some(RemoteStopped)] [store_id=2]
47 [raft_client.rs:776] [“connection abort”] [addr=172.18.3.41:20160] [store_id=1]
60 [raft_client.rs:776] [“connection abort”] [addr=172.18.3.42:20160] [store_id=2]
69 [util.rs:469] [“reconnect failed”] [err_code=KV:PD:Unknown] [err=“Other(”[components/pd_client/src/util.rs:306]: cancel reconnection due to too small interval")"]
79 [util.rs:419] [“request failed, retry”] [err_code=KV:PD:Unknown] [err=“Other(”[components/pd_client/src/tso.rs:88]: Timestamp channel is dropped")"]
103 [util.rs:460] [“request failed”] [err_code=KV:PD:gRPC] [err=“Grpc(RpcFailure(RpcStatus { code: 2-UNKNOWN, message: “rpc error: code = Unavailable desc = not leader”, details: [] }))”]
116 [util.rs:592] [“connect failed”] [error=“Grpc(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: “failed to connect to all addresses”, details: [] }))”] [endpoints=http://172.18.3.41:2379]
119 [util.rs:419] [“request failed, retry”] [err_code=KV:PD:Unknown] [err=“Other(”[components/pd_client/src/client.rs:883]: get timestamp timeout")"]
473 [util.rs:419] [“request failed, retry”] [err_code=KV:PD:gRPC] [err=“Grpc(RpcFailure(RpcStatus { code: 2-UNKNOWN, message: “rpc error: code = Unavailable desc = not leader”, details: [] }))”]
555 [util.rs:592] [“connect failed”] [error=“Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: “Deadline Exceeded”, details: [] }))”] [endpoints=http://172.18.3.41:2379]
2636 [util.rs:419] [“request failed, retry”] [err_code=KV:PD:Unknown] [err=“Other(”[components/pd_client/src/tso.rs:85]: TimestampRequest channel is closed")"]
115884 [util.rs:419] [“request failed, retry”] [err_code=KV:PD:Unknown] [err=“Other(TrySendError { kind: Disconnected })”]

感觉可能是节点间通信的问题

主要是,实验环境,还是没有业务在使用,只是使用DM来同步数据,就出现大量ERROR,这样不敢上生产啊:cold_sweat:

目前确定TIDB集群全部端口都可以通行

实验环境什么配置?

[util.rs:469] [“reconnect failed”] [err_code=KV:PD:Unknown] 
[err=“Other(”[components/pd_client/src/util.rs:598]: failed to connect to
[name: \“pd-2\” member_id: 4194890414609143144 peer_urls: 
\“[http://172.18.3.41:2380](http://172.18.3.41:2380/)\” client_urls: 
\“[http://172.18.3.41:2379](http://172.18.3.41:2379/)\”, name: \“pd-3\” member_id: 6804184977433165527 peer_urls: 
\“[http://172.18.3.42:2380](http://172.18.3.42:2380/)\” client_urls: \“[http://172.18.3.42:2379](http://172.18.3.42:2379/)\”, name: \“pd-1\” member_id: 9398170741118694543 peer_urls: \“[http://172.18.3.40:2380](http://172.18.3.40:2380/)\” client_urls: \“[http://172.18.3.40:2379](http://172.18.3.40:2379/)\”]")"]

上面全是 PD 的端口号,PD 都不通,你咋测…

4P8G,目前只有DM写,不会有查询之类的

三台机,检测到是因为某节点的服务器负载过高

内存回收好慢啊啊啊。。。。。。

配置不够,拉不起的

根据日志排查, 估计配置有问题

1 个赞

是不是网络问题?或者吞吐量太大。看到是像通信问题的异常

[“connection aborted”] [addr=172.18.3.41:20160] 你telnet一下这个ip端口,看是否正常。