tidb 7.1.0版本tidb-server出现获取tso超时、连接tikv超时、cpu负载过高等问题

【 TiDB 使用环境】生产环境 /测试/ Poc
【 TiDB 版本】7.1.0
【复现路径】未复现
【遇到的问题:问题现象及影响】
tidb节点出现较多异常报错,还出现panic报错,辛苦大佬指导排查问题

[2023/12/25 10:41:58.622 +08:00] [WARN] [pd.go:152] ["get timestamp too slow"] ["cost time"=50.144859ms]

[2023/12/25 10:42:20.048 +08:00] [ERROR] [tso_dispatcher.go:178] ["[tso] tso request is canceled due to timeout"] [dc-location=global] [error="[PD:client:ErrClientGetTSOTime
out]get TSO timeout"]
[2023/12/25 10:42:20.049 +08:00] [ERROR] [tso_dispatcher.go:453] ["[tso] getTS error"] [dc-location=global] [stream-addr=http://127.0.1.151:2379] [error="[PD:client:ErrClientGetTSO]rpc error: code = Canceled desc = context canceled: rpc error: code = Canceled desc = context canceled"]

[2023/12/25 10:42:20.152 +08:00] [WARN] [server.go:644] ["Server.onConn handshake"] [conn=1338767557106020929] [error="[server:8052]invalid sequence 0 != 1"] ["remote addr"=
127.0.1.64:47920]
[2023/12/25 10:42:20.153 +08:00] [WARN] [server.go:644] ["Server.onConn handshake"] [conn=1338767557106020955] [error="[server:8052]invalid sequence 0 != 1"] ["remote addr"=
127.0.1.64:48056]
[2023/12/25 10:42:20.152 +08:00] [WARN] [server.go:644] ["Server.onConn handshake"] [conn=1338767557106020871] [error="[server:8052]invalid sequence 0 != 1"] ["remote addr"=
127.0.1.64:47632]
[2023/12/25 10:42:20.153 +08:00] [WARN] [server.go:644] ["Server.onConn handshake"] [conn=1338767557106020941] [error="[server:8052]invalid sequence 0 != 1"] ["remote addr"=
127.0.1.64:48034]
[2023/12/25 10:42:20.153 +08:00] [WARN] [server.go:644] ["Server.onConn handshake"] [conn=1338767557106020937] [error="[server:8052]invalid sequence 0 != 1"] ["remote addr"=
127.0.1.64:47954]
[2023/12/25 10:42:20.153 +08:00] [WARN] [server.go:644] ["Server.onConn handshake"] [conn=1338767557106020839] [error="[server:8052]invalid sequence 0 != 1"] ["remote addr"=
127.0.1.64:47438]
[2023/12/25 10:42:20.153 +08:00] [WARN] [server.go:644] ["Server.onConn handshake"] [conn=1338767557106020949] [error="[server:8052]invalid sequence 0 != 1"] ["remote addr"=
127.0.1.82:57506]
[2023/12/25 10:42:20.153 +08:00] [WARN] [server.go:644] ["Server.onConn handshake"] [conn=1338767557106020983] [error="[server:8052]invalid sequence 0 != 1"] ["remote addr"=
127.0.1.64:48164]
[2023/12/25 10:42:20.153 +08:00] [WARN] [server.go:644] ["Server.onConn handshake"] [conn=1338767557106020945] [error="[server:8052]invalid sequence 0 != 1"] ["remote addr"=
127.0.1.64:48044]
[2023/12/25 10:42:20.152 +08:00] [WARN] [server.go:644] ["Server.onConn handshake"] [conn=1338767557106020923] [error="[server:8052]invalid sequence 0 != 1"] ["remote addr"=
127.0.1.64:47898]
[2023/12/25 10:42:20.153 +08:00] [WARN] [server.go:644] ["Server.onConn handshake"] [conn=1338767557106020861] [error="[server:8052]invalid sequence 0 != 1"] ["remote addr"=
127.0.1.82:57118]

之后告警告警tidb-server panic.

补充点上下文?

出现panic、连接pd timeout报错等问题,希望找到问题原因。

问题期间网络ping 延迟如图,其中ping pd节点延迟4ms左右:

检查PD是否正常,检查网络

1检查网络是否通,延迟问题 2,检查pd节点的性能问题,资源使用 3,看看pd节点是否日志有些异常

从第一个图中日志可以看出,连接PD,tso片段传出一半就断开了。从网络层开始查起,看看是物理原因还是软件原因引起的网络异常。

看一下leader pd的资源使用情况

问题期间cpu比较空闲。