tidb日志显示TIME_COP_PROCESS tidb,tikv各节点的CPU使用率异常暴增

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:

【TiDB 版本】
V4.0.9
【问题描述】
系统运行时,tidb,tikv各节点的CPU使用率异常暴增,查看日志如下:


请问是什么原因造成的?有大神解疑一下吗?

tidb.log

[2021/02/18 14:59:56.680 +08:00] [INFO] [region_cache.go:839] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=112293] [currIdx=1] [leaderStoreID=7]
[2021/02/18 14:59:56.680 +08:00] [INFO] [region_cache.go:839] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=118341] [currIdx=1] [leaderStoreID=7]
[2021/02/18 15:00:01.580 +08:00] [INFO] [coprocessor.go:1034] ["[TIME_COP_PROCESS] resp_time:470.164512ms txnStartTS:423003842436661260 region_id:100565 store_addr:172.19.xxx.112:20160 kv_process_ms:468 scan_total_write:365237 scan_processed_write:363935 scan_total_data:364868 scan_processed_data:363751 scan_total_lock:1 scan_processed_lock:0"] [conn=23009167]
[2021/02/18 15:00:01.900 +08:00] [INFO] [region_cache.go:839] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=113765] [currIdx=1] [leaderStoreID=7]
[2021/02/18 15:00:02.287 +08:00] [INFO] [coprocessor.go:1034] ["[TIME_COP_PROCESS] resp_time:405.226373ms txnStartTS:423003842633269259 region_id:100565 store_addr:172.19.xxx.112:20160 kv_process_ms:404 scan_total_write:365237 scan_processed_write:363935 scan_total_data:364868 scan_processed_data:363751 scan_total_lock:1 scan_processed_lock:0"] [conn=23009183]
[2021/02/18 15:00:03.044 +08:00] [INFO] [region_cache.go:839] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=113749] [currIdx=1] [leaderStoreID=1]
[2021/02/18 15:00:06.280 +08:00] [INFO] [region_cache.go:839] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=100485] [currIdx=1] [leaderStoreID=7]
[2021/02/18 15:00:07.302 +08:00] [INFO] [region_cache.go:839] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=120789] [currIdx=1] [leaderStoreID=7]
[2021/02/18 15:00:08.050 +08:00] [INFO] [region_cache.go:839] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=113941] [currIdx=1] [leaderStoreID=7]
[2021/02/18 15:00:08.457 +08:00] [INFO] [region_cache.go:839] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=99777] [currIdx=2] [leaderStoreID=2]
[2021/02/18 15:00:08.541 +08:00] [INFO] [region_cache.go:839] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=100229] [currIdx=1] [leaderStoreID=1]
[2021/02/18 15:00:09.307 +08:00] [INFO] [region_cache.go:839] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=125109] [currIdx=1] [leaderStoreID=7]
[2021/02/18 15:00:10.272 +08:00] [INFO] [region_cache.go:839] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=113909] [currIdx=1] [leaderStoreID=7]
[2021/02/18 15:00:11.113 +08:00] [INFO] [region_cache.go:839] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=117365] [currIdx=1] [leaderStoreID=7]
[2021/02/18 15:00:13.130 +08:00] [INFO] [region_cache.go:839] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=125105] [currIdx=1] [leaderStoreID=7]
[2021/02/18 15:00:14.233 +08:00] [INFO] [region_cache.go:839] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=100565] [currIdx=1] [leaderStoreID=7]
[2021/02/18 15:00:15.102 +08:00] [INFO] [region_cache.go:839] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=113369] [currIdx=2] [leaderStoreID=2]
[2021/02/18 15:00:18.934 +08:00] [INFO] [region_cache.go:839] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=102061] [currIdx=1] [leaderStoreID=1]
[2021/02/18 15:00:21.214 +08:00] [INFO] [coprocessor.go:1034] ["[TIME_COP_PROCESS] resp_time:429.167608ms txnStartTS:423003847587790880 region_id:100565 store_addr:172.19.xxx.111:20160 kv_process_ms:416 scan_total_write:365257 scan_processed_write:363938 scan_total_data:364884 scan_processed_data:363752 scan_total_lock:1 scan_processed_lock:0"] [conn=23009512]
[2021/02/18 15:00:24.355 +08:00] [INFO] [region_cache.go:839] ["switch region leader to specific leader due to kv return NotLeader"] [regionID=99721] [currIdx=1] [leaderStoreID=1]
[2021/02/18 15:00:26.037 +08:00] [INFO] [coprocessor.go:1034] ["[TIME_COP_PROCESS] resp_time:302.735185ms txnStartTS:423003848885403663 region_id:113781 store_addr:172.19.xxx.111:20160 kv_process_ms:304 scan_total_write:431790 scan_processed_write:431789 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] [conn=46610]
[2021/02/18 15:00:26.354 +08:00] [INFO] [coprocessor.go:1034] ["[TIME_COP_PROCESS] resp_time:303.293387ms txnStartTS:423003848977154052 region_id:113781 store_addr:172.19.xxx.111:20160 kv_process_ms:304 scan_total_write:431790 scan_processed_write:431789 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] [conn=46610]
[2021/02/18 15:00:26.359 +08:00] [INFO] [coprocessor.go:1034] ["[TIME_COP_PROCESS] resp_time:308.222458ms txnStartTS:423003848977154052 region_id:111653 store_addr:172.19.xxx.111:20160 kv_process_ms:308 scan_total_write:374492 scan_processed_write:374333 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] [conn=46610]
[2021/02/18 15:00:26.687 +08:00] [INFO] [coprocessor.go:1034] ["[TIME_COP_PROCESS] resp_time:300.309561ms txnStartTS:423003849055797252 region_id:113781 store_addr:172.19.xxx.111:20160 kv_process_ms:300 scan_total_write:431790 scan_processed_write:431789 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] [conn=46610]
[2021/02/18 15:00:26.687 +08:00] [INFO] [coprocessor.go:1034] ["[TIME_COP_PROCESS] resp_time:300.32278ms txnStartTS:423003849055797252 region_id:116733 store_addr:172.19.xxx.112:20160 kv_process_ms:300 scan_total_write:426809 scan_processed_write:426808 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] [conn=46610]
[2021/02/18 15:00:26.689 +08:00] [INFO] [coprocessor.go:1034] ["[TIME_COP_PROCESS] resp_time:302.198388ms txnStartTS:423003849055797252 region_id:111653 store_addr:172.19.xxx.111:20160 kv_process_ms:300 scan_total_write:374492 scan_processed_write:374333 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] [conn=46610]
[2021/02/18 15:00:27.033 +08:00] [INFO] [coprocessor.go:1034] ["[TIME_COP_PROCESS] resp_time:307.448396ms txnStartTS:423003849147547653 region_id:113781 store_addr:172.19.xxx.111:20160 kv_process_ms:304 scan_total_write:431790 scan_processed_write:431789 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] [conn=46610]
[2021/02/18 15:00:27.051 +08:00] [INFO] [coprocessor.go:1034] ["[TIME_COP_PROCESS] resp_time:325.67523ms txnStartTS:423003849147547653 region_id:111653 store_addr:172.19.xxx.111:20160 kv_process_ms:320 scan_total_write:374492 scan_processed_write:374333 scan_total_data:0 scan_processed_data:0 scan_total_lock:1 scan_processed_lock:0"] [conn=46610

tikv.log

[2021/02/18 14:59:56.680 +08:00] [WARN] [endpoint.rs:530] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 118341, leader may Some(id: 118344 store_id: 7)\" not_leader { region_id: 118341 leader { id: 118344 store_id: 7 } }"]
[2021/02/18 15:00:01.731 +08:00] [INFO] [pd.rs:793] ["try to transfer leader"] [to_peer="id: 113768 store_id: 7"] [from_peer="id: 113767 store_id: 2"] [region_id=113765]
[2021/02/18 15:00:01.732 +08:00] [INFO] [peer.rs:2110] ["transfer leader"] [peer="id: 113768 store_id: 7"] [peer_id=113767] [region_id=113765]
[2021/02/18 15:00:01.732 +08:00] [INFO] [raft.rs:1376] ["[term 12810] starts to transfer leadership to 113768"] [lead_transferee=113768] [term=12810] [raft_id=113767] [region_id=113765]
[2021/02/18 15:00:01.732 +08:00] [INFO] [raft.rs:1389] ["sends MsgTimeoutNow to 113768 immediately as 113768 already has up-to-date log"] [lead_transferee=113768] [raft_id=113767] [region_id=113765]
[2021/02/18 15:00:01.732 +08:00] [INFO] [raft.rs:1003] ["received a message with higher term from 113768"] ["msg type"=MsgRequestVote] [message_term=12811] [term=12810] [from=113768] [raft_id=113767] [region_id=113765]
[2021/02/18 15:00:01.732 +08:00] [INFO] [raft.rs:783] ["became follower at term 12811"] [term=12811] [raft_id=113767] [region_id=113765]
[2021/02/18 15:00:01.733 +08:00] [INFO] [raft.rs:1192] ["[logterm: 12810, index: 567992, vote: 0] cast vote for 113768 [logterm: 12810, index: 567992] at term 12811"] ["msg type"=MsgRequestVote] [term=12811] [msg_index=567992] [msg_term=12810] [from=113768] [vote=0] [log_index=567992] [log_term=12810] [raft_id=113767] [region_id=113765]
[2021/02/18 15:00:01.882 +08:00] [WARN] [endpoint.rs:530] [error-response] [err="Region error (will back off and retry) message: \"peer is not leader for region 113765, leader may Some(id: 113768 store_id: 7)\" not_leader { region_id: 113765 leader { id: 113768 store_id: 7 } }"]
[2021/02/18 15:00:03.031 +08:00] [INFO] [pd.rs:793] ["try to transfer leader"] [to_peer="id: 113750 store_id: 1"] [from_peer="id: 113751 store_id: 2"] [region_id=113749]
[2021/02/18 15:00:03.032 +08:00] [INFO] [peer.rs:2110] ["transfer leader"] [peer="id: 113750 store_id: 1"] [peer_id=113751] [region_id=113749]
[2021/02/18 15:00:03.032 +08:00] [INFO] [raft.rs:1376] ["[term 27857] starts to transfer leadership to 113750"] [lead_transferee=113750] [term=27857] [raft_id=113751] [region_id=113749]
[2021/02/18 15:00:03.033 +08:00] [INFO] [raft.rs:1389] ["sends MsgTimeoutNow to 113750 immediately as 113750 already has up-to-date log"] [lead_transferee=113750] [raft_id=113751] [region_id=113749]
[2021/02/18 15:00:03.033 +08:00] [INFO] [raft.rs:1003] ["received a message with higher term from 113750"] ["msg type"=MsgRequestVote] [message_term=27858] [term=27857] [from=113750] [raft_id=113751] [region_id=113749]
[2021/02/18 15:00:03.034 +08:00] [INFO] [raft.rs:783] ["became follower at term 27858"] [term=27858] [raft_id=113751] [region_id=113749]
[2021/02/18 15:00:03.034 +08:00] [INFO] [raft.rs:1192] ["[logterm: 27857, index: 294846, vote: 0] cast vote for 113750 [logterm: 27857, index: 294846] at term 27858"] ["msg type"=MsgRequestVote] [term=27858] [msg_index=294846] [msg_term=27857] [from=113750] [vote=0] [log_index=294846] [log_term=27857] [raft_id=113751] [region_id=113749]
[2021/02/18 15:00:04.032 +08:00] [INFO] [pd.rs:793] ["try to transfer leader"] [to_peer="id: 102062 store_id: 1"] [from_peer="id: 102063 store_id: 2"] [region_id=102061]
[2021/02/18 15:00:04.032 +08:00] [INFO] [peer.rs:2110] ["transfer leader"] [peer="id: 102062 store_id: 1"] [peer_id=102063] [region_id=102061]
[2021/02/18 15:00:04.032 +08:00] [INFO] [raft.rs:1376] ["[term 13742] starts to transfer leadership to 102062"] [lead_transferee=102062] [term=13742] [raft_id=102063] [region_id=102061]
[2021/02/18 15:00:04.032 +08:00] [INFO] [raft.rs:1389] ["sends MsgTimeoutNow to 102062 immediately as 102062 already has up-to-date log"] [lead_transferee=102062] [raft_id=102063] [region_id=102061]
[2021/02/18 15:00:04.033 +08:00] [INFO] [raft.rs:1003] ["received a message with higher term from 102062"] ["msg type"=MsgRequestVote] [message_term=13743] [term=13742] [from=102062] [raft_id=102063] [region_id=102061]
[2021/02/18 15:00:04.033 +08:00] [INFO] [raft.rs:783] ["became follower at term 13743"] [term=13743] [raft_id=102063] [region_id=102061]
[2021/02/18 15:00:04.033 +08:00] [INFO] [raft.rs:1192] ["[logterm: 13742, index: 76072, vote: 0] cast vote for 102062 [logterm: 13742, index: 76072] at term 13743"] ["msg type"=MsgRequestVote] [term=13743] [msg_index=76072] [msg_term=13742] [from=102062] [vote=0] [log_index=76072] [log_term=13742] [raft_id=102063] [region_id=102061]
[2021/02/18 15:00:05.031 +08:00] [INFO] [pd.rs:793] ["try to transfer leader"] [to_peer="id: 100230 store_id: 1"] [from_peer="id: 100231 store_id: 2"] [region_id=100229]
[2021/02/18 15:00:05.032 +08:00] [INFO] [peer.rs:2110] ["transfer leader"] [peer="id: 100230 store_id: 1"] [peer_id=100231] [region_id=100229]
[2021/02/18 15:00:05.032 +08:00] [INFO] [raft.rs:1376] ["[term 3357] starts to transfer leadership to 100230"] [lead_transferee=100230] [term=3357] [raft_id=100231] [region_id=100229]
[2021/02/18 15:00:05.032 +08:00] [INFO] [raft.rs:1389] ["sends MsgTimeoutNow to 100230 immediately as 100230 already has up-to-date log"] [lead_transferee=100230] [raft_id=100231] [region_id=100229]
[2021/02/18 15:00:05.032 +08:00] [INFO] [raft.rs:1003] ["received a message with higher term from 100230"] ["msg type"=MsgRequestVote] [message_term=3358] [term=3357] [from=100230] [raft_id=100231] [region_id=100229]

2021-02-22补充:
CPU异常升高图:

监控中 PD TSO 的 wait duration 异常升高:

监控 Grafana → disk performancelatencyload正常。

监控 Grafana → PDetcdround trip

监控 Grafana → blackbox_exporterping latency

监控中 KV Cmd Duration:

参与官方文档CPU 占用过多导致读写延迟增加进行自我排查的结果如上图所示。

请问是否可以确定 TiDB 到 PD leader 的网络原因导致的CPU压力升高?


若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

可以通过 dashboard 或者 tidb 慢日志看下对应的时间段是否有慢 SQL

有慢查询,最慢的1秒左右。

那建议先优化下慢 SQL ,然后再看下效果。

请问楼主,最终是否有确定该问题以及如何解决?

还没有解决,今天又出现

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。