集群因为cpu问题出现挂机重启不间断后写入和查询性能低下

【 TiDB 使用环境】poc
【 TiDB 版本】 v7.5.0
【复现路径】stg集群业务上升导致资源不足,因为cpu占满,重启此起彼伏
【遇到的问题:问题现象及影响】stg集群业务上升导致资源不足,因为cpu占满,重启此起彼伏,直接升级虚拟机配置后,cpu正常,但是写入和查询性能低下
【资源配置】进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面
【附件:截图/日志/监控】
tikv日志:
[2024/05/31 11:58:22.077 +08:00] [WARN] [endpoint.rs:858] [error-response] [err=“Region error (will back off and retry) message: "peer is not leader for region 159437901, leader may Some(id: 159437903 store_id: 159381618)" not_leader { region_id: 159437901 leader { id: 159437903 store_id: 159381618 } }”] [thread_id=0x5]
[2024/05/31 11:58:22.219 +08:00] [WARN] [endpoint.rs:858] [error-response] [err=“Region error (will back off and retry) message: "peer is not leader for region 159612101, leader may Some(id: 159612103 store_id: 159381618)" not_leader { region_id: 159612101 leader { id: 159612103 store_id: 159381618 } }”] [thread_id=0x5]
[2024/05/31 11:58:22.219 +08:00] [WARN] [endpoint.rs:858] [error-response] [err=“Region error (will back off and retry) message: "peer is not leader for region 159612101, leader may Some(id: 159612103 store_id: 159381618)" not_leader { region_id: 159612101 leader { id: 159612103 store_id: 159381618 } }”] [thread_id=0x5]
[2024/05/31 11:58:22.267 +08:00] [WARN] [endpoint.rs:858] [error-response] [err=“Region error (will back off and retry) message: "peer is not leader for region 159651305, leader may Some(id: 159651307 store_id: 159381618)" not_leader { region_id: 159651305 leader { id: 159651307 store_id: 159381618 } }”] [thread_id=0x5]
[2024/05/31 11:58:22.515 +08:00] [WARN] [endpoint.rs:858] [error-response] [err=“Region error (will back off and retry) message: "peer is not leader for region 159448001, leader may Some(id: 159448004 store_id: 159381618)" not_leader { region_id: 159448001 leader { id: 159448004 store_id: 159381618 } }”] [thread_id=0x5]
[2024/05/31 11:58:25.816 +08:00] [WARN] [endpoint.rs:858] [error-response] [err=“Region error (will back off and retry) message: "peer is not leader for region 159680625, leader may Some(id: 159680627 store_id: 159381618)" not_leader { region_id: 159680625 leader { id: 159680627 store_id: 159381618 } }”] [thread_id=0x5]
[2024/05/31 11:58:25.816 +08:00] [WARN] [endpoint.rs:858] [error-response] [err=“Region error (will back off and retry) message: "peer is not leader for region 159680625, leader may Some(id: 159680627 store_id: 159381618)" not_leader { region_id: 159680625 leader { id: 159680627 store_id: 159381618 } }”] [thread_id=0x5]
[2024/05/31 11:58:27.329 +08:00] [WARN] [endpoint.rs:858] [error-response] [err=“Region error (will back off and retry) message: "peer is not leader for region 159566845, leader may Some(id: 159566848 store_id: 159381618)" not_leader { region_id: 159566845 leader { id: 159566848 store_id: 159381618 } }”] [thread_id=0x5]
[2024/05/31 11:58:27.510 +08:00] [WARN] [endpoint.rs:858] [error-response] [err=“Region error (will back off and retry) message: "peer is not leader for region 159680625, leader may Some(id: 159680627 store_id: 159381618)" not_leader { region_id: 159680625 leader { id: 159680627 store_id: 159381618 } }”] [thread_id=0x5]
[2024/05/31 11:58:31.438 +08:00] [WARN] [endpoint.rs:858] [error-response] [err=“Region error (will back off and retry) message: "peer is not leader for region 9063, leader may Some(id: 159382653 store_id: 159381618)" not_leader { region_id: 9063 leader { id: 159382653 store_id: 159381618 } }”] [thread_id=0x5]

[2024/05/31 11:58:31.951 +08:00] [WARN] [endpoint.rs:858] [error-response] [err=“Region error (will back off and retry) message: "peer is not leader for region 159409205, leader may Some(id: 159409208 store_id: 159381618)" not_leader { region_id: 159409205 leader { id: 159409208 store_id: 159381618 } }”] [thread_id=0x5]

[2024/05/31 11:55:03.590 +08:00] [ERROR] [kv.rs:1115] [“KvService response batch commands fail”] [err=“"SendError(…)"”] [thread_id=0x5]
[2024/05/31 11:55:03.590 +08:00] [ERROR] [kv.rs:1115] [“KvService response batch commands fail”] [err=“"SendError(…)"”] [thread_id=0x5]
[2024/05/31 11:55:03.606 +08:00] [ERROR] [kv.rs:1115] [“KvService response batch commands fail”] [err=“"SendError(…)"”] [thread_id=0x5]
[2024/05/31 11:55:03.914 +08:00] [ERROR] [kv.rs:1115] [“KvService response batch commands fail”] [err=“"SendError(…)"”] [thread_id=0x5]
[2024/05/31 11:55:04.374 +08:00] [ERROR] [kv.rs:1115] [“KvService response batch commands fail”] [err=“"SendError(…)"”] [thread_id=0x5]
[2024/05/31 11:55:05.671 +08:00] [ERROR] [kv.rs:1115] [“KvService response batch commands fail”] [err=“"SendError(…)"”] [thread_id=0x5]
[2024/05/31 11:55:06.087 +08:00] [ERROR] [kv.rs:1115] [“KvService response batch commands fail”] [err=“"SendError(…)"”] [thread_id=0x5]
[2024/05/31 11:55:16.396 +08:00] [ERROR] [kv.rs:753] [“KvService::batch_raft send response fail”] [err=RemoteStopped] [thread_id=0x5]
[2024/05/31 11:55:16.396 +08:00] [ERROR] [raft_client.rs:584] [“connection aborted”] [addr=10.131.236.160:20160] [receiver_err=“Some(RpcFailure(RpcStatus { code: 14-UNAVAILABLE, message: "Connection reset by peer", details: }))”] [sink_error=“Some(RpcFinished(Some(RpcStatus { code: 14-UNAVAILABLE, message: "Connection reset by peer", details: })))”] [store_id=159381618] [thread_id=0x5]
[2024/05/31 11:55:16.396 +08:00] [ERROR] [raft_client.rs:885] [“connection abort”] [addr=10.131.236.160:20160] [store_id=159381618] [thread_id=0x5]
[2024/05/31 11:55:21.397 +08:00] [ERROR] [raft_client.rs:853] [“wait connect timeout”] [addr=10.131.236.160:20160] [store_id=159381618] [thread_id=0x5]
[2024/05/31 11:55:26.401 +08:00] [ERROR] [raft_client.rs:853] [“wait connect timeout”] [addr=10.131.236.160:20160] [store_id=159381618] [thread_id=0x5]
[2024/05/31 11:55:31.410 +08:00] [ERROR] [raft_client.rs:853] [“wait connect timeout”] [addr=10.131.236.160:20160] [store_id=159381618] [thread_id=0x5]
[2024/05/31 11:55:36.416 +08:00] [ERROR] [raft_client.rs:853] [“wait connect timeout”] [addr=10.131.236.160:20160] [store_id=159381618] [thread_id=0x5]
[2024/05/31 11:55:46.759 +08:00] [ERROR] [kv.rs:927] [“batch_commands error”] [err=“RpcFinished(Some(RpcStatus { code: 0-OK, message: "", details: }))”] [thread_id=0x5]
[2024/05/31 11:55:46.759 +08:00] [ERROR] [kv.rs:1115] [“KvService response batch commands fail”] [err=“"SendError(…)"”] [thread_id=0x5]
[2024/05/31 11:55:46.760 +08:00] [ERROR] [raft_client.rs:853] [“wait connect timeout”] [addr=10.131.236.160:20160] [store_id=159381618] [thread_id=0x5]
[2024/05/31 11:55:47.415 +08:00] [ERROR] [kv.rs:1115] [“KvService response batch commands fail”] [err=“"SendError(…)"”] [thread_id=0x5]
[2024/05/31 11:55:47.467 +08:00] [ERROR] [kv.rs:1115] [“KvService response batch commands fail”] [err=“"SendError(…)"”] [thread_id=0x5]
[2024/05/31 11:55:48.009 +08:00] [ERROR] [kv.rs:1115] [“KvService response batch commands fail”] [err=“"SendError(…)"”] [thread_id=0x5]
[2024/05/31 11:55:49.090 +08:00] [ERROR] [kv.rs:1115] [“KvService response batch commands fail”] [err=“"SendError(…)"”] [thread_id=0x5]
[2024/05/31 11:55:51.771 +08:00] [ERROR] [raft_client.rs:853] [“wait connect timeout”] [addr=10.131.236.160:20160] [store_id=159381618] [thread_id=0x5]
[2024/05/31 11:55:51.775 +08:00] [ERROR] [transport.rs:113] [“failed to send significant msg”] [msg=“Unreachable { region_id: 159685805, to_peer_id: 159685807 }”] [thread_id=0x5]
[2024/05/31 11:55:51.782 +08:00] [ERROR] [transport.rs:113] [“failed to send significant msg”] [msg=“Unreachable { region_id: 159685805, to_peer_id: 159685807 }”] [thread_id=0x5]
[2024/05/31 11:55:56.791 +08:00] [ERROR] [raft_client.rs:853] [“wait connect timeout”] [addr=10.131.236.160:20160] [store_id=159381618] [thread_id=0x5]
[2024/05/31 11:57:17.182 +08:00] [ERROR] [raft_client.rs:853] [“wait connect timeout”] [addr=10.131.233.78:20160] [store_id=1] [thread_id=0x5]
[2024/05/31 11:57:22.794 +08:00] [ERROR] [kv.rs:1115] [“KvService response batch commands fail”] [err=“"SendError(…)"”] [thread_id=0x5]
[2024/05/31 11:57:30.128 +08:00] [ERROR] [time.rs:373] [“monotonic time jumped back, 4343.172 → 4280.369”] [thread_id=0x5]
[2024/05/31 11:57:30.128 +08:00] [ERROR] [time.rs:373] [“monotonic time jumped back, 4343.172 → 4280.369”] [thread_id=0x5]
[2024/05/31 12:00:43.746 +08:00] [ERROR] [kv.rs:1115] [“KvService response batch commands fail”] [err=“"SendError(…)"”] [thread_id=0x5]
[2024/05/31 12:00:43.755 +08:00] [ERROR] [kv.rs:1115] [“KvService response batch commands fail”] [err=“"SendError(…)"”] [thread_id=0x5]
[2024/05/31 12:00:43.756 +08:00] [ERROR] [kv.rs:1115] [“KvService response batch commands fail”] [err=“"SendError(…)"”] [thread_id=0x5]
[2024/05/31 12:00:43.759 +08:00] [ERROR] [kv.rs:1115] [“KvService response batch commands fail”] [err=“"SendError(…)"”] [thread_id=0x5]
[2024/05/31 12:00:43.791 +08:00] [ERROR] [kv.rs:1115] [“KvService response batch commands fail”] [err=“"SendError(…)"”] [thread_id=0x5]
[2024/05/31 12:00:43.791 +08:00] [ERROR] [kv.rs:1115] [“KvService response batch commands fail”] [err=“"SendError(…)"”] [thread_id=0x5]
[2024/05/31 12:00:43.828 +08:00] [ERROR] [kv.rs:1115] [“KvService response batch commands fail”] [err=“"SendError(…)"”] [thread_id=0x5]
[2024/05/31 12:00:43.878 +08:00] [ERROR] [kv.rs:1115] [“KvService response batch commands fail”] [err=“"SendError(…)"”] [thread_id=0x5]

额 这个问题好像就是慢 SQL 打爆了 tikv CPU,然后云服务器不断重启的那个吧。感觉可以关单了。

优化sql吧,这才是解决的关键

查询sql导致的,这个最有效方法是优化sql,加服务器啥的浪费钱

优化SQL

此话题已在最后回复的 60 天后被自动关闭。不再允许新回复。