tidb 突然导致服务器CPU 内存 占用 100%

【 TiDB 使用环境】生产环境
【 TiDB 版本】7.5.0
【复现路径】有一些慢sql内存占用比较大

【遇到的问题:问题现象及影响】
tidb 突然导致服务器CPU 内存 占用 100% ,希望各位大佬能帮我分析分析看看

服务器配置是 16G 4核cpu

我是tidb的忠实粉丝,可是苦于技术有限,对性能分析调优实在是没点头绪。只好求助各位大佬了

【资源配置】进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面
慢SQL截图

storage.block-cache.capacity 大小
image

【附件:截图/日志/监控】
这个是阿里云截图


这个是一些关键日志(文件中是全的,文件在最底下),如需其他日志和截图只要我有都可以提供

2024-01-13 09:07:50 (UTC+08:00)TiDB 10.0.0.76:4000[2pc.go:1783] ["schemaLeaseChecker is not set for this transaction"] [conn=2175271786] [session_alias=] [sessionID=2175271786] [startTS=446983849933537288] [checkTS=446983849933537289]
2024-01-13 09:08:15 (UTC+08:00)TiDB 10.0.0.76:4000[2pc.go:1783] ["schemaLeaseChecker is not set for this transaction"] [conn=2175271770] [session_alias=] [sessionID=2175271770] [startTS=446983856474292227] [checkTS=446983856474292231]
2024-01-13 09:09:02 (UTC+08:00)PD 10.0.0.76:2379[util.go:163] ["apply request took too long"] [took=181.679616ms] [expected-duration=100ms] [prefix="read-only range "] [request="key:\"/pd/6963105917181191707/config\" "] [response="range_response_count:1 size:4921"] []
2024-01-13 09:09:02 (UTC+08:00)PD 10.0.0.76:2379[trace.go:152] ["trace[1823224743] range"] [detail="{range_begin:/pd/6963105917181191707/config; range_end:; response_count:1; response_revision:192676103; }"] [duration=261.92867ms] [start=2024/01/13 09:09:02.672 +08:00] [end=2024/01/13 09:09:02.934 +08:00] [steps="[\"trace[1823224743] 'range keys from in-memory index tree'  (duration: 107.11239ms)\",\"trace[1823224743] 'range keys from bolt db'  (duration: 71.409954ms)\"]"]
2024-01-13 09:09:03 (UTC+08:00)TiDB 10.0.0.76:4000[pd.go:152] ["get timestamp too slow"] ["cost time"=54.310922ms]
2024-01-13 09:09:04 (UTC+08:00)PD 10.0.0.76:2379[tso.go:334] ["clock offset"] [] [jet-lag=378.484356ms] [prev-physical=2024/01/13 09:09:03.820 +08:00] [now=2024/01/13 09:09:04.198 +08:00] [update-physical-interval=50ms]
2024-01-13 09:09:05 (UTC+08:00)PD 10.0.0.76:2379[util.go:163] ["apply request took too long"] [took=299.641743ms] [expected-duration=100ms] [prefix=] [request="header:<ID:13188664713108092049 > txn:<compare:<target:VALUE key:\"/pd/6963105917181191707/timestamp\" value_size:8 > success:<request_put:<key:\"/pd/6963105917181191707/timestamp\" value_size:8 >> failure:<>>"] [response=size:20] []
2024-01-13 09:09:05 (UTC+08:00)PD 10.0.0.76:2379[trace.go:152] ["trace[2018233445] linearizableReadLoop"] [detail="{readStateIndex:192720602; appliedIndex:192720601; }"] [duration=1.369837438s] [start=2024/01/13 09:09:04.336 +08:00] [end=2024/01/13 09:09:05.706 +08:00] [steps="[\"trace[2018233445] 'read index received'  (duration: 103.214821ms)\",\"trace[2018233445] 'applied index is now lower than readState.Index'  (duration: 1.266621552s)\"]"]
2024-01-13 09:09:05 (UTC+08:00)PD 10.0.0.76:2379[util.go:163] ["apply request took too long"] [took=1.370092931s] [expected-duration=100ms] [prefix="read-only range "] [request="key:\"health\" "] [response="range_response_count:0 size:7"] []
2024-01-13 09:09:05 (UTC+08:00)PD 10.0.0.76:2379[trace.go:152] ["trace[171133557] range"] [detail="{range_begin:health; range_end:; response_count:0; response_revision:192676104; }"] [duration=1.370133565s] [start=2024/01/13 09:09:04.336 +08:00] [end=2024/01/13 09:09:05.707 +08:00] [steps="[\"trace[171133557] 'agreement among raft nodes before linearized reading'  (duration: 1.370042492s)\"]"]
2024-01-13 09:09:05 (UTC+08:00)PD 10.0.0.76:2379[util.go:163] ["apply request took too long"] [took=1.370375971s] [expected-duration=100ms] [prefix="read-only range "] [request="key:\"health\" "] [response="range_response_count:0 size:7"] []
2024-01-13 09:09:05 (UTC+08:00)PD 10.0.0.76:2379[trace.go:152] ["trace[1220459767] range"] [detail="{range_begin:health; range_end:; response_count:0; response_revision:192676104; }"] [duration=1.370461967s] [start=2024/01/13 09:09:04.337 +08:00] [end=2024/01/13 09:09:05.707 +08:00] [steps="[\"trace[1220459767] 'agreement among raft nodes before linearized reading'  (duration: 1.370341353s)\"]"]
2024-01-13 09:09:05 (UTC+08:00)PD 10.0.0.76:2379[util.go:163] ["apply request took too long"] [took=997.964842ms] [expected-duration=100ms] [prefix="read-only range "] [request="key:\"health\" "] [response="range_response_count:0 size:7"] []
2024-01-13 09:09:05 (UTC+08:00)PD 10.0.0.76:2379[trace.go:152] ["trace[1408137589] range"] [detail="{range_begin:health; range_end:; response_count:0; response_revision:192676104; }"] [duration=997.988085ms] [start=2024/01/13 09:09:04.709 +08:00] [end=2024/01/13 09:09:05.707 +08:00] [steps="[\"trace[1408137589] 'agreement among raft nodes before linearized reading'  (duration: 997.941582ms)\"]"]
2024-01-13 09:09:05 (UTC+08:00)PD 10.0.0.76:2379[util.go:163] ["apply request took too long"] [took=559.380158ms] [expected-duration=100ms] [prefix="read-only range "] [request="key:\"/pd/6963105917181191707/gc/safe_point\" "] [response="range_response_count:1 size:79"] []
2024-01-13 09:09:05 (UTC+08:00)PD 10.0.0.76:2379[trace.go:152] ["trace[1513864565] range"] [detail="{range_begin:/pd/6963105917181191707/gc/safe_point; range_end:; response_count:1; response_revision:192676104; }"] [duration=559.454168ms] [start=2024/01/13 09:09:05.195 +08:00] [end=2024/01/13 09:09:05.755 +08:00] [steps="[\"trace[1513864565] 'agreement among raft nodes before linearized reading'  (duration: 511.642472ms)\",\"trace[1513864565] 'range keys from bolt db'  (duration: 47.677874ms)\"]"]
2024-01-13 09:09:05 (UTC+08:00)PD 10.0.0.76:2379[util.go:163] ["apply request took too long"] [took=1.077325466s] [expected-duration=100ms] [prefix="read-only range "] [request="key:\"/topology/tidb/\" range_end:\"/topology/tidb0\" "] [response="range_response_count:6 size:1002"] []
2024-01-13 09:09:05 (UTC+08:00)PD 10.0.0.76:2379[trace.go:152] ["trace[1651274640] range"] [detail="{range_begin:/topology/tidb/; range_end:/topology/tidb0; response_count:6; response_revision:192676104; }"] [duration=1.077543358s] [start=2024/01/13 09:09:04.677 +08:00] [end=2024/01/13 09:09:05.755 +08:00] [steps="[\"trace[1651274640] 'agreement among raft nodes before linearized reading'  (duration: 1.029783981s)\",\"trace[1651274640] 'range keys from bolt db'  (duration: 47.497177ms)\"]"]
2024-01-13 09:09:05 (UTC+08:00)PD 10.0.0.76:2379[util.go:163] ["apply request took too long"] [took=198.724618ms] [expected-duration=100ms] [prefix="read-only range "] [request="key:\"/pd/6963105917181191707/config\" "] [response="range_response_count:1 size:4921"] []
2024-01-13 09:09:05 (UTC+08:00)PD 10.0.0.76:2379[trace.go:152] ["trace[1889995612] range"] [detail="{range_begin:/pd/6963105917181191707/config; range_end:; response_count:1; response_revision:192676104; }"] [duration=198.813564ms] [start=2024/01/13 09:09:05.649 +08:00] [end=2024/01/13 09:09:05.847 +08:00] [steps="[\"trace[1889995612] 'agreement among raft nodes before linearized reading'  (duration: 58.764203ms)\",\"trace[1889995612] 'range keys from bolt db'  (duration: 139.893607ms)\"]"]
2024-01-13 09:09:06 (UTC+08:00)TiDB 10.0.0.76:4000[pd.go:152] ["get timestamp too slow"] ["cost time"=1.03219361s]
2024-01-13 09:09:06 (UTC+08:00)TiDB 10.0.0.76:4000[coprocessor.go:1330] ["[TIME_COP_WAIT] resp_time:489.624714ms txnStartTS:446983869764730881 region_id:228401 store_addr:10.0.0.74:20160 kv_process_ms:0 kv_wait_ms:0 kv_read_ms:0 processed_versions:0 total_versions:2 rocksdb_delete_skipped_count:0 rocksdb_key_skipped_count:0 rocksdb_cache_hit_count:8 rocksdb_read_count:0 rocksdb_read_byte:0"]
2024-01-13 09:09:09 (UTC+08:00)PD 10.0.0.76:2379[util.go:163] ["apply request took too long"] [took=123.162515ms] [expected-duration=100ms] [prefix=] [request="header:<ID:9606895634472091609 > put:<key:\"/tidb/server/minstartts/a69cb687-c1af-4963-aad7-5c4eb877835f\" value_size:18 lease:383523597613409156 >"] [response=size:7] []
2024-01-13 09:09:09 (UTC+08:00)PD 10.0.0.76:2379[util.go:163] ["apply request took too long"] [took=381.602581ms] [expected-duration=100ms] [prefix="read-only range "] [request="key:\"/pd/6963105917181191707/config\" "] [response="range_response_count:1 size:4921"] []
2024-01-13 09:09:09 (UTC+08:00)PD 10.0.0.76:2379[trace.go:152] ["trace[1628403812] range"] [detail="{range_begin:/pd/6963105917181191707/config; range_end:; response_count:1; response_revision:192676109; }"] [duration=381.669999ms] [start=2024/01/13 09:09:09.345 +08:00] [end=2024/01/13 09:09:09.727 +08:00] [steps="[\"trace[1628403812] 'agreement among raft nodes before linearized reading'  (duration: 381.529783ms)\"]"]
2024-01-13 09:09:09 (UTC+08:00)PD 10.0.0.76:2379[trace.go:152] ["trace[407178525] linearizableReadLoop"] [detail="{readStateIndex:192720606; appliedIndex:192720605; }"] [duration=381.396625ms] [start=2024/01/13 09:09:09.345 +08:00] [end=2024/01/13 09:09:09.727 +08:00] [steps="[\"trace[407178525] 'read index received'  (duration: 51.278827ms)\",\"trace[407178525] 'applied index is now lower than readState.Index'  (duration: 330.117183ms)\"]"]
2024-01-13 09:09:10 (UTC+08:00)PD 10.0.0.76:2379[trace.go:152] ["trace[291352191] linearizableReadLoop"] [detail="{readStateIndex:192720608; appliedIndex:192720608; }"] [duration=229.134147ms] [start=2024/01/13 09:09:10.718 +08:00] [end=2024/01/13 09:09:10.947 +08:00] [steps="[\"trace[291352191] 'read index received'  (duration: 229.130966ms)\",\"trace[291352191] 'applied index is now lower than readState.Index'  (duration: 2.645µs)\"]"]
2024-01-13 09:09:11 (UTC+08:00)PD 10.0.0.76:2379[util.go:163] ["apply request took too long"] [took=428.201051ms] [expected-duration=100ms] [prefix="read-only range "] [request="key:\"health\" "] [response="range_response_count:0 size:7"] []
2024-01-13 09:09:11 (UTC+08:00)PD 10.0.0.76:2379[trace.go:152] ["trace[1408481354] range"] [detail="{range_begin:health; range_end:; response_count:0; response_revision:192676110; }"] [duration=428.286883ms] [start=2024/01/13 09:09:10.718 +08:00] [end=2024/01/13 09:09:11.146 +08:00] [steps="[\"trace[1408481354] 'agreement among raft nodes before linearized reading'  (duration: 428.135031ms)\"]"]
2024-01-13 09:09:11 (UTC+08:00)PD 10.0.0.76:2379[trace.go:152] ["trace[991936988] linearizableReadLoop"] [detail="{readStateIndex:192720608; appliedIndex:192720608; }"] [duration=331.010827ms] [start=2024/01/13 09:09:10.947 +08:00] [end=2024/01/13 09:09:11.278 +08:00] [steps="[\"trace[991936988] 'read index received'  (duration: 331.007877ms)\",\"trace[991936988] 'applied index is now lower than readState.Index'  (duration: 2.458µs)\"]"]
2024-01-13 09:09:11 (UTC+08:00)PD 10.0.0.76:2379[util.go:163] ["apply request took too long"] [took=560.265805ms] [expected-duration=100ms] [prefix="read-only range "] [request="key:\"/pd/6963105917181191707/config\" "] [response="range_response_count:1 size:4921"] []
2024-01-13 09:09:11 (UTC+08:00)PD 10.0.0.76:2379[tso.go:334] ["clock offset"] [] [jet-lag=567.045524ms] [prev-physical=2024/01/13 09:09:10.718 +08:00] [now=2024/01/13 09:09:11.285 +08:00] [update-physical-interval=50ms]
2024-01-13 09:09:11 (UTC+08:00)PD 10.0.0.76:2379[util.go:163] ["apply request took too long"] [took=348.498137ms] [expected-duration=100ms] [prefix="read-only range "] [request="key:\"health\" "] [response="range_response_count:0 size:7"] []
2024-01-13 09:09:11 (UTC+08:00)PD 10.0.0.76:2379[trace.go:152] ["trace[1299312368] range"] [detail="{range_begin:/pd/6963105917181191707/config; range_end:; response_count:1; response_revision:192676110; }"] [duration=700.343264ms] [start=2024/01/13 09:09:10.718 +08:00] [end=2024/01/13 09:09:11.418 +08:00] [steps="[\"trace[1299312368] 'agreement among raft nodes before linearized reading'  (duration: 560.226944ms)\"]"]
2024-01-13 09:09:11 (UTC+08:00)PD 10.0.0.76:2379[util.go:163] ["apply request took too long"] [took=139.644317ms] [expected-duration=100ms] [prefix="read-only range "] [request="key:\"health\" "] [response="range_response_count:0 size:7"] []
2024-01-13 09:09:11 (UTC+08:00)PD 10.0.0.76:2379[trace.go:152] ["trace[948346012] range"] [detail="{range_begin:health; range_end:; response_count:0; response_revision:192676110; }"] [duration=139.693796ms] [start=2024/01/13 09:09:11.279 +08:00] [end=2024/01/13 09:09:11.419 +08:00] [steps="[\"trace[948346012] 'agreement among raft nodes before linearized reading'  (duration: 139.59201ms)\"]"]
2024-01-13 09:09:11 (UTC+08:00)PD 10.0.0.76:2379[util.go:163] ["apply request took too long"] [took=242.119141ms] [expected-duration=100ms] [prefix="read-only range "] [request="key:\"health\" "] [response="range_response_count:0 size:7"] []
2024-01-13 09:09:11 (UTC+08:00)PD 10.0.0.76:2379[trace.go:152] ["trace[1884724294] range"] [detail="{range_begin:health; range_end:; response_count:0; response_revision:192676110; }"] [duration=242.190365ms] [start=2024/01/13 09:09:11.286 +08:00] [end=2024/01/13 09:09:11.528 +08:00] [steps="[\"trace[1884724294] 'agreement among raft nodes before linearized reading'  (duration: 133.314842ms)\",\"trace[1884724294] 'range keys from in-memory index tree'  (duration: 108.792242ms)\"]"]
2024-01-13 09:09:11 (UTC+08:00)PD 10.0.0.76:2379[tso.go:334] ["clock offset"] [] [jet-lag=281.145198ms] [prev-physical=2024/01/13 09:09:11.285 +08:00] [now=2024/01/13 09:09:11.566 +08:00] [update-physical-interval=50ms]
2024-01-13 09:09:11 (UTC+08:00)PD 10.0.0.76:2379[trace.go:152] ["trace[42751983] range"] [detail="{range_begin:health; range_end:; response_count:0; response_revision:192676110; }"] [duration=619.909245ms] [start=2024/01/13 09:09:10.947 +08:00] [end=2024/01/13 09:09:11.566 +08:00] [steps="[\"trace[42751983] 'agreement among raft nodes before linearized reading'  (duration: 348.457441ms)\"]"]
2024-01-13 09:09:12 (UTC+08:00)TiDB 10.0.0.76:4000[pd.go:152] ["get timestamp too slow"] ["cost time"=400.197296ms]
2024-01-13 09:09:12 (UTC+08:00)PD 10.0.0.76:2379[tso.go:334] ["clock offset"] [] [jet-lag=158.820388ms] [prev-physical=2024/01/13 09:09:12.270 +08:00] [now=2024/01/13 09:09:12.428 +08:00] [update-physical-interval=50ms]
2024-01-13 09:09:12 (UTC+08:00)TiKV 10.0.0.76:20160[trend.rs:291] ["history window flipping: enter"] [increasing_rate=0] [flip_margin_error=1500] [delta=2424.961157796468] [name=L1] [thread_id=0x5]
2024-01-13 09:09:12 (UTC+08:00)TiDB 10.0.0.76:4000[coprocessor.go:1330] ["[TIME_COP_WAIT] resp_time:628.703669ms txnStartTS:446983871298273286 region_id:228401 store_addr:10.0.0.74:20160 kv_process_ms:0 kv_wait_ms:0 kv_read_ms:0 processed_versions:0 total_versions:1 rocksdb_delete_skipped_count:0 rocksdb_key_skipped_count:0 rocksdb_cache_hit_count:4 rocksdb_read_count:0 rocksdb_read_byte:0"]
2024-01-13 09:09:12 (UTC+08:00)TiDB 10.0.0.76:4000[coprocessor.go:1330] ["[TIME_COP_WAIT] resp_time:512.053606ms txnStartTS:446983871298273282 region_id:228401 store_addr:10.0.0.74:20160 kv_process_ms:0 kv_wait_ms:0 kv_read_ms:0 processed_versions:0 total_versions:1 rocksdb_delete_skipped_count:0 rocksdb_key_skipped_count:0 rocksdb_cache_hit_count:4 rocksdb_read_count:0 rocksdb_read_byte:0"]
2024-01-13 09:09:13 (UTC+08:00)PD 10.0.0.76:2379[util.go:163] ["apply request took too long"] [took=628.551044ms] [expected-duration=100ms] [prefix=] [request="header:<ID:13188664713108092070 > txn:<compare:<> success:<request_put:<key:\"resource_group/states/default\" value_size:106 >> failure:<>>"] [response=size:20] []
2024-01-13 09:09:13 (UTC+08:00)PD 10.0.0.76:2379[trace.go:152] ["trace[1453964403] linearizableReadLoop"] [detail="{readStateIndex:192720609; appliedIndex:192720608; }"] [duration=634.220812ms] [start=2024/01/13 09:09:12.812 +08:00] [end=2024/01/13 09:09:13.446 +08:00] [steps="[\"trace[1453964403] 'read index received'  (duration: 5.910935ms)\",\"trace[1453964403] 'applied index is now lower than readState.Index'  (duration: 628.308115ms)\"]"]
2024-01-13 09:09:13 (UTC+08:00)PD 10.0.0.76:2379[util.go:163] ["apply request took too long"] [took=673.130977ms] [expected-duration=100ms] [prefix="read-only range "] [request="key:\"/tidb/store/gcworker/saved_safe_point\" "] [response="range_response_count:1 size:82"] []
2024-01-13 09:09:13 (UTC+08:00)PD 10.0.0.76:2379[trace.go:152] ["trace[990622929] range"] [detail="{range_begin:/tidb/store/gcworker/saved_safe_point; range_end:; response_count:1; response_revision:192676111; }"] [duration=673.278959ms] [start=2024/01/13 09:09:12.812 +08:00] [end=2024/01/13 09:09:13.485 +08:00] [steps="[\"trace[990622929] 'agreement among raft nodes before linearized reading'  (duration: 634.331195ms)\",\"trace[990622929] 'range keys from bolt db'  (duration: 38.716397ms)\"]"]
2024-01-13 09:09:13 (UTC+08:00)PD 10.0.0.76:2379[util.go:163] ["apply request took too long"] [took=664.792599ms] [expected-duration=100ms] [prefix="read-only range "] [request="key:\"/pd/6963105917181191707/timestamp\" "] [response="range_response_count:1 size:68"] []
2024-01-13 09:09:13 (UTC+08:00)PD 10.0.0.76:2379[trace.go:152] ["trace[1592898840] range"] [detail="{range_begin:/pd/6963105917181191707/timestamp; range_end:; response_count:1; response_revision:192676111; }"] [duration=664.838394ms] [start=2024/01/13 09:09:12.821 +08:00] [end=2024/01/13 09:09:13.486 +08:00] [steps="[\"trace[1592898840] 'agreement among raft nodes before linearized reading'  (duration: 664.73312ms)\"]"]
2024-01-13 09:09:13 (UTC+08:00)PD 10.0.0.76:2379[tso.go:334] ["clock offset"] [] [jet-lag=697.965522ms] [prev-physical=2024/01/13 09:09:12.821 +08:00] [now=2024/01/13 09:09:13.518 +08:00] [update-physical-interval=50ms]
2024-01-13 09:09:13 (UTC+08:00)TiDB 10.0.0.76:4000[coprocessor.go:1330] ["[TIME_COP_WAIT] resp_time:360.316267ms txnStartTS:446983871298273281 region_id:228325 store_addr:10.0.0.80:20160 kv_process_ms:0 kv_wait_ms:0 kv_read_ms:0 processed_versions:1 total_versions:2 rocksdb_delete_skipped_count:1 rocksdb_key_skipped_count:2 rocksdb_cache_hit_count:8 rocksdb_read_count:0 rocksdb_read_byte:0"]
2024-01-13 09:09:13 (UTC+08:00)TiDB 10.0.0.76:4000[pd.go:152] ["get timestamp too slow"] ["cost time"=58.900843ms]
2024-01-13 09:09:13 (UTC+08:00)TiDB 10.0.0.76:4000[pd.go:152] ["get timestamp too slow"] ["cost time"=80.821323ms]
2024-01-13 09:09:13 (UTC+08:00)TiKV 10.0.0.76:20160[trend.rs:272] ["history window flipping: end"] [flipping_duration=1] [increasing_rate=0] [time_based_multiple=0.00029993703107215343] [flip_margin_error=500] [delta=2415.8989729224995] [name=L1] [thread_id=0x5]
2024-01-13 09:09:13 (UTC+08:00)PD 10.0.0.76:2379[trace.go:152] ["trace[1297175575] put"] [detail="{key:/topology/tidb/10.0.0.76:4000/info; req_size:234; response_revision:192676113; }"] [duration=126.009427ms] [start=2024/01/13 09:09:13.719 +08:00] [end=2024/01/13 09:09:13.845 +08:00] [steps="[\"trace[1297175575] 'process raft request'  (duration: 125.899649ms)\"]"]
2024-01-13 09:09:13 (UTC+08:00)TiDB 10.0.0.76:4000[pd.go:152] ["get timestamp too slow"] [conn=2175271812] [session_alias=] ["cost time"=51.866301ms]
2024-01-13 09:09:13 (UTC+08:00)TiDB 10.0.0.76:4000[pd.go:152] ["get timestamp too slow"] [conn=2175271788] [session_alias=] ["cost time"=31.441626ms]
2024-01-13 09:09:14 (UTC+08:00)PD 10.0.0.76:2379[tso.go:334] ["clock offset"] [] [jet-lag=158.025296ms] [prev-physical=2024/01/13 09:09:14.291 +08:00] [now=2024/01/13 09:09:14.449 +08:00] [update-physical-interval=50ms]
2024-01-13 09:09:19 (UTC+08:00)TiKV 10.0.0.76:20160[write.rs:855] ["[store 5] async write too slow, write_kv: 0s, write_raft: 1.04097722s, send: 0.049922504s, callback: 0s thread: sync-writer"] [takes=1040] [thread_id=0x5]
2024-01-13 09:09:35 (UTC+08:00)TiKV 10.0.0.76:20160[store.rs:1125] ["[store 5] handle 9 pending peers include 9 ready, 0 entries, 8 messages and 0 snapshots"] [takes=3392] [thread_id=0x5]
2024-01-13 09:09:42 (UTC+08:00)TiKV 10.0.0.76:20160[endpoint.rs:1151] ["cdc resolve region leadership"] [takes=1221] [thread_id=0x5]
2024-01-13 09:09:45 (UTC+08:00)TiKV 10.0.0.76:20160[write.rs:855] ["[store 5] async write too slow, write_kv: 0s, write_raft: 0.870993804s, send: 0.560334296s, callback: 0s thread: sync-writer"] [takes=1689] [thread_id=0x5]
2024-01-13 09:09:45 (UTC+08:00)TiKV 10.0.0.76:20160[write.rs:855] ["[store 5] async write too slow, write_kv: 0s, write_raft: 0.929442005s, send: 0.500874594s, callback: 0s thread: sync-writer"] [takes=1689] [thread_id=0x5]
2024-01-13 09:09:45 (UTC+08:00)TiKV 10.0.0.76:20160[endpoint.rs:597] ["the max gap of follower safe-ts is large"] [oldest_candidate=None] [latest_candidate=None] [applied_index=19] [duration_to_last_consume_leader=29925ms] [resolved_ts=446983861310586881] [safe_ts=446983861310586881] [gap=51236ms] [region_id=223073] [thread_id=0x5]
2024-01-13 09:09:45 (UTC+08:00)TiKV 10.0.0.76:20160[endpoint.rs:618] ["the max gap of follower resolved-ts is large; it's the same region that has the min safe-ts"] [thread_id=0x5]
2024-01-13 09:09:45 (UTC+08:00)TiKV 10.0.0.76:20160[pd.rs:1448] ["store heartbeat failed"] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [thread_id=0x5]
2024-01-13 09:09:45 (UTC+08:00)TiKV 10.0.0.76:20160[store.rs:1125] ["[store 5] handle 4 pending peers include 3 ready, 1 entries, 2 messages and 0 snapshots"] [takes=7756] [thread_id=0x5]
2024-01-13 09:09:45 (UTC+08:00)TiKV 10.0.0.76:20160[store.rs:1125] ["[store 5] handle 22 pending peers include 19 ready, 0 entries, 0 messages and 0 snapshots"] [takes=8660] [thread_id=0x5]
2024-01-13 09:09:45 (UTC+08:00)TiKV 10.0.0.76:20160[kv.rs:753] ["KvService::batch_raft send response fail"] [err=RemoteStopped] [thread_id=0x5]
2024-01-13 09:09:45 (UTC+08:00)TiKV 10.0.0.76:20160[util.rs:496] ["request failed, retry"] [err_code=KV:Pd:Grpc] [err="Grpc(RpcFailure(RpcStatus { code: 4-DEADLINE_EXCEEDED, message: \"Deadline Exceeded\", details: [] }))"] [thread_id=0x5]
2024-01-13 09:09:45 (UTC+08:00)TiKV 10.0.0.76:20160[<unknown>] ["ipv4:10.0.0.76:12182: Keepalive watchdog fired. Closing transport."] [thread_id=0x5]
2024-01-13 09:09:45 (UTC+08:00)TiKV 10.0.0.76:20160[<unknown>] ["ipv4:10.0.0.76:12186: Keepalive watchdog fired. Closing transport."] [thread_id=0x5]
2024-01-13 09:09:45 (UTC+08:00)TiKV 10.0.0.76:20160[<unknown>] ["ipv4:10.0.0.76:12188: Keepalive watchdog fired. Closing transport."] [thread_id=0x5]
2024-01-13 09:09:45 (UTC+08:00)TiKV 10.0.0.76:20160[trend.rs:291] ["history window flipping: enter"] [increasing_rate=0] [flip_margin_error=1000] [delta=2048.1577481840213] [name=L2] [thread_id=0x5]
2024-01-13 09:09:45 (UTC+08:00)TiKV 10.0.0.76:20160[<unknown>] ["ipv4:10.0.0.74:48640: Keepalive watchdog fired. Closing transport."] [thread_id=0x5]
2024-01-13 09:09:45 (UTC+08:00)TiKV 10.0.0.76:20160[<unknown>] ["ipv4:10.0.0.76:12184: Keepalive watchdog fired. Closing transport."] [thread_id=0x5]
2024-01-13 09:09:45 (UTC+08:00)TiKV 10.0.0.76:20160[<unknown>] ["ipv4:10.0.0.76:2379: Keepalive watchdog fired. Closing transport."] [thread_id=0x5]
2024-01-13 09:09:45 (UTC+08:00)TiKV 10.0.0.76:20160[<unknown>] ["ipv4:10.0.0.74:20160: Keepalive watchdog fired. Closing transport."] [thread_id=0x5]

这个是事发时候的完整日志
pd_10.0.0.76_2379.log (60.5 KB)
tidb_10.0.0.76_4000.log (86.3 KB)
tikv_10.0.0.76_20160.log (158.1 KB)

配置太低了吧,cpu 4核肯定不够

看到截图中tso_wait 2.02s
再看pd日志
[2024/01/13 09:09:13.519 +08:00] [Warn] [tso.go:334] [“clock offset”] [] [jet-lag=697.965522ms] [prev-physical=2024/01/13 09:09:12.821 +08:00] [now=2024/01/13 09:09:13.518 +08:00] [update-physical-interval=50ms]
[2024/01/13 09:09:14.449 +08:00] [Warn] [tso.go:334] [“clock offset”] [] [jet-lag=158.025296ms] [prev-physical=2024/01/13 09:09:14.291 +08:00] [now=2024/01/13 09:09:14.449 +08:00] [update-physical-interval=50ms]
[2024/01/13 09:17:28.917 +08:00] [Warn] [grpclog.go:60] [“grpc: Server.processUnaryRPC failed to write status connection error: desc = "transport is closing"”]
[2024/01/13 09:17:28.923 +08:00] [Warn] [grpclog.go:60] [“grpc: Server.processUnaryRPC failed to write status connection error: desc = "transport is closing"”]

9点09直接跳到9点17,tidb日志也是这样的。10.0.0.76 这个机器是不是pd tikv tidb混合部署的?
猜测是某个组件或者SQL使用的内存增加,导致内存使用100%后导致pd和tikv 挂起了。
建议查询内存配置
TiDB 内存控制文档 | PingCAP 文档中心
TiKV 内存参数性能调优 | PingCAP 文档中心

2 个赞

看下TiDB服务器下的日志目录,有个OOM日志记录的文件,看看是否那个时间段有出现OOM的SQL语句

1 个赞


绿色蓝色这两条SQL展开来看下呢。

1 个赞

还有“集群信息”里面的“主机信息”也截图下

1 个赞

学习了。

1 个赞

你cpu是共享的还是独立的?

1 个赞

查看SQL或者大事务

1 个赞

可以按你的思路排查一下top sql的执行计划,发出来看一下是否有优化的空间。
image

2 个赞

4核很容易跑满

1 个赞

配置太低了,稍微来几个慢查询资源就用满了。

1 个赞

感觉你这个是纯粹并发太高,慢日志多,cpu不够

1 个赞

你自己top观察load average是不是很高了

1 个赞

你的拓扑图发一下,应该是pd和tidb-server和tikv-server混合部署了,然后tidb-server直接将物理机内存占满导致pd直接卡死,连接不上了,等于整个集群从9点9分到9点17卡死了,如果资源只有这么点的话,建议把tidb的内存限制也配上,oom超时的sql选择直接杀掉把。。。tikv-server的内存block-cache.capacity%限制是5G,也可以再调小一点,设个3G吧。。。

2 个赞

2s的tsowait是不可接受的。

这意味着pd基本没有cpu时间来提供tso。tidb的整个系统都特别依赖pd的tso服务。这个tso服务一慢,所有的服务都慢。

起码要把pd和tikv分开,tidb虽然也占cpu,但是没有tikv那么离谱。
最好是pd单独部署。

1 个赞

看看执行计划:排序是否可以利用上索引的有序性的特性的。如果没有,可以考虑创建组合索引。
SELECT
record.license,
record_type,
record.park_name,
image,
in_time,
out_time,
record.money ‘recordMoney’,
record.create_time ‘createTime’,
record.update_time ‘updateTime’,
record_status,
record.park_id AS ‘parkId’,
record.member_id,
record.order_id,
record.NO AS ‘no’,
record.remark AS ‘remark’,
record.index,
record.id,
pay_type,
pay_no,
pay_time,
payment,
porder.money,
voucher_id,
integral_id,
invoice_id,
discount_money,
discount_voucher,
discount_card,
discount_integral,
ppd.device_sn AS ‘roadsSn’,
ppd.ip AS ‘roadsIp’,
ppd.type AS ‘roadsType’,
ppd.name AS ‘roadsName’,
ps.code,
ps.device_sn AS ‘deviceSn’,
(
SELECT
group_concat( pi.url, ‘#’, pi.type SEPARATOR ‘,’ )
FROM
parking_image AS pi
WHERE
pi.group_id = record.id
AND pi.is_deleted = 0
) AS img_url
FROM
parking_record record
LEFT JOIN parking_pay_order porder ON record.order_id = porder.id
AND porder.status = 1
LEFT JOIN parking_plc_device ppd ON record.plc_id = ppd.id
LEFT JOIN parking_seats ps ON record.seats_id = ps.id
WHERE
record.is_deleted = 0
AND record.record_status = ‘1’ \ n
AND record.park_id = 1706920397874491394
AND record.tenant_id = ‘928703’
AND record.plc_id = 1706920676269858818
AND record.seats_id = 1743170094616018945
ORDER BY
record.in_time DESC
LIMIT 20

1 个赞

配置感觉确实不是很够的

1 个赞

cpu不够吧

1 个赞

从tidb dashboard入手,配合监控找出慢sql优化

1 个赞