【 TiDB 使用环境】生产环
【 TiDB 版本】5.0.1
【遇到的问题:问题现象及影响】
- tikv-server. cpu 上升、发现业务放量。但是qps整体不高。
告警详情: tikv-02 CPU使用大于90%(目前使用:92.60833334177732%)
告警时间: 2023-03-10 14:07:52
-
检查流量情况:
发现qps不太高
-
集群tikv的负载开始上升
告警详情: tikv-04 CPU使用大于80%(目前使用:86.6583333350718%)
告警时间: 2023-03-10 14:12:52
查看tikv信息,发现tikv hung住。
- 登陆故障tikv机器,server 状态正常 (running)
[root@tikv-02 log]# systemctl -l|grep tikv
tikv-20160.service loaded active running tikv service
systemctl status tikv-20160.service
- 检查集群状态
5.1 查看集群状态未:disconnected (截图未保留)
5.2 尝试telnettelnet tikv-02 20160
卡住
5.3 尝试启动tikv,查看集群状态还是down
tiup cluster start cluster_1 -N tikv-02
5.4 tiup 重启tikv节点,重启timeout 卡住
tiup cluster restart cluster_1 -N tikv-02
-
systemctl restart tikv-20160.service
后恢复。
-
手续收集日志发现:
- tikv.log
[2023/03/10 14:10:13.187 +08:00] [INFO] [tracker.rs:248] [slow-query] [perf_stats.internal_delete_skipped_count=7479] [perf_stats.internal_key_skipped_count=43869] [perf_stats.block_read_byte=0] [perf_stats.block_read_count=0] [perf_stats.block_cache_hit_count=126282] [scan.range.first="Some(start: 7480000000000001135F72800000006D92D512 end: 7480000000000001135F72800000006D92D513)"] [scan.ranges=6442] [scan.total=27874] [scan.processed=13921] [scan.is_desc=false] [tag=select] [table_id=275] [txn_start_ts=439989950792859665] [total_suspend_time=3.247s] [total_process_time=483ms] [handler_build_time=1ms] [wait_time.snapshot=0ns] [wait_time.schedule=2ms] [wait_time=2ms] [total_lifetime=3.733s] [remote_host=ipv4: 192.168.8.183:5668] [region_id=43066374]
[2023/03/10 14:10:13.211 +08:00] [INFO] [tracker.rs:248] [slow-query] [perf_stats.internal_delete_skipped_count=0] [perf_stats.internal_key_skipped_count=3730] [perf_stats.block_read_byte=0] [perf_stats.block_read_count=0] [perf_stats.block_cache_hit_count=46639] [scan.range.first="Some(start: 7480000000000001135F72800000006C12E805 end: 7480000000000001135F72800000006C12E806)"] [scan.ranges=2546] [scan.total=4822] [scan.processed=4000] [scan.is_desc=false] [tag=select] [table_id=275] [txn_start_ts=439989960098971676] [total_suspend_time=3.218s] [total_process_time=164ms] [handler_build_time=1ms] [wait_time.snapshot=0ns] [wait_time.schedule=2ms] [wait_time=2ms] [total_lifetime=3.385s] [remote_host=ipv4: 192.168.8.120:61611] [region_id=42864517]
[2023/03/10 14:10:13.247 +08:00] [INFO] [tracker.rs:248] [slow-query] [perf_stats.internal_delete_skipped_count=0] [perf_stats.internal_key_skipped_count=9141] [perf_stats.block_read_byte=0] [perf_stats.block_read_count=0] [perf_stats.block_cache_hit_count=62716] [scan.range.first="Some(start: 7480000000000001135F72800000006F32B37C end: 7480000000000001135F72800000006F32B381)"] [scan.ranges=3170] [scan.total=8565] [scan.processed=6916] [scan.is_desc=false] [tag=select] [table_id=275] [txn_start_ts=439989935169077294] [total_suspend_time=3.223s] [total_process_time=292ms] [handler_build_time=1ms] [wait_time.snapshot=0ns] [wait_time.schedule=1ms] [wait_time=1ms] [total_lifetime=3.517s] [remote_host=ipv4: 192.168.8.120:61607] [region_id=43326515]
[2023/03/10 14:10:13.247 +08:00] [INFO] [tracker.rs:248] [slow-query] [perf_stats.internal_delete_skipped_count=0] [perf_stats.internal_key_skipped_count=4046] [perf_stats.block_read_byte=0] [perf_stats.block_read_count=0] [perf_stats.block_cache_hit_count=26988] [scan.range.first="Some(start: 7480000000000001135F7280000000721C3FA0 end: 7480000000000001135F7280000000721C3FA3)"] [scan.ranges=1565] [scan.total=3756] [scan.processed=3420] [scan.is_desc=false] [tag=select] [table_id=275] [txn_start_ts=439989943190683668] [total_suspend_time=3.25s] [total_process_time=98ms] [handler_build_time=1ms] [wait_time.snapshot=0ns] [wait_time.schedule=3ms] [wait_time=3ms] [total_lifetime=3.352s] [remote_host=ipv4: 192.168.8.120:61609] [region_id=43579051]
[2023/03/10 14:10:13.256 +08:00] [INFO] [tracker.rs:248] [slow-query] [perf_stats.internal_delete_skipped_count=0] [perf_stats.internal_key_skipped_count=18283] [perf_stats.block_read_byte=0] [perf_stats.block_read_count=0] [perf_stats.block_cache_hit_count=163988] [scan.range.first="Some(start: 7480000000000001135F728000000071FCB7D9 end: 7480000000000001135F728000000071FCB7DA)"] [scan.ranges=9455] [scan.total=19658] [scan.processed=17535] [scan.is_desc=false] [tag=select] [table_id=275] [txn_start_ts=439989981581672464] [total_suspend_time=3.257s] [total_process_time=611ms] [handler_build_time=1ms] [wait_time.snapshot=0ns] [wait_time.schedule=12ms] [wait_time=12ms] [total_lifetime=3.881s] [remote_host=ipv4: 192.168.8.120:61613] [region_id=43573861]
[2023/03/10 14:10:13.256 +08:00] [INFO] [tracker.rs:248] [slow-query] [perf_stats.internal_delete_skipped_count=0] [perf_stats.internal_key_skipped_count=10332] [perf_stats.block_read_byte=0] [perf_stats.block_read_count=0] [perf_stats.block_cache_hit_count=98] [scan.range.first="Some(start: 7480000000000001135F72800000006C3D5883 end: 7480000000000001135F72800000006C3D6147)"] [scan.ranges=2] [scan.total=5169] [scan.processed=5167] [scan.is_desc=false] [tag=select] [table_id=275] [txn_start_ts=439989960098971676] [total_suspend_time=3.171s] [total_process_time=9ms] [handler_build_time=0ns] [wait_time.snapshot=0ns] [wait_time.schedule=56ms] [wait_time=56ms] [total_lifetime=3.236s] [remote_host=ipv4: 192.168.8.120:61611] [region_id=42880285]
[2023/03/10 14:10:13.256 +08:00] [INFO] [tracker.rs:248] [slow-query] [perf_stats.internal_delete_skipped_count=40] [perf_stats.internal_key_skipped_count=864] [perf_stats.block_read_byte=0] [perf_stats.block_read_count=0] [perf_stats.block_cache_hit_count=13914] [scan.range.first="Some(start: 7480000000000001135F72800000006E94BDEF end: 7480000000000001135F72800000006E94BDF0)"] [scan.ranges=804] [scan.total=1306] [scan.processed=1126] [scan.is_desc=false] [tag=select] [table_id=275] [txn_start_ts=439989940870709252] [total_suspend_time=3.168s] [total_process_time=50ms] [handler_build_time=1ms] [wait_time.snapshot=0ns] [wait_time.schedule=34ms] [wait_time=34ms] [total_lifetime=3.253s] [remote_host=ipv4: 192.168.8.120:61609] [region_id=43206949]
....
...
[2023/03/10 14:11:54.630 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:20160: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:11:54.797 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:20160: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:11:55.091 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:20160: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:03.627 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:20160: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.402 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:5666: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.402 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:5667: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.404 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:5668: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.407 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:61609: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.408 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:61611: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.409 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:61607: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.410 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:61613: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.426 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:21566: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.428 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:52567: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.428 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:1897: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.430 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:30194: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.430 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:33917: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.431 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:5665: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.433 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:22966: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.434 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:49120: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:08.165 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:20160: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:08.165 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:20160: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:13.165 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:20160: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:23.165 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:2379: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:13:16.490 +08:00] [INFO] [util.rs:536] ["connecting to PD endpoint"] [endpoints=http://172.16.25.179:2379]
- tidb_err.log
<jemalloc>: Malformed conf string
<jemalloc>: Malformed conf string
<jemalloc>: Malformed conf string
<jemalloc>: Malformed conf string
考虑到jemalloc和内存有关,发现tikv在14:10左右内存占用有少量增长,但是没有oom。
问题:
- 希望大佬帮解答下,tikv 因为什么
hung
住,如何排查。
【资源配置】
【附件:截图/日志/监控】