5.0.1 tikv-server hung住,节点先状态从 disconnect -> down

【 TiDB 使用环境】生产环
【 TiDB 版本】5.0.1
【遇到的问题:问题现象及影响】

  1. tikv-server. cpu 上升、发现业务放量。但是qps整体不高。
告警详情: tikv-02 CPU使用大于90%(目前使用:92.60833334177732%)
告警时间: 2023-03-10 14:07:52
  1. 检查流量情况:
    发现qps不太高

  2. 集群tikv的负载开始上升

告警详情: tikv-04 CPU使用大于80%(目前使用:86.6583333350718%)
告警时间: 2023-03-10 14:12:52

查看tikv信息,发现tikv hung住。

  1. 登陆故障tikv机器,server 状态正常 (running)
[root@tikv-02 log]# systemctl -l|grep tikv
tikv-20160.service                                                             loaded active     running      tikv service

systemctl status tikv-20160.service

  1. 检查集群状态
    5.1 查看集群状态未:disconnected (截图未保留)

    5.2 尝试telnet telnet tikv-02 20160 卡住

5.3 尝试启动tikv,查看集群状态还是down

tiup cluster start cluster_1 -N tikv-02

5.4 tiup 重启tikv节点,重启timeout 卡住

tiup cluster restart  cluster_1 -N tikv-02

  1. systemctl restart tikv-20160.service后恢复。

  2. 手续收集日志发现:

  • tikv.log
[2023/03/10 14:10:13.187 +08:00] [INFO] [tracker.rs:248] [slow-query] [perf_stats.internal_delete_skipped_count=7479] [perf_stats.internal_key_skipped_count=43869] [perf_stats.block_read_byte=0] [perf_stats.block_read_count=0] [perf_stats.block_cache_hit_count=126282] [scan.range.first="Some(start: 7480000000000001135F72800000006D92D512 end: 7480000000000001135F72800000006D92D513)"] [scan.ranges=6442] [scan.total=27874] [scan.processed=13921] [scan.is_desc=false] [tag=select] [table_id=275] [txn_start_ts=439989950792859665] [total_suspend_time=3.247s] [total_process_time=483ms] [handler_build_time=1ms] [wait_time.snapshot=0ns] [wait_time.schedule=2ms] [wait_time=2ms] [total_lifetime=3.733s] [remote_host=ipv4: 192.168.8.183:5668] [region_id=43066374]
[2023/03/10 14:10:13.211 +08:00] [INFO] [tracker.rs:248] [slow-query] [perf_stats.internal_delete_skipped_count=0] [perf_stats.internal_key_skipped_count=3730] [perf_stats.block_read_byte=0] [perf_stats.block_read_count=0] [perf_stats.block_cache_hit_count=46639] [scan.range.first="Some(start: 7480000000000001135F72800000006C12E805 end: 7480000000000001135F72800000006C12E806)"] [scan.ranges=2546] [scan.total=4822] [scan.processed=4000] [scan.is_desc=false] [tag=select] [table_id=275] [txn_start_ts=439989960098971676] [total_suspend_time=3.218s] [total_process_time=164ms] [handler_build_time=1ms] [wait_time.snapshot=0ns] [wait_time.schedule=2ms] [wait_time=2ms] [total_lifetime=3.385s] [remote_host=ipv4: 192.168.8.120:61611] [region_id=42864517]
[2023/03/10 14:10:13.247 +08:00] [INFO] [tracker.rs:248] [slow-query] [perf_stats.internal_delete_skipped_count=0] [perf_stats.internal_key_skipped_count=9141] [perf_stats.block_read_byte=0] [perf_stats.block_read_count=0] [perf_stats.block_cache_hit_count=62716] [scan.range.first="Some(start: 7480000000000001135F72800000006F32B37C end: 7480000000000001135F72800000006F32B381)"] [scan.ranges=3170] [scan.total=8565] [scan.processed=6916] [scan.is_desc=false] [tag=select] [table_id=275] [txn_start_ts=439989935169077294] [total_suspend_time=3.223s] [total_process_time=292ms] [handler_build_time=1ms] [wait_time.snapshot=0ns] [wait_time.schedule=1ms] [wait_time=1ms] [total_lifetime=3.517s] [remote_host=ipv4: 192.168.8.120:61607] [region_id=43326515]
[2023/03/10 14:10:13.247 +08:00] [INFO] [tracker.rs:248] [slow-query] [perf_stats.internal_delete_skipped_count=0] [perf_stats.internal_key_skipped_count=4046] [perf_stats.block_read_byte=0] [perf_stats.block_read_count=0] [perf_stats.block_cache_hit_count=26988] [scan.range.first="Some(start: 7480000000000001135F7280000000721C3FA0 end: 7480000000000001135F7280000000721C3FA3)"] [scan.ranges=1565] [scan.total=3756] [scan.processed=3420] [scan.is_desc=false] [tag=select] [table_id=275] [txn_start_ts=439989943190683668] [total_suspend_time=3.25s] [total_process_time=98ms] [handler_build_time=1ms] [wait_time.snapshot=0ns] [wait_time.schedule=3ms] [wait_time=3ms] [total_lifetime=3.352s] [remote_host=ipv4: 192.168.8.120:61609] [region_id=43579051]
[2023/03/10 14:10:13.256 +08:00] [INFO] [tracker.rs:248] [slow-query] [perf_stats.internal_delete_skipped_count=0] [perf_stats.internal_key_skipped_count=18283] [perf_stats.block_read_byte=0] [perf_stats.block_read_count=0] [perf_stats.block_cache_hit_count=163988] [scan.range.first="Some(start: 7480000000000001135F728000000071FCB7D9 end: 7480000000000001135F728000000071FCB7DA)"] [scan.ranges=9455] [scan.total=19658] [scan.processed=17535] [scan.is_desc=false] [tag=select] [table_id=275] [txn_start_ts=439989981581672464] [total_suspend_time=3.257s] [total_process_time=611ms] [handler_build_time=1ms] [wait_time.snapshot=0ns] [wait_time.schedule=12ms] [wait_time=12ms] [total_lifetime=3.881s] [remote_host=ipv4: 192.168.8.120:61613] [region_id=43573861]
[2023/03/10 14:10:13.256 +08:00] [INFO] [tracker.rs:248] [slow-query] [perf_stats.internal_delete_skipped_count=0] [perf_stats.internal_key_skipped_count=10332] [perf_stats.block_read_byte=0] [perf_stats.block_read_count=0] [perf_stats.block_cache_hit_count=98] [scan.range.first="Some(start: 7480000000000001135F72800000006C3D5883 end: 7480000000000001135F72800000006C3D6147)"] [scan.ranges=2] [scan.total=5169] [scan.processed=5167] [scan.is_desc=false] [tag=select] [table_id=275] [txn_start_ts=439989960098971676] [total_suspend_time=3.171s] [total_process_time=9ms] [handler_build_time=0ns] [wait_time.snapshot=0ns] [wait_time.schedule=56ms] [wait_time=56ms] [total_lifetime=3.236s] [remote_host=ipv4: 192.168.8.120:61611] [region_id=42880285]
[2023/03/10 14:10:13.256 +08:00] [INFO] [tracker.rs:248] [slow-query] [perf_stats.internal_delete_skipped_count=40] [perf_stats.internal_key_skipped_count=864] [perf_stats.block_read_byte=0] [perf_stats.block_read_count=0] [perf_stats.block_cache_hit_count=13914] [scan.range.first="Some(start: 7480000000000001135F72800000006E94BDEF end: 7480000000000001135F72800000006E94BDF0)"] [scan.ranges=804] [scan.total=1306] [scan.processed=1126] [scan.is_desc=false] [tag=select] [table_id=275] [txn_start_ts=439989940870709252] [total_suspend_time=3.168s] [total_process_time=50ms] [handler_build_time=1ms] [wait_time.snapshot=0ns] [wait_time.schedule=34ms] [wait_time=34ms] [total_lifetime=3.253s] [remote_host=ipv4: 192.168.8.120:61609] [region_id=43206949]
....
...
[2023/03/10 14:11:54.630 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:20160: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:11:54.797 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:20160: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:11:55.091 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:20160: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:03.627 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:20160: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.402 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:5666: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.402 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:5667: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.404 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:5668: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.407 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:61609: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.408 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:61611: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.409 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:61607: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.410 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:61613: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.426 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:21566: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.428 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:52567: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.428 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:1897: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.430 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:30194: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.430 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:33917: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.431 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:5665: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.433 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:22966: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:04.434 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:49120: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:08.165 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:20160: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:08.165 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:20160: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:13.165 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:20160: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:12:23.165 +08:00] [INFO] [<unknown>] ["ipv4:tikv-servers:2379: Keepalive watchdog fired. Closing transport."]
[2023/03/10 14:13:16.490 +08:00] [INFO] [util.rs:536] ["connecting to PD endpoint"] [endpoints=http://172.16.25.179:2379]
  • tidb_err.log
<jemalloc>: Malformed conf string
<jemalloc>: Malformed conf string
<jemalloc>: Malformed conf string
<jemalloc>: Malformed conf string

考虑到jemalloc和内存有关,发现tikv在14:10左右内存占用有少量增长,但是没有oom。

问题:

  1. 希望大佬帮解答下,tikv 因为什么hung 住,如何排查。

【资源配置】
【附件:截图/日志/监控】

  1. 是集群整体的负载一起增加了,但是每个节点都是均衡的,还是某个节点的负载会更高?
  2. 针对业务的操作上有没有什么变更会导致处理性能下降?

集群恢复工作后,有没有重现这个故障?

看了下更新的发布版本记录,5.0.5 和 5.0.6 都修复了大量的bug,不知道是否是已知的问题
https://docs.pingcap.com/zh/tidb/stable/release-5.0.5
https://docs.pingcap.com/zh/tidb/stable/release-5.0.6

信息纠正:集群版本为 5.0.1,集群版本写错,抱歉。

  1. 节点负载比较均衡。整个集群负载先上升后,然后tikv节点hung住。

  1. 业务有放量,整体集群qps 不高,故障期间没有明显流量上涨情况。
    沟通业务:业务存在放量,流量增长不明显,存在消息积压情况,初步判断消息积压因为tikv故障导致的集群性能下降导致的积压。

  2. 恢复后未发现出现类似故障。

  3. bug 未对比出相似的,我再查查。

在多观察下,看看 tikv 有没有 IO 瓶颈…

确实存在磁盘 io util 徒增情况,看起来没占满IO资源(没到达瓶颈)

CPU出现较大瓶颈, 机器8C:


建议参考 @xfworld 老师建议,升级一下版本

hi 老师,能帮看下这个oom的问题么?
近期出现多次了,已经怀疑人生了

@xfworld 老师 @Lucien-卢西恩 老师,辛苦有时间帮看下。

老师,版本升级, 如何测sql兼容性呢?
除了使用https://github.com/innerr/workload-sim.tidb.ticat 工具收集业务流量,还能通过其它方式来验证业务兼容性么?

tikv挂起了为啥不帖tikv的日志咧?

已补充在1楼帖子

OOM 的改善,只能优化,没其他的好办法

目前只有 6.5.X( LTS ) 对 OOM 的优化是最有效的

1 个赞

此话题已在最后回复的 60 天后被自动关闭。不再允许新回复。