tidb-server重启故障排查

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:

【TiDB 版本】
Cluster type: tidb
Cluster version: v4.0.7
SSH type: builtin
v1.3.2 tiup
Go Version: go1.13
Git Branch: release-1.3
GitHash: 2d88460
【问题描述】
故障信息收集脚本info_gathering.py在哪里下载?通过链接,只下载到info_collecting.py,但执行此脚本提示缺少inventory.ini文件

故障描述:tidb-server 自动重启,tidb_stderr.log在下面文件


若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

辛苦上传下报错节点的完整 tidb_stderr.log 和 tidb.log 文件。

tidb_stderr.log (800.5 KB) logs-tidb_192_168_157_45_4000.zip (4.2 MB)

是凌晨,自动重启的
[2021/02/26 00:55:02.308 +08:00] [Info] [printer.go:33] [“Welcome to TiDB.”] [“Release Version”=v4.0.7-2-gac0d45662] [Edition=Community] [“Git Commit Hash”=ac0d4566229a4e15d6d1773cc399a0b921d290a9] [“Git Branch”=HEAD] [“UTC Build Time”=“2020-11-06 03:19:50”] [GoVersion=go1.13] [“Race Enabled”=false] [“Check Table Before Drop”=false] [“TiKV Min Version”=v3.0.0-60965b006877ca7234adaced7890d7b029ed1306]

凌晨是否有大量导数或者查询操作?dmesg|grep -i kill 看看有没有 oom kill 的记录,怀疑是有 OOM。

凌晨有批任务执行的。一开始我也认为是oom,但查看监控图,该机器内存消耗了70GB,机器共512GB内存。

[root@tiflash1 log]# dmesg|grep -i kill
[4175844.027699] snapshot-worker invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[4175844.027774] [] oom_kill_process+0x25e/0x3f0
[4175844.028607] Out of memory: Kill process 56545 (tidb-server) score 968 or sacrifice child
[4175844.028698] Killed process 56545 (tidb-server), UID 1000, total-vm:528557244kB, anon-rss:514268888kB, file-rss:0kB, shmem-rss:0kB
[4211403.979596] BackgrProcPool invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[4211403.979672] [] oom_kill_process+0x25e/0x3f0
[4211403.980584] Out of memory: Kill process 153857 (tidb-server) score 967 or sacrifice child
[4211403.980674] Killed process 153857 (tidb-server), UID 1000, total-vm:530463276kB, anon-rss:513874916kB, file-rss:0kB, shmem-rss:0kB
[4219568.627511] time-monitor invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[4219568.627587] [] oom_kill_process+0x25e/0x3f0
[4219568.628485] Out of memory: Kill process 230999 (tidb-server) score 968 or sacrifice child
[4219568.628589] Killed process 230999 (tidb-server), UID 1000, total-vm:525300312kB, anon-rss:513984316kB, file-rss:0kB, shmem-rss:0kB
[4219568.687818] systemd-journal invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[4219568.687887] [] oom_kill_process+0x25e/0x3f0
[4219568.688542] Out of memory: Kill process 231005 (tidb-server) score 968 or sacrifice child
[4219568.688615] Killed process 231005 (tidb-server), UID 1000, total-vm:525300312kB, anon-rss:513992548kB, file-rss:80kB, shmem-rss:0kB
[9992550.384948] BackgrProcPool invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[9992550.385013] [] oom_kill_process+0x25e/0x3f0
[9992550.385922] Out of memory: Kill process 20910 (tidb-server) score 968 or sacrifice child
[9992550.386017] Killed process 20910 (tidb-server), UID 1000, total-vm:522304772kB, anon-rss:514323468kB, file-rss:0kB, shmem-rss:0kB
[11321240.099748] tidb-server invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[11321240.099888] [] oom_kill_process+0x25e/0x3f0
[11321240.100848] Out of memory: Kill process 161585 (tidb-server) score 968 or sacrifice child
[11321240.100940] Killed process 161585 (tidb-server), UID 1000, total-vm:522148404kB, anon-rss:514073392kB, file-rss:0kB, shmem-rss:0kB
[11487907.596993] tidb-server invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[11487907.597113] [] oom_kill_process+0x25e/0x3f0
[11487907.598247] Out of memory: Kill process 136959 (tidb-server) score 967 or sacrifice child
[11487907.598352] Killed process 136861 (tidb-server), UID 1000, total-vm:524263392kB, anon-rss:513819244kB, file-rss:0kB, shmem-rss:0kB
[root@tiflash1 log]# dmesg -T |grep -i “out of memory”
[三 11月 11 19:11:46 2020] Out of memory: Kill process 56545 (tidb-server) score 968 or sacrifice child
[四 11月 12 05:04:25 2020] Out of memory: Kill process 153857 (tidb-server) score 967 or sacrifice child
[四 11月 12 07:20:30 2020] Out of memory: Kill process 230999 (tidb-server) score 968 or sacrifice child
[四 11月 12 07:20:30 2020] Out of memory: Kill process 231005 (tidb-server) score 968 or sacrifice child
[一 1月 18 02:56:52 2021] Out of memory: Kill process 20910 (tidb-server) score 968 or sacrifice child
[二 2月 2 12:01:42 2021] Out of memory: Kill process 161585 (tidb-server) score 968 or sacrifice child
[四 2月 4 10:19:29 2021] Out of memory: Kill process 136959 (tidb-server) score 967 or sacrifice child

一共有几个 tidb-server?批量任务逻辑是什么样的?

有2个 tidb-server,批量任务逻辑是load data 及明细统计汇总

辛苦提供下 TiDB 面板的监控信息

  1. 看您反馈的 dmesg 和 tidb.log 日志信息不多。
  2. 麻烦帮忙看下 tidb_stderr.log 最后写入时间是什么时候,我看最后是一个gorotuing
    ls -lrt tidb_stderr.log
  3. 麻烦反馈下 [2021/02/26 00:55:02.308 +08:00] 包含这个时间段前后的 /var/log/message 日志
    如果是其他操作系统目录可能不同,反馈下操作系统完整的message日志,多谢。
  4. 另外,请问只有这一个tidb-server重启了吗?还有其他重启的吗?
  5. tiup display 看下您的拓扑结构,多谢。

1、tidb_stderr.log已经上传全部内容了,tidb.log上传的是凌晨0-6点,tidb_stderr.log 最后写入时间:
808K -rw-r–r–. 1 tidb tidb 801K 2月 26 00:54 tidb_stderr.log
2、操作系统完整的message日志:
Feb 26 00:50:01 tiflash1 systemd: Started Session 27033 of user root.
Feb 26 00:54:46 tiflash1 systemd: tidb-4000.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Feb 26 00:54:46 tiflash1 systemd: Unit tidb-4000.service entered failed state.
Feb 26 00:54:46 tiflash1 systemd: tidb-4000.service failed.
Feb 26 00:55:02 tiflash1 systemd: tidb-4000.service holdoff time over, scheduling restart.
Feb 26 00:55:02 tiflash1 systemd: Stopped tidb service.
Feb 26 00:55:02 tiflash1 systemd: Started tidb service.
Feb 26 01:00:01 tiflash1 systemd: Started Session 27034 of user root.
3、只有一个tidb-server重启
4、tiup display 拓扑结构结构指? tiup cluster display ?

[root@monitor home]# tiup cluster display -TiDB
Starting component cluster: /root/.tiup/components/cluster/v1.3.2/tiup-cluster display
Cluster type: tidb
Cluster name: -TiDB
Cluster version: v4.0.7
SSH type: builtin
Dashboard URL: http://192.168.157.42:2379/dashboard
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir


192.168.157.49:9093 alertmanager 192.168.157.49 9093/9094 linux/x86_64 Up /data/tidb-data/alertmanager-9093 /data/tidb-deploy/alertmanager-9093
192.168.157.49:3000 grafana 192.168.157.49 3000 linux/x86_64 Up - /data/tidb-deploy/grafana-3000
192.168.157.41:2379 pd 192.168.157.41 2379/2380 linux/x86_64 Up|L /data/tidb-data/pd-2379 /data/tidb-deploy/pd-2379
192.168.157.42:2379 pd 192.168.157.42 2379/2380 linux/x86_64 Up|UI /data/tidb-data/pd-2379 /data/tidb-deploy/pd-2379
192.168.157.43:2379 pd 192.168.157.43 2379/2380 linux/x86_64 Up /data/tidb-data/pd-2379 /data/tidb-deploy/pd-2379
192.168.157.49:9090 prometheus 192.168.157.49 9090 linux/x86_64 Up /data/tidb-data/prometheus-8249 /data/tidb-deploy/prometheus-8249
192.168.157.45:4000 tidb 192.168.157.45 4000/10080 linux/x86_64 Up - /data1/tidb-deploy/tidb-4000
192.168.157.46:4001 tidb 192.168.157.46 4001/10081 linux/x86_64 Up - /data/tidb-deploy/tidb-4001
192.168.157.47:4000 tidb 192.168.157.47 4000/10080 linux/x86_64 Up - /data/tidb-deploy/tidb-4000
192.168.157.48:4000 tidb 192.168.157.48 4000/10080 linux/x86_64 Up - /data/tidb-deploy/tidb-4000
192.168.157.45:9000 tiflash 192.168.157.45 9000/8123/3930/20170/20292/8234 linux/x86_64 N/A /data1/tiflash-9000 /data/tidb-deploy/tiflash-9000
192.168.157.41:20160 tikv 192.168.157.41 20160/20180 linux/x86_64 Up /data1/tidb-data/tikv-20160 /data1/tidb-deploy/tikv-20160
192.168.157.41:20161 tikv 192.168.157.41 20161/20181 linux/x86_64 Up /data2/tidb-data/tikv-20161 /data2/tidb-deploy/tikv-20161
192.168.157.41:20162 tikv 192.168.157.41 20162/20182 linux/x86_64 Up /data3/tidb-data/tikv-20162 /data3/tidb-deploy/tikv-20162
192.168.157.41:20163 tikv 192.168.157.41 20163/20183 linux/x86_64 Up /data4/tidb-data/tikv-20163 /data4/tidb-deploy/tikv-20163
192.168.157.42:20160 tikv 192.168.157.42 20160/20180 linux/x86_64 Up /data1/tidb-data/tikv-20160 /data1/tidb-deploy/tikv-20160
192.168.157.42:20161 tikv 192.168.157.42 20161/20181 linux/x86_64 Up /data2/tidb-data/tikv-20161 /data2/tidb-deploy/tikv-20161
192.168.157.42:20162 tikv 192.168.157.42 20162/20182 linux/x86_64 Up /data3/tidb-data/tikv-20162 /data3/tidb-deploy/tikv-20162
192.168.157.42:20163 tikv 192.168.157.42 20163/20183 linux/x86_64 Up /data4/tidb-data/tikv-20163 /data4/tidb-deploy/tikv-20163
192.168.157.43:20160 tikv 192.168.157.43 20160/20180 linux/x86_64 Up /data1/tidb-data/tikv-20160 /data1/tidb-deploy/tikv-20160
192.168.157.43:20161 tikv 192.168.157.43 20161/20181 linux/x86_64 Up /data2/tidb-data/tikv-20161 /data2/tidb-deploy/tikv-20161
192.168.157.43:20162 tikv 192.168.157.43 20162/20182 linux/x86_64 Up /data3/tidb-data/tikv-20162 /data3/tidb-deploy/tikv-20162
192.168.157.43:20163 tikv 192.168.157.43 20163/20183 linux/x86_64 Up /data4/tidb-data/tikv-20163 /data4/tidb-deploy/tikv-20163
192.168.157.44:20160 tikv 192.168.157.44 20160/20180 linux/x86_64 Up /data1/tidb-data/tikv-20160 /data1/tidb-deploy/tikv-20160
192.168.157.44:20161 tikv 192.168.157.44 20161/20181 linux/x86_64 Up /data2/tidb-data/tikv-20161 /data2/tidb-deploy/tikv-20161
192.168.157.44:20162 tikv 192.168.157.44 20162/20182 linux/x86_64 Up /data3/tidb-data/tikv-20162 /data3/tidb-deploy/tikv-20162
192.168.157.44:20163 tikv 192.168.157.44 20163/20183 linux/x86_64 Up /data4/tidb-data/tikv-20163 /data4/tidb-deploy/tikv-20163
Total nodes: 27

  1. 看日志和监控,再tidb-server重启前的一段时间,duration 升高,是有大量的查询或者分析吗?
  2. 看到您的 tiflash 和 这个重启的 tidb-server是混合部署的? 方便看下那个时候的 tiflash 日志是否正常吗?
  3. 也麻烦上传下 tiflash 的监控信息,同时也上传下 system info 的监控吧,多谢。

1、是的,tidb-server重启前的一段时间,在跑批任务
2、tiflash 之前缩容,没在用了,状态是N/A,另system info 的监控已在上面图片了
3、tiflash_error.log 日志:
2021.02.26 00:05:02.250436 [ 42 ] pingcap.tikv: region {4351075,5633,23} find error: peer is not leader for region 4351075, leader may Some(id: 26621732 store_id: 7)
2021.02.26 00:05:43.447803 [ 35 ] pingcap.tikv: region {4351075,5639,23} find error: region 4351075 is missing
2021.02.26 00:06:04.429915 [ 48 ] pingcap.tikv: region {843673,18752,3} find error: peer is not leader for region 843673, leader may Some(id: 26621758 store_id: 4)
2021.02.26 00:07:15.339731 [ 44 ] pingcap.tikv: region {4351075,5642,23} find error: region 4351075 is missing
2021.02.26 00:07:25.416708 [ 43 ] pingcap.tikv: region {843673,18752,3} find error: peer is not leader for region 843673, leader may Some(id: 26720592 store_id: 9)
2021.02.26 00:34:08.142928 [ 50 ] pingcap.tikv: region {843673,18752,3} find error: peer is not leader for region 843673, leader may Some(id: 26764439 store_id: 13)
2021.02.26 00:34:18.234033 [ 35 ] pingcap.tikv: region {4351075,5645,23} find error: peer is not leader for region 4351075, leader may Some(id: 26762864 store_id: 13)
2021.02.26 00:34:48.424024 [ 50 ] pingcap.tikv: region {843673,18758,3} find error: region 843673 is missing
2021.02.26 00:53:40.170886 [ 50 ] pingcap.tikv: CheckTxnStatus Failed4: Deadline Exceeded
2021.02.26 05:15:36.814464 [ 36 ] pingcap.tikv: region {843673,18761,3} find error: region 843673 is missing
2021.02.26 05:15:47.556725 [ 43 ] pingcap.tikv: region {843673,18764,3} find error: peer is not leader for region 843673, leader may Some(id: 26725687 store_id: 16)

多谢,抱歉回复晚了,我们在分析panic,有进展会尽快同步,多谢。

https://github.com/pingcap/tidb/pull/20359 您好,应该是这个问题,升级到v4.0.8以上版本可以修复。

好的,谢谢,我们升级版本试试

:handshake:

tiup cluster upgrade <cluster-name> <version>

请问这个滚动升级,影响集群功能不?因为是生产环境了

  1. 如果你每个组件都是多个,那么当 tidb-server升级时,如果上层有负载均衡,会切回到其他tidb-server;当 tikv-server 升级时需要把leader切回到其他tikv,所以在leader迁移阶段会有一些重试; pd 升级阶段当pd leader 切换时会有短暂影响,几秒钟。

  2. 所以建议还是在业务低峰期,挑一个时间窗口进行升级。