TiKV down了一个,如何处理

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】:2.1.6
  • 【问题描述】:tikv有一个是down,应该如何debug和解决

若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

提供一下这个tikv所在系统的log不知道有没有帮助 Feb 13 16:12:16 Tikv01 systemd: Started tikv-20160 service. Feb 13 16:12:16 Tikv01 run_tikv.sh: sync … Feb 13 16:12:16 Tikv01 run_tikv.sh: real#0110m0.013s Feb 13 16:12:16 Tikv01 run_tikv.sh: user#0110m0.001s Feb 13 16:12:16 Tikv01 run_tikv.sh: sys#0110m0.012s Feb 13 16:12:16 Tikv01 systemd: tikv-20160.service: main process exited, code=killed, status=11/SEGV Feb 13 16:12:16 Tikv01 run_tikv.sh: ok Feb 13 16:12:16 Tikv01 systemd: Unit tikv-20160.service entered failed state. Feb 13 16:12:16 Tikv01 systemd: tikv-20160.service failed. Feb 13 16:12:31 Tikv01 systemd: tikv-20160.service holdoff time over, scheduling restart. Feb 13 16:12:31 Tikv01 systemd: Stopped tikv-20160 service. Feb 13 16:12:31 Tikv01 systemd: Started tikv-20160 service. Feb 13 16:12:31 Tikv01 run_tikv.sh: sync … Feb 13 16:12:31 Tikv01 run_tikv.sh: real#0110m0.013s Feb 13 16:12:31 Tikv01 run_tikv.sh: user#0110m0.001s Feb 13 16:12:31 Tikv01 run_tikv.sh: sys#0110m0.012s Feb 13 16:12:31 Tikv01 systemd: tikv-20160.service: main process exited, code=killed, status=11/SEGV Feb 13 16:12:31 Tikv01 run_tikv.sh: ok Feb 13 16:12:31 Tikv01 systemd: Unit tikv-20160.service entered failed state. Feb 13 16:12:31 Tikv01 systemd: tikv-20160.service failed. Feb 13 16:12:47 Tikv01 systemd: tikv-20160.service holdoff time over, scheduling restart. Feb 13 16:12:47 Tikv01 systemd: Stopped tikv-20160 service. Feb 13 16:12:47 Tikv01 systemd: Started tikv-20160 service. Feb 13 16:12:47 Tikv01 run_tikv.sh: sync … Feb 13 16:12:47 Tikv01 run_tikv.sh: real#0110m0.011s Feb 13 16:12:47 Tikv01 run_tikv.sh: user#0110m0.000s Feb 13 16:12:47 Tikv01 run_tikv.sh: sys#0110m0.010s Feb 13 16:12:47 Tikv01 systemd: tikv-20160.service: main process exited, code=killed, status=11/SEGV Feb 13 16:12:47 Tikv01 run_tikv.sh: ok Feb 13 16:12:47 Tikv01 systemd: Unit tikv-20160.service entered failed state. Feb 13 16:12:47 Tikv01 systemd: tikv-20160.service failed. Feb 13 16:13:02 Tikv01 systemd: tikv-20160.service holdoff time over, scheduling restart. Feb 13 16:13:02 Tikv01 systemd: Stopped tikv-20160 service. Feb 13 16:13:02 Tikv01 systemd: Started tikv-20160 service. Feb 13 16:13:02 Tikv01 run_tikv.sh: sync … Feb 13 16:13:02 Tikv01 run_tikv.sh: real#0110m0.011s Feb 13 16:13:02 Tikv01 run_tikv.sh: user#0110m0.000s Feb 13 16:13:02 Tikv01 run_tikv.sh: sys#0110m0.010s Feb 13 16:13:02 Tikv01 systemd: tikv-20160.service: main process exited, code=killed, status=11/SEGV Feb 13 16:13:02 Tikv01 run_tikv.sh: ok Feb 13 16:13:02 Tikv01 systemd: Unit tikv-20160.service entered failed state. Feb 13 16:13:02 Tikv01 systemd: tikv-20160.service failed. Feb 13 16:13:17 Tikv01 systemd: tikv-20160.service holdoff time over, scheduling restart. Feb 13 16:13:17 Tikv01 systemd: Stopped tikv-20160 service. Feb 13 16:13:17 Tikv01 systemd: Started tikv-20160 service. Feb 13 16:13:17 Tikv01 run_tikv.sh: sync … Feb 13 16:13:17 Tikv01 run_tikv.sh: real#0110m0.013s Feb 13 16:13:17 Tikv01 run_tikv.sh: user#0110m0.002s Feb 13 16:13:17 Tikv01 run_tikv.sh: sys#0110m0.010s Feb 13 16:13:17 Tikv01 run_tikv.sh: ok Feb 13 16:13:17 Tikv01 systemd: tikv-20160.service: main process exited, code=killed, status=11/SEGV Feb 13 16:13:17 Tikv01 systemd: Unit tikv-20160.service entered failed state. Feb 13 16:13:17 Tikv01 systemd: tikv-20160.service failed.

附上tikv_stderr.log的最后几行

E0211 18:43:27.718481758 32640 error.c:285] Error 0x7f7e230c2000 is full, dropping error 0x7f7dfa54a620 = {“created”:“@1581417807.718474624”,“description”:“OS Error”,“errno”:32,“file”:“/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.2.3/grpc/src/core/lib/iomgr/tcp_posix.c”,“file_line”:582,“grpc_status”:14,“os_error”:“Broken pipe”,“syscall”:“sendmsg”}

tikv.log的最后几行

执行ansible-playbook start.yml

报错: fatal: [50.16.170.111]: FAILED! => {“changed”: false, “elapsed”: 300, “msg”: “the TiKV port 20160 is not up”}

第一个截图看着是 os 层面有报错,建议从操作系统层面排查下。第二个截图信息只是 region 缺少 leader 的报错,跟 tikv 挂掉没关系。