TiFlash 处于Down状态，如何启动和排查原因

郑旭东石家庄 · 2024 年5 月 20 日 09:02

May 19 09:50:02 host5024 systemd: Started Session 18470 of user root.
May 19 10:00:01 host5024 systemd: Started Session 18471 of user root.
May 19 10:01:01 host5024 systemd: Started Session 18472 of user root.
May 19 10:10:01 host5024 systemd: Started Session 18473 of user root.
May 19 10:20:01 host5024 systemd: Started Session 18474 of user root.
May 19 10:30:01 host5024 systemd: Started Session 18475 of user root.
May 19 10:40:01 host5024 systemd: Started Session 18476 of user root.
May 19 10:50:01 host5024 systemd: Started Session 18477 of user root.
May 19 11:00:01 host5024 systemd: Started Session 18478 of user root.
May 19 11:01:01 host5024 systemd: Started Session 18479 of user root.
May 19 11:10:01 host5024 systemd: Started Session 18480 of user root.
May 19 11:20:01 host5024 systemd: Started Session 18481 of user root.
May 19 11:26:40 host5024 kernel: PDLeaderLoop[23438]: segfault at 30 ip 00000000091555c8 sp 00007f0b91dadeb0 error 4
May 19 11:26:40 host5024 abrt-hook-ccpp: Process 23325 (tiflash) of user 1000 killed by SIGSEGV - dumping core

郑旭东石家庄 · 2024 年5 月 20 日 09:02

麻烦问下，使用什么命令把tiflash拉起来？

我是人间不清醒 · 2024 年5 月 20 日 09:06

55.25保留这台tiflash，55.24上面和tikv混部的下架掉（tiup），把tiflash副本改成1个。

郑旭东石家庄 · 2024 年5 月 20 日 09:21

我还有一个问题，就是我查下提示有问题的服务器上，查看flash服务，是运行状态，查下进程也在，telnet也能进去。这个是什么情况。

郑旭东石家庄 · 2024 年5 月 20 日 09:24

副本数 1

郑旭东石家庄 · 2024 年5 月 20 日 09:39

/flash/tidb-data/tiflash-9000/metadata/db_161619/t_124444.sql 文件不存在
类似这个提示，在正常的tiflash节点中的错误日志中也有

胡杨树旁 · 2024 年5 月 20 日 09:44

在这个tiflash 下有个自己的启动脚本script 下有个 run_**.sh的脚本，启动下这个脚本，看看会有什么日志打印出来，

wzf0072 · 2024 年5 月 20 日 09:51

tiup cluster start tidb-yunfengkong --node 192.168.55.24:9000

wzf0072 · 2024 年5 月 20 日 09:53

有尝试自动重启TiFLASH,直接拉应该起不来

郑旭东石家庄 · 2024 年5 月 20 日 09:56

执行/flash/tidb-deploy/tiflash-9000/scripts/run_tiflash.sh 这个后系统日志提示
May 20 17:45:37 host5024 systemd: Stopped tiflash service.
May 20 17:45:37 host5024 systemd: Started tiflash service.
May 20 17:45:38 host5024 bash: sync …
May 20 17:45:38 host5024 bash: real#0110m0.216s
May 20 17:45:38 host5024 bash: user#0110m0.000s
May 20 17:45:38 host5024 bash: sys#0110m0.014s
May 20 17:45:38 host5024 bash: ok
May 20 17:45:38 host5024 bash: arg matches is ArgMatches { args: {“engine-version”: MatchedArg { occurs: 1, indices: [4], vals: [“v7.5.1”] }, “engine-git-hash”: MatchedArg { occurs: 1, indices: [6], vals: [“9970e492dfdf2b4bee487fef4b27fee66131531f”] }, “config”: MatchedArg { occurs: 1, indices: [2], vals: [“/flash/tidb-deploy/tiflash-9000/conf/tiflash-learner.toml”] }, “pd-endpoints”: MatchedArg { occurs: 1, indices: [8, 9, 10], vals: [“192.168.55.22:2379”, “192.168.55.23:2379”, “192.168.55.25:2379”] }, “engine-label”: MatchedArg { occurs: 1, indices: [12], vals: [“tiflash”] }, “engine-addr”: MatchedArg { occurs: 1, indices: [14], vals: [“192.168.55.24:3930”] }}, subcommand: None, usage: Some(“USAGE:\n TiFlash Proxy [FLAGS] [OPTIONS] --engine-git-hash --engine-label --engine-version ”) }
May 20 17:46:58 host5024 abrt-hook-ccpp: Process 39329 (tiflash) of user 1000 killed by SIGSEGV - dumping core
May 20 17:47:13 host5024 abrt-server: Executable ‘/flash/tidb-deploy/tiflash-9000/bin/tiflash/tiflash’ doesn’t belong to any package and ProcessUnpackaged is set to ‘no’
May 20 17:47:13 host5024 abrt-server: ‘post-create’ on ‘/var/spool/abrt/ccpp-2024-05-20-17:46:58-39329’ exited with 1
May 20 17:47:13 host5024 abrt-server: Deleting problem directory ‘/var/spool/abrt/ccpp-2024-05-20-17:46:58-39329’
May 20 17:47:14 host5024 systemd-logind: New session 18753 of user root.

一直重复以上内容
附件中为tiflash_error.log 中输出的内容

error1.log (1.8 MB)

郑旭东石家庄 · 2024 年5 月 20 日 09:57

对，尝试了，拉不起来

郑旭东石家庄 · 2024 年5 月 20 日 09:58

试了，提示
Starting instance 192.168.55.24:9000

Error: failed to start tiflash: failed to start: 192.168.55.24 tiflash-9000.service, please check the instance’s log(/flash/tidb-deploy/tiflash-9000/log) for more detail.: timed out waiting for tiflash 192.168.55.24:9000 to be ready after 120s: tiflash store status is ‘Ready’, not fully running yet

Verbose debug logs has been written to /home/tidb/.tiup/logs/tiup-cluster-debug-2024-05-20-17-46-38.log.
tiup-cluster-debug-2024-05-20-17-46-38.log (121.5 KB)

Kongdom · 2024 年5 月 20 日 11:27

这个看了么？

tidb狂热爱好者 · 2024 年5 月 20 日 11:28

把tiflash的表都删除完了把tiflash下线。再添加一个flash node 再配表

Billdi表弟 · 2024 年5 月 20 日 11:29

实在不行，下点Tiflash，需要的话再增加节点

Kongdom · 2024 年5 月 20 日 11:31

看到这个我想起来以前遇到过的一次，就是服务器资源不足导致服务启动缓慢，虽然120s报错，但是服务实际还在启动中，过了5分钟之后再看集群状态，服务已经起来了。不过我那次也是刚好启动之后有别的事情耽误了，才发现虽然报错了，但是启动了，之前是一直不断重启，导致一直以为服务有问题启动不起来。

Billdi表弟 · 2024 年5 月 20 日 11:40

可能还是资源问题吧

Kongdom · 2024 年5 月 20 日 11:41

是的，当时是加了资源解决的。

wzf0072 · 2024 年5 月 21 日 00:38

这个节点上TiKV和TiFLASH混合部署，你看下 TiKV-Summary和TiFLASH-Summary，中CPU、内存、I/O使用情况。

wzf0072 · 2024 年5 月 21 日 00:52

tiup cluster stop tidb-yunfengkong --node 192.168.55.24:9000
先关闭55.24上的TiFLASH，等这个节点的内存使用恢复到40%-50%后，再尝试启动TiFLASH
tiup cluster START tidb-yunfengkong --node 192.168.55.24:9000

专栏 - TiDB 集群 TiKV 节点内存占用较高问题排查 | TiDB 社区 TiDB 集群 TiKV 节点内存占用较高问题排查