TiFlash 同步卡住

【TiDB 集群信息】

  • 集群版本: v5.1.1
  • TiKV 节点个数: 5
  • TiFlash 节点个数: 1

【遇到的问题】
扩容 Tiflash 节点后,给某个表设置副本后,还没完成同步进度就卡住了,而且从添加到现在已经好几天了进度都一动不动的!

【排查步骤】
我是根据官网文档 https://docs.pingcap.com/zh/tidb/stable/troubleshoot-tiflash#tiflash-数据同步卡住 一步步去排查的:
1、low-space-ratio 是默认的 0.8,虽然 TiFlash 跟 TiKV 放在同一台服务器(不同挂载目录),但是还是没有达到它的阈值,所以应该不是空间问题
2、查看 flash_cluster_manager.log 日志,flash_region_count 没有更新
3、cpu 使用率 region-worker 并不是一条只限,证明是有在工作?

【疑问】
1、我尝试把副本删除之后,重新添加还是不行!
2、有什么办法可以临时解决吗,因为现在 Tiflash 已经没法使用了,把它缩容从新扩容是否可行呢?

缩容后重新扩容是可行的,但是需要先解决 相关表的 tiflash 副本的设置后在缩容

tiflash 有什么错误信息么?

删除tifash副本,再缩容

tiflash的日志发一下。

摘取的部分错误日志,麻烦帮忙看下

[2022/06/03 13:18:25.801 +08:00] [ERROR] [<unknown>] ["pingcap.pd: get safe point failed: 2: rpc error: code = Unavailable desc = not leader"] [thread_id=41]
[2022/06/03 13:18:25.802 +08:00] [ERROR] [<unknown>] ["pingcap.pd: failed to get cluster id by :http://192.168.1.x:2379"] [thread_id=2]
[2022/06/03 13:18:25.802 +08:00] [ERROR] [<unknown>] ["pingcap.pd: failed to get cluster id by :http://192.168.1.x:2379"] [thread_id=2]
[2022/06/03 13:18:25.803 +08:00] [ERROR] [<unknown>] ["pingcap.pd: failed to get cluster id by :http://192.168.1.x:2379"] [thread_id=2]
[2022/06/03 13:18:25.803 +08:00] [ERROR] [<unknown>] ["pingcap.pd: Exception: failed to update leader"] [thread_id=2]
[2022/06/03 19:33:20.919 +08:00] [ERROR] [<unknown>] ["pingcap.pd: get safe point failed: 2: rpc error: code = Unavailable desc = not leader"] [thread_id=15]
[2022/06/03 19:33:20.920 +08:00] [ERROR] [<unknown>] ["pingcap.pd: failed to get cluster id by :http://192.168.1.x:2379"] [thread_id=2]
[2022/06/03 19:33:20.920 +08:00] [ERROR] [<unknown>] ["pingcap.pd: failed to get cluster id by :http://192.168.1.x:2379"] [thread_id=2]
[2022/06/03 19:33:20.921 +08:00] [ERROR] [<unknown>] ["pingcap.pd: failed to get cluster id by :http://192.168.1.x:2379"] [thread_id=2]
[2022/06/03 19:33:20.921 +08:00] [ERROR] [<unknown>] ["pingcap.pd: Exception: failed to update leader"] [thread_id=2]
[2022/06/03 22:18:48.615 +08:00] [ERROR] [<unknown>] ["pingcap.pd: get safe point failed: 4: Deadline Exceeded"] [thread_id=30]
[2022/06/03 22:18:49.099 +08:00] [ERROR] [<unknown>] ["pingcap.pd: get safe point failed: 2: rpc error: code = Unavailable desc = not leader"] [thread_id=30]
[2022/06/03 22:18:49.100 +08:00] [ERROR] [<unknown>] ["pingcap.pd: failed to get cluster id by :http://192.168.1.x:2379"] [thread_id=2]
[2022/06/03 22:18:49.101 +08:00] [ERROR] [<unknown>] ["pingcap.pd: failed to get cluster id by :http://192.168.1.x:2379"] [thread_id=2]
[2022/06/03 22:18:49.101 +08:00] [ERROR] [<unknown>] ["pingcap.pd: failed to get cluster id by :http://192.168.1.x:2379"] [thread_id=2]
[2022/06/03 22:18:49.101 +08:00] [ERROR] [<unknown>] ["pingcap.pd: Exception: failed to update leader"] [thread_id=2]
[2022/06/04 05:32:03.309 +08:00] [ERROR] [<unknown>] ["pingcap.tikv: Get Failed4: Deadline Exceeded"] [thread_id=32]
[2022/06/04 05:32:08.409 +08:00] [ERROR] [<unknown>] ["pingcap.tikv: Get Failed4: Deadline Exceeded"] [thread_id=32]
[2022/06/04 05:32:13.543 +08:00] [ERROR] [<unknown>] ["pingcap.tikv: Get Failed4: Deadline Exceeded"] [thread_id=32]
[2022/06/04 05:32:18.914 +08:00] [ERROR] [<unknown>] ["pingcap.tikv: Get Failed4: Deadline Exceeded"] [thread_id=32]
[2022/06/04 05:32:38.582 +08:00] [ERROR] [<unknown>] ["pingcap.tikv: Get Failed4: Deadline Exceeded"] [thread_id=38]
[2022/06/04 05:32:46.901 +08:00] [ERROR] [<unknown>] ["pingcap.tikv: Get Failed4: Deadline Exceeded"] [thread_id=38]
[2022/06/04 05:33:05.713 +08:00] [ERROR] [<unknown>] ["pingcap.tikv: Get Failed4: Deadline Exceeded"] [thread_id=33]
[2022/06/04 05:33:10.819 +08:00] [ERROR] [<unknown>] ["pingcap.tikv: Get Failed4: Deadline Exceeded"] [thread_id=33]
[2022/06/04 16:18:47.868 +08:00] [ERROR] [<unknown>] ["pingcap.pd: get safe point failed: 4: Deadline Exceeded"] [thread_id=31]
[2022/06/05 21:18:42.370 +08:00] [ERROR] [<unknown>] ["pingcap.pd: write tso failed\
 "] [thread_id=113]
[2022/06/07 04:38:19.802 +08:00] [ERROR] [<unknown>] ["pingcap.pd: get safe point failed: 4: Deadline Exceeded"] [thread_id=36]

1.请检查对应的pd状态
2. gc时间是多少
3. tiflash到pd的网络是否有问题

1、检查 pd 的监控图是ok的


错误日志也不多

[2022/06/06 23:18:40.299 +08:00] [ERROR] [client.go:179] ["region sync with leader meet error"] [error="[PD:grpc:ErrGRPCRecv]rpc error: code = Canceled desc = context canceled"]
[2022/06/07 01:02:04.346 +08:00] [ERROR] [client.go:179] ["region sync with leader meet error"] [error="[PD:grpc:ErrGRPCRecv]rpc error: code = Canceled desc = context canceled"]

2、tidb_gc_life_time 默认 10min

3、不排除之前是网络波动的影响,但是最近这两天的错误日志也没有刷 failed to get cluster id by ... 的日志,即使网络偶尔波动也不至于导致进度没有变化吧

可以尝试循环重启pd 试一下

轮流重启会影响服务吗

相当于滚动升级 影响小

1、已经重启了,但是并没有用!

2、我发现目前集群中有几个 down-peer,请问有影响吗?

有的 如果tiflash需要同步的副本down掉了 无法继续,检查下对应的region,是否可以修复。

我使用 pd-ctl region check-down-peer 查出 down peer 的 id,但是它关联的表并不是我在 tiflash 添加的表,所以应该没关系吧

不行喔,还是会出现进度突然卡着不动的情况

  1. 把副本删掉,然后 log 改 DEBUG 级别,再加副本后卡住,把 log 文件贴到群里,谢谢。
  2. 如果监控信息能提供,clinic 也提供下相关面板。

1、我看错误日志里面有这些报错,指的是 1000464025 这个 region 有问题吗

./tiflash_error.log:2022.06.10 07:10:29.092088 [ 6 ] <Warning> pingcap.tikv: region {1000464025,2404,16} find error: peer is not leader for region 1000464025, leader may Some(id: 1320844893 store_id: 1006624325)
./tiflash_error.log:2022.06.10 07:13:58.107618 [ 34 ] <Warning> pingcap.tikv: region {1000464025,2404,16} find error: peer is not leader for region 1000464025, leader may Some(id: 1316086596 store_id: 1005346961)
./tiflash_error.log:2022.06.10 08:13:39.429593 [ 9 ] <Warning> pingcap.tikv: region {1000464025,2404,16} find error: peer is not leader for region 1000464025, leader may Some(id: 1320844893 store_id: 1006624325)
./tiflash_error.log:2022.06.10 08:18:55.258946 [ 37 ] <Warning> pingcap.tikv: region {1000464025,2404,16} find error: peer is not leader for region 1000464025, leader may Some(id: 1316086596 store_id: 1005346961)
./tiflash_error.log:2022.06.10 09:45:21.527625 [ 48 ] <Warning> pingcap.tikv: region {1000464025,2404,16} find error: peer is not leader for region 1000464025, leader may Some(id: 1320844893 store_id: 1006624325)
./tiflash_error.log:2022.06.10 09:53:44.463513 [ 34 ] <Warning> pingcap.tikv: region {1000464025,2404,16} find error: peer is not leader for region 1000464025, leader may Some(id: 1316086596 store_id: 1005346961)

2、我查看这个同步卡住的表的 region 信息 show table xxx regions ,并没有发现 1000464025 这个 region 喔

我在 tiflash 日志发现一些日志,说是没找到 region,然后我查了一下这些 region id 的 leader 都是指向某一个 store ,是不是证明这个 store 有问题

15:10 分开始添加副本的,只有一个 region,却一直卡着,下面是 15:10 - 15:20 的日志,不同日志我都放在一个文件里面了
tiflash.log (29.5 KB)

可以从这个几个方面看看

  1. 看看 PD 监控,是否有region 副本处于异常状态。pd会优先处理down peer 的 region 副本。
  2. 看 tiflash_tikv.log 日志,是否有报错异常

1、有 down region ,这个是会自动修复吗?

2、日志中找到这几个可疑的错误日志

[root@ip-192-168-1-23 log]# grep -i error tiflash_tikv.log
[2022/06/10 15:07:15.594 +08:00] [ERROR] [server.rs:799] ["failed to init io snooper"] [err_code=KV:Unknown] [err="\"IO snooper is not started due to not compiling with BCC\""]
[2022/06/10 15:07:17.105 +08:00] [WARN] [store.rs:1353] ["set thread priority for raftstore failed"] [error="Os { code: 13, kind: PermissionDenied, message: \"Permission denied\" }"]
[2022/06/10 15:07:17.107 +08:00] [INFO] [<unknown>] ["Failed to add :: listener, the environment may not support IPv6: {\"created\":\"@1654844837.107368099\",\"description\":\"Address family not supported by protocol\",\"errno\":97,\"file\":\"/root/.cargo/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.8.1/grpc/src/core/lib/iomgr/socket_utils_common_posix.cc\",\"file_line\":443,\"os_error\":\"Address family not supported by protocol\",\"syscall\":\"socket\",\"target_address\":\"[::]:20170\"}"]