Tiflash scale-in后再次scale-out 启动成功后, 没有从tikv中同步表, tiflash_replica 里面的available 和 progress 都是0 没有进展

【 TiDB 使用环境】Poc
【 TiDB 版本】6.5.3
【复现路径】1. 混布的场景下, tiflash部分表tiflash总是提示 9012 tiflash server time out .
2. 为解决问题1, 将所有的tiflash scale-in 然后重新修改端口部署scale-out.
3. 将部分表再次设置为 replica 1 的状态
4. 经历了一晚上 12个小时, tiflash没有任何同步的regions
5. 旧的节点已经下线接近20小时, 但是pd-ctl store 还是可以看到 offline的 tiflash节点.

昨天不小心将所有的数据都设置为了 replica 1 但是依旧没效果, 早上起来 准备从数据库级别设置为 0 速度很慢.
【遇到的问题:问题现象及影响】tiflash无法同步数据
【资源配置】4* 96c 鲲鹏 512G nvme服务器. tiflash 三个节点.
【附件:截图/日志/监控】

日志信息:

tiup cluster看呢?tiflash缩容成功了么?

注意:

如果在集群中所有的 TiFlash 节点停止运行之前,没有取消所有同步到 TiFlash 的表,则需要手动在 PD 中清除同步规则,否则无法成功完成 TiFlash 节点的下线。

可能还需要手动在 PD 中清除同步规则的步骤如下:

  1. 查询当前 PD 实例中所有与 TiFlash 相关的数据同步规则。
curl http://10.3.x.xxx:2379/pd/api/v1/config/rules/group/tiflash
  1. 删除所有与 TiFlash 相关的数据同步规则。以 idtable-25873-r 的规则为例,通过以下命令可以删除该规则。
curl -v -X DELETE http://10.3.xx.xxx:2379/pd/api/v1/config/rule/tiflash/table-25873-r

这坑大了

搞不定 我感觉我又要开一个帖子了…

缩容成功了吗?

  1. select STORE_ID,ADDRESS,STORE_STATE,STORE_STATE_NAME,VERSION,label from information_schema.TIKV_STORE_STATUS;
  2. pdctl 查看下store信息

通过上面两种方式,看看还有没有缩容掉的tiflash节点

没有缩容掉节点

pd-ctl 里面很多 offline的状态

{
“count”: 19,
“stores”: [
{
“store”: {
“id”: 91,
“address”: “192.168.255.119:3931”,
“labels”: [
{
“key”: “engine”,
“value”: “tiflash”
}
],
“version”: “v6.5.3”,
“peer_address”: “192.168.255.119:20171”,
“status_address”: “192.168.255.119:20293”,
“git_hash”: “e63e24991079fff1e5afe03e859f743cbb6cf4a7”,
“start_timestamp”: 1694990902,
“deploy_path”: “/deploy/tidb/tiflash-9001/bin/tiflash”,
“last_heartbeat”: 1695016186000543038,
“state_name”: “Offline”
},
“status”: {
“capacity”: “1.718TiB”,
“available”: “1.07TiB”,
“used_size”: “55.47GiB”,
“leader_count”: 0,
“leader_weight”: 1,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 5088,
“region_weight”: 1,
“region_score”: 1402721.0927894693,
“region_size”: 1152681,
“learner_count”: 5088,
“slow_score”: 1,
“start_ts”: “2023-09-18T06:48:22+08:00”,
“last_heartbeat_ts”: “2023-09-18T13:49:46.000543038+08:00”,
“uptime”: “7h1m24.000543038s”
}
},
{
“store”: {
“id”: 92,
“address”: “192.168.255.121:3931”,
“labels”: [
{
“key”: “engine”,
“value”: “tiflash”
}
],
“version”: “v6.5.3”,
“peer_address”: “192.168.255.121:20171”,
“status_address”: “192.168.255.121:20293”,
“git_hash”: “e63e24991079fff1e5afe03e859f743cbb6cf4a7”,
“start_timestamp”: 1694990955,
“deploy_path”: “/deploy/tidb/tiflash-9001/bin/tiflash”,
“last_heartbeat”: 1695016328659292069,
“state_name”: “Offline”
},
“status”: {
“capacity”: “1.718TiB”,
“available”: “821.3GiB”,
“used_size”: “50.54GiB”,
“leader_count”: 0,
“leader_weight”: 1,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 4361,
“region_weight”: 1,
“region_score”: 1402379.9771483315,
“region_size”: 1111245,
“learner_count”: 4361,
“slow_score”: 1,
“start_ts”: “2023-09-18T06:49:15+08:00”,
“last_heartbeat_ts”: “2023-09-18T13:52:08.659292069+08:00”,
“uptime”: “7h2m53.659292069s”
}

缩容成功了, 但是存在问题 还是tiflash 还是不同步表信息呢 应该如何继续处理呢.

“used_size”: “50.54GiB”,
“region_count”: 4361,

怎么看着你这个扩容的tiflash上面已经有50g数据了,4361个region了呢?

select * from INFORMATION_SCHEMA.TIFLASH_SEGMENTS limit 10;

这个表里面有东西吗?

没有 是空的.

1 个赞

Offline 这个事之前的 下线的 scale-in了.

1 个赞

试试这个 curl -X DELETE http://0.0.0.0:2379/pd/api/v1/store/91?force=true
91是你要干掉的状态是offline的storeid

information_schema.tikv_store_status的完整结果上传下

嗯 已经干掉了 感谢.

258496 192.168.255.122:40160 0 Up null 6.5.3 1.718TiB 953.9GiB 3049 1 3049 411046 3049 1 508486.6818209742 411046 2023-09-14 16:10:14 2023-09-19 15:38:52 119h28m38.062766264s
1 192.168.255.119:20160 0 Up [{key: zone, value: z1}] 6.5.3 1.718TiB 924.3GiB 2876 1 2876 414082 2876 1 514286.71857759176 414082 2023-09-14 14:37:03 2023-09-19 15:38:54 121h1m51.762853588s
226277 192.168.255.120:20161 0 Up null 6.5.3 1.718TiB 1.163TiB 3216 1 3216 426869 3216 1 514733.2659775779 426869 2023-09-14 15:08:04 2023-09-19 15:38:47 120h30m43.766527257s
258382 192.168.255.122:30161 0 Up null 6.5.3 1.718TiB 698.8GiB 5662 1 5662 393236 5662 1 508833.9184591647 393236 2023-09-14 15:49:22 2023-09-19 15:38:56 119h49m34.107963094s
258385 192.168.255.122:30162 0 Up null 6.5.3 1.718TiB 741.7GiB 4393 1 4393 403566 4393 1 517134.05442057207 403566 2023-09-14 15:59:58 2023-09-19 15:38:52 119h38m54.87364112s
258383 192.168.255.122:30160 0 Up null 6.5.3 1.718TiB 953.9GiB 4437 1 4437 416965 4437 1 515808.81324896513 416965 2023-09-14 15:39:02 2023-09-19 15:38:53 119h59m51.378808445s
258498 192.168.255.122:40161 0 Up null 6.5.3 1.718TiB 698.8GiB 4075 1 4075 398519 4075 1 515669.93963886664 398519 2023-09-14 16:20:29 2023-09-19 15:38:48 119h18m19.243259526s
258899 192.168.255.120:40161 0 Up null 6.5.3 1.718TiB 1.163TiB 3805 1 3805 427866 3805 1 515935.48269961274 427866 2023-09-14 17:01:41 2023-09-19 15:38:55 118h37m14.704738017s
2499975916 192.168.255.120:3932 0 Up [{key: engine, value: tiflash}] v6.5.3 3.437TiB 1.935TiB 0 1 0 0 0 1 0 0 2023-09-19 14:12:59 2023-09-19 15:38:49 1h25m50.974001582s
2 192.168.255.121:20160 0 Up [{key: zone, value: z2}] 6.5.3 1.718TiB 678.2GiB 3883 1 3883 390430 3883 1 507812.4106585259 390430 2023-09-14 14:47:27 2023-09-19 15:38:55 120h51m28.629992545s
226275 192.168.255.119:20161 0 Up null 6.5.3 1.718TiB 1.15TiB 4903 1 4903 422419 4903 1 509971.6392815095 422419 2023-09-14 14:57:50 2023-09-19 15:38:46 120h40m56.298890796s
226276 192.168.255.121:20161 0 Up null 6.5.3 1.718TiB 899.9GiB 8924 1 8924 407427 8924 1 507779.48573316267 407427 2023-09-14 15:28:33 2023-09-19 15:38:52 120h10m19.30820353s
226278 192.168.255.120:20160 0 Up null 6.5.3 1.718TiB 1.114TiB 3923 1 3923 421709 3923 1 510886.2425984309 421709 2023-09-14 15:18:20 2023-09-19 15:38:50 120h20m30.979116085s
258495 192.168.255.122:40162 0 Up null 6.5.3 1.718TiB 741.7GiB 4074 1 4074 396014 4074 1 507456.8358859969 396014 2023-09-14 16:30:47 2023-09-19 15:38:50 119h8m3.140887692s
258898 192.168.255.120:40160 0 Up null 6.5.3 1.718TiB 1.114TiB 2958 1 2958 420769 2958 1 509747.4473477929 420769 2023-09-14 16:51:18 2023-09-19 15:38:46 118h47m28.615459575s
2569081859 192.168.255.119:3932 0 Up [{key: engine, value: tiflash}] v6.5.3 3.437TiB 2.596TiB 0 1 0 0 0 1 0 0 2023-09-19 15:06:07 2023-09-19 15:38:47 32m40.300532353s
2375021399 192.168.255.121:3932 0 Up [{key: engine, value: tiflash}] v6.5.3 3.437TiB 1.437TiB 0 1 0 0 0 1 0 0 2023-09-19 14:12:43 2023-09-19 15:38:54 1h26m11.293674582s
417651984 192.168.255.119:50160 0 Up null 6.5.3 3.437TiB 3.121TiB 9794 1 9794 477407 9794 1 518462.47008923115 477407 2023-09-17 22:30:53 2023-09-19 15:38:53 41h8m0.26053394s

等offline变成tombstone后,prune删除,然后观察新的tiflash节点是否正常

https://docs.pingcap.com/zh/tidb/stable/troubleshoot-tiflash#tiflash-数据不同步
按这个排查下