3节点的tikv测试环境,rm删除两个节点做恢复,其中一个节点无法加入到集群

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】:v3.0.6
  • 【问题描述】:
    3节点的tikv测试环境,rm删除两个节点做恢复,unsafe-recover remove-fail-stores两个tikv后启动集群,其中一个被删除的tikv在deploy之后正常启动,另一个tikv节点无法启动,检查发现安装目录只有data空文件夹,通过ansible启动的时候报错如下图

    安装目录没有数据
    image
    通过pd-ctl查看不到第三个tikv的信息

第三个tikv节点log目录也为空

有尝试重新初始化再安装第三个tikv节点,完全当做tikv扩容来做,依然报错
若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

1 个赞

使用 pd-ctl 进行下列检查

  1. pd-ctl -u <endpoint> -d region --jq '.regions[]|select(has("leader")|not)|{id: .id, peer_stores: [.peers[].store_id]}
  2. pd-ctl -u <endpoint> region check down-peer
1 个赞

» region check down-peer { “count”: 0, “regions”: [] }

[tidb@tidb01 bin]$ ./pd-ctl -u 10.16.160.57:2379 -i » region --jq=".regions[]|select(has(“leader”)|not)|{id: .id, peer_stores: [.peers[].store_id]}" exec: “jq”: executable file not found in $PATH » write |1: file already closed

第一个命令执行失败

» region check miss-peer { “count”: 22, “regions”: [ { “id”: 20, “start_key”: “7480000000000000FF1100000000000000F8”, “end_key”: “7480000000000000FF1300000000000000F8”, “epoch”: { “conf_ver”: 15, “version”: 9 }, “peers”: [ { “id”: 21, “store_id”: 1 }, { “id”: 5009, “store_id”: 5001 } ], “leader”: { “id”: 5009, “store_id”: 5001 }, “approximate_size”: 1 }, { “id”: 30, “start_key”: “7480000000000000FF1B00000000000000F8”, “end_key”: “7480000000000000FF1D00000000000000F8”, “epoch”: { “conf_ver”: 15, “version”: 14 }, “peers”: [ { “id”: 31, “store_id”: 1 }, { “id”: 5014, “store_id”: 5001 } ], “leader”: { “id”: 5014, “store_id”: 5001 }, “approximate_size”: 1 }, { “id”: 1009, “start_key”: “7480000000000000FF2900000000000000F8”, “end_key”: “7480000000000000FF2B00000000000000F8”, “epoch”: { “conf_ver”: 15, “version”: 21 }, “peers”: [ { “id”: 1010, “store_id”: 1 }, { “id”: 5021, “store_id”: 5001 } ], “leader”: { “id”: 5021, “store_id”: 5001 }, “approximate_size”: 1 }, { “id”: 34, “start_key”: “7480000000000000FF1F00000000000000F8”, “end_key”: “7480000000000000FF2100000000000000F8”, “epoch”: { “conf_ver”: 15, “version”: 16 }, “peers”: [ { “id”: 35, “store_id”: 1 }, { “id”: 5016, “store_id”: 5001 } ], “leader”: { “id”: 5016, “store_id”: 5001 }, “approximate_size”: 1 }, { “id”: 1001, “start_key”: “7480000000000000FF2500000000000000F8”, “end_key”: “7480000000000000FF2700000000000000F8”, “epoch”: { “conf_ver”: 15, “version”: 19 }, “peers”: [ { “id”: 1002, “store_id”: 1 }, { “id”: 5019, “store_id”: 5001 } ], “leader”: { “id”: 5019, “store_id”: 5001 }, “approximate_size”: 1, “approximate_keys”: 2 }, { “id”: 26, “start_key”: “7480000000000000FF1700000000000000F8”, “end_key”: “7480000000000000FF1900000000000000F8”, “epoch”: { “conf_ver”: 15, “version”: 12 }, “peers”: [ { “id”: 27, “store_id”: 1 }, { “id”: 5012, “store_id”: 5001 } ], “leader”: { “id”: 5012, “store_id”: 5001 }, “approximate_size”: 1, “approximate_keys”: 8 }, { “id”: 8, “start_key”: “7480000000000000FF0500000000000000F8”, “end_key”: “7480000000000000FF0700000000000000F8”, “epoch”: { “conf_ver”: 15, “version”: 3 }, “peers”: [ { “id”: 9, “store_id”: 1 }, { “id”: 5003, “store_id”: 5001 } ], “leader”: { “id”: 5003, “store_id”: 5001 }, “approximate_size”: 1, “approximate_keys”: 2 }, { “id”: 4045, “start_key”: “7480000000000000FF2B00000000000000F8”, “end_key”: “7480000000000000FF2F00000000000000F8”, “epoch”: { “conf_ver”: 15, “version”: 22 }, “peers”: [ { “id”: 4046, “store_id”: 1 }, { “id”: 5022, “store_id”: 5001 } ], “leader”: { “id”: 5022, “store_id”: 5001 }, “approximate_size”: 1, “approximate_keys”: 7 }, { “id”: 40, “start_key”: “7480000000000000FF2300000000000000F8”, “end_key”: “7480000000000000FF2500000000000000F8”, “epoch”: { “conf_ver”: 15, “version”: 18 }, “peers”: [ { “id”: 41, “store_id”: 1 }, { “id”: 5018, “store_id”: 5001 } ], “leader”: { “id”: 5018, “store_id”: 5001 }, “approximate_size”: 1 }, { “id”: 1005, “start_key”: “7480000000000000FF2700000000000000F8”, “end_key”: “7480000000000000FF2900000000000000F8”, “epoch”: { “conf_ver”: 15, “version”: 20 }, “peers”: [ { “id”: 1006, “store_id”: 1 }, { “id”: 5020, “store_id”: 5001 } ], “leader”: { “id”: 5020, “store_id”: 5001 }, “approximate_size”: 1 }, { “id”: 12, “start_key”: “7480000000000000FF0900000000000000F8”, “end_key”: “7480000000000000FF0B00000000000000F8”, “epoch”: { “conf_ver”: 15, “version”: 5 }, “peers”: [ { “id”: 13, “store_id”: 1 }, { “id”: 5005, “store_id”: 5001 } ], “leader”: { “id”: 5005, “store_id”: 5001 }, “approximate_size”: 1 }, { “id”: 32, “start_key”: “7480000000000000FF1D00000000000000F8”, “end_key”: “7480000000000000FF1F00000000000000F8”, “epoch”: { “conf_ver”: 15, “version”: 15 }, “peers”: [ { “id”: 33, “store_id”: 1 }, { “id”: 5015, “store_id”: 5001 } ], “leader”: { “id”: 5015, “store_id”: 5001 }, “approximate_size”: 1 }, { “id”: 24, “start_key”: “7480000000000000FF1500000000000000F8”, “end_key”: “7480000000000000FF1700000000000000F8”, “epoch”: { “conf_ver”: 15, “version”: 11 }, “peers”: [ { “id”: 25, “store_id”: 1 }, { “id”: 5011, “store_id”: 5001 } ], “leader”: { “id”: 5011, “store_id”: 5001 }, “approximate_size”: 1, “approximate_keys”: 32 }, { “id”: 16, “start_key”: “7480000000000000FF0D00000000000000F8”, “end_key”: “7480000000000000FF0F00000000000000F8”, “epoch”: { “conf_ver”: 15, “version”: 7 }, “peers”: [ { “id”: 17, “store_id”: 1 }, { “id”: 5007, “store_id”: 5001 } ], “leader”: { “id”: 5007, “store_id”: 5001 }, “approximate_size”: 1, “approximate_keys”: 1040 }, { “id”: 6, “start_key”: “”, “end_key”: “7480000000000000FF0500000000000000F8”, “epoch”: { “conf_ver”: 15, “version”: 2 }, “peers”: [ { “id”: 7, “store_id”: 1 }, { “id”: 5002, “store_id”: 5001 } ], “leader”: { “id”: 5002, “store_id”: 5001 }, “written_bytes”: 37, “read_bytes”: 268, “approximate_size”: 1, “approximate_keys”: 161 }, { “id”: 10, “start_key”: “7480000000000000FF0700000000000000F8”, “end_key”: “7480000000000000FF0900000000000000F8”, “epoch”: { “conf_ver”: 15, “version”: 4 }, “peers”: [ { “id”: 11, “store_id”: 1 }, { “id”: 5004, “store_id”: 5001 } ], “leader”: { “id”: 5004, “store_id”: 5001 }, “approximate_size”: 1 }, { “id”: 14, “start_key”: “7480000000000000FF0B00000000000000F8”, “end_key”: “7480000000000000FF0D00000000000000F8”, “epoch”: { “conf_ver”: 15, “version”: 6 }, “peers”: [ { “id”: 15, “store_id”: 1 }, { “id”: 5006, “store_id”: 5001 } ], “leader”: { “id”: 5006, “store_id”: 5001 }, “approximate_size”: 1 }, { “id”: 22, “start_key”: “7480000000000000FF1300000000000000F8”, “end_key”: “7480000000000000FF1500000000000000F8”, “epoch”: { “conf_ver”: 15, “version”: 10 }, “peers”: [ { “id”: 23, “store_id”: 1 }, { “id”: 5010, “store_id”: 5001 } ], “leader”: { “id”: 5010, “store_id”: 5001 }, “approximate_size”: 1, “approximate_keys”: 30 }, { “id”: 2, “start_key”: “7480000000000000FF2F00000000000000F8”, “end_key”: “”, “epoch”: { “conf_ver”: 15, “version”: 22 }, “peers”: [ { “id”: 3, “store_id”: 1 }, { “id”: 5023, “store_id”: 5001 } ], “leader”: { “id”: 3, “store_id”: 1 }, “approximate_size”: 42, “approximate_keys”: 162350 }, { “id”: 28, “start_key”: “7480000000000000FF1900000000000000F8”, “end_key”: “7480000000000000FF1B00000000000000F8”, “epoch”: { “conf_ver”: 15, “version”: 13 }, “peers”: [ { “id”: 29, “store_id”: 1 }, { “id”: 5013, “store_id”: 5001 } ], “leader”: { “id”: 5013, “store_id”: 5001 }, “approximate_size”: 1 }, { “id”: 37, “start_key”: “7480000000000000FF2100000000000000F8”, “end_key”: “7480000000000000FF2300000000000000F8”, “epoch”: { “conf_ver”: 15, “version”: 17 }, “peers”: [ { “id”: 38, “store_id”: 1 }, { “id”: 5017, “store_id”: 5001 } ], “leader”: { “id”: 5017, “store_id”: 5001 }, “approximate_size”: 1 }, { “id”: 18, “start_key”: “7480000000000000FF0F00000000000000F8”, “end_key”: “7480000000000000FF1100000000000000F8”, “epoch”: { “conf_ver”: 15, “version”: 8 }, “peers”: [ { “id”: 19, “store_id”: 1 }, { “id”: 5008, “store_id”: 5001 } ], “leader”: { “id”: 5008, “store_id”: 5001 }, “written_bytes”: 371, “read_bytes”: 309, “approximate_size”: 1, “approximate_keys”: 120 } ] }

3 个节点,rm 两个节点后,数据都丢失了。只剩一个节点也不能对外提供服务。如果是两个节点宕机的话,可以参考这个帖子恢复。周五的暴击:TiKV 节点宕机无法正常启动之后

或者参考这个帖子看看 三个tikv节点,rm删除其中两个tikv的deploy目录后,tidb依然正常工作,重新添加节点时失败

好尴尬的说,刚刚发现集群恢复正常了,tikv3节点有数据了,我看了下tikv3节点的log,没发现什么异常,现在都恢复正常了,我真的是什么都没干就好了,刚刚还在学别的东西,之前tikv3节点下面的log目录,data目录都是空的,pd-ctl查询的store信息都只有两个store,刚刚看了你发的文档后想尝试,发现最新的store信息是三个了~


刚去pd节点抓了下日志,同一时间的日志信息如上图

这个可能是什么原因导致的啊?

额,看你上面的描述说,tikv3 扩容的时候应该是扩容失败的,之后过了一段时间,集群恢复正常 ? 扩容失败的话,tikv3 应该是没有成功启动的吧。

嗯~没有启动的痕迹,看日志是11.58分的时候tikv启动的,然而在这个时候我没有碰集群~ 在之前pd-ctl查看,store一直都只是两个store是up状态,~~~ 我在考虑会不会是我的测试环境性能太差导致的,反正就是很慢很慢的那种

额,有可能的。建议测试的时候还是使用官方推荐的机器部署哈

:rofl::rofl::rofl::rofl:

功能测试的话,可以用 docker 跑一下,性能测试的话,建议还是要用 推荐的机器哟 https://pingcap.com/docs-cn/stable/how-to/deploy/hardware-recommendations/

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。