v7.5.0 scale-in pd不彻底,导致两个群集数据混乱

【 TiDB 使用环境】生产环境
【 TiDB 版本】
【复现路径】scale-in一个pd节点成功后,群集还会重连这个pd节点。
【遇到的问题:问题现象及影响】如果这个下线的节点,scale-out到新的pd群集,两个群集将会合并导致数据混乱。

10.25.248.131:2380(VMS584328)之前属于tikv-oversea群集,2024/04/08 10:19:26对10.25.248.131:2380进行了scale-in,tiup cluster display tikv-oversea已经显示10.25.248.131:2380被移出,随后下线了服务器VMS584328。但是pd.log中显示tikv-oversea还在连接10.25.248.131:2380并报错连不上,报错一直持续到了2024/04/10。

2024/04/10新上线了一台服务器VMS602679恰好ip复用了10.25.248.131,2024/04/10 13:47将10.25.248.131:2380(VMS602679)scale-out到了tikv-dal-test群集,tikv-dal-test群集变成了3+1模式。tikv-oversea的6节点也在这时候重新连接上了10.25.248.131:2380,变成了6+1模式。随后3+1+6,10个pd节点全部打通,形成了一个10节点的pd群集,此时数据混乱。

tikv-oversea
10.109.220.10:2379
10.109.220.9:2379
10.25.248.208:2379
10.25.248.246:2379
10.58.228.76:2379
10.58.228.86:2379

tikv-dal-test
10.58.228.37
10.109.216.124
10.25.248.212

tikv-oversea pd log:
[2024/04/07 18:37:25.977 +08:00] [INFO] [etcdutil.go:309] [“update endpoints”] [num-change=7->8] [last-endpoints=“[http://10.58.228.76:2379,http://10.58.228.86:2379,http://10.109.220.9:2379,http://10.109.220.10:2379,http://10.25.248.246:2379,http://10.25.248.131:2379,http://10.25.249.164:2379]”] [endpoints=“[http://10.58.228.76:2379,http://10.58.228.86:2379,http://10.109.220.10:2379,http://10.25.248.246:2379,http://10.109.220.9:2379,http://10.25.248.131:2379,http://10.25.249.164:2379,http://10.25.248.208:2379]”]
[2024/04/08 10:19:26.254 +08:00] [INFO] [cluster.go:422] [“removed member”] [cluster-id=468758231b5b0393] [local-member-id=edff54aa33575887] [removed-remote-peer-id=f67c161a4e9b9cb8] [removed-remote-peer-urls=“[http://10.25.248.131:2380]”]
[2024/04/08 10:19:27.958 +08:00] [WARN] [grpclog.go:60] [“grpc: addrConn.createTransport failed to connect to {http://10.25.248.131:2379 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.25.248.131:2379: connect: connection refused". Reconnecting…”]
[2024/04/08 10:19:27.958 +08:00] [WARN] [grpclog.go:60] [“grpc: addrConn.createTransport failed to connect to {http://10.25.248.131:2379 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.25.248.131:2379: connect: connection refused". Reconnecting…”]

[2024/04/09 14:46:33.395 +08:00] [WARN] [grpclog.go:60] [“grpc: addrConn.createTransport failed to connect to {http://10.25.248.131:2379 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.25.248.131:2379: connect: connection timed out". Reconnecting…”]
[2024/04/09 14:49:25.265 +08:00] [WARN] [grpclog.go:60] [“grpc: addrConn.createTransport failed to connect to {http://10.25.248.131:2379 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.25.248.131:2379: i/o timeout". Reconnecting…”]

[2024/04/10 13:44:05.323 +08:00] [WARN] [grpclog.go:60] [“grpc: addrConn.createTransport failed to connect to {http://10.25.248.131:2379 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.25.248.131:2379: connect: connection refused". Reconnecting…”]
[2024/04/10 13:45:57.545 +08:00] [WARN] [grpclog.go:60] [“grpc: addrConn.createTransport failed to connect to {http://10.25.248.131:2379 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.25.248.131:2379: connect: connection refused". Reconnecting…”]
[2024/04/10 13:46:21.890 +08:00] [WARN] [grpclog.go:60] [“grpc: addrConn.createTransport failed to connect to {http://10.25.248.131:2379 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 10.25.248.131:2379: connect: connection refused". Reconnecting…”]
[2024/04/10 13:47:58.088 +08:00] [INFO] [etcdutil.go:309] [“update endpoints”] [num-change=6->7] [last-endpoints=“[http://10.25.248.246:2379,http://10.58.228.76:2379,http://10.25.248.208:2379,http://10.58.228.86:2379,http://10.109.220.10:2379,http://10.109.220.9:2379]”] [endpoints=“[http://10.58.228.76:2379,http://10.58.228.86:2379,http://10.109.220.10:2379,http://10.109.220.9:2379,http://10.25.248.208:2379,http://10.25.248.246:2379,http://10.25.248.131:2379]”]
[2024/04/10 13:48:08.085 +08:00] [INFO] [etcdutil.go:309] [“update endpoints”] [num-change=6->7] [last-endpoints=“[http://10.58.228.76:2379,http://10.58.228.86:2379,http://10.109.220.9:2379,http://10.109.220.10:2379,http://10.25.248.246:2379,http://10.25.248.208:2379]”] [endpoints=“[http://10.58.228.86:2379,http://10.109.220.10:2379,http://10.25.248.208:2379,http://10.58.228.76:2379,http://10.109.220.9:2379,http://10.25.248.246:2379,http://10.25.248.131:2379]”]
[2024/04/10 13:48:18.090 +08:00] [INFO] [etcdutil.go:309] [“update endpoints”] [num-change=7->10] [last-endpoints=“[http://10.58.228.76:2379,http://10.58.228.86:2379,http://10.109.220.10:2379,http://10.109.220.9:2379,http://10.25.248.208:2379,http://10.25.248.246:2379,http://10.25.248.131:2379]”] [endpoints=“[http://10.109.220.10:2379,http://10.58.228.76:2379,http://10.109.220.9:2379,http://10.58.228.86:2379,http://10.109.216.124:2379,http://10.25.248.212:2379,http://10.25.248.208:2379,http://10.58.228.37:2379,http://10.25.248.246:2379,http://10.25.248.131:2379]”]

后续测试scale-in pd node,会一直重连remove pd node,打印WARN报错。
转移pd leader该报错依旧,reload后报错消失。
有两个需求:

  1. display显示pd node完成删除,没有显示tombstone状态,会误导运维人员该pd node完全下线成功。
  2. 不应该一直重连remove pd node,这个在推送元数据后应该也是能实现的。
1 个赞

不可能融合的,因为不同集群有不同的cluster_id,顶多是报错罢了。

1 个赞

scale-in的过程信息还有吗

官网缺失对应说明和操作:Update scale-tidb-using-tiup.md by easonn7 · Pull Request #16743 · pingcap/docs-cn
等这个 pr 合并即可。

这个问题 update prometheus config when scale in by Yujie-Xie · Pull Request #2387 · pingcap/tiup · GitHub tiup 1.15 修复了部分,就是扩缩容刷新 Prometheus。
今天仔细 review 了下 tiup 里面的相关逻辑,发现我们缩容 PD 刷新了配置信息,但是没有刷新集群中的 run scripts 脚本内容。这部分我们会完善下官网,长期来说会在 tiup 中修复这个问题。
等有相关 issue 和 pr 会更新到这里。

1 个赞

:100: :100: :100:

厉害:+1:t2::+1:t2::+1:t2::+1:t2::+1:t2::+1:t2::+1:t2::+1:t2::+1:t2::+1:t2::+1:t2:

学习了

scale-in的过程信息还有吗

学习了

感觉这是BUG

学习了

这种错误没人 应该在同一个内网的都会犯这个错

此话题已在最后回复的 60 天后被自动关闭。不再允许新回复。