TiDB 3.0.2 PD扩容 [error] rafthttp: [request sent was ignored (cluster ID mismatch: remote[

场景

在原有的集群上扩容PD 原有节点机地址:192.168.180.46,192.168.180.47,192.168.180.48 新节点机IP地址:192.168.181.18

我做了什么

PD扩容,根据文档进行操作




  1. [tidb@test1 scripts]$ ansible-playbook bootstrap.yml -l 192.168.181.18 --extra-vars “dev_mode=True”

  2. [tidb@test1 scripts]$ ansible-playbook deploy.yml -l 192.168.181.18

  3. 修改节点机配置, 进入节点机 test4

[tidb@test4 scripts]$ vim run_pd.sh
#!/bin/bash
set -e
ulimit -n 1000000

# WARNING: This file was auto-generated. Do not edit!
#          All your edit might be overwritten!
DEPLOY_DIR=/home/tidb/deploy

cd "${DEPLOY_DIR}" || exit 1



exec bin/pd-server 
    --name="pd_test4" 
    --client-urls="http://192.168.181.18:2379" 
    --advertise-client-urls="http://192.168.181.18:2379" 
    --peer-urls="http://192.168.181.18:2380" 
    --advertise-peer-urls="http://192.168.181.18:2380" 
    --data-dir="/home/tidb/deploy/data.pd" 
    --join="http://192.168.180.48:2379" 
    --config=conf/pd.toml 
    --log-file="/home/tidb/deploy/log/pd.log" 2>> "/home/tidb/deploy/log/pd_stderr.log"

  1. 启动PD
[tidb@test4 scripts]$ ./start_pd.sh
[tidb@test4 scripts]$
  1. 查看PD.log
[tidb@test4 log]$ tail -1000f pd.log
......
[2019/08/27 17:08:16.247 +08:00] [WARN] [stream.go:681] ["request sent was ignored by remote peer due to cluster ID mismatch"] [remote-peer-id=fb02ae473bb0b305] [remote-peer-cluster-id=21d02ab059590cc5] [local-member-id=2029038589c20d50] [local-member-cluster-id=c7561f62a4c47cc7] [error="cluster ID mismatch"]
[2019/08/27 17:08:16.247 +08:00] [WARN] [stream.go:681] ["request sent was ignored by remote peer due to cluster ID mismatch"] [remote-peer-id=fb02ae473bb0b305] [remote-peer-cluster-id=21d02ab059590cc5] [local-member-id=2029038589c20d50] [local-member-cluster-id=c7561f62a4c47cc7] [error="cluster ID mismatch"]
[2019/08/27 17:08:16.320 +08:00] [INFO] [raft.go:922] ["2029038589c20d50 is starting a new election at term 1"]
[2019/08/27 17:08:16.320 +08:00] [INFO] [raft.go:741] ["2029038589c20d50 became pre-candidate at term 1"]
[2019/08/27 17:08:16.320 +08:00] [INFO] [raft.go:820] ["2029038589c20d50 received MsgPreVoteResp from 2029038589c20d50 at term 1"]
[2019/08/27 17:08:16.320 +08:00] [INFO] [raft.go:807] ["2029038589c20d50 [logterm: 1, index: 4] sent MsgPreVote request to fb02ae473bb0b305 at term 1"]
[2019/08/27 17:08:16.320 +08:00] [INFO] [raft.go:807] ["2029038589c20d50 [logterm: 1, index: 4] sent MsgPreVote request to 474a6e7996dd50a6 at term 1"]
[2019/08/27 17:08:16.320 +08:00] [INFO] [raft.go:807] ["2029038589c20d50 [logterm: 1, index: 4] sent MsgPreVote request to 50003c586d7591bc at term 1"]
2019/08/27 17:08:16.321 log.go:84: [error] rafthttp: [request sent was ignored (cluster ID mismatch: remote[fb02ae473bb0b305]=21d02ab059590cc5, local=c7561f62a4c47cc7)]
2019/08/27 17:08:16.321 log.go:84: [error] rafthttp: [request sent was ignored (cluster ID mismatch: remote[50003c586d7591bc]=21d02ab059590cc5, local=c7561f62a4c47cc7)]
2019/08/27 17:08:16.321 log.go:84: [error] rafthttp: [request sent was ignored (cluster ID mismatch: remote[474a6e7996dd50a6]=21d02ab059590cc5, local=c7561f62a4c47cc7)]
[2019/08/27 17:08:16.342 +08:00] [WARN] [stream.go:681] ["request sent was ignored by remote peer due to cluster ID mismatch"] [remote-peer-id=474a6e7996dd50a6] [remote-peer-cluster-id=21d02ab059590cc5] [local-member-id=2029038589c20d50] [local-member-cluster-id=c7561f62a4c47cc7] [error="cluster ID mismatch"]
[2019/08/27 17:08:16.342 +08:00] [WARN] [stream.go:681] ["request sent was ignored by remote peer due to cluster ID mismatch"] [remote-peer-id=474a6e7996dd50a6] [remote-peer-cluster-id=21d02ab059590cc5] [local-member-id=2029038589c20d50] [local-member-cluster-id=c7561f62a4c47cc7] [error="cluster ID mismatch"]
......
我尝试重新部署PD, 又出现了新的异常
1. 停止所有pd
[tidb@test1 tidb-ansible]$ ansible-playbook stop.yml --tags=pd

2. 清空pd缓存
[tidb@test1 tidb-ansible]$ ansible-playbook unsafe_cleanup_data.yml --tags=pd

3. 重新部署
[tidb@test1 tidb-ansible]$ ansible-playbook deploy.yml --tags=pd

4. 启动集群
[tidb@test1 tidb-ansible]$ ansible-playbook start.yml --tags=pd

5. 进入节点机查看PD log

[2019/08/27 17:53:09.839 +08:00] [WARN] [stream.go:681] ["request sent was ignored by remote peer due to cluster ID mismatch"] [remote-peer-id=50003c586d7591bc] [remote-peer-cluster-id=21d02ab059590cc5] [local-member-id=2029038589c20d50] [local-member-cluster-id=c7561f62a4c47cc7] [error="cluster ID mismatch"]
[2019/08/27 17:53:09.839 +08:00] [FATAL] [main.go:111] ["run server failed"] [error="Etcd cluster ID mismatch, expect 14363702570076372167, got 2436494335309057221"] [stack="github.com/pingcap/log.Fatal
	/home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/github.com/pingcap/log@v0.0.0-20190715063458-479153f07ebd/global.go:59
main.main
	/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/pd/cmd/pd-server/main.go:111
runtime.main
	/usr/local/go/src/runtime/proc.go:200"]

可以参考:https://pingcap.com/docs-cn/v3.0/faq/tidb/#3-2-2-pd-启动报错-etcd-cluster-id-mismatch

你好大神, 你发的这个连接我现在已经读过了, 只是发生错误的这个节点就是我要加入的新节点。其它的节点都没有问题。

pd 集群不支持清理后重新部署,如果之前的集群数据没有了,需要用 pd-recover 工具恢复。

新增PD 扩容的问题也是因为没有数据(相当于集群中的节点数据没有了) 对吧

可能是新增的 PD 节点并没有真正加入到现有的集群,这个需要通过 使用 pd-ctl 检查新节点是否添加成功: /home/tidb/tidb-ansible/resources/bin/pd-ctl -u “http://172.16.10.1:2379” -d member 如果没有执行成功,需要检查新增节点是否有相关报错导致没有加入成功

我按照官方文档的操作(截图我放到开始的问题上了), 新的节点在加入的时候就发生了异常

感谢你的帮助,我把最终的解决方法,分享出来,大体上都是根据官方文件做的,这里只多了一个细节操作




扩容 PD 节点

目标: 加入新节点 172.160.180.18

  1. 编辑 inventory.ini 文件,加入节点信息

  2. 清空pd缓存(如果之前没安装,就跳过此步)

[tidb@test1 tidb-ansible]$ ansible-playbook unsafe_cleanup.yml --tags=pd -l 172.160.180.18
  1. 重新部署
[tidb@test1 tidb-ansible]$ ansible-playbook bootstrap.yml -l 172.160.180.18 --extra-vars "dev_mode=True"

Congrats! All goes well. :-)
[tidb@test1 tidb-ansible]$
[tidb@test1 tidb-ansible]$
[tidb@test1 tidb-ansible]$
[tidb@test1 tidb-ansible]$
[tidb@test1 tidb-ansible]$ ansible-playbook deploy.yml --tags=pd -l 172.160.180.18

Congrats! All goes well. :-)
[tidb@test1 tidb-ansible]$
  1. 进入节点机 test4, 修改节点机配置
# 例子
......
exec bin/pd-server 
    --name="pd_test4" 
    --client-urls="http://172.160.180.18:2379" 
    --advertise-client-urls="http://172.160.180.18:2379" 
    --peer-urls="http://172.160.180.18:2380" 
    --advertise-peer-urls="http://172.160.180.18:2380" 
    --data-dir="/home/tidb/deploy/data.pd" 
    --initial-cluster="pd_test1=http://172.160.180.46:2380,pd_test2=http://172.160.180.47:2380,pd_test3=http://172.160.180.48:2380,pd_test4=http://172.160.180.18:2380"  # 删除这一行,不可以使用注释,要直接删除
    --join="http://172.160.180.48:2379"   # 替换为这一行,IP地址是现有PD集群中的任意一个即可
    --config=conf/pd.toml 
    --log-file="/home/tidb/deploy/log/pd.log" 2>> "/home/tidb/deploy/log/pd_stderr.log"
......
[tidb@test4 ~]$ vim /home/tidb/deploy/scripts/run_pd.sh
#!/bin/bash
set -e
ulimit -n 1000000

# WARNING: This file was auto-generated. Do not edit!
#          All your edit might be overwritten!
DEPLOY_DIR=/home/tidb/deploy

cd "${DEPLOY_DIR}" || exit 1



exec bin/pd-server 
    --name="pd_test4" 
    --client-urls="http://172.160.180.18:2379" 
    --advertise-client-urls="http://172.160.180.18:2379" 
    --peer-urls="http://172.160.180.18:2380" 
    --advertise-peer-urls="http://172.160.180.18:2380" 
    --data-dir="/home/tidb/deploy/data.pd" 
    --join="http://172.160.180.48:2379" 
    --config=conf/pd.toml 
    --log-file="/home/tidb/deploy/log/pd.log" 2>> "/home/tidb/deploy/log/pd_stderr.log"
  1. 删除缓存节点机中的缓存数据(如果之前没安装,就跳过此步)
# 删除缓存数据
[tidb@test4 ~]$ rm -rf /home/tidb/deploy/data.pd/
# 删除历史pd.log
[tidb@test4 ~]$ rm -rf /home/tidb/deploy/log/pd*
  1. 启动pd
[tidb@test4 ~]$ /home/tidb/deploy/scripts/start_pd.sh
  1. 进入主控机,查看PD节点是否成功加入
[tidb@test1 tidb-ansible]$ /home/tidb/tidb-ansible/resources/bin/pd-ctl -u "http://172.160.180.46:2379" -d member
  1. 滚动升级整个集群
[tidb@test1 tidb-ansible]$ ansible-playbook rolling_update.yml
  1. 启动PD
[tidb@test1 tidb-ansible]$ ansible-playbook start.yml --tags=pd -l 172.160.180.18
  1. 更新监控
[tidb@test1 tidb-ansible]$ ansible-playbook rolling_update_monitor.yml --tags=prometheus
2赞