添加pump节点一直停留在wait until the pump port is up无法启动

为提高效率,提问时请尽量提供详细背景信息,问题描述清晰可优先响应。以下信息点请尽量提供:

  • 系统版本 & kernel 版本:
  • TiDB 版本:2.1
  • 磁盘型号:
  • 集群节点分布:
  • 数据量 & region 数量 & 副本数:
  • 集群 QPS、.999-Duration、读写比例:
  • 问题描述(我做了什么): 添加pump节点一直停留在wait until the pump port is up无法启动,是啥情况,之前有添加过pump和drainer节点,因磁盘故障无法使用,是否需要先下线pump节点和drainer节点,如何下线?
  1. 新添加节点问题需要检查 pump 进程是否正常启动,查看 pump 日志是否有异常
  2. 旧节点的运维可以参考文档:https://pingcap.com/docs-cn/v3.0/how-to/maintain/tidb-binlog/#pumpdrainer-的启动退出流程

参照文档:
https://pingcap.com/docs-cn/v2.1/how-to/maintain/tidb-binlog/

查询pump状态:
[tidb@hk-tidb-01 bin]$ ./binlogctl -pd-urls=http://192.168.10.73:2379 -cmd pumps
[2019/09/27 08:40:46.198 +08:00] [INFO] [nodes.go:47] [“query node”] [type=pump] [node=“{NodeID: hk-tipump-101:8250, Addr: 192.168.10.83:8250, State: paused, MaxCommitTS: 0, UpdateTime: 2019-09-27 08:40:37 +0800 CST}”]

下线pump:
[tidb@hk-tidb-01 bin]$ ./binlogctl -pd-urls=http://192.168.10.73:2379 -cmd offline-pump -hk-tipump-101:8250 ip-192.168.10.83:8250

报错:
flag provided but not defined: -hk-tipump-101:8250

[2019/09/27 08:44:08.610 +08:00] [ERROR] [main.go:38] [“parse cmd flags”] [error=“flag provided but not defined: -hk-tipump-101:8250”] [errorVerbose=“flag provided but not defined: -hk-tipump-101:8250
github.com/pingcap/errors.AddStack
/home/jenkins/workspace/release_tidb_2.1-ga/go/pkg/mod/github.com/pingcap/errors@v0.11.1/errors.go:174
github.com/pingcap/errors.Trace
/home/jenkins/workspace/release_tidb_2.1-ga/go/pkg/mod/github.com/pingcap/errors@v0.11.1/juju_adaptor.go:15
main.(*Config).Parse
/home/jenkins/workspace/release_tidb_2.1-ga/go/src/github.com/pingcap/tidb-tools/tidb-binlog/binlogctl/config.go:83
main.main
/home/jenkins/workspace/release_tidb_2.1-ga/go/src/github.com/pingcap/tidb-tools/tidb-binlog/binlogctl/main.go:32
runtime.main
/usr/local/go/src/runtime/proc.go:200
runtime.goexit
/usr/local/go/src/runtime/asm_amd64.s:1337”] [stack=“github.com/pingcap/log.Error
/home/jenkins/workspace/release_tidb_2.1-ga/go/pkg/mod/github.com/pingcap/log@v0.0.0-20190307075452-bd41d9273596/global.go:42
main.main
/home/jenkins/workspace/release_tidb_2.1-ga/go/src/github.com/pingcap/tidb-tools/tidb-binlog/binlogctl/main.go:38
runtime.main
/usr/local/go/src/runtime/proc.go:200”]

比如我要先下线,online状态的drainer 用这个模板

bin/binlogctl -pd-urls=http://127.0.0.1:2379 -cmd pause-drainer -node-id ip-127-0-0-1:8249

这里说的-node-id和ip要怎么输入

我用这个命令

[tidb@hk-tidb-01 bin]$ ./binlogctl -pd-urls=http://192.168.10.73:2379 -cmd drainers [2019/09/27 09:03:06.782 +08:00] [INFO] [nodes.go:47] [“query node”] [type=drainer] [node="{NodeID: hk-tidrainer-301:8249, Addr: 192.168.10.86:8249, State: online, MaxCommitTS: 411383221469315073, UpdateTime: 2019-09-24 13:21:01 +0800 CST}"]

查出信息然后执行

[tidb@hk-tidb-01 bin]$ ./binlogctl -pd-urls=http://192.168.10.73:2379 -cmd pause-drainer 192.168.10.86:8249

为什么报错

[2019/09/27 09:03:53.150 +08:00] [ERROR] [main.go:38] [“parse cmd flags”] [error="‘192.168.10.86:8249’ is not a valid flag"]

根据 cmd pumps 返回的结果,nodeid 是 hk-tipump-101:8250
所以 offline 的命令应该是
./binlogctl -pd-urls=http://10.0.1.19:2379 -cmd offline-pump -node-id hk-tipump-101:8250

修改了pump节点的IP后还是卡在这里 TASK [wait until the pump port is up]

中控机的tidb.log里打印 [2019/09/27 11:12:51.591 +08:00] [INFO] [gc_worker.go:304] ["[gc worker] gc interval haven’t past since last run, no need to gc"] [“leaderTick on”=5b5c22073c80005] [interval=10m0s] [“last run”=2019/09/27 11:05:51.000 +08:00]

pump节点log文件打印 2019/09/27 11:14:25 server.go:313: ^[[0;37m[info] register success, this pump’s node id is hk-tipump-101:8250^[[0m 2019/09/27 11:14:25 node.go:147: ^[[0;37m[info] start try to notify drainer: 192.168.10.86:8249 ^[[0m 2019/09/27 11:14:28 main.go:59: ^[[0;31m[error] pump server error, fail to notify all living drainer: notify drainer(192.168.10.86:8249); but return error(rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = “transport: Error while dialing dial tcp 192.168.10.86:8249: connect: no route to host”)^[[0m

集群中的drainer节点要开启吗,添加pump部署应该在drainer之后啊,是否应该先删除集群中的drainer节点

pump server error, fail to notify all living drainer

最终解决方法:

强制下线drainer节点 resources/bin/binlogctl -pd-urls=http://192.168.10.73:2379 -cmd update-drainer -node-id hk-tidrainer-301:8249 -state offline

再次开启pump节点 ansible-playbook start.yml --tags=pump

pump节点启动成功!!!感谢tidb团队,感谢戚铮大神!!!

如启动后,监控drainer节点为红色

进入drainer节点,切换tidb用户

执行[tidb@hk-tidrainer-301 log]$ /ext4/deploy/scripts/start_drainer.sh

也就是systemctl start drainer-8249.service

后监控drainer节点红色消失

运行一段时间后drainer-8249服务可能会掉线,如遇到此种情况

可以添加drainer-8249到开机启动

root用户

systemctl enable drainer-8249.service

1 个赞

修改inventory.ini文件后需要滚动升级 ansible-playbook rolling_update.yml

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。