TiUP升级集群报Run Command Timeout/SSH Timeout错误解决方案

(1)问题现象:升级tiup过程中stop tikv节点超时:ERROR Run Command Timeout,其实登录到192.168.1.43查看tikv其实已经stop了。
2020-06-29T05:21:18.289+0800 INFO Stopping instance 192.168.1.43
2020-06-29T05:22:58.364+0800 INFO SSHCommand {“host”: “192.168.1.43”, “port”: “22”, “cmd”: “export LANG=C; PATH=$PATH:/usr/bin:/usr/sbin sudo -H -u root bash -c “systemctl daemon-reload && systemctl stop tikv-20160.service””, “stdout”: “”, “stderr”: “Run Command Timeout!\n”}
2020-06-29T05:22:58.364+0800 ERROR Run Command Timeout!

2020-06-29T05:22:58.364+0800 INFO Execute command finished {“code”: 1, “error”: “failed to upgrade: failed to stop 192.168.1.43: failed to stop: tikv 192.168.1.43:20160: executor.ssh.execute_timedout: Execute command over SSH timedout for ‘tidb@192.168.1.43:22’ {ssh_stderr: Run Command Timeout!\n, ssh_stdout: , ssh_command: export LANG=C; PATH=$PATH:/usr/bin:/usr/sbin sudo -H -u root bash -c “systemctl daemon-reload && systemctl stop tikv-20160.service”}”, “errorVerbose”: “executor.ssh.execute_timedout: Execute command over SSH timedout for ‘tidb@192.168.1.43:22’ {ssh_stderr: Run Command Timeout!\n, ssh_stdout: , ssh_command: export LANG=C; PATH=$PATH:/usr/bin:/usr/sbin sudo -H -u root bash -c “systemctl daemon-reload && systemctl stop tikv-20160.service”}\n at github.com/pingcap/tiup/pkg/cluster/executor.(*SSHExecutor).Execute()\n\tgithub.com/pingcap/tiup@/pkg/cluster/executor/ssh.go:172\n at github.com/pingcap/tiup/pkg/cluster/module.(*SystemdModule).Execute()\n\tgithub.com/pingcap/tiup@/pkg/cluster/module/systemd.go:89\n at github.com/pingcap/tiup/pkg/cluster/operation.stopInstance()\n\tgithub.com/pingcap/tiup@/pkg/cluster/operation/action.go:574\n at github.com/pingcap/tiup/pkg/cluster/operation.Upgrade()\n\tgithub.com/pingcap/tiup@/pkg/cluster/operation/upgrade.go:99\n at github.com/pingcap/tiup/pkg/cluster/task.(*ClusterOperate).Execute()\n\tgithub.com/pingcap/tiup@/pkg/cluster/task/action.go:53\n at github.com/pingcap/tiup/pkg/cluster/task.(*Serial).Execute()\n\tgithub.com/pingcap/tiup@/pkg/cluster/task/task.go:189\n at github.com/pingcap/tiup/components/cluster/command.upgrade()\n\tgithub.com/pingcap/tiup@/components/cluster/command/upgrade.go:174\n at github.com/pingcap/tiup/components/cluster/command.newUpgradeCmd.func1()\n\tgithub.com/pingcap/tiup@/components/cluster/command/upgrade.go:50\n at github.com/spf13/cobra.(*Command).execute()\n\tgithub.com/spf13/cobra@v1.0.0/command.go:842\n at github.com/spf13/cobra.(*Command).ExecuteC()\n\tgithub.com/spf13/cobra@v1.0.0/command.go:950\n at github.com/spf13/cobra.(*Command).Execute()\n\tgithub.com/spf13/cobra@v1.0.0/command.go:887\n at github.com/pingcap/tiup/components/cluster/command.Execute()\n\tgithub.com/pingcap/tiup@/components/cluster/command/root.go:220\n at main.main()\n\tgithub.com/pingcap/tiup@/components/cluster/main.go:19\n at runtime.main()\n\truntime/proc.go:203\n at runtime.goexit()\n\truntime/asm_amd64.s:1357\nfailed to stop: tikv 192.168.1.43:20160\ngithub.com/pingcap/tiup/pkg/cluster/operation.stopInstance\n\tgithub.com/pingcap/tiup@/pkg/cluster/operation/action.go:593\ngithub.com/pingcap/tiup/pkg/cluster/operation.Upgrade\n\tgithub.com/pingcap/tiup@/pkg/cluster/operation/upgrade.go:99\ngithub.com/pingcap/tiup/pkg/cluster/task.(*ClusterOperate).Execute\n\tgithub.com/pingcap/tiup@/pkg/cluster/task/action.go:53\ngithub.com/pingcap/tiup/pkg/cluster/task.(*Serial).Execute\n\tgithub.com/pingcap/tiup@/pkg/cluster/task/task.go:189\ngithub.com/pingcap/tiup/components/cluster/command.upgrade\n\tgithub.com/pingcap/tiup@/components/cluster/command/upgrade.go:174\ngithub.com/pingcap/tiup/components/cluster/command.newUpgradeCmd.func1\n\tgithub.com/pingcap/tiup@/components/cluster/command/upgrade.go:50\ngithub.com/spf13/cobra.(*Command).execute\n\tgithub.com/spf13/cobra@v1.0.0/command.go:842\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\tgithub.com/spf13/cobra@v1.0.0/command.go:950\ngithub.com/spf13/cobra.(*Command).Execute\n\tgithub.com/spf13/cobra@v1.0.0/command.go:887\ngithub.com/pingcap/tiup/components/cluster/command.Execute\n\tgithub.com/pingcap/tiup@/components/cluster/command/root.go:220\nmain.main\n\tgithub.com/pingcap/tiup@/components/cluster/main.go:19\nruntime.main\n\truntime/proc.go:203\nruntime.goexit\n\truntime/asm_amd64.s:1357\nfailed to stop 192.168.1.43\nfailed to upgrade”}

(2)解决方案:
1、升级tiup到最新版本: tiup update --self && tiup update --all 升级以下 tiup 及其组件
为啥要升级,目的是要使用最新版本的tiup的下面2个参数:
tiup cluster --help
Flags:
-h, --help help for tiup
–ssh-timeout int Timeout in seconds to connect host via SSH, ignored for operations that don’t need an SSH connection. (default 5)
-v, --version version for tiup
–wait-timeout int Timeout in seconds to wait for an operation to complete, ignored for operations that don’t fit. (default 60)

如果报ssh-timeout相关的报错,这个是中控机跟tikv/pd/tidb机器建立ssh连接的超时时间,如果遇到网络不好等情况,可以调大这个参数时间
如果报ERROR Run Command Timeout相关的报错,这个是中控机跟tikv/pd/tidb机器执行命令的超时时间,如果遇到执行比较慢,可以调大这个参数时间。

2、调整了相关的timeout超时时间,执行了多次还是升级不成功,那就祭出最大的杀器:–force

滚动升级会逐个升级所有的组件。升级 TiKV 期间,会逐个将 TiKV 上的所有 leader 切走再停止该 TiKV 实例。默认超时时间为 5 分钟,超过后会直接停止实例。

如果不希望驱逐 leader,而希望立刻升级,可以在上述命令中指定 --force,该方式会造成性能抖动(特别建议在凌晨低峰时间操作,将影响降低到最低),不会造成数据损失。

2赞

直接重启未升级节点试试?

上面的文档 2.2 使用 TiUP Cluster 升级中断

display 的时候版本显示会有问题
修改一下元数据文件
~/.tiup/storage/cluster/clusters/{cluster_name}/meta.yaml

timeout 的这台机器,手动上去执行这个命令能正常退出吗

可以的,在中控机上手动执行提示的命令是没有问题的,尝试过2次,第二次升级的时候第一次出问题的192.168.1.43的tikv就能正常restart,上面的文章总共执行了2次,第二次升级时又出现了另一台的tikv同样的问题。

不是显示的问题,其实登录到run timeout的这台服务器查看其实tikv已经是stop了的,感觉tiup收集反馈有问题。

不是在中控机上执行,要去 timeout 的那个机器执行
sudo -H -u root bash -c “systemctl daemon-reload && systemctl stop tikv-20160.service
看是不是哪里卡住了

不是卡住了,是在timeout机器上执行都OK,并且文章中也提到了,中控机爆出timeout ERROR后,你登录对应的tikv机器查看tikv-server其实已经被stop了。所以不存在你说的卡住问题。我怀疑是中控机在接受tikv节点的tikv已经stop的“反馈消息”过程有问题。