Tiup 滚动升级重启tikv时集群不可用

  • 【TiDB 版本】:v3.0.9
  • 【问题描述】: Tiup 滚动升级重启tikv时 集群不可用

升级命令:
tiup cluster upgrade v4.0.5 --transfer-timeout 100000000

日志:
Still waitting for 47 store leaders to transfer…
Still waitting for 47 store leaders to transfer…
Still waitting for 47 store leaders to transfer…
Restarting instance 172.16.188.132

日志到这里集群就无法访问了,请问这是什么原因。 可以避免吗?

其它时间集群正常提供服务,预计升级中集群会有3分钟左右的时间处于不可用状态。

在滚动重启 TiKV 节点的时候,会先在 TiKV 节点上加上 evcit-leader-scheduler 将 leader 迁移走,等 leader 迁移走后,将 evcit-leader-scheduler 去掉,重启 TiKV 节点,这样可以尽量避免直接关机导致服务不可用的情况。

这边是在等待 leader 迁移的时候,有 region 一直无法迁移 leader 成功,等到超时之后,就直接关闭了 TiKV 实例,所以会导致部分 region 不可用,因为 leader 挂了。关于迁移不成功的原因,这个需要当时看下该 store 上剩余的 region 的具体情况。

这个问题可以稳定复现吗?另外有尝试过在 4.0.6 版本上是否也会遇到一样的情况?

leader 都已经迁移成功了,然后才Restarting instance。这一步会导致服务不可用。等重启结束。服务就恢复正常了。

v3.0.9 -> v4.0.5
v4.0.5 -> 4.0.6 都有这个问题。

完整的升级日志:

$ tiup cluster upgrade dev-cluster v4.0.6 --transfer-timeout 100000000

Starting component cluster: /home/tidb/.tiup/components/cluster/v1.1.2/tiup-cluster upgrade dev-cluster v4.0.6 --transfer-timeout 100000000

  • [ Serial ] - SSHKeySet: privateKey=/home/tidb/.tiup/storage/cluster/clusters/dev-cluster/ssh/id_rsa, publicKey=/home/tidb/.tiup/storage/cluster/clusters/dev-cluster/ssh/id_rsa.pub
  • [Parallel] - UserSSH: user=tidb, host=172.16.188.106
  • [Parallel] - UserSSH: user=tidb, host=172.16.188.162
  • [Parallel] - UserSSH: user=tidb, host=172.16.188.109
  • [Parallel] - UserSSH: user=tidb, host=172.16.188.106
  • [Parallel] - UserSSH: user=tidb, host=172.16.188.133
  • [Parallel] - UserSSH: user=tidb, host=172.16.188.132
  • [Parallel] - UserSSH: user=tidb, host=172.16.188.113
  • [Parallel] - UserSSH: user=tidb, host=172.16.188.106
  • [Parallel] - UserSSH: user=tidb, host=172.16.188.144
  • [Parallel] - UserSSH: user=tidb, host=172.16.188.150
  • [Parallel] - UserSSH: user=tidb, host=172.16.188.107
  • [Parallel] - UserSSH: user=tidb, host=172.16.188.106
  • [Parallel] - UserSSH: user=tidb, host=172.16.188.133
  • [Parallel] - UserSSH: user=tidb, host=172.16.188.120
  • [Parallel] - UserSSH: user=tidb, host=172.16.188.123
  • [ Serial ] - Download: component=pd, version=v4.0.6, os=linux, arch=amd64
  • [ Serial ] - Download: component=grafana, version=v4.0.6, os=linux, arch=amd64
  • [ Serial ] - Download: component=tikv, version=v4.0.6, os=linux, arch=amd64
  • [ Serial ] - Download: component=pump, version=v4.0.6, os=linux, arch=amd64
  • [ Serial ] - Download: component=tidb, version=v4.0.6, os=linux, arch=amd64
  • [ Serial ] - Download: component=prometheus, version=v4.0.6, os=linux, arch=amd64
  • [ Serial ] - CopyComponent: component=grafana, version=v4.0.6, remote=172.16.188.106:/data/tidata os=linux, arch=amd64
  • [ Serial ] - BackupComponent: component=tikv, currentVersion=v4.0.5, remote=172.16.188.132:/data/tidata
  • [ Serial ] - BackupComponent: component=pd, currentVersion=v4.0.5, remote=172.16.188.113:/data/tidata
  • [ Serial ] - BackupComponent: component=pd, currentVersion=v4.0.5, remote=172.16.188.120:/data/tidata
  • [ Serial ] - BackupComponent: component=pd, currentVersion=v4.0.5, remote=172.16.188.144:/data/tidata
  • [ Serial ] - BackupComponent: component=tikv, currentVersion=v4.0.5, remote=172.16.188.107:/data/tidata
  • [ Serial ] - BackupComponent: component=tikv, currentVersion=v4.0.5, remote=172.16.188.123:/data/tidata
  • [ Serial ] - BackupComponent: component=tikv, currentVersion=v4.0.5, remote=172.16.188.150:/data/tidata
  • [ Serial ] - BackupComponent: component=pump, currentVersion=v4.0.5, remote=172.16.188.133:/data/pump
  • [ Serial ] - BackupComponent: component=tikv, currentVersion=v4.0.5, remote=172.16.188.109:/data/tidata
  • [ Serial ] - BackupComponent: component=pump, currentVersion=v4.0.5, remote=172.16.188.106:/data/pump
  • [ Serial ] - BackupComponent: component=tidb, currentVersion=v4.0.5, remote=172.16.188.133:/data/tidata
  • [ Serial ] - BackupComponent: component=tikv, currentVersion=v4.0.5, remote=172.16.188.162:/data/tidata
  • [ Serial ] - BackupComponent: component=tidb, currentVersion=v4.0.5, remote=172.16.188.106:/data/tidata
  • [ Serial ] - CopyComponent: component=prometheus, version=v4.0.6, remote=172.16.188.106:/data/tidata os=linux, arch=amd64
  • [ Serial ] - CopyComponent: component=pump, version=v4.0.6, remote=172.16.188.133:/data/pump os=linux, arch=amd64
  • [ Serial ] - CopyComponent: component=pd, version=v4.0.6, remote=172.16.188.120:/data/tidata os=linux, arch=amd64
  • [ Serial ] - CopyComponent: component=pd, version=v4.0.6, remote=172.16.188.113:/data/tidata os=linux, arch=amd64
  • [ Serial ] - CopyComponent: component=tikv, version=v4.0.6, remote=172.16.188.150:/data/tidata os=linux, arch=amd64
  • [ Serial ] - CopyComponent: component=tikv, version=v4.0.6, remote=172.16.188.162:/data/tidata os=linux, arch=amd64
  • [ Serial ] - CopyComponent: component=tikv, version=v4.0.6, remote=172.16.188.132:/data/tidata os=linux, arch=amd64
  • [ Serial ] - CopyComponent: component=tidb, version=v4.0.6, remote=172.16.188.133:/data/tidata os=linux, arch=amd64
  • [ Serial ] - CopyComponent: component=pd, version=v4.0.6, remote=172.16.188.144:/data/tidata os=linux, arch=amd64
  • [ Serial ] - CopyComponent: component=tikv, version=v4.0.6, remote=172.16.188.107:/data/tidata os=linux, arch=amd64
  • [ Serial ] - CopyComponent: component=tikv, version=v4.0.6, remote=172.16.188.109:/data/tidata os=linux, arch=amd64
  • [ Serial ] - CopyComponent: component=tikv, version=v4.0.6, remote=172.16.188.123:/data/tidata os=linux, arch=amd64
  • [ Serial ] - InitConfig: cluster=dev-cluster, user=tidb, host=172.16.188.133, path=/home/tidb/.tiup/storage/cluster/clusters/dev-cluster/config-cache/pump-8250.service, deploy_dir=/data/pump, data_dir=[/data/pump/data.pump], log_dir=/data/pump/log, cache_dir=/home/tidb/.tiup/storage/cluster/clusters/dev-cluster/config-cache
  • [ Serial ] - InitConfig: cluster=dev-cluster, user=tidb, host=172.16.188.120, path=/home/tidb/.tiup/storage/cluster/clusters/dev-cluster/config-cache/pd-2379.service, deploy_dir=/data/tidata, data_dir=[/data/tidata/data.pd], log_dir=/data/tidata/log, cache_dir=/home/tidb/.tiup/storage/cluster/clusters/dev-cluster/config-cache
  • [ Serial ] - InitConfig: cluster=dev-cluster, user=tidb, host=172.16.188.113, path=/home/tidb/.tiup/storage/cluster/clusters/dev-cluster/config-cache/pd-2379.service, deploy_dir=/data/tidata, data_dir=[/data/tidata/data.pd], log_dir=/data/tidata/log, cache_dir=/home/tidb/.tiup/storage/cluster/clusters/dev-cluster/config-cache
  • [ Serial ] - InitConfig: cluster=dev-cluster, user=tidb, host=172.16.188.133, path=/home/tidb/.tiup/storage/cluster/clusters/dev-cluster/config-cache/tidb-4000.service, deploy_dir=/data/tidata, data_dir=[], log_dir=/data/tidata/log, cache_dir=/home/tidb/.tiup/storage/cluster/clusters/dev-cluster/config-cache
  • [ Serial ] - InitConfig: cluster=dev-cluster, user=tidb, host=172.16.188.144, path=/home/tidb/.tiup/storage/cluster/clusters/dev-cluster/config-cache/pd-2379.service, deploy_dir=/data/tidata, data_dir=[/data/tidata/data.pd], log_dir=/data/tidata/log, cache_dir=/home/tidb/.tiup/storage/cluster/clusters/dev-cluster/config-cache
  • [ Serial ] - BackupComponent: component=prometheus, currentVersion=v4.0.5, remote=172.16.188.106:/data/tidata
  • [ Serial ] - InitConfig: cluster=dev-cluster, user=tidb, host=172.16.188.162, path=/home/tidb/.tiup/storage/cluster/clusters/dev-cluster/config-cache/tikv-20160.service, deploy_dir=/data/tidata, data_dir=[/data/tidata/data], log_dir=/data/tidata/log, cache_dir=/home/tidb/.tiup/storage/cluster/clusters/dev-cluster/config-cache
  • [ Serial ] - InitConfig: cluster=dev-cluster, user=tidb, host=172.16.188.107, path=/home/tidb/.tiup/storage/cluster/clusters/dev-cluster/config-cache/tikv-20160.service, deploy_dir=/data/tidata, data_dir=[/data/tidata/data], log_dir=/data/tidata/log, cache_dir=/home/tidb/.tiup/storage/cluster/clusters/dev-cluster/config-cache
  • [ Serial ] - CopyComponent: component=prometheus, version=v4.0.6, remote=172.16.188.106:/data/tidata os=linux, arch=amd64
  • [ Serial ] - InitConfig: cluster=dev-cluster, user=tidb, host=172.16.188.109, path=/home/tidb/.tiup/storage/cluster/clusters/dev-cluster/config-cache/tikv-20160.service, deploy_dir=/data/tidata, data_dir=[/data/tidata/data], log_dir=/data/tidata/log, cache_dir=/home/tidb/.tiup/storage/cluster/clusters/dev-cluster/config-cache
  • [ Serial ] - InitConfig: cluster=dev-cluster, user=tidb, host=172.16.188.150, path=/home/tidb/.tiup/storage/cluster/clusters/dev-cluster/config-cache/tikv-20160.service, deploy_dir=/data/tidata, data_dir=[/data/tidata/data], log_dir=/data/tidata/log, cache_dir=/home/tidb/.tiup/storage/cluster/clusters/dev-cluster/config-cache
  • [ Serial ] - InitConfig: cluster=dev-cluster, user=tidb, host=172.16.188.123, path=/home/tidb/.tiup/storage/cluster/clusters/dev-cluster/config-cache/tikv-20160.service, deploy_dir=/data/tidata, data_dir=[/data/tidata/data], log_dir=/data/tidata/log, cache_dir=/home/tidb/.tiup/storage/cluster/clusters/dev-cluster/config-cache
  • [ Serial ] - InitConfig: cluster=dev-cluster, user=tidb, host=172.16.188.132, path=/home/tidb/.tiup/storage/cluster/clusters/dev-cluster/config-cache/tikv-20160.service, deploy_dir=/data/tidata, data_dir=[/data/tidata/data], log_dir=/data/tidata/log, cache_dir=/home/tidb/.tiup/storage/cluster/clusters/dev-cluster/config-cache
  • [ Serial ] - CopyComponent: component=pump, version=v4.0.6, remote=172.16.188.106:/data/pump os=linux, arch=amd64
  • [ Serial ] - InitConfig: cluster=dev-cluster, user=tidb, host=172.16.188.106, path=/home/tidb/.tiup/storage/cluster/clusters/dev-cluster/config-cache/prometheus-9090.service, deploy_dir=/data/tidata, data_dir=[/data/tidata/prometheus2.0.0.data.metrics], log_dir=/data/tidata/log, cache_dir=/home/tidb/.tiup/storage/cluster/clusters/dev-cluster/config-cache
  • [ Serial ] - InitConfig: cluster=dev-cluster, user=tidb, host=172.16.188.106, path=/home/tidb/.tiup/storage/cluster/clusters/dev-cluster/config-cache/pump-8250.service, deploy_dir=/data/pump, data_dir=[/data/pump/data.pump], log_dir=/data/pump/log, cache_dir=/home/tidb/.tiup/storage/cluster/clusters/dev-cluster/config-cache
  • [ Serial ] - BackupComponent: component=grafana, currentVersion=v4.0.5, remote=172.16.188.106:/data/tidata
  • [ Serial ] - CopyComponent: component=grafana, version=v4.0.6, remote=172.16.188.106:/data/tidata os=linux, arch=amd64
  • [ Serial ] - CopyComponent: component=tidb, version=v4.0.6, remote=172.16.188.106:/data/tidata os=linux, arch=amd64
  • [ Serial ] - InitConfig: cluster=dev-cluster, user=tidb, host=172.16.188.106, path=/home/tidb/.tiup/storage/cluster/clusters/dev-cluster/config-cache/tidb-4000.service, deploy_dir=/data/tidata, data_dir=[], log_dir=/data/tidata/log, cache_dir=/home/tidb/.tiup/storage/cluster/clusters/dev-cluster/config-cache
  • [ Serial ] - InitConfig: cluster=dev-cluster, user=tidb, host=172.16.188.106, path=/home/tidb/.tiup/storage/cluster/clusters/dev-cluster/config-cache/grafana-3000.service, deploy_dir=/data/tidata, data_dir=[], log_dir=/data/tidata/log, cache_dir=/home/tidb/.tiup/storage/cluster/clusters/dev-cluster/config-cache
  • [ Serial ] - UpgradeCluster
    Restarting component pd
    Restarting instance 172.16.188.113
    Restart 172.16.188.113 success
    Restarting instance 172.16.188.120
    Restart 172.16.188.120 success
    Restarting instance 172.16.188.144
    Restart 172.16.188.144 success
    Restarting component tikv
    Evicting 1927 leaders from store 172.16.188.107:20160…
    Still waitting for 1927 store leaders to transfer…
    Still waitting for 1927 store leaders to transfer…
    Still waitting for 1927 store leaders to transfer…
    Still waitting for 1927 store leaders to transfer…
    Still waitting for 1927 store leaders to transfer…
    Still waitting for 1927 store leaders to transfer…
    Still waitting for 1927 store leaders to transfer…
    Still waitting for 1927 store leaders to transfer…
    Still waitting for 1927 store leaders to transfer…
    Still waitting for 1927 store leaders to transfer…
    Still waitting for 1925 store leaders to transfer…
    Still waitting for 1925 store leaders to transfer…
    Still waitting for 1925 store leaders to transfer…
    Still waitting for 1925 store leaders to transfer…
    Still waitting for 1925 store leaders to transfer…
    Still waitting for 1912 store leaders to transfer…
    Still waitting for 1912 store leaders to transfer…
    Still waitting for 1912 store leaders to transfer…
    Still waitting for 1912 store leaders to transfer…
    Still waitting for 1912 store leaders to transfer…
    Still waitting for 1911 store leaders to transfer…
    Still waitting for 1911 store leaders to transfer…
    Still waitting for 1911 store leaders to transfer…
    Still waitting for 1911 store leaders to transfer…
    Still waitting for 1911 store leaders to transfer…
    Still waitting for 1612 store leaders to transfer…
    Still waitting for 1612 store leaders to transfer…
    Still waitting for 1612 store leaders to transfer…
    Still waitting for 1612 store leaders to transfer…
    Still waitting for 1467 store leaders to transfer…
    Still waitting for 1467 store leaders to transfer…
    Still waitting for 1467 store leaders to transfer…
    Still waitting for 1467 store leaders to transfer…
    Still waitting for 1467 store leaders to transfer…
    Still waitting for 886 store leaders to transfer…
    Still waitting for 886 store leaders to transfer…
    Still waitting for 886 store leaders to transfer…
    Still waitting for 886 store leaders to transfer…
    Still waitting for 886 store leaders to transfer…
    Still waitting for 540 store leaders to transfer…
    Still waitting for 540 store leaders to transfer…
    Still waitting for 540 store leaders to transfer…
    Still waitting for 540 store leaders to transfer…
    Still waitting for 540 store leaders to transfer…
    Still waitting for 70 store leaders to transfer…
    Still waitting for 70 store leaders to transfer…
    Still waitting for 70 store leaders to transfer…
    Still waitting for 70 store leaders to transfer…
    Still waitting for 70 store leaders to transfer…
    Restarting instance 172.16.188.107
    Restart 172.16.188.107 success
    Delete leader evicting scheduler of store 5 success
    Removed store leader evicting scheduler from 172.16.188.107:20160.
    Evicting 2316 leaders from store 172.16.188.123:20160…
    Still waitting for 2316 store leaders to transfer…
    Still waitting for 2250 store leaders to transfer…
    Still waitting for 2250 store leaders to transfer…
    Still waitting for 2250 store leaders to transfer…
    Still waitting for 2250 store leaders to transfer…
    Still waitting for 2250 store leaders to transfer…
    Still waitting for 1850 store leaders to transfer…
    Still waitting for 1850 store leaders to transfer…
    Still waitting for 1850 store leaders to transfer…
    Still waitting for 1850 store leaders to transfer…
    Still waitting for 1850 store leaders to transfer…
    Still waitting for 1336 store leaders to transfer…
    Still waitting for 1336 store leaders to transfer…
    Still waitting for 1336 store leaders to transfer…
    Still waitting for 1336 store leaders to transfer…
    Still waitting for 1336 store leaders to transfer…
    Still waitting for 752 store leaders to transfer…
    Still waitting for 752 store leaders to transfer…
    Still waitting for 752 store leaders to transfer…
    Still waitting for 752 store leaders to transfer…
    Still waitting for 317 store leaders to transfer…
    Still waitting for 317 store leaders to transfer…
    Still waitting for 317 store leaders to transfer…
    Still waitting for 317 store leaders to transfer…
    Still waitting for 317 store leaders to transfer…
    Restarting instance 172.16.188.123
    Restart 172.16.188.123 success
    Delete leader evicting scheduler of store 4 success
    Removed store leader evicting scheduler from 172.16.188.123:20160.
    Evicting 2313 leaders from store 172.16.188.150:20160…
    Still waitting for 2313 store leaders to transfer…
    Still waitting for 2313 store leaders to transfer…

Still waitting for 2313 store leaders to transfer…
Still waitting for 2104 store leaders to transfer…
Still waitting for 2104 store leaders to transfer…
Still waitting for 2104 store leaders to transfer…
Still waitting for 2104 store leaders to transfer…
Still waitting for 2104 store leaders to transfer…
Still waitting for 1841 store leaders to transfer…
Still waitting for 1841 store leaders to transfer…
Still waitting for 1841 store leaders to transfer…
Still waitting for 1841 store leaders to transfer…
Still waitting for 1087 store leaders to transfer…
Still waitting for 1087 store leaders to transfer…
Still waitting for 1087 store leaders to transfer…
Still waitting for 1087 store leaders to transfer…
Still waitting for 1087 store leaders to transfer…
Still waitting for 257 store leaders to transfer…
Still waitting for 257 store leaders to transfer…
Still waitting for 257 store leaders to transfer…
Still waitting for 257 store leaders to transfer…
Still waitting for 257 store leaders to transfer…
Restarting instance 172.16.188.150
Restart 172.16.188.150 success
Delete leader evicting scheduler of store 21258 success
Removed store leader evicting scheduler from 172.16.188.150:20160.
Evicting 2314 leaders from store 172.16.188.132:20160…
Still waitting for 2314 store leaders to transfer…
Still waitting for 2314 store leaders to transfer…
Still waitting for 2314 store leaders to transfer…
Still waitting for 2314 store leaders to transfer…
Still waitting for 1756 store leaders to transfer…
Still waitting for 1756 store leaders to transfer…
Still waitting for 1756 store leaders to transfer…
Still waitting for 1756 store leaders to transfer…
Still waitting for 1756 store leaders to transfer…
Still waitting for 802 store leaders to transfer…
Still waitting for 802 store leaders to transfer…
Still waitting for 802 store leaders to transfer…
Still waitting for 802 store leaders to transfer…
Still waitting for 802 store leaders to transfer…
Still waitting for 47 store leaders to transfer…
Still waitting for 47 store leaders to transfer…
Still waitting for 47 store leaders to transfer…
Still waitting for 47 store leaders to transfer…
Restarting instance 172.16.188.132
Restart 172.16.188.132 success
Delete leader evicting scheduler of store 21256 success
Removed store leader evicting scheduler from 172.16.188.132:20160.
Evicting 2311 leaders from store 172.16.188.162:20160…
Still waitting for 2311 store leaders to transfer…
Still waitting for 2311 store leaders to transfer…
Still waitting for 2311 store leaders to transfer…
Still waitting for 1777 store leaders to transfer…
Still waitting for 1777 store leaders to transfer…
Still waitting for 1777 store leaders to transfer…
Still waitting for 1777 store leaders to transfer…
Still waitting for 898 store leaders to transfer…
Still waitting for 898 store leaders to transfer…
Still waitting for 898 store leaders to transfer…
Still waitting for 898 store leaders to transfer…
Still waitting for 898 store leaders to transfer…
Still waitting for 651 store leaders to transfer…
Still waitting for 651 store leaders to transfer…
Still waitting for 651 store leaders to transfer…
Still waitting for 651 store leaders to transfer…
Still waitting for 651 store leaders to transfer…
Still waitting for 12 store leaders to transfer…
Still waitting for 12 store leaders to transfer…
Still waitting for 12 store leaders to transfer…
Still waitting for 12 store leaders to transfer…
Still waitting for 12 store leaders to transfer…
Restarting instance 172.16.188.162
Restart 172.16.188.162 success
Delete leader evicting scheduler of store 21257 success
Removed store leader evicting scheduler from 172.16.188.162:20160.
Evicting 2316 leaders from store 172.16.188.109:20160…
Still waitting for 2316 store leaders to transfer…
Still waitting for 2316 store leaders to transfer…
Still waitting for 2316 store leaders to transfer…
Still waitting for 2316 store leaders to transfer…
Still waitting for 2316 store leaders to transfer…
Still waitting for 1890 store leaders to transfer…
Still waitting for 1890 store leaders to transfer…
Still waitting for 1890 store leaders to transfer…
Still waitting for 1890 store leaders to transfer…
Still waitting for 1890 store leaders to transfer…
Still waitting for 1517 store leaders to transfer…
Still waitting for 1517 store leaders to transfer…
Still waitting for 1517 store leaders to transfer…
Still waitting for 1517 store leaders to transfer…
Still waitting for 1115 store leaders to transfer…
Still waitting for 1115 store leaders to transfer…
Still waitting for 1115 store leaders to transfer…
Still waitting for 1115 store leaders to transfer…
Still waitting for 1115 store leaders to transfer…
Still waitting for 254 store leaders to transfer…
Still waitting for 254 store leaders to transfer…
Still waitting for 254 store leaders to transfer…
Still waitting for 254 store leaders to transfer…
Still waitting for 254 store leaders to transfer…
Restarting instance 172.16.188.109
Restart 172.16.188.109 success
Delete leader evicting scheduler of store 1 success
Removed store leader evicting scheduler from 172.16.188.109:20160.
Restarting component pump
Restarting instance 172.16.188.106
Restart 172.16.188.106 success
Restarting instance 172.16.188.133
Restart 172.16.188.133 success
Restarting component tidb
Restarting instance 172.16.188.133
Restart 172.16.188.133 success
Restarting instance 172.16.188.106
Restart 172.16.188.106 success
Restarting component prometheus
Restarting instance 172.16.188.106
Restart 172.16.188.106 success
Restarting component grafana
Restarting instance 172.16.188.106
Restart 172.16.188.106 success
Upgraded cluster dev-cluster successfully

尽量在业务低峰期执行,可以降低对业务的影响,另外对系统占用资源少,可以尽快transfer leader,多谢。

可以确定有这个问题吗? 还是我的姿势不对?

看起来是达到超时时间,leader 还没迁移完成,所以直接关闭 TiKV 实例导致服务部分不可用。可以考虑 tiup cluster upgrade 的时候设置 --transfer-timeout 参数,默认是 300 秒,也就是 5 分钟,如果达到 5 分钟 leader 没有迁移完成,就直接关闭 TiKV 实例了,如果 TiKV 实例上 leader 比较多或者迁移比较慢,可以考虑调大这个参数,等待 leader 迁移完成。

升级命令:
tiup cluster upgrade v4.0.5 --transfer-timeout 100000000

已经设置过这个参数了,并且leader都已经迁移完成了。

是指 leader 全部迁移完成重启 TiKV 实例的时候也会导致集群服务不可用么
集群服务不可用时报什么错误?

Restarting instance 每个TIKV到这一步的时候集群都处于无法访问的状态,没有任何错误。

无法访问时如何判断出来的?

  1. 通过sysbench跑压力查询
  2. 通过客户端工具无法连接
  1. 客户端工具无法连接的时候,是连接上去卡住,还是有什么返回结果
  2. 集群设置的副本数是多少?
  1. 客户端工具连接卡住,过了一会就报错tikv无法连接。

  2. 副本数是2

副本数是 2 的话重启一个 tikv 实例无法对外提供服务时符合预期的,因为一个 tikv 实例宕机的时候,一个 region 的两个副本就有一个没了,该 region 无法正常对外提供服务,因为无法满足 raft 多数派原则。如果要避免这个问题,需要将副本数设置到 3 副本以上,这样停止一个 TiKV 实例,region 的多数副本还是存活的,可以正常对外提供服务。

1赞

好的,我调整副本测试一下

嗯嗯,如果还是有问题,欢迎反馈

副本调整为3之后升级就没有问题了。 :call_me_hand::call_me_hand::call_me_hand:

感谢反馈信息