7.6.0升级8.1.0失败

突破边界 · 2024 年7 月 11 日 15:10

【 TiDB 使用环境】测试
【 TiDB 版本】v7.6.0
【复现路径】执行升级命令

tiup cluster upgrade tidb-test v8.1.0

最终在重启tiflash时失败

Error: failed to restart: 192.168.0.150 tiflash-9000.service, 
please check the instance's log(/mnt/filemanage/data1/tidb-deploy/tiflash-9000/log) for more detail.: 
timed out waiting for port 3930 to be started after 2m0s

【遇到的问题：问题现象及影响】

现象一：升级过程tiflash.log日志有发现一行错误信息如下：

[2024/07/11 22:48:39.009 +08:00] [ERROR] [LocalAdmissionController.cpp:445] 
["watch resource group event failed: read watch stream failed, CANCELLED"] 
[source=LocalAdmissionController] [thread_id=804]

现象二：升级过程重启tiflash环节，tiflash.log最后的日志显示一直在WaitCheckRegionReady，日志如下：

[2024/07/11 22:49:22.648 +08:00] [INFO] [ReadIndex.cpp:207] 
["1 regions need to fetch latest commit-in
dex in next round, sleep for 20.000s"] [source=WaitCheckRegionReady] [thread_id=1]
[2024/07/11 22:49:42.648 +08:00] [INFO] [ReadIndex.cpp:207] 
["1 regions need to fetch latest commit-in
dex in next round, sleep for 20.000s"] [source=WaitCheckRegionReady] [thread_id=1]
[2024/07/11 22:50:02.649 +08:00] [INFO] [ReadIndex.cpp:207] 
["1 regions need to fetch latest commit-in
dex in next round, sleep for 20.000s"] [source=WaitCheckRegionReady] [thread_id=1]
[2024/07/11 22:50:22.649 +08:00] [INFO] [ReadIndex.cpp:207] 
["1 regions need to fetch latest commit-in
dex in next round, sleep for 20.000s"] [source=WaitCheckRegionReady] [thread_id=1]
[2024/07/11 22:50:42.650 +08:00] [INFO] [ReadIndex.cpp:207] 
["1 regions need to fetch latest commit-in
dex in next round, sleep for 20.000s"] [source=WaitCheckRegionReady] [thread_id=1]
[2024/07/11 22:51:02.650 +08:00] [INFO] [ReadIndex.cpp:207] 
["1 regions need to fetch latest commit-in
dex in next round, sleep for 20.000s"] [source=WaitCheckRegionReady] [thread_id=1]
[2024/07/11 22:51:22.651 +08:00] [INFO] [ReadIndex.cpp:207] 
["1 regions need to fetch latest commit-in
dex in next round, sleep for 20.000s"] [source=WaitCheckRegionReady] [thread_id=1]
[2024/07/11 22:51:42.652 +08:00] [INFO] [ReadIndex.cpp:207] 
["1 regions need to fetch latest commit-in
dex in next round, sleep for 20.000s"] [source=WaitCheckRegionReady] [thread_id=1]
[2024/07/11 22:52:02.652 +08:00] [INFO] [ReadIndex.cpp:207] 
["1 regions need to fetch latest commit-in
dex in next round, sleep for 20.000s"] [source=WaitCheckRegionReady] [thread_id=1]
[2024/07/11 22:52:22.653 +08:00] [INFO] [ReadIndex.cpp:207] 
["1 regions need to fetch latest commit-in
dex in next round, sleep for 20.000s"] [source=WaitCheckRegionReady] [thread_id=1]
[2024/07/11 22:52:42.654 +08:00] [INFO] [ReadIndex.cpp:207] 
["1 regions need to fetch latest commit-in
dex in next round, sleep for 20.000s"] [source=WaitCheckRegionReady] [thread_id=1]
[2024/07/11 22:53:02.655 +08:00] [WARN] [ReadIndex.cpp:224] 
["1 regions CANNOT fetch latest commit-ind
ex from TiKV, (region-id): 60200"] [source=WaitCheckRegionReady] 
[thread_id=1]

现象三：升级失败后，数据库仍可以正常使用
补充另一个信息，今天下午我在7.6.0版本下扩容了tiflash节点，部署成功了，但是似乎并不能下推算子，具体可参见：算子没有下推到TiFlash. 都跟tiflash有关，不知道是否有关联？
【资源配置】进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面
【附件：截图/日志/监控】

Kongdom · 2024 年7 月 12 日 01:09

display看看集群里tiflash节点的状态和版本

zhaokede · 2024 年7 月 12 日 01:56

这个升级失败是自动回滚了吗

forever · 2024 年7 月 12 日 02:06

当前集群的所有组件都是什么版本呢

xfworld · 2024 年7 月 12 日 02:13

可以把 tiflash 缩容后，在来升级，在扩出来，这样也可以的

WaitCheckRegionReady 是不是设定了Region relipca 的同步处理，但是 region 的数据还未正常的传递到 tiflash，需要检查下

有猫万事足 · 2024 年7 月 12 日 02:53

感觉还是升级的时候tiflash的升级时间超过2分钟了。优先把这个时间调大一点看看能否升级完成。

https://docs.pingcap.com/zh/tidb/stable/tiup-component-cluster#--wait-timeoutuint默认-120

把 --wait-timeout设置到600试试看能否完成。

之前断掉的升级过程。可以用tiup cluster replay

对集群进行升级或重启等操作时，操作有可能因为环境的原因而偶然失败。这时如果重新进行操作，需要从头开始执行所有步骤。如果集群规模较大，会耗费较长时间。此时可以使用 tiup cluster replay 命令重试刚才失败的命令，并且跳过已经成功的步骤。

呢莫不爱吃鱼 · 2024 年7 月 12 日 03:08

看到好几个因为tiflash升级超时失败的了，出个馊主意……先把tiflash下掉，升级成功之后再扩容回来呢……

这里介绍不了我 · 2024 年7 月 12 日 03:12

这玩意不是升级超时了嘛，要么你就升级时把tiflash先下掉，或者加上超时时间 tiup cluster upgrade tidb-test v8.1.0 --wait-timeout 600

有猫万事足 · 2024 年7 月 12 日 03:41

真不用这么麻烦，你把升级时间调大点，可以解决大部分问题。

下掉tiflash其实并不一定安全，因为有些sql如果没有tiflash，会转到tikv上执行，可能效率非常差。弄的TP业务也不稳定。

T02iDBer_7S8XqKfl · 2024 年7 月 12 日 03:45

for more detail.: 
timed out waiting for port 3930 to be started after 2m0s
```根据日志来看，应该是超时导致。可以延长一下等待的时间。

TiDB社区小助手 · 2024 年7 月 12 日 04:53

可以参与升级活动呀，遇到问题可以直接发活动群里问！get 官方技术支持！

一键报名：PingCAP Account

活动详情：

突破边界 · 2024 年7 月 12 日 05:32

昨晚试了几种方法，都不行，提示多是7.6.0与8.1.0不匹配什么的，估计是升级一半失败导致的，后面居然遇到I/O error了

然后重启服务器，就升级成功了，也是诡异

突破边界 · 2024 年7 月 12 日 05:33

tiup cluster replay 这个命令好，我前面都是直接用重新升级

Kongdom · 2024 年7 月 13 日 15:11

最好是执行display看一下各节点的版本，确认一下是否升级成功。

T02iDBer_7S8XqKfl · 2024 年7 月 17 日 02:44

可以调整一下升级的延时。

TiDBer_rvITcue9 · 2024 年7 月 17 日 03:12

这主意不错，

Kongdom · 2024 年7 月 17 日 03:15

重启解决99.99%的问题

这里介绍不了我 · 2024 年7 月 17 日 03:24

比较诡异

tony5413 · 2024 年7 月 17 日 03:44

居然最终是重启解决

awakening · 2024 年7 月 17 日 05:55

万能的重启