v4.0.0-rc 升级到 v4.0.0升级失败 failed to upgrade: failed to restart tiflash

执行升级后报如下错误

tiup cluster upgrade tidb-test v4.0.0

Restarting component tiflash
        Restarting instance 172.18.181.6
retry error: operation timed out after 1m0s
        172.18.181.6 failed to restart: timed out waiting for port 9000 to be started after 1m0s

Error: failed to upgrade: failed to restart tiflash:    192.168.181.6 failed to restart: timed out waiting for port 9000 to be started after 1m0s: timed out waiting for port 9000 to be started after 1m0s

Verbose debug logs has been written to /home/tidb/logs/tiup-cluster-debug-2020-06-29-20-26-05.log.
Error: run `/home/tidb/.tiup/components/cluster/v1.0.7/tiup-cluster` (wd:/home/tidb/.tiup/data/S3IqRWN) failed: exit status 1

tiflash所在机器查看日志报错如下:

[tidb@dev7 tidb-deploy]$ cd tiflash-9000/
[tidb@dev7 tiflash-9000]$ ll
总用量 4
drwxr-xr-x 3 tidb tidb   21 6月  29 20:24 bin
drwxr-xr-x 3 tidb tidb   21 6月  29 20:24 bin.old.v4.0.0-rc
drwxr-xr-x 2 tidb tidb   87 6月  29 21:31 conf
drwxr-xr-x 2 tidb tidb 4096 6月  29 20:24 log
drwxr-xr-x 2 tidb tidb   28 6月  23 03:56 scripts
[tidb@dev7 tiflash-9000]$ pwd
/tidb-deploy/tiflash-9000
[tidb@dev7 tiflash-9000]$ ls /log
tiflash_cluster_manager.log           tiflash.log.11                        tiflash.log.2                         tiflash.log.8
tiflash_error.log                     tiflash.log.12                        tiflash.log.3                         tiflash.log.9
tiflash.log                           tiflash.log.13                        tiflash.log.4                         tiflash_tikv.log
tiflash.log.0                         tiflash.log.14                        tiflash.log.5                         tiflash_tikv.log.2020-06-29-12:24:56
tiflash.log.1                         tiflash.log.15                        tiflash.log.6
tiflash.log.10                        tiflash.log.16                        tiflash.log.7
[tidb@dev7 tiflash-9000]$ tail -1000f tiflash_error.log
tail: 无法打开"tiflash_error.log" 读取数据: 没有那个文件或目录
tail: 没有剩余文件
[tidb@dev7 tiflash-9000]$ tail -1000f log/tiflash_error.log
2020.06.23 04:03:38.491313 [ 7 ] <Warning> pingcap.tikv: region {6,5,2} find error: peer is not leader for region 6, leader may Some(id: 77 store_id: 5)
2020.06.23 04:04:11.867121 [ 17 ] <Warning> pingcap.tikv: region {6,5,2} find error: peer is not leader for region 6, leader may Some(id: 7 store_id: 1)
2020.06.23 04:04:40.480288 [ 17 ] <Warning> pingcap.tikv: region {6,5,2} find error: peer is not leader for region 6, leader may Some(id: 77 store_id: 5)
2020.06.23 04:06:57.043714 [ 17 ] <Warning> pingcap.tikv: region {6,5,2} find error: peer is not leader for region 6, leader may Some(id: 7 store_id: 1)
2020.06.23 04:17:02.641782 [ 5 ] <Warning> pingcap.tikv: region {6,5,2} find error: peer is not leader for region 6, leader may Some(id: 77 store_id: 5)
2020.06.23 04:17:18.717236 [ 5 ] <Warning> pingcap.tikv: region {6,5,2} find error: peer is not leader for region 6, leader may Some(id: 7 store_id: 1)
2020.06.23 04:17:36.378698 [ 5 ] <Warning> pingcap.tikv: region {6,5,2} find error: peer is not leader for region 6, leader may Some(id: 77 store_id: 5)
2020.06.23 04:19:51.718644 [ 5 ] <Warning> pingcap.tikv: region {6,5,2} find error: peer is not leader for region 6, leader may Some(id: 7 store_id: 1)
2020.06.23 04:23:20.400907 [ 5 ] <Warning> pingcap.tikv: region {6,5,2} find error: peer is not leader for region 6, leader may Some(id: 77 store_id: 5)
2020.06.23 04:35:59.248611 [ 11 ] <Warning> pingcap.tikv: region {6,5,2} find error: peer is not leader for region 6, leader may Some(id: 7 store_id: 1)
2020.06.23 04:41:09.146777 [ 17 ] <Warning> pingcap.tikv: region {6,5,2} find error: peer is not leader for region 6, leader may Some(id: 46 store_id: 4)
2020.06.23 06:31:10.635586 [ 5 ] <Warning> pingcap.tikv: region {6,5,2} find error: peer is not leader for region 6, leader may Some(id: 7 store_id: 1)
2020.06.23 08:12:14.881873 [ 20 ] <Warning> pingcap.tikv: region {6,5,2} find error: EpochNotMatch current epoch of region 6 is conf_ver: 5 version: 3, but you sent conf_ver: 5 version: 2
2020.06.24 14:27:05.928733 [ 9 ] <Error> pingcap.tikv: Get Failed4: Deadline Exceeded
2020.06.24 14:27:11.022742 [ 9 ] <Error> pingcap.tikv: Get Failed4: Deadline Exceeded
2020.06.24 14:27:14.024087 [ 9 ] <Error> pingcap.tikv: Get Failed14: Connection reset by peer
2020.06.24 14:27:14.332150 [ 9 ] <Error> pingcap.tikv: Get Failed14: failed to connect to all addresses
2020.06.24 14:27:15.332264 [ 9 ] <Error> pingcap.tikv: Get Failed14: failed to connect to all addresses
2020.06.24 14:27:16.756154 [ 9 ] <Error> pingcap.tikv: Get Failed14: failed to connect to all addresses
2020.06.24 14:27:19.813190 [ 9 ] <Error> pingcap.tikv: Get Failed14: failed to connect to all addresses
2020.06.24 14:27:24.022110 [ 9 ] <Error> pingcap.tikv: Get Failed14: failed to connect to all addresses
2020.06.24 14:27:25.869636 [ 9 ] <Error> pingcap.tikv: Get Failed14: failed to connect to all addresses
2020.06.24 14:27:27.580239 [ 9 ] <Error> pingcap.tikv: Get Failed14: failed to connect to all addresses
2020.06.24 14:27:29.662999 [ 9 ] <Error> pingcap.tikv: Get Failed14: failed to connect to all addresses
2020.06.24 14:27:31.983739 [ 9 ] <Error> pingcap.tikv: Get Failed14: failed to connect to all addresses
2020.06.24 14:27:33.821525 [ 9 ] <Error> pingcap.tikv: Get Failed14: failed to connect to all addresses
2020.06.24 14:27:36.055393 [ 9 ] <Error> pingcap.tikv: Get Failed14: failed to connect to all addresses
2020.06.24 14:27:37.711725 [ 9 ] <Error> pingcap.tikv: Get Failed14: failed to connect to all addresses
2020.06.24 14:27:40.205996 [ 9 ] <Warning> SchemaSyncService: Schema sync failed by Exception: Get Failed14: failed to connect to all addresses
2020.06.24 14:29:45.775965 [ 4 ] <Warning> pingcap.tikv: region {99561,5,3} find error: peer is not leader for region 99561, leader may Some(id: 99564 store_id: 5)

各位老师种情况该如何解决啊,比较急啊,在此感谢!

hi、

提供下 tiflash_cluster_manager.log 和 tiflash_tikv.log 再看下,文本文件。快速处理的话,可以先 scale-in tiflash 节点。升级成功后在扩容上去。

多谢老师,卸载tiflash是不是使用tiflash创建的副本表都会消失
tiflash_cluster_manager.log (388 字节) tiflash_tikv.log (21.6 KB)

日志已收到,

这里是缩容 tiflash,是的,需要重新 alter 去加。

  1. 提供下 upgrade 的 debug 日志。这边确定下 tiup 的版本。
  2. 执行以下命令,并将结果返回下tiup ctl pd -u http://172.16.4.107:12379 config show |grep enable-placement-rules

1、这个是upgrade 的 debug 日志tiup-cluster-debug-2020-06-29-20-26-05.log (145.5 KB)
2、运行tiup ctl pd -u http://172.16.4.107:12379 config show |grep enable-placement-rules没有反应

:rofl:,地址和端口改成你的 pd 的地址和端口,

  1. 返回下 display 的结果,看下当前集群状态
  2. 辛苦将 tiflash.log 日志也上传下,上次信息 miss 了

:sweat_smile:运行结果"enable-placement-rules": “true”,
这个在部署的时候 topology.yaml 中配置的就是true

ok,看下上面的信息并反馈下

display结果:

Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.0.7/tiup-cluster display tidb-dev
TiDB Cluster: tidb-dev
TiDB Version: v4.0.0-rc
ID                    Role          Host            Ports                            OS/Arch       Status  Data Dir                      Deploy Dir
--                    ----          ----            -----                            -------       ------  --------                      ----------
172.18.180.58:9093   alertmanager  172.18.180.58  9093/9094                        linux/x86_64  Up      /tidb-data/alertmanager-9093  /tidb-deploy/alertmanager-9093
172.18.180.58:3000   grafana       172.18.180.58  3000                             linux/x86_64  Up      -                             /tidb-deploy/grafana-3000
172.18.180.58:2379   pd            172.18.180.58  2379/2380                        linux/x86_64  Up|L    /tidb-data/pd-2379            /tidb-deploy/pd-2379
172.18.180.59:2379   pd            172.18.180.59  2379/2380                        linux/x86_64  Up      /tidb-data/pd-2379            /tidb-deploy/pd-2379
172.18.181.57:2379   pd            172.18.181.57  2379/2380                        linux/x86_64  Up|UI   /tidb-data/pd-2379            /tidb-deploy/pd-2379
172.18.180.58:9090   prometheus    172.18.180.58  9090                             linux/x86_64  Up      /tidb-data/prometheus-9090    /tidb-deploy/prometheus-9090
172.18.180.59:4000   tidb          172.18.180.59  4000/10080                       linux/x86_64  Up      -                             /tidb-deploy/tidb-4000
172.18.181.57:4000   tidb          172.18.181.57  4000/10080                       linux/x86_64  Up      -                             /tidb-deploy/tidb-4000
172.18.181.6:9000    tiflash       172.18.181.6   9000/8123/3930/20170/20292/8234  linux/x86_64  Up      /tidb-data/tiflash-9000       /tidb-deploy/tiflash-9000
172.18.172.34:20160  tikv          172.18.172.34  20160/20180                      linux/x86_64  Up      /tidb-data/tikv-20160         /tidb-deploy/tikv-20160
172.18.172.35:20160  tikv          172.18.172.35  20160/20180                      linux/x86_64  Up      /tidb-data/tikv-20160         /tidb-deploy/tikv-20160
172.18.172.36:20160  tikv          172.18.172.36  20160/20180                      linux/x86_64  Up      /tidb-data/tikv-20160         /tidb-deploy/tikv-20160
[tidb@test4 ~]$

tiflash 报错日志:

[tidb@dev7 log]$ tail -1000f tiflash_error.log
2020.06.23 04:03:38.491313 [ 7 ] <Warning> pingcap.tikv: region {6,5,2} find error: peer is not leader for region 6, leader may Some(id: 77 store_id: 5)
2020.06.23 04:04:11.867121 [ 17 ] <Warning> pingcap.tikv: region {6,5,2} find error: peer is not leader for region 6, leader may Some(id: 7 store_id: 1)
2020.06.23 04:04:40.480288 [ 17 ] <Warning> pingcap.tikv: region {6,5,2} find error: peer is not leader for region 6, leader may Some(id: 77 store_id: 5)
2020.06.23 04:06:57.043714 [ 17 ] <Warning> pingcap.tikv: region {6,5,2} find error: peer is not leader for region 6, leader may Some(id: 7 store_id: 1)
2020.06.23 04:17:02.641782 [ 5 ] <Warning> pingcap.tikv: region {6,5,2} find error: peer is not leader for region 6, leader may Some(id: 77 store_id: 5)
2020.06.23 04:17:18.717236 [ 5 ] <Warning> pingcap.tikv: region {6,5,2} find error: peer is not leader for region 6, leader may Some(id: 7 store_id: 1)
2020.06.23 04:17:36.378698 [ 5 ] <Warning> pingcap.tikv: region {6,5,2} find error: peer is not leader for region 6, leader may Some(id: 77 store_id: 5)
2020.06.23 04:19:51.718644 [ 5 ] <Warning> pingcap.tikv: region {6,5,2} find error: peer is not leader for region 6, leader may Some(id: 7 store_id: 1)
2020.06.23 04:23:20.400907 [ 5 ] <Warning> pingcap.tikv: region {6,5,2} find error: peer is not leader for region 6, leader may Some(id: 77 store_id: 5)
2020.06.23 04:35:59.248611 [ 11 ] <Warning> pingcap.tikv: region {6,5,2} find error: peer is not leader for region 6, leader may Some(id: 7 store_id: 1)
2020.06.23 04:41:09.146777 [ 17 ] <Warning> pingcap.tikv: region {6,5,2} find error: peer is not leader for region 6, leader may Some(id: 46 store_id: 4)
2020.06.23 06:31:10.635586 [ 5 ] <Warning> pingcap.tikv: region {6,5,2} find error: peer is not leader for region 6, leader may Some(id: 7 store_id: 1)
2020.06.23 08:12:14.881873 [ 20 ] <Warning> pingcap.tikv: region {6,5,2} find error: EpochNotMatch current epoch of region 6 is conf_ver: 5 version: 3, but you sent conf_ver: 5 version: 2
2020.06.24 14:27:05.928733 [ 9 ] <Error> pingcap.tikv: Get Failed4: Deadline Exceeded
2020.06.24 14:27:11.022742 [ 9 ] <Error> pingcap.tikv: Get Failed4: Deadline Exceeded
2020.06.24 14:27:14.024087 [ 9 ] <Error> pingcap.tikv: Get Failed14: Connection reset by peer
2020.06.24 14:27:14.332150 [ 9 ] <Error> pingcap.tikv: Get Failed14: failed to connect to all addresses
2020.06.24 14:27:15.332264 [ 9 ] <Error> pingcap.tikv: Get Failed14: failed to connect to all addresses
2020.06.24 14:27:16.756154 [ 9 ] <Error> pingcap.tikv: Get Failed14: failed to connect to all addresses
2020.06.24 14:27:19.813190 [ 9 ] <Error> pingcap.tikv: Get Failed14: failed to connect to all addresses
2020.06.24 14:27:24.022110 [ 9 ] <Error> pingcap.tikv: Get Failed14: failed to connect to all addresses
2020.06.24 14:27:25.869636 [ 9 ] <Error> pingcap.tikv: Get Failed14: failed to connect to all addresses
2020.06.24 14:27:27.580239 [ 9 ] <Error> pingcap.tikv: Get Failed14: failed to connect to all addresses
2020.06.24 14:27:29.662999 [ 9 ] <Error> pingcap.tikv: Get Failed14: failed to connect to all addresses
2020.06.24 14:27:31.983739 [ 9 ] <Error> pingcap.tikv: Get Failed14: failed to connect to all addresses
2020.06.24 14:27:33.821525 [ 9 ] <Error> pingcap.tikv: Get Failed14: failed to connect to all addresses
2020.06.24 14:27:36.055393 [ 9 ] <Error> pingcap.tikv: Get Failed14: failed to connect to all addresses
2020.06.24 14:27:37.711725 [ 9 ] <Error> pingcap.tikv: Get Failed14: failed to connect to all addresses
2020.06.24 14:27:40.205996 [ 9 ] <Warning> SchemaSyncService: Schema sync failed by Exception: Get Failed14: failed to connect to all addresses
2020.06.24 14:29:45.775965 [ 4 ] <Warning> pingcap.tikv: region {99561,5,3} find error: peer is not leader for region 99561, leader may Some(id: 99564 store_id: 5)
^C
[tidb@dev7 log]$

辛苦啦:smiley:

感谢配合反馈信息,这边需要看下 more tiflash.log,希望提供文本文件,这边好排查,error 日志在上面已经看到反馈了哦,需要看下启停时是否有有效报错