tidb升级到4.0.3时tidb组件重启时卡住

tiup版本:1.0.8
tidb版本:4.0.2

tiup报错提示如下:
Still waitting for 16 store leaders to transfer…
Stopping instance xxx.xxx.xxx.188
Stop tikv xxx.xxx.xxx.188:20160 success
Starting instance tikv xxx.xxx.xxx.188:20160
Start tikv xxx.xxx.xxx.188:20160 success
Delete leader evicting scheduler of store 1 success
Removed store leader evicting scheduler from xxx.xxx.xxx.188:20160.
Restarting component tidb
Restarting instance xxx.xxx.xxx.188
retry error: operation timed out after 2m0s
xxx.xxx.xxx.188 failed to restart: timed out waiting for port 4000 to be started after 2m0s

Error: failed to upgrade: failed to restart tidb: xxx.xxx.xxx.188 failed to restart: timed out waiting for port 4000 to be started after 2m0s: timed out waiting for port 4000 to be started after 2m0s

Verbose debug logs has been written to /home/tidb/logs/tiup-cluster-debug-2020-07-31-14-59-02.log.
Error: run `` (wd:/home/tidb/.tiup/data/S6Ib9tH) failed: exit status 1
[tidb@xxh_zentao1 ~]$ vi /home/tidb/logs/tiup-cluster-debug-2020-07-31-14-59-02.log

检查发现tidb.log报错日志如下:
[2020/07/31 14:58:18.610 +08:00] [INFO] [printer.go:33] [“Welcome to TiDB.”] [“Release Version”=v4.0.3] [Edition=Community] [“Git Commit Hash”=0529b1b493e46aae71bbe34cbe24515a2eb1b47c] [“Git Branch”=heads/refs/tags/v4.0.3] [“UTC Build Time”=“2020-07-24 12:06:35”] [GoVersion=go1.13] [“Race Enabled”=false] [“Check Table Before Drop”=false] [“TiKV Min Version”=v3.0.0-60965b006877ca7234adaced7890d7b029ed1306]
[2020/07/31 14:58:18.611 +08:00] [INFO] [printer.go:47] [“loaded config”] [config="{“host”:“0.0.0.0”,“advertise-address”:“134.200.45.188”,“port”:4000,“cors”:"",“store”:“tikv”,“path”:“xxx.xxx.xxx.188:2379,xxx.xxx.xxx.189:2379,xxx.xxx.xxx.190:2379”,“socket”:"",“lease”:“45s”,“run-ddl”:true,“split-table”:true,“token-limit”:1000,“oom-use-tmp-storage”:true,“tmp-storage-path”:"/tmp/1001_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage",“oom-action”:“log”,“mem-quota-query”:1073741824,“tmp-storage-quota”:-1,“enable-streaming”:false,“enable-batch-dml”:false,“lower-case-table-names”:2,“server-version”:"",“log”:{“level”:“error”,“format”:“text”,“disable-timestamp”:false,“enable-timestamp”:null,“disable-error-stack”:null,“enable-error-stack”:null,“file”:{“filename”:"/data/tidb/deploy/log/tidb.log",“max-size”:300,“max-days”:0,“max-backups”:0},“enable-slow-log”:true,“slow-query-file”:“log/tidb_slow_query.log”,“slow-threshold”:300,“expensive-threshold”:10000,“query-log-max-len”:2048,“record-plan-in-slow-log”:1},“security”:{“skip-grant-table”:false,“ssl-ca”:"",“ssl-cert”:"",“ssl-key”:"",“require-secure-transport”:false,“cluster-ssl-ca”:"",“cluster-ssl-cert”:"",“cluster-ssl-key”:"",“cluster-verify-cn”:null},“status”:{“status-host”:“0.0.0.0”,“metrics-addr”:"",“status-port”:10080,“metrics-interval”:15,“report-status”:true,“record-db-qps”:false},“performance”:{“max-procs”:0,“max-memory”:0,“stats-lease”:“3s”,“stmt-count-limit”:5000,“feedback-probability”:0.05,“query-feedback-limit”:1024,“pseudo-estimate-ratio”:0.8,“force-priority”:“NO_PRIORITY”,“bind-info-lease”:“3s”,“txn-total-size-limit”:104857600,“tcp-keep-alive”:true,“cross-join”:true,“run-auto-analyze”:true,“agg-push-down-join”:false,“committer-concurrency”:16,“max-txn-ttl”:600000},“prepared-plan-cache”:{“enabled”:false,“capacity”:100,“memory-guard-ratio”:0.1},“opentracing”:{“enable”:false,“rpc-metrics”:false,“sampler”:{“type”:“const”,“param”:1,“sampling-server-url”:"",“max-operations”:0,“sampling-refresh-interval”:0},“reporter”:{“queue-size”:0,“buffer-flush-interval”:0,“log-spans”:false,“local-agent-host-port”:""}},“proxy-protocol”:{“networks”:"",“header-timeout”:5},“tikv-client”:{“grpc-connection-count”:16,“grpc-keepalive-time”:10,“grpc-keepalive-timeout”:3,“commit-timeout”:“41s”,“max-batch-size”:128,“overload-threshold”:200,“max-batch-wait-time”:0,“batch-wait-size”:8,“enable-chunk-rpc”:true,“region-cache-ttl”:600,“store-limit”:0,“store-liveness-timeout”:“5s”,“copr-cache”:{“enable”:false,“capacity-mb”:1000,“admission-max-result-mb”:10,“admission-min-process-ms”:5}},“binlog”:{“enable”:false,“ignore-error”:false,“write-timeout”:“15s”,“binlog-socket”:"",“strategy”:“range”},“compatible-kill-query”:false,“plugin”:{“dir”:"",“load”:""},“pessimistic-txn”:{“enable”:true,“max-retry-count”:256},“check-mb4-value-in-utf8”:true,“max-index-length”:3072,“alter-primary-key”:false,“treat-old-version-utf8-as-utf8mb4”:true,“enable-table-lock”:false,“delay-clean-table-lock”:0,“split-region-max-num”:1000,“stmt-summary”:{“enable”:true,“enable-internal-query”:false,“max-stmt-count”:200,“max-sql-length”:4096,“refresh-interval”:1800,“history-size”:24},“repair-mode”:false,“repair-table-list”:[],“isolation-read”:{“engines”:[“tikv”,“tiflash”,“tidb”]},“max-server-connections”:0,“new_collations_enabled_on_first_bootstrap”:false,“experimental”:{“allow-expression-index”:false},“enable-collect-execution-info”:true,“skip-register-to-dashboard”:false,“enable-telemetry”:true}"]
[2020/07/31 14:58:18.926 +08:00] [FATAL] [terror.go:348] [“unexpected error”] [error="[variable:1231]Variable ‘tidb_index_lookup_join_concurrency’ can’t be set to the value of ‘-1’"] [stack=“github.com/pingcap/log.Fatal\n\t/home/jenkins/agent/workspace/tidb_v4.0.3/go/pkg/mod/github.com/pingcap/log@v0.0.0-20200511115504-543df19646ad/global.go:59\ngithub.com/pingcap/parser/terror.MustNil\n\t/home/jenkins/agent/workspace/tidb_v4.0.3/go/pkg/mod/github.com/pingcap/parser@v0.0.0-20200623164729-3a18f1e5dceb/terror/terror.go:348\nmain.createStoreAndDomain\n\t/home/jenkins/agent/workspace/tidb_v4.0.3/go/src/github.com/pingcap/tidb/tidb-server/main.go:296\nmain.main\n\t/home/jenkins/agent/workspace/tidb_v4.0.3/go/src/github.com/pingcap/tidb/tidb-server/main.go:181\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203”]

检查下参数配置文件,是否设置了不合法的参数值,调整下再尝试启动

怎么检查呢?这个参数我没找到,集群是从3.0.9(ansible安装)->4.0.0(tiup)->……-4.0.3(tiup)升级上来的

还没有升级的 tidb 上面执行下 show global variables like ‘tidb_index_lookup_join_concurrency’,如果是 -1,可以改成文档中的默认值

show variables like ‘%tidb_index_lookup_concurrency%’;

set @@session.tidb_index_lookup_concurrency=4;
set @@global.tidb_index_lookup_concurrency=4;

我已经在没有升级的tidb上更新为默认值了,但是重新执行tiup cluster upgrade test-cluster v4.0.3升级集群时还是报这个错,tiup cluster display test-cluster输出如下:

[tidb@xxh_zentao1 ~]$ tiup cluster display test-cluster
Starting component cluster: display test-cluster
TiDB Cluster: test-cluster
TiDB Version: v4.0.2
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir


xxx.xxx.xxx.190:9093 alertmanager xxx.xxx.xxx.190 9093/9094 linux/x86_64 Up /data/tidb/deploy/data.alertmanager /data/tidb/deploy
xxx.xxx.xxx.190:3000 grafana xxx.xxx.xxx.190 3000 linux/x86_64 Up - /data/tidb/deploy
xxx.xxx.xxx.188:2379 pd xxx.xxx.xxx.188 2379/2380 linux/x86_64 Up /data/tidb/deploy/data.pd /data/tidb/deploy
xxx.xxx.xxx.189:2379 pd xxx.xxx.xxx.189 2379/2380 linux/x86_64 Up|L /data/tidb/deploy/data.pd /data/tidb/deploy
xxx.xxx.xxx.190:2379 pd xxx.xxx.xxx.190 2379/2380 linux/x86_64 Up /data/tidb/deploy/data.pd /data/tidb/deploy
xxx.xxx.xxx.190:9090 prometheus xxx.xxx.xxx.190 9090 linux/x86_64 Up /data/tidb/deploy/prometheus2.0.0.data.metrics /data/tidb/deploy
xxx.xxx.xxx.188:4000 tidb xxx.xxx.xxx.188 4000/10080 linux/x86_64 Down - /data/tidb/deploy
xxx.xxx.xxx.189:4000 tidb xxx.xxx.xxx.189 4000/10080 linux/x86_64 Up - /data/tidb/deploy
xxx.xxx.xxx.190:4000 tidb xxx.xxx.xxx.190 4000/10080 linux/x86_64 Up - /data/tidb/deploy
xxx.xxx.xxx.188:20160 tikv xxx.xxx.xxx.188 20160/20180 linux/x86_64 Up /data/tidb/deploy/data /data/tidb/deploy
xxx.xxx.xxx.189:20160 tikv xxx.xxx.xxx.189 20160/20180 linux/x86_64 Up /data/tidb/deploy/data /data/tidb/deploy
xxx.xxx.xxx.190:20160 tikv xxx.xxx.xxx.190 20160/20180 linux/x86_64 Up /data/tidb/deploy/data /data/tidb/deploy

先尝试先 stop -N 报错节点,然后 start -N 启动该节点;

如果仍然报错,发一下报错节点的日志和 conf 配置文件。

先执行tiup cluster stop test-cluster -N xxx.xxx.xxx.188:4000,再执行tiup cluster start test-cluster -N xxx.xxx.xxx.188:4000,tidb.log仍然报错如下:

[2020/07/31 15:32:22.268 +08:00] [INFO] [printer.go:33] [“Welcome to TiDB.”] [“Release Version”=v4.0.3] [Edition=Community] [“Git Commit Hash”=0529b1b493e46aae71bbe34cbe24515a2eb1b47c] [“Git Branch”=heads/refs/tags/v4.0.3] [“UTC Build Time”=“2020-07-24 12:06:35”] [GoVersion=go1.13] [“Race Enabled”=false] [“Check Table Before Drop”=false] [“TiKV Min Version”=v3.0.0-60965b006877ca7234adaced7890d7b029ed1306]
[2020/07/31 15:32:22.269 +08:00] [INFO] [printer.go:47] [“loaded config”] [config="{“host”:“0.0.0.0”,“advertise-address”:“xxx.xxx.xxx.188”,“port”:4000,“cors”:"",“store”:“tikv”,“path”:“xxx.xxx.xxx.188:2379,xxx.xxx.xxx.189:2379,xxx.xxx.xxx.190:2379”,“socket”:"",“lease”:“45s”,“run-ddl”:true,“split-table”:true,“token-limit”:1000,“oom-use-tmp-storage”:true,“tmp-storage-path”:"/tmp/1001_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage",“oom-action”:“log”,“mem-quota-query”:1073741824,“tmp-storage-quota”:-1,“enable-streaming”:false,“enable-batch-dml”:false,“lower-case-table-names”:2,“server-version”:"",“log”:{“level”:“error”,“format”:“text”,“disable-timestamp”:false,“enable-timestamp”:null,“disable-error-stack”:null,“enable-error-stack”:null,“file”:{“filename”:"/data/tidb/deploy/log/tidb.log",“max-size”:300,“max-days”:0,“max-backups”:0},“enable-slow-log”:true,“slow-query-file”:“log/tidb_slow_query.log”,“slow-threshold”:300,“expensive-threshold”:10000,“query-log-max-len”:2048,“record-plan-in-slow-log”:1},“security”:{“skip-grant-table”:false,“ssl-ca”:"",“ssl-cert”:"",“ssl-key”:"",“require-secure-transport”:false,“cluster-ssl-ca”:"",“cluster-ssl-cert”:"",“cluster-ssl-key”:"",“cluster-verify-cn”:null},“status”:{“status-host”:“0.0.0.0”,“metrics-addr”:"",“status-port”:10080,“metrics-interval”:15,“report-status”:true,“record-db-qps”:false},“performance”:{“max-procs”:0,“max-memory”:0,“stats-lease”:“3s”,“stmt-count-limit”:5000,“feedback-probability”:0.05,“query-feedback-limit”:1024,“pseudo-estimate-ratio”:0.8,“force-priority”:“NO_PRIORITY”,“bind-info-lease”:“3s”,“txn-total-size-limit”:104857600,“tcp-keep-alive”:true,“cross-join”:true,“run-auto-analyze”:true,“agg-push-down-join”:false,“committer-concurrency”:16,“max-txn-ttl”:600000},“prepared-plan-cache”:{“enabled”:false,“capacity”:100,“memory-guard-ratio”:0.1},“opentracing”:{“enable”:false,“rpc-metrics”:false,“sampler”:{“type”:“const”,“param”:1,“sampling-server-url”:"",“max-operations”:0,“sampling-refresh-interval”:0},“reporter”:{“queue-size”:0,“buffer-flush-interval”:0,“log-spans”:false,“local-agent-host-port”:""}},“proxy-protocol”:{“networks”:"",“header-timeout”:5},“tikv-client”:{“grpc-connection-count”:16,“grpc-keepalive-time”:10,“grpc-keepalive-timeout”:3,“commit-timeout”:“41s”,“max-batch-size”:128,“overload-threshold”:200,“max-batch-wait-time”:0,“batch-wait-size”:8,“enable-chunk-rpc”:true,“region-cache-ttl”:600,“store-limit”:0,“store-liveness-timeout”:“5s”,“copr-cache”:{“enable”:false,“capacity-mb”:1000,“admission-max-result-mb”:10,“admission-min-process-ms”:5}},“binlog”:{“enable”:false,“ignore-error”:false,“write-timeout”:“15s”,“binlog-socket”:"",“strategy”:“range”},“compatible-kill-query”:false,“plugin”:{“dir”:"",“load”:""},“pessimistic-txn”:{“enable”:true,“max-retry-count”:256},“check-mb4-value-in-utf8”:true,“max-index-length”:3072,“alter-primary-key”:false,“treat-old-version-utf8-as-utf8mb4”:true,“enable-table-lock”:false,“delay-clean-table-lock”:0,“split-region-max-num”:1000,“stmt-summary”:{“enable”:true,“enable-internal-query”:false,“max-stmt-count”:200,“max-sql-length”:4096,“refresh-interval”:1800,“history-size”:24},“repair-mode”:false,“repair-table-list”:[],“isolation-read”:{“engines”:[“tikv”,“tiflash”,“tidb”]},“max-server-connections”:0,“new_collations_enabled_on_first_bootstrap”:false,“experimental”:{“allow-expression-index”:false},“enable-collect-execution-info”:true,“skip-register-to-dashboard”:false,“enable-telemetry”:true}"]
[2020/07/31 15:32:22.617 +08:00] [FATAL] [terror.go:348] [“unexpected error”] [error="[variable:1231]Variable ‘tidb_index_lookup_join_concurrency’ can’t be set to the value of ‘-1’"] [stack=“github.com/pingcap/log.Fatal\n\t/home/jenkins/agent/workspace/tidb_v4.0.3/go/pkg/mod/github.com/pingcap/log@v0.0.0-20200511115504-543df19646ad/global.go:59\ngithub.com/pingcap/parser/terror.MustNil\n\t/home/jenkins/agent/workspace/tidb_v4.0.3/go/pkg/mod/github.com/pingcap/parser@v0.0.0-20200623164729-3a18f1e5dceb/terror/terror.go:348\nmain.createStoreAndDomain\n\t/home/jenkins/agent/workspace/tidb_v4.0.3/go/src/github.com/pingcap/tidb/tidb-server/main.go:296\nmain.main\n\t/home/jenkins/agent/workspace/tidb_v4.0.3/go/src/github.com/pingcap/tidb/tidb-server/main.go:181\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:203”]

conf目录下tidb的配置文件如下:tidb (2).toml (1.7 KB)

执行升级操作时候的 tiup 命令可以提供一下吗

tiup cluster upgrade test-cluster v4.0.3

升级到 4.0.0 时候的命令是什么

tiup cluster upgrade test-cluster v4.0.0

SELECT HIGH_PRIORITY VARIABLE_VALUE FROM mysql.tidb WHERE VARIABLE_NAME=“tidb_server_version”;

这个的查询结果是什么?

查询结果是:48

是不是升级到过 master

之前升级到4.0.2后,为了测试tiflash的join特性,更新过tidb和tiflash的热更新(tiflash的join版本),tidb的热更新版本是:v4.0.0-beta.2-842-ge5e8cdd89

把这个值用 update 改成 47,再试试启动

正确的修改语句是?

update mysql.tidb set VARIABLE_VALUE=47 where VARIABLE_NAME=“tidb_server_version”;

修改到47后,先执行tiup cluster stop test-cluster -N xxx.xxx.xxx.188:4000,再执行tiup cluster start test-cluster -N xxx.xxx.xxx.188:4000,tidb.log仍然是之前那个报错,启动不起来。

使用tiup cluster display test-cluster查看,节点为Down状态

  1. 修改到 47

  2. 从正常的机器上,
    set global tidb_index_lookup_concurrency=4;
    set global tidb_index_lookup_join_concurrency=4;
    set global tidb_hashagg_final_concurrency=4;
    set global tidb_hashagg_partial_concurrency=4;
    set global tidb_window_concurrency=4;
    set global tidb_projection_concurrency=4;
    set global tidb_hash_join_concurrency=4;

  3. 再启动