tiup add new tidb instance error

【 TiDB 使用环境】生产环境 /测试/ Poc
【 TiDB 版本】
【复现路径】做过哪些操作出现的问题

  1. tiup cluster scale-out xx tidb_flash_scaleout.yaml
  2. yaml 内容
[root@xx install_dir]# cat tidb_flash_scaleout.yaml 
tidb_servers:
  - host: 10.29.0.20
tiflash_servers:
  - host: 10.29.0.20

【遇到的问题:问题现象及影响】
tiup cluster display xx

10.29.0.19:9000    tiflash       10.29.0.19   9000/8123/3930/20170/20292/8234  linux/x86_64  Tombstone  /data/tidb-data/tiflash-9000       /data/tidb-deploy/tiflash-9000
10.29.0.20:9000    tiflash       10.29.0.20   9000/8123/3930/20170/20292/8234  linux/x86_64  N/A        /data/tidb-data/tiflash-9000       /data/tidb-deploy/tiflash-9000

Error: failed to start tidb: failed to start: 10.29.0.20 tidb-4000.service, please check the instance's log(/data/tidb-deploy/tidb-4000/log) for more detail.: timed out waiting for port 4000 to be started after 2m0s

Verbose debug logs has been written to /root/.tiup/logs/tiup-cluster-debug-2022-11-15-17-36-57.log.


[2022/11/15 17:35:58.383 +08:00] [INFO] [client.go:687] ["[pd] tso dispatcher created"] [dc-location=global]
[2022/11/15 17:35:58.383 +08:00] [INFO] [store.go:80] ["new store with retry success"]
[2022/11/15 17:35:58.384 +08:00] [FATAL] [session.go:3052] ["check bootstrapped failed"] [error="failed to decode region range key, key: \"6D426F6F7473747261FF704B657900000000FB0000000000000073\", err: invalid marker byte, group bytes \"9645__spl\""] [stack="github.com/pingcap/tidb/session.getStoreBootstrapVersion\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/session/session.go:3052\ngithub.com/pingcap/tidb/session.BootstrapSession\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/session/session.go:2827\nmain.createStoreAndDomain\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/tidb-server/main.go:296\nmain.main\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/tidb-server/main.go:202\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"]

【资源配置】
【附件:截图/日志/监控】

什么版本?

V6.1.0

补全扩容配置再试试呢
https://docs.pingcap.com/zh/tidb/stable/scale-tidb-using-tiup#1-编写扩容拓扑配置

补全了也一样

+ Initialize target host environments
+ Deploy TiDB instance
  - Deploy instance tidb -> 10.29.0.20:4000 ... Done
  - Deploy instance tiflash -> 10.29.0.20:9000 ... Done
+ Copy certificate to remote host
+ Generate scale-out config
  - Generate scale-out config tidb -> 10.29.0.20:4000 ... Done
  - Generate scale-out config tiflash -> 10.29.0.20:9000 ... Done
+ Init monitor config
+ Check status
Enabling component tidb
        Enabling instance 10.29.0.20:4000
        Enable instance 10.29.0.20:4000 success
Enabling component tiflash
        Enabling instance 10.29.0.20:9000
        Enable instance 10.29.0.20:9000 success
Enabling component node_exporter
        Enabling instance 10.29.0.20
        Enable 10.29.0.20 success
Enabling component blackbox_exporter
        Enabling instance 10.29.0.20
        Enable 10.29.0.20 success
+ [ Serial ] - Save meta
+ [ Serial ] - Start new instances
Starting component tidb
        Starting instance 10.29.0.20:4000

Error: failed to start tidb: failed to start: 10.29.0.20 tidb-4000.service, please check the instance's log(/data/tidb-deploy/tidb-4000/log) for more detail.: timed out waiting for port 4000 to be started after 2m0s

LISTEN     0      32768     [::]:12020                 [::]:*                  \n", "stderr": "", "__hash__": "1a4714d7146fa85240a1ff4ef7451df719e0b4f0", "__func__": "github.com/pingcap/tiup/pkg/cluster/executor.(*CheckPointExecutor).Execute", "hit": false}
2022-11-16T08:55:58.852+0800    DEBUG   retry error     {"error": "operation timed out after 2m0s"}
2022-11-16T08:55:58.852+0800    DEBUG   TaskFinish      {"task": "Start new instances", "error": "failed to start tidb: failed to start: 10.29.0.20 tidb-4000.service, please check the instance's log(/data/tidb-deploy/tidb-4000/log) for more detail.: timed out waiting for port 4000 to be started after 2m0s", "errorVerbose": "timed out waiting for port 4000 to be started after 2m0s\ngithub.com/pingcap/tiup/pkg/cluster/module.(*WaitFor).Execute\n\tgithub.com/pingcap/tiup/pkg/cluster/module/wait_for.go:91\ngithub.com/pingcap/tiup/pkg/cluster/spec.PortStarted\n\tgithub.com/pingcap/tiup/pkg/cluster/spec/instance.go:119\ngithub.com/pingcap/tiup/pkg/cluster/spec.(*BaseInstance).Ready\n\tgithub.com/pingcap/tiup/pkg/cluster/spec/instance.go:151\ngithub.com/pingcap/tiup/pkg/cluster/operation.startInstance\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:405\ngithub.com/pingcap/tiup/pkg/cluster/operation.StartComponent.func1\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:534\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20220819030929-7fc1605a5dde/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1594\nfailed to start: 10.29.0.20 tidb-4000.service, please check the instance's log(/data/tidb-deploy/tidb-4000/log) for more detail.\nfailed to start tidb"}
2022-11-16T08:55:58.852+0800    INFO    Execute command finished        {"code": 1, "error": "failed to start tidb: failed to start: 10.29.0.20 tidb-4000.service, please check the instance's log(/data/tidb-deploy/tidb-4000/log) for more detail.: timed out waiting for port 4000 to be started after 2m0s", "errorVerbose": "timed out waiting for port 4000 to be started after 2m0s\ngithub.com/pingcap/tiup/pkg/cluster/module.(*WaitFor).Execute\n\tgithub.com/pingcap/tiup/pkg/cluster/module/wait_for.go:91\ngithub.com/pingcap/tiup/pkg/cluster/spec.PortStarted\n\tgithub.com/pingcap/tiup/pkg/cluster/spec/instance.go:119\ngithub.com/pingcap/tiup/pkg/cluster/spec.(*BaseInstance).Ready\n\tgithub.com/pingcap/tiup/pkg/cluster/spec/instance.go:151\ngithub.com/pingcap/tiup/pkg/cluster/operation.startInstance\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:405\ngithub.com/pingcap/tiup/pkg/cluster/operation.StartComponent.func1\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:534\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.0.0-20220819030929-7fc1605a5dde/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1594\nfailed to start: 10.29.0.20 tidb-4000.service, please check the instance's log(/data/tidb-deploy/tidb-4000/log) for more detail.\nfailed to start tidb"}

发一下这个下面的日志
/data/tidb-deploy/tidb-4000/log

这是我之前贴过的

[2022/11/15 17:35:58.383 +08:00] [INFO] [client.go:687] ["[pd] tso dispatcher created"] [dc-location=global]
[2022/11/15 17:35:58.383 +08:00] [INFO] [store.go:80] ["new store with retry success"]
[2022/11/15 17:35:58.384 +08:00] [FATAL] [session.go:3052] ["check bootstrapped failed"] [error="failed to decode region range key, key: \"6D426F6F7473747261FF704B657900000000FB0000000000000073\", err: invalid marker byte, group bytes \"9645__spl\""] [stack="github.com/pingcap/tidb/session.getStoreBootstrapVersion\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/session/session.go:3052\ngithub.com/pingcap/tidb/session.BootstrapSession\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/session/session.go:2827\nmain.createStoreAndDomain\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/tidb-server/main.go:296\nmain.main\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/tidb-server/main.go:202\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:250"]

这个怎么感觉是 tikv region 有损坏啊

扩容之前有没有check集群有没有什么存在的风险,以及修复风险?
或者tiflas和tidb 分别扩容看看?

看你这个 display , 19 的 tiflash 应该是下线了的?整个集群还有 tiflash 副本和节点么。

扩容之前有没有check集群有没有什么存在的风险,以及修复风险?

check 过,风险都是已经已知的

Node        Check         Result  Message
----        -----         ------  -------
10.29.0.20  memory        Pass    memory size is 32768MB
10.29.0.20  limits        Fail    soft limit of 'stack' for user 'root' is not set or too low
10.29.0.20  limits        Fail    soft limit of 'nofile' for user 'root' is not set or too low
10.29.0.20  limits        Fail    hard limit of 'nofile' for user 'root' is not set or too low
10.29.0.20  thp           Pass    THP is disabled
10.29.0.20  service       Fail    service irqbalance is not running
10.29.0.20  command       Pass    numactl: policy: default
10.29.0.20  timezone      Pass    time zone is the same as the first PD machine: Asia/Shanghai
10.29.0.20  os-version    Fail    os vendor alinux not supported
10.29.0.20  cpu-cores     Pass    number of CPU cores / threads: 8
10.29.0.20  cpu-governor  Warn    Unable to determine current CPU frequency governor policy
10.29.0.20  selinux       Pass    SELinux is disabled

19 的 tiflash 是下线了的
整个集群没有了 tiflash 副本

select * from imformation_schema.tiflash_replica 来再确认下是否没有 tiflash 副本了。

tiflash 下线后 理论上 display 看不到的,是不是 没有 tiup prune ?https://docs.pingcap.com/zh/tidb/dev/tiup-component-cluster-prune

而且日志中的 key 确实无法 decode:

(root@127.0.0.1) [(none)]>select tidb_decode_key("6D426F6F7473747261FF704B657900000000FB0000000000000073");
+---------------------------------------------------------------------------+
| tidb_decode_key("6D426F6F7473747261FF704B657900000000FB0000000000000073") |
+---------------------------------------------------------------------------+
| 6D426F6F7473747261FF704B657900000000FB0000000000000073                    |
+---------------------------------------------------------------------------+
1 row in set, 1 warning (0.00 sec)

(root@127.0.0.1) [(none)]>show warnings;
+---------+------+---------------------------------------------------------------------+
| Level   | Code | Message                                                             |
+---------+------+---------------------------------------------------------------------+
| Warning | 1105 | invalid key: 6D426F6F7473747261FF704B657900000000FB0000000000000073 |
+---------+------+---------------------------------------------------------------------+
1 row in set (0.00 sec)

我再说一下我的操作流程吧
执行1: 我现在已经把 集群中的 tiflash、tidb 全部卸载了,执行 tiup cluster display xx 看不到了 tidb 和 tiflash,在这个基础上我去装tidb和tiflash,给我报了上面的错误
执行2: 执行 tiup cluster display xx 看到了 tidb 和 tiflash 各一个,都是 UP 状态,在这个基础上我去装tidb和tiflash,给我报了上面的错误。

就算我把配置补齐了,也是报上述的错误

集群中有生产数据吗?