tidb 单节点启动异常

tidb 启动单节点失败
【 TiDB 使用环境】生产\测试环境\ POC 生产
【 TiDB 版本】5.4
【遇到的问题】 tiup cluster start tidb-slb -N xxxx.153:4000 启动单节点失败
【复现路径】执行sql 脚本后cpu 跑满单节点挂了,后续启动该节点失败
【问题现象及影响】

执行启动脚本后报错:
root@:/data/tidb/tidb-deploy/tidb-4000/log# tiup cluster start tidb-slb -N xxxxx:4000
tiup is checking updates for component cluster …
A new version of cluster is available:
The latest version: v1.11.0
Local installed version: v1.10.1
Update current component: tiup update cluster
Update all components: tiup update --all

Starting component cluster: /root/.tiup/components/cluster/v1.10.1/tiup-cluster start tidb-slb -N xxxxx:4000
Starting cluster tidb-slb…

  • [ Serial ] - SSHKeySet: privateKey=/root/.tiup/storage/cluster/clusters/tidb-slb/ssh/id_rsa, publicKey=/root/.tiup/storage/cluster/clusters/tidb-slb/ssh/id_rsa.pub
  • [Parallel] - UserSSH: user=tidb, host=xxxxx
  • [Parallel] - UserSSH: user=tidb, host=xxxxx
  • [Parallel] - UserSSH: user=tidb, host=xxxxx
  • [Parallel] - UserSSH: user=tidb, host=xxxxx
  • [Parallel] - UserSSH: user=tidb, host=xxxxx
  • [Parallel] - UserSSH: user=tidb, host=xxxxx
  • [Parallel] - UserSSH: user=tidb, host=xxxxx
  • [Parallel] - UserSSH: user=tidb, host=xxxxx
  • [Parallel] - UserSSH: user=tidb, host=xxxxx
  • [Parallel] - UserSSH: user=tidb, host=xxxxx
  • [Parallel] - UserSSH: user=tidb, host=xxxxx
  • [Parallel] - UserSSH: user=tidb, host=xxxxx
  • [Parallel] - UserSSH: user=tidb, host=xxxxx
  • [Parallel] - UserSSH: user=tidb, host=xxxxx
  • [Parallel] - UserSSH: user=tidb, host=xxxxx
  • [Parallel] - UserSSH: user=tidb, host=xxxxx
  • [Parallel] - UserSSH: user=tidb, host=xxxxx
  • [Parallel] - UserSSH: user=tidb, host=xxxxx
  • [Parallel] - UserSSH: user=tidb, host=xxxxx
  • [Parallel] - UserSSH: user=tidb, host=xxxxx
  • [Parallel] - UserSSH: user=tidb, host=xxxxx
  • [ Serial ] - StartCluster
    Starting component tidb
    Starting instance xxxxx:4000

Error: failed to start tidb: failed to start: xxxxx tidb-4000.service, please check the instance’s log(/data/tidb/tidb-deploy/tidb-4000/log) for more detail.: timed out waiting for port 4000 to be started after 2m0s

Verbose debug logs has been written to /root/.tiup/logs/tiup-cluster-debug-2022-10-12-17-03-19.log.

tidb 日志报错:
[2022/10/12 16:25:11.548 +08:00] [WARN] [pd.go:152] [“get timestamp too slow”] [“cost time”=79.927587ms]
[2022/10/12 16:25:13.309 +08:00] [WARN] [pd.go:152] [“get timestamp too slow”] [“cost time”=58.264043ms]
[2022/10/12 16:25:13.511 +08:00] [WARN] [pd.go:152] [“get timestamp too slow”] [“cost time”=46.059301ms]
[2022/10/12 16:25:13.512 +08:00] [WARN] [pd.go:152] [“get timestamp too slow”] [“cost time”=46.152008ms]
[2022/10/12 16:26:00.453 +08:00] [ERROR] [client.go:502] ["[pd] tso request is canceled due to timeout"] [dc-location=global] [error="[PD:client:ErrClientGetTSOTimeout]get TSO timeout"]
[2022/10/12 16:27:50.612 +08:00] [ERROR] [client.go:786] ["[pd] getTS error"] [dc-location=global] [error="[PD:client:ErrClientGetTSO]EOF: EOF"]
[2022/10/12 16:29:18.147 +08:00] [INFO] [client.go:730] ["[pd] tso stream is not ready"] [dc=global]
[2022/10/12 16:27:02.731 +08:00] [ERROR] [pd.go:236] [“updateTS error”] [txnScope=global] [error=EOF]


后面的日志咋到中间了。 systemctl status tidb-4000服务看下有啥报错吗 ,日志没看到报错的原因。 tidb进程有没有,按道理会自动拉起

查看服务:
● tidb-4000.service - tidb service
Loaded: loaded (/etc/systemd/system/tidb-4000.service; enabled; vendor preset: enabled)
Active: activating (auto-restart) (Result: exit-code) since Wed 2022-10-12 17:31:18 CST; 13s ago
Process: 1147008 ExecStart=/bin/bash -c /data/tidb/tidb-deploy/tidb-4000/scripts/run_tidb.sh (code=exited, status=1/FAILURE)
Main PID: 1147008 (code=exited, status=1/FAILURE)

这个服务好像没有啥子问题

可以去对应的节点tidb的deploy目录下的scripts,进入之后手动启动,看一下报错是什么。

[2022/10/12 18:15:06.236 +08:00] [ERROR] [terror.go:307] [“encountered error”] [error="[server:1045]Access denied for user ‘root’@‘xxx.157’ (using password: YES)"] [stack=“github.com/pingcap/tidb/parser/terror.Log\ \t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/parser/terror/terror.go:307\ github.com/pingcap/tidb/server.(*Server).onConn\ \t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/server.go:516”]
[2022/10/12 18:15:13.775 +08:00] [WARN] [cache.go:925] [“net.LookupAddr returned an error during auth check”] [host=xxx.166] [error=“lookup xxx.166.in-addr.arpa. on 127.0.0.53:53: no such host”]
[2022/10/12 18:15:13.775 +08:00] [WARN] [conn.go:720] [“failed to check the user authplugin”] [conn=9] [error="[server:1045]Access denied for user ‘data_read’@‘xxx.166’ (using password: YES)"]
[2022/10/12 18:15:13.775 +08:00] [ERROR] [terror.go:307] [“encountered error”] [error="[server:1045]Access denied for user ‘data_read’@‘xxx.166’ (using password: YES)"] [stack=“github.com/pingcap/tidb/parser/terror.Log\ \t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/parser/terror/terror.go:307\ github.com/pingcap/tidb/server.(*Server).onConn\ \t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/server.go:516”]

xxxx 这个ip 是我们另外的服务器的地址

/bin/bash -c /data/tidb/tidb-deploy/tidb-4000/scripts/run_tidb.sh

执行下这个呢?

另外看看 pd 的状态呢

这状态就不正常 不是running,防火墙等问题排除后,看不到别的关键日志。 建议要么就缩容扩容这个TiDB节点

[2022/10/12 18:15:06.236 +08:00] [ERROR] [terror.go:307] [“encountered error”] [error="[server:1045]Access denied for user ‘root’@‘xxx.157’ (using password: YES)"] [stack=“github.com/pingcap/tidb/parser/terror.Log\ \t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/parser/terror/terror.go:307\ github.com/pingcap/tidb/server.(*Server.onConn\ \t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/server.go:516”]
[2022/10/12 18:15:13.775 +08:00] [WARN] [cache.go:925] [“net.LookupAddr returned an error during auth check”] [host=xxx.166] [error=“lookup xxx.166.in-addr.arpa. on 127.0.0.53:53: no such host”]
[2022/10/12 18:15:13.775 +08:00] [WARN] [conn.go:720] [“failed to check the user authplugin”] [conn=9] [error="[server:1045]Access denied for user ‘data_read’@‘xxx.166’ (using password: YES)"]
[2022/10/12 18:15:13.775 +08:00] [ERROR] [terror.go:307] [“encountered error”] [error="[server:1045]Access denied for user ‘data_read’@‘xxx.166’ (using password: YES)"] [stack=“github.com/pingcap/tidb/parser/terror.Log\ \t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/parser/terror/terror.go:307\ github.com/pingcap/tidb/server.(*Server.onConn\ \t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/tidb/server/server.go:516”]

xxxx 这个ip 是我们另外的服务器的地址

这个就是手动执行后的结果

确认下 pd 的状态,看日志 tidb 启动不了是访问不到 pd

最终没办法将该tidb 下架重新添加了