tidb节点启动报错,

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:

【TiDB 版本】v4.0.6

【问题描述】
使用tiup cluster restart test-cluster启动集群,两台tidb节点无法启动,tidb.log中报错如下
[FATAL] [terror.go:348] [“unexpected error”] [error="[privilege:8049]mysql.db"] [stack=“github.com/pingcap/parser/terror.MustNil\ \t/home/jenkins/agent/workspace/tidb_v4.0.6/go/pkg/mod/github.com/pingcap/parser@v0.0.0-20200911054040-258297116c4b/terror/terror.go:348\ main.createStoreAndDomain\ \t/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/tidb-server/main.go:259\ main.main\ \t/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/tidb-server/main.go:179\ runtime.main\ \t/usr/local/go/src/runtime/proc.go:203”] [stack=“github.com/pingcap/parser/terror.MustNil\ \t/home/jenkins/agent/workspace/tidb_v4.0.6/go/pkg/mod/github.com/pingcap/parser@v0.0.0-20200911054040-258297116c4b/terror/terror.go:348\ main.createStoreAndDomain\ \t/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/tidb-server/main.go:259\ main.main\ \t/home/jenkins/agent/workspace/tidb_v4.0.6/go/src/github.com/pingcap/tidb/tidb-server/main.go:179\ runtime.main\ \t/usr/local/go/src/runtime/proc.go:203”]


若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

请问下重启之前做过什么操作吗?

重启之前强制下线了一台主机


报错说region id为25的节点无法访问,于是看了一下region状态

其中24590972和38833310都是好的


38546296就是之前强制下线的那台主机,我尝试在另外两台好的kv节点上执行unsafe-recover指令,但没有效果

重启之前为何要强制下线一个 tikv 节点?另外 unsafe-recover 具体是怎么操作的?

那个节点起不来导致其他节点日志显示与其通信错误而不能up 由于算上那个节点有五个 就算强制下线应该也还有2个副本 所以强制下线它 unsafe 在147和240这两台存有region 25副本的机器上执行 大概就是tikv-ctl unsafe-recover -s 38546296 -all region

麻烦反馈下 “ tiup cluster display {cluster-name}” 的结果


我还尝试用了一些命令删除有问题的peer,但分别有如图错误提示

1.从上面的截图中看有两个 tikv 节点都是 down 状态,这两个节点有做过强制下线吗?
2.麻烦反馈下 pd-ctl 中 store 完整信息。

您好,我们刚才解决了region问题,现在db报其他错误

db提示与pd连接超时,但是如图我们三个pd都是up的

tikv 节点状态都正常了吗?检查下 pd 和 tikv 节点日志,看下具体提示什么错误。

一共三个pd节点,刚才重启集群时日志分别如下,113


114

115

现在kv节点都还处于disconnect状态,其中147日志如下

tidb 集群启动顺序先是 pd -> tikv -> tidb ,tikv 现在都没有正常启动,tidb 访问当然会报错。

好的,我们现在在重启集群,马上反馈

我刚才尝试在每个健康的kv节点上执行./tikv-ctl --db /home/tidb/deploy/data/db unsafe-recover remove-fail-stores -s 38546296 --all-regions,试图删除已经强制下线节点的regions,然后重启集群。qi zh其中kv节点240的日志如图


store如图
store
{
“count”: 5,
“stores”: [
{
“store”: {
“id”: 38833310,
“address”: “10.12.5.147:20160”,
“version”: “4.0.6”,
“status_address”: “10.12.5.147:20180”,
“git_hash”: “ca2475bfbcb49a7c34cf783596acb3edd05fc88f”,
“start_timestamp”: 1617248660,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1617278938753846245,
“state_name”: “Up”
},
“status”: {
“capacity”: “5.952TiB”,
“available”: “3.685TiB”,
“used_size”: “2.078TiB”,
“leader_count”: 3456,
“leader_weight”: 2,
“leader_score”: 1728,
“leader_size”: 210224,
“region_count”: 98428,
“region_weight”: 2,
“region_score”: 393091.5,
“region_size”: 786183,
“sending_snap_count”: 2,
“receiving_snap_count”: 1,
“start_ts”: “2021-04-01T03:44:20Z”,
“last_heartbeat_ts”: “2021-04-01T12:08:58.753846245Z”,
“uptime”: “8h24m38.753846245s”
}
},
{
“store”: {
“id”: 256634687,
“address”: “10.12.5.12:20160”,
“version”: “4.0.6”,
“status_address”: “10.12.5.12:20180”,
“git_hash”: “ca2475bfbcb49a7c34cf783596acb3edd05fc88f”,
“start_timestamp”: 1617248849,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1617277699444568860,
“state_name”: “Disconnected”
},
“status”: {
“capacity”: “2.952TiB”,
“available”: “2.35TiB”,
“used_size”: “604.9GiB”,
“leader_count”: 0,
“leader_weight”: 1,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 26342,
“region_weight”: 1,
“region_score”: 18281,
“region_size”: 18281,
“start_ts”: “2021-04-01T03:47:29Z”,
“last_heartbeat_ts”: “2021-04-01T11:48:19.44456886Z”,
“uptime”: “8h0m50.44456886s”
}
},
{
“store”: {
“id”: 24478148,
“address”: “10.12.5.236:20160”,
“version”: “4.0.6”,
“status_address”: “10.12.5.236:20180”,
“git_hash”: “ca2475bfbcb49a7c34cf783596acb3edd05fc88f”,
“start_timestamp”: 1617248731,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1617278932097802241,
“state_name”: “Up”
},
“status”: {
“capacity”: “5.952TiB”,
“available”: “2.803TiB”,
“used_size”: “2.033TiB”,
“leader_count”: 12024,
“leader_weight”: 2,
“leader_score”: 6012,
“leader_size”: 720637,
“region_count”: 101746,
“region_weight”: 2,
“region_score”: 416939,
“region_size”: 833878,
“sending_snap_count”: 1,
“start_ts”: “2021-04-01T03:45:31Z”,
“last_heartbeat_ts”: “2021-04-01T12:08:52.097802241Z”,
“uptime”: “8h23m21.097802241s”
}
},
{
“store”: {
“id”: 24480822,
“address”: “10.12.5.239:20160”,
“version”: “4.0.6”,
“status_address”: “10.12.5.239:20180”,
“git_hash”: “ca2475bfbcb49a7c34cf783596acb3edd05fc88f”,
“start_timestamp”: 1617248780,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1617277712905685930,
“state_name”: “Disconnected”
},
“status”: {
“capacity”: “5.952TiB”,
“available”: “3.861TiB”,
“used_size”: “2TiB”,
“leader_count”: 3,
“leader_weight”: 2,
“leader_score”: 1.5,
“leader_size”: 134,
“region_count”: 88650,
“region_weight”: 2,
“region_score”: 88273.5,
“region_size”: 176547,
“start_ts”: “2021-04-01T03:46:20Z”,
“last_heartbeat_ts”: “2021-04-01T11:48:32.90568593Z”,
“uptime”: “8h2m12.90568593s”
}
},
{
“store”: {
“id”: 24590972,
“address”: “10.12.5.240:20160”,
“version”: “4.0.6”,
“status_address”: “10.12.5.240:20180”,
“git_hash”: “ca2475bfbcb49a7c34cf783596acb3edd05fc88f”,
“start_timestamp”: 1617248859,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1617277713365695442,
“state_name”: “Disconnected”
},
“status”: {
“capacity”: “5.952TiB”,
“available”: “4.029TiB”,
“used_size”: “1.828TiB”,
“leader_count”: 0,
“leader_weight”: 2,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 85146,
“region_weight”: 2,
“region_score”: 340616.5,
“region_size”: 681233,
“start_ts”: “2021-04-01T03:47:39Z”,
“last_heartbeat_ts”: “2021-04-01T11:48:33.365695442Z”,
“uptime”: “8h0m54.365695442s”
}
}
]
}

原先集群中有142,147,236,239,240,11,12七个节点,我们下线了142和11两个节点,现在store中的count为5,其中142节点的id为38546296,11节点的id为256634153。

如果是同时下线了两个 tikv 节点,可能会出现多副本丢失的情况,需要在每个存活的 tikv 节点上针对 store 38546296 和 256634153 执行下 unsafe-recover remove-fail-stores ,执行前每个 tikv 需要先停掉。

好的,我尝试一下


我在每个kv上执行了两句unsafe-recover,重启时db报错如下