三个tikv节点,rm删除其中两个tikv的deploy目录后,tidb依然正常工作,重新添加节点时失败

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】:3.0.6(版本存在缺失v标记问题,还未更换最新v3.0.6)
  • 【问题描述】:

    截图为最新启动报错
    五台测试机,53,54为tidb,53,54,57为pd,57,58,59为tikv,数据量很小,只有一个几行的表

本想做tikv多副本丢失异常恢复测试,手动rm删除58,59安装目录deploy下面的所有文件后,发现集群依然正常运行,等待了一会,手动kill57节点的tikv-server进程,发现该进程会自动重启,通过中控机stop.yml关闭这太tikv后,通过tikv-ctl检查regions发现all regions are healthy.

然后重启tikv三个节点的虚拟机服务器,启动后唯一的tikv57依然可以正常启动

通过中控机deploy.yml --tag tikv -l 10.16.160.58,10.16.160.59 重新安装被删除的两个tikv节点,安装完成后无法启动,提示

因该版本3.0.6存在问题,执行rolling_update报错版本号冲突,手动关闭整个集群,然后ansible-playbook deploy.yml整个集群,然后start.yml,报错与最新报错一样

尝试通过pd-ctl delete 下线58,59,一直卡在offline状态

[tidb@tidb01 bin]$ ./pd-ctl -u 10.16.160.57:2379 store
{
“count”: 3,
“stores”: [
{
“store”: {
“id”: 1,
“address”: “10.16.160.57:20160”,
“version”: “3.0.6”,
“state_name”: “Up”
},
“status”: {
“capacity”: “46.83GiB”,
“available”: “40.26GiB”,
“leader_weight”: 1,
“region_weight”: 1,
“start_ts”: “2019-12-10T17:15:30+08:00”,
“last_heartbeat_ts”: “2019-12-10T17:39:30.476500826+08:00”,
“uptime”: “24m0.476500826s”
}
},
{
“store”: {
“id”: 4,
“address”: “10.16.160.59:20160”,
“state”: 1,
“version”: “3.0.6”,
“state_name”: “Offline”
},
“status”: {
“leader_weight”: 1,
“region_weight”: 1,
“start_ts”: “1970-01-01T08:00:00+08:00”
}
},
{
“store”: {
“id”: 5,
“address”: “10.16.160.58:20160”,
“state”: 1,
“version”: “3.0.6”,
“state_name”: “Offline”
},
“status”: {
“leader_weight”: 1,
“region_weight”: 1,
“start_ts”: “1970-01-01T08:00:00+08:00”
}
}
]
}

现在tidb两个节点无法启动,tikv两个节点无法启动
下图为tikv58的log

若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

你好: 1 尝试启动tidb,tikv,上传tidb和tikv日志(所有日志) 2. pd-ctl 进入命令行,执行 store, member, config show all, region , 反馈下当前结果

tidblog.zip (2.8 MB) tikvlog.zip (2.2 MB)

[tidb@t03 bin]$ ./pd-ctl -u 10.16.160.57:2379 store
{
“count”: 3,
“stores”: [
{
“store”: {
“id”: 1,
“address”: “10.16.160.57:20160”,
“version”: “3.0.6”,
“state_name”: “Up”
},
“status”: {
“capacity”: “46.83GiB”,
“available”: “40.25GiB”,
“leader_weight”: 1,
“region_weight”: 1,
“start_ts”: “2019-12-10T17:15:30+08:00”,
“last_heartbeat_ts”: “2019-12-10T18:06:30.621218188+08:00”,
“uptime”: “51m0.621218188s”
}
},
{
“store”: {
“id”: 4,
“address”: “10.16.160.59:20160”,
“state”: 1,
“version”: “3.0.6”,
“state_name”: “Offline”
},
“status”: {
“leader_weight”: 1,
“region_weight”: 1,
“start_ts”: “1970-01-01T08:00:00+08:00”
}
},
{
“store”: {
“id”: 5,
“address”: “10.16.160.58:20160”,
“state”: 1,
“version”: “3.0.6”,
“state_name”: “Offline”
},
“status”: {
“leader_weight”: 1,
“region_weight”: 1,
“start_ts”: “1970-01-01T08:00:00+08:00”
}
}
]
}

[tidb@t03 bin]$ ./pd-ctl -u 10.16.160.57:2379 member
{
“header”: {
“cluster_id”: 6765671312664933091
},
“members”: [
{
“name”: “pd_tidb01”,
“member_id”: 2053682350972264432,
“peer_urls”: [
http://10.16.160.53:2380
],
“client_urls”: [
http://10.16.160.53:2379
]
},
{
“name”: “pd_t02”,
“member_id”: 3365773935136632150,
“peer_urls”: [
http://10.16.160.54:2380
],
“client_urls”: [
http://10.16.160.54:2379
]
},
{
“name”: “pd_t03”,
“member_id”: 18203173743506467089,
“peer_urls”: [
http://10.16.160.57:2380
],
“client_urls”: [
http://10.16.160.57:2379
]
}
],
“leader”: {
“name”: “pd_tidb01”,
“member_id”: 2053682350972264432,
“peer_urls”: [
http://10.16.160.53:2380
],
“client_urls”: [
http://10.16.160.53:2379
]
},
“etcd_leader”: {
“name”: “pd_tidb01”,
“member_id”: 2053682350972264432,
“peer_urls”: [
http://10.16.160.53:2380
],
“client_urls”: [
http://10.16.160.53:2379
]
}
}

[tidb@t03 bin]$ ./pd-ctl -u 10.16.160.57:2379 config show all
{
“client-urls”: “http://10.16.160.53:2379”,
“peer-urls”: “http://10.16.160.53:2380”,
“advertise-client-urls”: “http://10.16.160.53:2379”,
“advertise-peer-urls”: “http://10.16.160.53:2380”,
“name”: “pd_tidb01”,
“data-dir”: “/home/tidb/deploy/data.pd”,
“force-new-cluster”: false,
“enable-grpc-gateway”: true,
“initial-cluster”: “pd_t03=http://10.16.160.57:2380,pd_tidb01=http://10.16.160.53:2380,pd_t02=http://10.16.160.54:2380”,
“initial-cluster-state”: “new”,
“join”: “”,
“lease”: 3,
“log”: {
“level”: “info”,
“format”: “text”,
“disable-timestamp”: false,
“file”: {
“filename”: “/home/tidb/deploy/log/pd.log”,
“log-rotate”: true,
“max-size”: 300,
“max-days”: 0,
“max-backups”: 0
},
“development”: false,
“disable-caller”: false,
“disable-stacktrace”: false,
“disable-error-verbose”: true,
“sampling”: null
},
“log-file”: “”,
“log-level”: “”,
“tso-save-interval”: “3s”,
“metric”: {
“job”: “pd_tidb01”,
“address”: “”,
“interval”: “15s”
},
“schedule”: {
“max-snapshot-count”: 3,
“max-pending-peer-count”: 16,
“max-merge-region-size”: 20,
“max-merge-region-keys”: 200000,
“split-merge-interval”: “1h0m0s”,
“enable-one-way-merge”: “false”,
“patrol-region-interval”: “100ms”,
“max-store-down-time”: “30m0s”,
“leader-schedule-limit”: 4,
“region-schedule-limit”: 4,
“replica-schedule-limit”: 8,
“merge-schedule-limit”: 8,
“hot-region-schedule-limit”: 2,
“hot-region-cache-hits-threshold”: 3,
“store-balance-rate”: 1,
“tolerant-size-ratio”: 5,
“low-space-ratio”: 0.8,
“high-space-ratio”: 0.6,
“scheduler-max-waiting-operator”: 3,
“disable-raft-learner”: “false”,
“disable-remove-down-replica”: “false”,
“disable-replace-offline-replica”: “false”,
“disable-make-up-replica”: “false”,
“disable-remove-extra-replica”: “false”,
“disable-location-replacement”: “false”,
“disable-namespace-relocation”: “false”,
“schedulers-v2”: [
{
“type”: “balance-region”,
“args”: null,
“disable”: false
},
{
“type”: “balance-leader”,
“args”: null,
“disable”: false
},
{
“type”: “hot-region”,
“args”: null,
“disable”: false
},
{
“type”: “label”,
“args”: null,
“disable”: false
}
]
},
“replication”: {
“max-replicas”: 3,
“location-labels”: “”,
“strictly-match-label”: “false”
},
“namespace”: {},
“pd-server”: {
“use-region-storage”: “true”
},
“cluster-version”: “3.0.6”,
“quota-backend-bytes”: “0B”,
“auto-compaction-mode”: “periodic”,
“auto-compaction-retention-v2”: “1h”,
“TickInterval”: “500ms”,
“ElectionInterval”: “3s”,
“PreVote”: true,
“security”: {
“cacert-path”: “”,
“cert-path”: “”,
“key-path”: “”
},
“label-property”: {},
“WarningMsgs”: null,
“namespace-classifier”: “table”,
“LeaderPriorityCheckInterval”: “1m0s”
}

[tidb@t03 bin]$ ./pd-ctl -u 10.16.160.57:2379 region
{
“count”: 21,
“regions”: [
{
“id”: 24,
“start_key”: “7480000000000000FF1500000000000000F8”,
“end_key”: “7480000000000000FF1700000000000000F8”,
“epoch”: {
“conf_ver”: 5,
“version”: 11
},
“peers”: [
{
“id”: 25,
“store_id”: 1
},
{
“id”: 69,
“store_id”: 4
},
{
“id”: 76,
“store_id”: 5
}
]
},
{
“id”: 26,
“start_key”: “7480000000000000FF1700000000000000F8”,
“end_key”: “7480000000000000FF1900000000000000F8”,
“epoch”: {
“conf_ver”: 5,
“version”: 12
},
“peers”: [
{
“id”: 27,
“store_id”: 1
},
{
“id”: 77,
“store_id”: 4
},
{
“id”: 84,
“store_id”: 5
}
]
},
{
“id”: 34,
“start_key”: “7480000000000000FF1F00000000000000F8”,
“end_key”: “7480000000000000FF2100000000000000F8”,
“epoch”: {
“conf_ver”: 5,
“version”: 16
},
“peers”: [
{
“id”: 35,
“store_id”: 1
},
{
“id”: 92,
“store_id”: 5
},
{
“id”: 97,
“store_id”: 4
}
]
},
{
“id”: 37,
“start_key”: “7480000000000000FF2100000000000000F8”,
“end_key”: “7480000000000000FF2300000000000000F8”,
“epoch”: {
“conf_ver”: 5,
“version”: 17
},
“peers”: [
{
“id”: 38,
“store_id”: 1
},
{
“id”: 93,
“store_id”: 4
},
{
“id”: 100,
“store_id”: 5
}
]
},
{
“id”: 40,
“start_key”: “7480000000000000FF2300000000000000F8”,
“end_key”: “7480000000000000FF2500000000000000F8”,
“epoch”: {
“conf_ver”: 5,
“version”: 18
},
“peers”: [
{
“id”: 41,
“store_id”: 1
},
{
“id”: 101,
“store_id”: 4
},
{
“id”: 108,
“store_id”: 5
}
]
},
{
“id”: 1001,
“start_key”: “7480000000000000FF2500000000000000F8”,
“end_key”: “7480000000000000FF2700000000000000F8”,
“epoch”: {
“conf_ver”: 5,
“version”: 19
},
“peers”: [
{
“id”: 1002,
“store_id”: 1
},
{
“id”: 1003,
“store_id”: 5
},
{
“id”: 1004,
“store_id”: 4
}
]
},
{
“id”: 2,
“start_key”: “7480000000000000FF2B00000000000000F8”,
“end_key”: “”,
“epoch”: {
“conf_ver”: 5,
“version”: 21
},
“peers”: [
{
“id”: 3,
“store_id”: 1
},
{
“id”: 104,
“store_id”: 5
},
{
“id”: 107,
“store_id”: 4
}
]
},
{
“id”: 20,
“start_key”: “7480000000000000FF1100000000000000F8”,
“end_key”: “7480000000000000FF1300000000000000F8”,
“epoch”: {
“conf_ver”: 5,
“version”: 9
},
“peers”: [
{
“id”: 21,
“store_id”: 1
},
{
“id”: 66,
“store_id”: 4
},
{
“id”: 72,
“store_id”: 5
}
]
},
{
“id”: 1009,
“start_key”: “7480000000000000FF2900000000000000F8”,
“end_key”: “7480000000000000FF2B00000000000000F8”,
“epoch”: {
“conf_ver”: 5,
“version”: 21
},
“peers”: [
{
“id”: 1010,
“store_id”: 1
},
{
“id”: 1011,
“store_id”: 5
},
{
“id”: 1012,
“store_id”: 4
}
]
},
{
“id”: 28,
“start_key”: “7480000000000000FF1900000000000000F8”,
“end_key”: “7480000000000000FF1B00000000000000F8”,
“epoch”: {
“conf_ver”: 5,
“version”: 13
},
“peers”: [
{
“id”: 29,
“store_id”: 1
},
{
“id”: 80,
“store_id”: 5
},
{
“id”: 85,
“store_id”: 4
}
]
},
{
“id”: 14,
“start_key”: “7480000000000000FF0B00000000000000F8”,
“end_key”: “7480000000000000FF0D00000000000000F8”,
“epoch”: {
“conf_ver”: 5,
“version”: 6
},
“peers”: [
{
“id”: 15,
“store_id”: 1
},
{
“id”: 52,
“store_id”: 4
},
{
“id”: 60,
“store_id”: 5
}
]
},
{
“id”: 18,
“start_key”: “7480000000000000FF0F00000000000000F8”,
“end_key”: “7480000000000000FF1100000000000000F8”,
“epoch”: {
“conf_ver”: 5,
“version”: 8
},
“peers”: [
{
“id”: 19,
“store_id”: 1
},
{
“id”: 64,
“store_id”: 5
},
{
“id”: 105,
“store_id”: 4
}
]
},
{
“id”: 1005,
“start_key”: “7480000000000000FF2700000000000000F8”,
“end_key”: “7480000000000000FF2900000000000000F8”,
“epoch”: {
“conf_ver”: 5,
“version”: 20
},
“peers”: [
{
“id”: 1006,
“store_id”: 1
},
{
“id”: 1007,
“store_id”: 5
},
{
“id”: 1008,
“store_id”: 4
}
]
},
{
“id”: 10,
“start_key”: “7480000000000000FF0700000000000000F8”,
“end_key”: “7480000000000000FF0900000000000000F8”,
“epoch”: {
“conf_ver”: 5,
“version”: 4
},
“peers”: [
{
“id”: 11,
“store_id”: 1
},
{
“id”: 44,
“store_id”: 4
},
{
“id”: 54,
“store_id”: 5
}
]
},
{
“id”: 32,
“start_key”: “7480000000000000FF1D00000000000000F8”,
“end_key”: “7480000000000000FF1F00000000000000F8”,
“epoch”: {
“conf_ver”: 5,
“version”: 15
},
“peers”: [
{
“id”: 33,
“store_id”: 1
},
{
“id”: 89,
“store_id”: 4
},
{
“id”: 96,
“store_id”: 5
}
]
},
{
“id”: 12,
“start_key”: “7480000000000000FF0900000000000000F8”,
“end_key”: “7480000000000000FF0B00000000000000F8”,
“epoch”: {
“conf_ver”: 5,
“version”: 5
},
“peers”: [
{
“id”: 13,
“store_id”: 1
},
{
“id”: 45,
“store_id”: 5
},
{
“id”: 56,
“store_id”: 4
}
]
},
{
“id”: 16,
“start_key”: “7480000000000000FF0D00000000000000F8”,
“end_key”: “7480000000000000FF0F00000000000000F8”,
“epoch”: {
“conf_ver”: 5,
“version”: 7
},
“peers”: [
{
“id”: 17,
“store_id”: 1
},
{
“id”: 58,
“store_id”: 5
},
{
“id”: 62,
“store_id”: 4
}
]
},
{
“id”: 22,
“start_key”: “7480000000000000FF1300000000000000F8”,
“end_key”: “7480000000000000FF1500000000000000F8”,
“epoch”: {
“conf_ver”: 5,
“version”: 10
},
“peers”: [
{
“id”: 23,
“store_id”: 1
},
{
“id”: 68,
“store_id”: 5
},
{
“id”: 73,
“store_id”: 4
}
]
},
{
“id”: 30,
“start_key”: “7480000000000000FF1B00000000000000F8”,
“end_key”: “7480000000000000FF1D00000000000000F8”,
“epoch”: {
“conf_ver”: 5,
“version”: 14
},
“peers”: [
{
“id”: 31,
“store_id”: 1
},
{
“id”: 81,
“store_id”: 4
},
{
“id”: 88,
“store_id”: 5
}
]
},
{
“id”: 6,
“start_key”: “”,
“end_key”: “7480000000000000FF0500000000000000F8”,
“epoch”: {
“conf_ver”: 5,
“version”: 2
},
“peers”: [
{
“id”: 7,
“store_id”: 1
},
{
“id”: 36,
“store_id”: 4
},
{
“id”: 50,
“store_id”: 5
}
]
},
{
“id”: 8,
“start_key”: “7480000000000000FF0500000000000000F8”,
“end_key”: “7480000000000000FF0700000000000000F8”,
“epoch”: {
“conf_ver”: 5,
“version”: 3
},
“peers”: [
{
“id”: 9,
“store_id”: 1
},
{
“id”: 42,
“store_id”: 5
},
{
“id”: 48,
“store_id”: 4
}
]
}
]
}

稍等,我总结一下回复

  1. 停止正常tikv实例
  2. 请参考命令: ./tikv-ctl --db <deploy_dir>/data/db unsafe-recover remove-fail-stores -s x,x --all-regions —>x替换为rm的两个store id.
  3. 启动tikv实例查看日志是否正常.
  4. 重新加回来其他两个实例
1 个赞

tikv已经正常启动,tidb也起来了,监控已经恢复,谢谢

:+1::+1::+1:

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。