三个tikv节点,rm删除其中两个tikv的deploy目录后,tidb依然正常工作,重新添加节点时失败

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】:3.0.6(版本存在缺失v标记问题,还未更换最新v3.0.6)
  • 【问题描述】: 截图为最新启动报错 五台测试机,53,54为tidb,53,54,57为pd,57,58,59为tikv,数据量很小,只有一个几行的表

本想做tikv多副本丢失异常恢复测试,手动rm删除58,59安装目录deploy下面的所有文件后,发现集群依然正常运行,等待了一会,手动kill57节点的tikv-server进程,发现该进程会自动重启,通过中控机stop.yml关闭这太tikv后,通过tikv-ctl检查regions发现all regions are healthy.

然后重启tikv三个节点的虚拟机服务器,启动后唯一的tikv57依然可以正常启动

通过中控机deploy.yml --tag tikv -l 10.16.160.58,10.16.160.59 重新安装被删除的两个tikv节点,安装完成后无法启动,提示

因该版本3.0.6存在问题,执行rolling_update报错版本号冲突,手动关闭整个集群,然后ansible-playbook deploy.yml整个集群,然后start.yml,报错与最新报错一样

尝试通过pd-ctl delete 下线58,59,一直卡在offline状态

[tidb@tidb01 bin]$ ./pd-ctl -u 10.16.160.57:2379 store { “count”: 3, “stores”: [ { “store”: { “id”: 1, “address”: “10.16.160.57:20160”, “version”: “3.0.6”, “state_name”: “Up” }, “status”: { “capacity”: “46.83GiB”, “available”: “40.26GiB”, “leader_weight”: 1, “region_weight”: 1, “start_ts”: “2019-12-10T17:15:30+08:00”, “last_heartbeat_ts”: “2019-12-10T17:39:30.476500826+08:00”, “uptime”: “24m0.476500826s” } }, { “store”: { “id”: 4, “address”: “10.16.160.59:20160”, “state”: 1, “version”: “3.0.6”, “state_name”: “Offline” }, “status”: { “leader_weight”: 1, “region_weight”: 1, “start_ts”: “1970-01-01T08:00:00+08:00” } }, { “store”: { “id”: 5, “address”: “10.16.160.58:20160”, “state”: 1, “version”: “3.0.6”, “state_name”: “Offline” }, “status”: { “leader_weight”: 1, “region_weight”: 1, “start_ts”: “1970-01-01T08:00:00+08:00” } } ] }

现在tidb两个节点无法启动,tikv两个节点无法启动 下图为tikv58的log

若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

你好: 1 尝试启动tidb,tikv,上传tidb和tikv日志(所有日志) 2. pd-ctl 进入命令行,执行 store, member, config show all, region , 反馈下当前结果

tidblog.zip (2.8 MB) tikvlog.zip (2.2 MB)

[tidb@t03 bin]$ ./pd-ctl -u 10.16.160.57:2379 store { “count”: 3, “stores”: [ { “store”: { “id”: 1, “address”: “10.16.160.57:20160”, “version”: “3.0.6”, “state_name”: “Up” }, “status”: { “capacity”: “46.83GiB”, “available”: “40.25GiB”, “leader_weight”: 1, “region_weight”: 1, “start_ts”: “2019-12-10T17:15:30+08:00”, “last_heartbeat_ts”: “2019-12-10T18:06:30.621218188+08:00”, “uptime”: “51m0.621218188s” } }, { “store”: { “id”: 4, “address”: “10.16.160.59:20160”, “state”: 1, “version”: “3.0.6”, “state_name”: “Offline” }, “status”: { “leader_weight”: 1, “region_weight”: 1, “start_ts”: “1970-01-01T08:00:00+08:00” } }, { “store”: { “id”: 5, “address”: “10.16.160.58:20160”, “state”: 1, “version”: “3.0.6”, “state_name”: “Offline” }, “status”: { “leader_weight”: 1, “region_weight”: 1, “start_ts”: “1970-01-01T08:00:00+08:00” } } ] }

[tidb@t03 bin]$ ./pd-ctl -u 10.16.160.57:2379 member { “header”: { “cluster_id”: 6765671312664933091 }, “members”: [ { “name”: “pd_tidb01”, “member_id”: 2053682350972264432, “peer_urls”: [ “http://10.16.160.53:2380” ], “client_urls”: [ “http://10.16.160.53:2379” ] }, { “name”: “pd_t02”, “member_id”: 3365773935136632150, “peer_urls”: [ “http://10.16.160.54:2380” ], “client_urls”: [ “http://10.16.160.54:2379” ] }, { “name”: “pd_t03”, “member_id”: 18203173743506467089, “peer_urls”: [ “http://10.16.160.57:2380” ], “client_urls”: [ “http://10.16.160.57:2379” ] } ], “leader”: { “name”: “pd_tidb01”, “member_id”: 2053682350972264432, “peer_urls”: [ “http://10.16.160.53:2380” ], “client_urls”: [ “http://10.16.160.53:2379” ] }, “etcd_leader”: { “name”: “pd_tidb01”, “member_id”: 2053682350972264432, “peer_urls”: [ “http://10.16.160.53:2380” ], “client_urls”: [ “http://10.16.160.53:2379” ] } }

[tidb@t03 bin]$ ./pd-ctl -u 10.16.160.57:2379 config show all { “client-urls”: “http://10.16.160.53:2379”, “peer-urls”: “http://10.16.160.53:2380”, “advertise-client-urls”: “http://10.16.160.53:2379”, “advertise-peer-urls”: “http://10.16.160.53:2380”, “name”: “pd_tidb01”, “data-dir”: “/home/tidb/deploy/data.pd”, “force-new-cluster”: false, “enable-grpc-gateway”: true, “initial-cluster”: “pd_t03=http://10.16.160.57:2380,pd_tidb01=http://10.16.160.53:2380,pd_t02=http://10.16.160.54:2380”, “initial-cluster-state”: “new”, “join”: “”, “lease”: 3, “log”: { “level”: “info”, “format”: “text”, “disable-timestamp”: false, “file”: { “filename”: “/home/tidb/deploy/log/pd.log”, “log-rotate”: true, “max-size”: 300, “max-days”: 0, “max-backups”: 0 }, “development”: false, “disable-caller”: false, “disable-stacktrace”: false, “disable-error-verbose”: true, “sampling”: null }, “log-file”: “”, “log-level”: “”, “tso-save-interval”: “3s”, “metric”: { “job”: “pd_tidb01”, “address”: “”, “interval”: “15s” }, “schedule”: { “max-snapshot-count”: 3, “max-pending-peer-count”: 16, “max-merge-region-size”: 20, “max-merge-region-keys”: 200000, “split-merge-interval”: “1h0m0s”, “enable-one-way-merge”: “false”, “patrol-region-interval”: “100ms”, “max-store-down-time”: “30m0s”, “leader-schedule-limit”: 4, “region-schedule-limit”: 4, “replica-schedule-limit”: 8, “merge-schedule-limit”: 8, “hot-region-schedule-limit”: 2, “hot-region-cache-hits-threshold”: 3, “store-balance-rate”: 1, “tolerant-size-ratio”: 5, “low-space-ratio”: 0.8, “high-space-ratio”: 0.6, “scheduler-max-waiting-operator”: 3, “disable-raft-learner”: “false”, “disable-remove-down-replica”: “false”, “disable-replace-offline-replica”: “false”, “disable-make-up-replica”: “false”, “disable-remove-extra-replica”: “false”, “disable-location-replacement”: “false”, “disable-namespace-relocation”: “false”, “schedulers-v2”: [ { “type”: “balance-region”, “args”: null, “disable”: false }, { “type”: “balance-leader”, “args”: null, “disable”: false }, { “type”: “hot-region”, “args”: null, “disable”: false }, { “type”: “label”, “args”: null, “disable”: false } ] }, “replication”: { “max-replicas”: 3, “location-labels”: “”, “strictly-match-label”: “false” }, “namespace”: {}, “pd-server”: { “use-region-storage”: “true” }, “cluster-version”: “3.0.6”, “quota-backend-bytes”: “0B”, “auto-compaction-mode”: “periodic”, “auto-compaction-retention-v2”: “1h”, “TickInterval”: “500ms”, “ElectionInterval”: “3s”, “PreVote”: true, “security”: { “cacert-path”: “”, “cert-path”: “”, “key-path”: “” }, “label-property”: {}, “WarningMsgs”: null, “namespace-classifier”: “table”, “LeaderPriorityCheckInterval”: “1m0s” }

[tidb@t03 bin]$ ./pd-ctl -u 10.16.160.57:2379 region { “count”: 21, “regions”: [ { “id”: 24, “start_key”: “7480000000000000FF1500000000000000F8”, “end_key”: “7480000000000000FF1700000000000000F8”, “epoch”: { “conf_ver”: 5, “version”: 11 }, “peers”: [ { “id”: 25, “store_id”: 1 }, { “id”: 69, “store_id”: 4 }, { “id”: 76, “store_id”: 5 } ] }, { “id”: 26, “start_key”: “7480000000000000FF1700000000000000F8”, “end_key”: “7480000000000000FF1900000000000000F8”, “epoch”: { “conf_ver”: 5, “version”: 12 }, “peers”: [ { “id”: 27, “store_id”: 1 }, { “id”: 77, “store_id”: 4 }, { “id”: 84, “store_id”: 5 } ] }, { “id”: 34, “start_key”: “7480000000000000FF1F00000000000000F8”, “end_key”: “7480000000000000FF2100000000000000F8”, “epoch”: { “conf_ver”: 5, “version”: 16 }, “peers”: [ { “id”: 35, “store_id”: 1 }, { “id”: 92, “store_id”: 5 }, { “id”: 97, “store_id”: 4 } ] }, { “id”: 37, “start_key”: “7480000000000000FF2100000000000000F8”, “end_key”: “7480000000000000FF2300000000000000F8”, “epoch”: { “conf_ver”: 5, “version”: 17 }, “peers”: [ { “id”: 38, “store_id”: 1 }, { “id”: 93, “store_id”: 4 }, { “id”: 100, “store_id”: 5 } ] }, { “id”: 40, “start_key”: “7480000000000000FF2300000000000000F8”, “end_key”: “7480000000000000FF2500000000000000F8”, “epoch”: { “conf_ver”: 5, “version”: 18 }, “peers”: [ { “id”: 41, “store_id”: 1 }, { “id”: 101, “store_id”: 4 }, { “id”: 108, “store_id”: 5 } ] }, { “id”: 1001, “start_key”: “7480000000000000FF2500000000000000F8”, “end_key”: “7480000000000000FF2700000000000000F8”, “epoch”: { “conf_ver”: 5, “version”: 19 }, “peers”: [ { “id”: 1002, “store_id”: 1 }, { “id”: 1003, “store_id”: 5 }, { “id”: 1004, “store_id”: 4 } ] }, { “id”: 2, “start_key”: “7480000000000000FF2B00000000000000F8”, “end_key”: “”, “epoch”: { “conf_ver”: 5, “version”: 21 }, “peers”: [ { “id”: 3, “store_id”: 1 }, { “id”: 104, “store_id”: 5 }, { “id”: 107, “store_id”: 4 } ] }, { “id”: 20, “start_key”: “7480000000000000FF1100000000000000F8”, “end_key”: “7480000000000000FF1300000000000000F8”, “epoch”: { “conf_ver”: 5, “version”: 9 }, “peers”: [ { “id”: 21, “store_id”: 1 }, { “id”: 66, “store_id”: 4 }, { “id”: 72, “store_id”: 5 } ] }, { “id”: 1009, “start_key”: “7480000000000000FF2900000000000000F8”, “end_key”: “7480000000000000FF2B00000000000000F8”, “epoch”: { “conf_ver”: 5, “version”: 21 }, “peers”: [ { “id”: 1010, “store_id”: 1 }, { “id”: 1011, “store_id”: 5 }, { “id”: 1012, “store_id”: 4 } ] }, { “id”: 28, “start_key”: “7480000000000000FF1900000000000000F8”, “end_key”: “7480000000000000FF1B00000000000000F8”, “epoch”: { “conf_ver”: 5, “version”: 13 }, “peers”: [ { “id”: 29, “store_id”: 1 }, { “id”: 80, “store_id”: 5 }, { “id”: 85, “store_id”: 4 } ] }, { “id”: 14, “start_key”: “7480000000000000FF0B00000000000000F8”, “end_key”: “7480000000000000FF0D00000000000000F8”, “epoch”: { “conf_ver”: 5, “version”: 6 }, “peers”: [ { “id”: 15, “store_id”: 1 }, { “id”: 52, “store_id”: 4 }, { “id”: 60, “store_id”: 5 } ] }, { “id”: 18, “start_key”: “7480000000000000FF0F00000000000000F8”, “end_key”: “7480000000000000FF1100000000000000F8”, “epoch”: { “conf_ver”: 5, “version”: 8 }, “peers”: [ { “id”: 19, “store_id”: 1 }, { “id”: 64, “store_id”: 5 }, { “id”: 105, “store_id”: 4 } ] }, { “id”: 1005, “start_key”: “7480000000000000FF2700000000000000F8”, “end_key”: “7480000000000000FF2900000000000000F8”, “epoch”: { “conf_ver”: 5, “version”: 20 }, “peers”: [ { “id”: 1006, “store_id”: 1 }, { “id”: 1007, “store_id”: 5 }, { “id”: 1008, “store_id”: 4 } ] }, { “id”: 10, “start_key”: “7480000000000000FF0700000000000000F8”, “end_key”: “7480000000000000FF0900000000000000F8”, “epoch”: { “conf_ver”: 5, “version”: 4 }, “peers”: [ { “id”: 11, “store_id”: 1 }, { “id”: 44, “store_id”: 4 }, { “id”: 54, “store_id”: 5 } ] }, { “id”: 32, “start_key”: “7480000000000000FF1D00000000000000F8”, “end_key”: “7480000000000000FF1F00000000000000F8”, “epoch”: { “conf_ver”: 5, “version”: 15 }, “peers”: [ { “id”: 33, “store_id”: 1 }, { “id”: 89, “store_id”: 4 }, { “id”: 96, “store_id”: 5 } ] }, { “id”: 12, “start_key”: “7480000000000000FF0900000000000000F8”, “end_key”: “7480000000000000FF0B00000000000000F8”, “epoch”: { “conf_ver”: 5, “version”: 5 }, “peers”: [ { “id”: 13, “store_id”: 1 }, { “id”: 45, “store_id”: 5 }, { “id”: 56, “store_id”: 4 } ] }, { “id”: 16, “start_key”: “7480000000000000FF0D00000000000000F8”, “end_key”: “7480000000000000FF0F00000000000000F8”, “epoch”: { “conf_ver”: 5, “version”: 7 }, “peers”: [ { “id”: 17, “store_id”: 1 }, { “id”: 58, “store_id”: 5 }, { “id”: 62, “store_id”: 4 } ] }, { “id”: 22, “start_key”: “7480000000000000FF1300000000000000F8”, “end_key”: “7480000000000000FF1500000000000000F8”, “epoch”: { “conf_ver”: 5, “version”: 10 }, “peers”: [ { “id”: 23, “store_id”: 1 }, { “id”: 68, “store_id”: 5 }, { “id”: 73, “store_id”: 4 } ] }, { “id”: 30, “start_key”: “7480000000000000FF1B00000000000000F8”, “end_key”: “7480000000000000FF1D00000000000000F8”, “epoch”: { “conf_ver”: 5, “version”: 14 }, “peers”: [ { “id”: 31, “store_id”: 1 }, { “id”: 81, “store_id”: 4 }, { “id”: 88, “store_id”: 5 } ] }, { “id”: 6, “start_key”: “”, “end_key”: “7480000000000000FF0500000000000000F8”, “epoch”: { “conf_ver”: 5, “version”: 2 }, “peers”: [ { “id”: 7, “store_id”: 1 }, { “id”: 36, “store_id”: 4 }, { “id”: 50, “store_id”: 5 } ] }, { “id”: 8, “start_key”: “7480000000000000FF0500000000000000F8”, “end_key”: “7480000000000000FF0700000000000000F8”, “epoch”: { “conf_ver”: 5, “version”: 3 }, “peers”: [ { “id”: 9, “store_id”: 1 }, { “id”: 42, “store_id”: 5 }, { “id”: 48, “store_id”: 4 } ] } ] }

稍等,我总结一下回复

  1. 停止正常tikv实例
  2. 请参考命令: ./tikv-ctl --db <deploy_dir>/data/db unsafe-recover remove-fail-stores -s x,x --all-regions —>x替换为rm的两个store id.
  3. 启动tikv实例查看日志是否正常.
  4. 重新加回来其他两个实例
1赞

tikv已经正常启动,tidb也起来了,监控已经恢复,谢谢

:+1::+1::+1: