机器挂掉了,tikv无法启动, 集群异常不可用

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】:4.0.5
  • 【问题描述】:
    6台机器,每台机器上部署两个tikv实例, 一台机器故障宕机后,整个tidb集群异常不可用。 故障机器重启后, 上面的两个tikv实例无法启动:[2020/10/26 20:48:43.223 +08:00] [FATAL] [server.rs:591] [“failed to start node: EngineTraits(Other(”[components/raftstore/src/store/fsm/store.rs:837]: \"[components/raftstore/src/store/peer_storage.rs:385]: [region 13024] entry at apply index 4531624 doesn\\\'t exist, may lose data.\""))"]

调度好像没起作用,没有把region leader从故障的节点上转移。 用tiup scale-in下线故障节点,没起作用; 重启整个集群,也没有作用
sync-log配置: raftstore.sync-log: false

pdctl config show all:

» config show all
{
“client-urls”: “http://0.0.0.0:2379”,
“peer-urls”: “http://172.20.53.141:2380”,
“advertise-client-urls”: “http://172.20.53.141:2379”,
“advertise-peer-urls”: “http://172.20.53.141:2380”,
“name”: “pd-53-141-2379”,
“data-dir”: “/apps/dba/data/tidb/pd_2379”,
“force-new-cluster”: false,
“enable-grpc-gateway”: true,
“initial-cluster”: “pd-53-129-2379=http://172.20.53.129:2380,pd-53-141-2379=http://172.20.53.141:2380,pd-53-138-2379=http://172.20.53.138:2380”,
“initial-cluster-state”: “new”,
“join”: “”,
“lease”: 3,
“log”: {
“level”: “info”,
“format”: “text”,
“disable-timestamp”: false,
“file”: {
“filename”: “/apps/dba/logs/tidb/pd_2379/pd.log”,
“max-size”: 100,
“max-days”: 30,
“max-backups”: 7
},
“development”: false,
“disable-caller”: false,
“disable-stacktrace”: false,
“disable-error-verbose”: true,
“sampling”: null
},
“tso-save-interval”: “3s”,
“metric”: {
“job”: “pd-53-141-2379”,
“address”: “”,
“interval”: “15s”
},
“schedule”: {
“max-snapshot-count”: 3,
“max-pending-peer-count”: 16,
“max-merge-region-size”: 20,
“max-merge-region-keys”: 200000,
“split-merge-interval”: “1h0m0s”,
“enable-one-way-merge”: “false”,
“enable-cross-table-merge”: “false”,
“patrol-region-interval”: “100ms”,
“max-store-down-time”: “3m0s”,
“leader-schedule-limit”: 64,
“leader-schedule-policy”: “count”,
“region-schedule-limit”: 4096,
“replica-schedule-limit”: 64,
“merge-schedule-limit”: 16,
“hot-region-schedule-limit”: 0,
“hot-region-cache-hits-threshold”: 3,
“store-limit”: {
“1”: {
“add-peer”: 15,
“remove-peer”: 15
},
“12”: {
“add-peer”: 15,
“remove-peer”: 15
},
“15”: {
“add-peer”: 15,
“remove-peer”: 15
},
“2”: {
“add-peer”: 15,
“remove-peer”: 15
},
“20”: {
“add-peer”: 15,
“remove-peer”: 15
},
“21”: {
“add-peer”: 15,
“remove-peer”: 15
},
“22”: {
“add-peer”: 15,
“remove-peer”: 15
},
“23”: {
“add-peer”: 15,
“remove-peer”: 15
},
“24”: {
“add-peer”: 15,
“remove-peer”: 15
},
“3”: {
“add-peer”: 15,
“remove-peer”: 15
},
“5”: {
“add-peer”: 15,
“remove-peer”: 15
},
“8”: {
“add-peer”: 15,
“remove-peer”: 15
}
},
“tolerant-size-ratio”: 0,
“low-space-ratio”: 0.9,
“high-space-ratio”: 0.8,
“scheduler-max-waiting-operator”: 5,
“enable-remove-down-replica”: “true”,
“enable-replace-offline-replica”: “true”,
“enable-make-up-replica”: “true”,
“enable-remove-extra-replica”: “true”,
“enable-location-replacement”: “true”,
“enable-debug-metrics”: “false”,
“schedulers-v2”: [
{
“type”: “balance-region”,
“args”: null,
“disable”: false,
“args-payload”: “”
},
{
“type”: “balance-leader”,
“args”: null,
“disable”: false,
“args-payload”: “”
},
{
“type”: “hot-region”,
“args”: null,
“disable”: false,
“args-payload”: “”
},
{
“type”: “label”,
“args”: null,
“disable”: false,
“args-payload”: “”
}
],
“schedulers-payload”: {
“balance-hot-region-scheduler”: null,
“balance-leader-scheduler”: {
“name”: “balance-leader-scheduler”,
“ranges”: [
{
“end-key”: “”,
“start-key”: “”
}
]
},
“balance-region-scheduler”: {
“name”: “balance-region-scheduler”,
“ranges”: [
{
“end-key”: “”,
“start-key”: “”
}
]
},
“label-scheduler”: {
“name”: “label-scheduler”,
“ranges”: [
{
“end-key”: “”,
“start-key”: “”
}
]
}
},
“store-limit-mode”: “manual”
},
“replication”: {
“max-replicas”: 2,
“location-labels”: “host,iport”,
“strictly-match-label”: “false”,
“enable-placement-rules”: “true”
},
“pd-server”: {
“use-region-storage”: “true”,
“max-gap-reset-ts”: “24h0m0s”,
“key-type”: “table”,
“runtime-services”: “”,
“metric-storage”: “http://172.20.51.225:9090”,
“dashboard-address”: “http://172.20.53.129:2379”,
“trace-region-flow”: “false”
},
“cluster-version”: “4.0.5”,
“quota-backend-bytes”: “8GiB”,
“auto-compaction-mode”: “periodic”,
“auto-compaction-retention-v2”: “1h”,
“TickInterval”: “500ms”,
“ElectionInterval”: “15s”,
“PreVote”: true,
“security”: {
“cacert-path”: “”,
“cert-path”: “”,
“key-path”: “”,
“cert-allowed-cn”: null
},
“label-property”: {},
“WarningMsgs”: [
“disable-telemetry in conf/pd.toml is deprecated, use enable-telemetry instead”,
“Config contains undefined item: enable-dynamic-config, use-region-storage”
],
“DisableStrictReconfigCheck”: false,
“HeartbeatStreamBindInterval”: “1m0s”,
“LeaderPriorityCheckInterval”: “1m0s”,
“dashboard”: {
“tidb-cacert-path”: “”,
“tidb-cert-path”: “”,
“tidb-key-path”: “”,
“public-path-prefix”: “”,
“internal-proxy”: true,
“enable-telemetry”: false,
“disable-telemetry”: true
},
“replication-mode”: {
“replication-mode”: “majority”,
“dr-auto-sync”: {
“label-key”: “”,
“primary”: “”,
“dr”: “”,
“primary-replicas”: 0,
“dr-replicas”: 0,
“wait-store-timeout”: “1m0s”,
“wait-sync-timeout”: “1m0s”
}
}
}

»

tiup cluster display:

Found cluster newer version:

The latest version:         v1.2.1
Local installed version:    v1.1.1
Update current component:   tiup update cluster
Update all components:      tiup update --all

Starting component cluster: /home/apps/.tiup/components/cluster/v1.1.1/tiup-cluster display tidblepro
tidb Cluster: tidblepro
tidb Version: v4.0.5
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir


172.20.51.89:9093 alertmanager 172.20.51.89 9093/9094 linux/x86_64 inactive /apps/dba/data/tidb/alert_9093 /apps/dba/svr/tidb/alert_9093
172.20.51.89:3000 grafana 172.20.51.89 3000 linux/x86_64 inactive - /apps/dba/svr/tidb/grafana_3000
172.20.53.129:2379 pd 172.20.53.129 2379/2380 linux/x86_64 Up|UI /apps/dba/data/tidb/pd_2379 /apps/dba/svr/tidb/pd_2379
172.20.53.138:2379 pd 172.20.53.138 2379/2380 linux/x86_64 Up|L /apps/dba/data/tidb/pd_2379 /apps/dba/svr/tidb/pd_2379
172.20.53.141:2379 pd 172.20.53.141 2379/2380 linux/x86_64 Up /apps/dba/data/tidb/pd_2379 /apps/dba/svr/tidb/pd_2379
172.20.51.225:9090 prometheus 172.20.51.225 9090 linux/x86_64 inactive /data/dba/data/tidb/prometheus_9090 /data/dba/svr/tidb/prometheus_9090
172.20.53.131:4000 tidb 172.20.53.131 4000/10080 linux/x86_64 Down - /apps/dba/svr/tidb/tidb_4000
172.20.53.133:4000 tidb 172.20.53.133 4000/10080 linux/x86_64 Down - /apps/dba/svr/tidb/tidb_4000
172.20.53.135:4000 tidb 172.20.53.135 4000/10080 linux/x86_64 Down - /apps/dba/svr/tidb/tidb_4000
172.20.53.141:4000 tidb 172.20.53.141 4000/10080 linux/x86_64 Down - /apps/dba/svr/tidb/tidb_4000
172.20.55.100:20160 tikv 172.20.55.100 20160/20180 linux/x86_64 Up /apps/dba/data/tidb/tikv_20160 /apps/dba/svr/tidb/tikv_20160
172.20.55.100:20161 tikv 172.20.55.100 20161/20181 linux/x86_64 Up /apps/dba/data/tidb/tikv_20161 /apps/dba/svr/tidb/tikv_20161
172.20.55.101:20160 tikv 172.20.55.101 20160/20180 linux/x86_64 Up /apps/dba/data/tidb/tikv_20160 /apps/dba/svr/tidb/tikv_20160
172.20.55.101:20161 tikv 172.20.55.101 20161/20181 linux/x86_64 Up /apps/dba/data/tidb/tikv_20161 /apps/dba/svr/tidb/tikv_20161
172.20.55.102:20160 tikv 172.20.55.102 20160/20180 linux/x86_64 Pending Offline /apps/dba/data/tidb/tikv_20160 /apps/dba/svr/tidb/tikv_20160
172.20.55.102:20161 tikv 172.20.55.102 20161/20181 linux/x86_64 Pending Offline /apps/dba/data/tidb/tikv_20161 /apps/dba/svr/tidb/tikv_20161
172.20.55.103:20160 tikv 172.20.55.103 20160/20180 linux/x86_64 Up /apps/dba/data/tidb/tikv_20160 /apps/dba/svr/tidb/tikv_20160
172.20.55.103:20161 tikv 172.20.55.103 20161/20181 linux/x86_64 Up /apps/dba/data/tidb/tikv_20161 /apps/dba/svr/tidb/tikv_20161
172.20.55.104:20160 tikv 172.20.55.104 20160/20180 linux/x86_64 Up /apps/dba/data/tidb/tikv_20160 /apps/dba/svr/tidb/tikv_20160
172.20.55.104:20161 tikv 172.20.55.104 20161/20181 linux/x86_64 Up /apps/dba/data/tidb/tikv_20161 /apps/dba/svr/tidb/tikv_20161
172.20.55.105:20160 tikv 172.20.55.105 20160/20180 linux/x86_64 Up /apps/dba/data/tidb/tikv_20160 /apps/dba/svr/tidb/tikv_20160
172.20.55.105:20161 tikv 172.20.55.105 20161/20181 linux/x86_64 Up /apps/dba/data/tidb/tikv_20161 /apps/dba/svr/tidb/tikv_20161

1、请先将 pd 调度相关的 scheduler 关闭:

  • config set region-schedule-limit 0
  • config set replica-schedule-limit 0
  • config set leader-schedule-limit 0
  • config set merge-schedule-limit 0

2、请 tiup cluster display {cluster_name} 看下结果
3、请 pd-ctl 拿下 store 的信息:pd-ctl store
4、请 pd-ctl 拿下在故障节点上 region 信息:
pd-ctl -u http://{pd_ip}:2379 -d region --jq='.regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length as $total | map(if .==({故障 store id_1},{故障 store id_2}) then . else empty end) | length>=$total-length)}'

重启集群后,4台tidb都异常了, 看tikv的日志报错也很多, 都连接不上, 我暂时把集群stop了, 保留现场。 现在先启动集群

tiup cluster display tidblepro结果如下:
Found cluster newer version:

The latest version:         v1.2.1
Local installed version:    v1.1.1
Update current component:   tiup update cluster
Update all components:      tiup update --all

Starting component cluster: /home/apps/.tiup/components/cluster/v1.1.1/tiup-cluster display tidblepro
tidb Cluster: tidblepro
tidb Version: v4.0.5
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir


172.20.51.89:9093 alertmanager 172.20.51.89 9093/9094 linux/x86_64 inactive /apps/dba/data/tidb/alert_9093 /apps/dba/svr/tidb/alert_9093
172.20.51.89:3000 grafana 172.20.51.89 3000 linux/x86_64 inactive - /apps/dba/svr/tidb/grafana_3000
172.20.53.129:2379 pd 172.20.53.129 2379/2380 linux/x86_64 Up|UI /apps/dba/data/tidb/pd_2379 /apps/dba/svr/tidb/pd_2379
172.20.53.138:2379 pd 172.20.53.138 2379/2380 linux/x86_64 Up|L /apps/dba/data/tidb/pd_2379 /apps/dba/svr/tidb/pd_2379
172.20.53.141:2379 pd 172.20.53.141 2379/2380 linux/x86_64 Up /apps/dba/data/tidb/pd_2379 /apps/dba/svr/tidb/pd_2379
172.20.51.225:9090 prometheus 172.20.51.225 9090 linux/x86_64 inactive /data/dba/data/tidb/prometheus_9090 /data/dba/svr/tidb/prometheus_9090
172.20.53.131:4000 tidb 172.20.53.131 4000/10080 linux/x86_64 Down - /apps/dba/svr/tidb/tidb_4000
172.20.53.133:4000 tidb 172.20.53.133 4000/10080 linux/x86_64 Down - /apps/dba/svr/tidb/tidb_4000
172.20.53.135:4000 tidb 172.20.53.135 4000/10080 linux/x86_64 Down - /apps/dba/svr/tidb/tidb_4000
172.20.53.141:4000 tidb 172.20.53.141 4000/10080 linux/x86_64 Down - /apps/dba/svr/tidb/tidb_4000
172.20.55.100:20160 tikv 172.20.55.100 20160/20180 linux/x86_64 Up /apps/dba/data/tidb/tikv_20160 /apps/dba/svr/tidb/tikv_20160
172.20.55.100:20161 tikv 172.20.55.100 20161/20181 linux/x86_64 Up /apps/dba/data/tidb/tikv_20161 /apps/dba/svr/tidb/tikv_20161
172.20.55.101:20160 tikv 172.20.55.101 20160/20180 linux/x86_64 Up /apps/dba/data/tidb/tikv_20160 /apps/dba/svr/tidb/tikv_20160
172.20.55.101:20161 tikv 172.20.55.101 20161/20181 linux/x86_64 Up /apps/dba/data/tidb/tikv_20161 /apps/dba/svr/tidb/tikv_20161
172.20.55.102:20160 tikv 172.20.55.102 20160/20180 linux/x86_64 Pending Offline /apps/dba/data/tidb/tikv_20160 /apps/dba/svr/tidb/tikv_20160
172.20.55.102:20161 tikv 172.20.55.102 20161/20181 linux/x86_64 Pending Offline /apps/dba/data/tidb/tikv_20161 /apps/dba/svr/tidb/tikv_20161
172.20.55.103:20160 tikv 172.20.55.103 20160/20180 linux/x86_64 Up /apps/dba/data/tidb/tikv_20160 /apps/dba/svr/tidb/tikv_20160
172.20.55.103:20161 tikv 172.20.55.103 20161/20181 linux/x86_64 Up /apps/dba/data/tidb/tikv_20161 /apps/dba/svr/tidb/tikv_20161
172.20.55.104:20160 tikv 172.20.55.104 20160/20180 linux/x86_64 Up /apps/dba/data/tidb/tikv_20160 /apps/dba/svr/tidb/tikv_20160
172.20.55.104:20161 tikv 172.20.55.104 20161/20181 linux/x86_64 Up /apps/dba/data/tidb/tikv_20161 /apps/dba/svr/tidb/tikv_20161
172.20.55.105:20160 tikv 172.20.55.105 20160/20180 linux/x86_64 Up /apps/dba/data/tidb/tikv_20160 /apps/dba/svr/tidb/tikv_20160
172.20.55.105:20161 tikv 172.20.55.105 20161/20181 linux/x86_64 Up /apps/dba/data/tidb/tikv_20161 /apps/dba/svr/tidb/tikv_20161

pd-ctl store结果如下:

» store
{
“count”: 12,
“stores”: [
{
“store”: {
“id”: 8,
“address”: “172.20.55.103:20161”,
“labels”: [
{
“key”: “host”,
“value”: “TIKV_LG_55_103”
},
{
“key”: “iport”,
“value”: “p21061”
}
],
“version”: “4.0.5”,
“status_address”: “172.20.55.103:20181”,
“git_hash”: “f39927a3529d40a6bb4e6c54854a94fdac996e92”,
“start_timestamp”: 1603720233,
“deploy_path”: “/apps/dba/svr/tidb/tikv_20161/bin”,
“last_heartbeat”: 1603720567244335674,
“state_name”: “Up”
},
“status”: {
“capacity”: “3.223TiB”,
“available”: “2.264TiB”,
“used_size”: “981.3GiB”,
“leader_count”: 5672,
“leader_weight”: 1,
“leader_score”: 5672,
“leader_size”: 1946630,
“region_count”: 13328,
“region_weight”: 1,
“region_score”: 3695578,
“region_size”: 3695578,
“start_ts”: “2020-10-26T21:50:33+08:00”,
“last_heartbeat_ts”: “2020-10-26T21:56:07.244335674+08:00”,
“uptime”: “5m34.244335674s”
}
},
{
“store”: {
“id”: 12,
“address”: “172.20.55.103:20160”,
“labels”: [
{
“key”: “host”,
“value”: “TIKV_LG_55_103”
},
{
“key”: “iport”,
“value”: “p21060”
}
],
“version”: “4.0.5”,
“status_address”: “172.20.55.103:20180”,
“git_hash”: “f39927a3529d40a6bb4e6c54854a94fdac996e92”,
“start_timestamp”: 1603720233,
“deploy_path”: “/apps/dba/svr/tidb/tikv_20160/bin”,
“last_heartbeat”: 1603720567342874199,
“state_name”: “Up”
},
“status”: {
“capacity”: “3.223TiB”,
“available”: “2.248TiB”,
“used_size”: “997.7GiB”,
“leader_count”: 5548,
“leader_weight”: 1,
“leader_score”: 5548,
“leader_size”: 1938655,
“region_count”: 13324,
“region_weight”: 1,
“region_score”: 3884163,
“region_size”: 3884163,
“start_ts”: “2020-10-26T21:50:33+08:00”,
“last_heartbeat_ts”: “2020-10-26T21:56:07.342874199+08:00”,
“uptime”: “5m34.342874199s”
}
},
{
“store”: {
“id”: 15,
“address”: “172.20.55.102:20161”,
“state”: 1,
“labels”: [
{
“key”: “host”,
“value”: “TIKV_LG_55_102”
},
{
“key”: “iport”,
“value”: “p21061”
}
],
“version”: “4.0.5”,
“status_address”: “172.20.55.102:20181”,
“git_hash”: “f39927a3529d40a6bb4e6c54854a94fdac996e92”,
“start_timestamp”: 1600390141,
“deploy_path”: “/apps/dba/svr/tidb/tikv_20161/bin”,
“last_heartbeat”: 1603710801035334664,
“state_name”: “Offline”
},
“status”: {
“capacity”: “0B”,
“available”: “0B”,
“used_size”: “0B”,
“leader_count”: 0,
“leader_weight”: 1,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 12881,
“region_weight”: 1,
“region_score”: 1,
“region_size”: 1,
“start_ts”: “2020-09-18T08:49:01+08:00”,
“last_heartbeat_ts”: “2020-10-26T19:13:21.035334664+08:00”,
“uptime”: “922h24m20.035334664s”
}
},
{
“store”: {
“id”: 20,
“address”: “172.20.55.100:20160”,
“labels”: [
{
“key”: “host”,
“value”: “TIKV_LG_55_100”
},
{
“key”: “iport”,
“value”: “p21060”
}
],
“version”: “4.0.5”,
“status_address”: “172.20.55.100:20180”,
“git_hash”: “f39927a3529d40a6bb4e6c54854a94fdac996e92”,
“start_timestamp”: 1603720233,
“deploy_path”: “/apps/dba/svr/tidb/tikv_20160/bin”,
“last_heartbeat”: 1603720567205475324,
“state_name”: “Up”
},
“status”: {
“capacity”: “3.223TiB”,
“available”: “2.266TiB”,
“used_size”: “980GiB”,
“leader_count”: 5852,
“leader_weight”: 1,
“leader_score”: 5852,
“leader_size”: 2013749,
“region_count”: 13302,
“region_weight”: 1,
“region_score”: 3669004,
“region_size”: 3669004,
“start_ts”: “2020-10-26T21:50:33+08:00”,
“last_heartbeat_ts”: “2020-10-26T21:56:07.205475324+08:00”,
“uptime”: “5m34.205475324s”
}
},
{
“store”: {
“id”: 1,
“address”: “172.20.55.101:20161”,
“labels”: [
{
“key”: “host”,
“value”: “TIKV_LG_55_101”
},
{
“key”: “iport”,
“value”: “p21061”
}
],
“version”: “4.0.5”,
“status_address”: “172.20.55.101:20181”,
“git_hash”: “f39927a3529d40a6bb4e6c54854a94fdac996e92”,
“start_timestamp”: 1603720233,
“deploy_path”: “/apps/dba/svr/tidb/tikv_20161/bin”,
“last_heartbeat”: 1603720567311285635,
“state_name”: “Up”
},
“status”: {
“capacity”: “3.223TiB”,
“available”: “2.275TiB”,
“used_size”: “969.9GiB”,
“leader_count”: 5188,
“leader_weight”: 1,
“leader_score”: 5188,
“leader_size”: 1724342,
“region_count”: 13821,
“region_weight”: 1,
“region_score”: 3559117,
“region_size”: 3559117,
“start_ts”: “2020-10-26T21:50:33+08:00”,
“last_heartbeat_ts”: “2020-10-26T21:56:07.311285635+08:00”,
“uptime”: “5m34.311285635s”
}
},
{
“store”: {
“id”: 2,
“address”: “172.20.55.101:20160”,
“labels”: [
{
“key”: “host”,
“value”: “TIKV_LG_55_101”
},
{
“key”: “iport”,
“value”: “p21060”
}
],
“version”: “4.0.5”,
“status_address”: “172.20.55.101:20180”,
“git_hash”: “f39927a3529d40a6bb4e6c54854a94fdac996e92”,
“start_timestamp”: 1603720233,
“deploy_path”: “/apps/dba/svr/tidb/tikv_20160/bin”,
“last_heartbeat”: 1603720567359342012,
“state_name”: “Up”
},
“status”: {
“capacity”: “3.223TiB”,
“available”: “2.269TiB”,
“used_size”: “976.6GiB”,
“leader_count”: 5364,
“leader_weight”: 1,
“leader_score”: 5364,
“leader_size”: 1772909,
“region_count”: 13974,
“region_weight”: 1,
“region_score”: 3724074,
“region_size”: 3724074,
“start_ts”: “2020-10-26T21:50:33+08:00”,
“last_heartbeat_ts”: “2020-10-26T21:56:07.359342012+08:00”,
“uptime”: “5m34.359342012s”
}
},
{
“store”: {
“id”: 3,
“address”: “172.20.55.105:20161”,
“labels”: [
{
“key”: “host”,
“value”: “TIKV_LG_55_105”
},
{
“key”: “iport”,
“value”: “p21061”
}
],
“version”: “4.0.5”,
“status_address”: “172.20.55.105:20181”,
“git_hash”: “f39927a3529d40a6bb4e6c54854a94fdac996e92”,
“start_timestamp”: 1603720233,
“deploy_path”: “/apps/dba/svr/tidb/tikv_20161/bin”,
“last_heartbeat”: 1603720567333756326,
“state_name”: “Up”
},
“status”: {
“capacity”: “3.223TiB”,
“available”: “2.258TiB”,
“used_size”: “987.8GiB”,
“leader_count”: 5398,
“leader_weight”: 1,
“leader_score”: 5398,
“leader_size”: 1913772,
“region_count”: 13067,
“region_weight”: 1,
“region_score”: 3820576,
“region_size”: 3820576,
“start_ts”: “2020-10-26T21:50:33+08:00”,
“last_heartbeat_ts”: “2020-10-26T21:56:07.333756326+08:00”,
“uptime”: “5m34.333756326s”
}
},
{
“store”: {
“id”: 5,
“address”: “172.20.55.102:20160”,
“state”: 1,
“labels”: [
{
“key”: “host”,
“value”: “TIKV_LG_55_102”
},
{
“key”: “iport”,
“value”: “p21060”
}
],
“version”: “4.0.5”,
“status_address”: “172.20.55.102:20180”,
“git_hash”: “f39927a3529d40a6bb4e6c54854a94fdac996e92”,
“start_timestamp”: 1600390089,
“deploy_path”: “/apps/dba/svr/tidb/tikv_20160/bin”,
“last_heartbeat”: 1603710798832597065,
“state_name”: “Offline”
},
“status”: {
“capacity”: “0B”,
“available”: “0B”,
“used_size”: “0B”,
“leader_count”: 0,
“leader_weight”: 1,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 0,
“region_weight”: 1,
“region_score”: 0,
“region_size”: 0,
“start_ts”: “2020-09-18T08:48:09+08:00”,
“last_heartbeat_ts”: “2020-10-26T19:13:18.832597065+08:00”,
“uptime”: “922h25m9.832597065s”
}
},
{
“store”: {
“id”: 23,
“address”: “172.20.55.104:20161”,
“labels”: [
{
“key”: “host”,
“value”: “TIKV_LG_55_104”
},
{
“key”: “iport”,
“value”: “p21061”
}
],
“version”: “4.0.5”,
“status_address”: “172.20.55.104:20181”,
“git_hash”: “f39927a3529d40a6bb4e6c54854a94fdac996e92”,
“start_timestamp”: 1603720233,
“deploy_path”: “/apps/dba/svr/tidb/tikv_20161/bin”,
“last_heartbeat”: 1603720567401089105,
“state_name”: “Up”
},
“status”: {
“capacity”: “3.223TiB”,
“available”: “2.272TiB”,
“used_size”: “973.9GiB”,
“leader_count”: 5052,
“leader_weight”: 1,
“leader_score”: 5052,
“leader_size”: 1681746,
“region_count”: 13812,
“region_weight”: 1,
“region_score”: 3710526,
“region_size”: 3710526,
“start_ts”: “2020-10-26T21:50:33+08:00”,
“last_heartbeat_ts”: “2020-10-26T21:56:07.401089105+08:00”,
“uptime”: “5m34.401089105s”
}
},
{
“store”: {
“id”: 21,
“address”: “172.20.55.100:20161”,
“labels”: [
{
“key”: “host”,
“value”: “TIKV_LG_55_100”
},
{
“key”: “iport”,
“value”: “p21061”
}
],
“version”: “4.0.5”,
“status_address”: “172.20.55.100:20181”,
“git_hash”: “f39927a3529d40a6bb4e6c54854a94fdac996e92”,
“start_timestamp”: 1603720233,
“deploy_path”: “/apps/dba/svr/tidb/tikv_20161/bin”,
“last_heartbeat”: 1603720567236981767,
“state_name”: “Up”
},
“status”: {
“capacity”: “3.223TiB”,
“available”: “2.245TiB”,
“used_size”: “1001GiB”,
“leader_count”: 5371,
“leader_weight”: 1,
“leader_score”: 5371,
“leader_size”: 1961727,
“region_count”: 12649,
“region_weight”: 1,
“region_score”: 3638956,
“region_size”: 3638956,
“start_ts”: “2020-10-26T21:50:33+08:00”,
“last_heartbeat_ts”: “2020-10-26T21:56:07.236981767+08:00”,
“uptime”: “5m34.236981767s”
}
},
{
“store”: {
“id”: 22,
“address”: “172.20.55.105:20160”,
“labels”: [
{
“key”: “host”,
“value”: “TIKV_LG_55_105”
},
{
“key”: “iport”,
“value”: “p21060”
}
],
“version”: “4.0.5”,
“status_address”: “172.20.55.105:20180”,
“git_hash”: “f39927a3529d40a6bb4e6c54854a94fdac996e92”,
“start_timestamp”: 1603720233,
“deploy_path”: “/apps/dba/svr/tidb/tikv_20160/bin”,
“last_heartbeat”: 1603720567265100141,
“state_name”: “Up”
},
“status”: {
“capacity”: “3.223TiB”,
“available”: “2.262TiB”,
“used_size”: “984GiB”,
“leader_count”: 5411,
“leader_weight”: 1,
“leader_score”: 5411,
“leader_size”: 1908874,
“region_count”: 13082,
“region_weight”: 1,
“region_score”: 3655404,
“region_size”: 3655404,
“start_ts”: “2020-10-26T21:50:33+08:00”,
“last_heartbeat_ts”: “2020-10-26T21:56:07.265100141+08:00”,
“uptime”: “5m34.265100141s”
}
},
{
“store”: {
“id”: 24,
“address”: “172.20.55.104:20160”,
“labels”: [
{
“key”: “host”,
“value”: “TIKV_LG_55_104”
},
{
“key”: “iport”,
“value”: “p21060”
}
],
“version”: “4.0.5”,
“status_address”: “172.20.55.104:20180”,
“git_hash”: “f39927a3529d40a6bb4e6c54854a94fdac996e92”,
“start_timestamp”: 1603720233,
“deploy_path”: “/apps/dba/svr/tidb/tikv_20160/bin”,
“last_heartbeat”: 1603720567456765759,
“state_name”: “Up”
},
“status”: {
“capacity”: “3.223TiB”,
“available”: “2.271TiB”,
“used_size”: “974.7GiB”,
“leader_count”: 4976,
“leader_weight”: 1,
“leader_score”: 4976,
“leader_size”: 1672851,
“region_count”: 13713,
“region_weight”: 1,
“region_score”: 3715693,
“region_size”: 3715693,
“start_ts”: “2020-10-26T21:50:33+08:00”,
“last_heartbeat_ts”: “2020-10-26T21:56:07.456765759+08:00”,
“uptime”: “5m34.456765759s”
}
}
]
}

»

下在故障节点上 region 信息, 太多了,请查看附件prolem_peers.txt (902.6 KB)

1、请先检查下,当前环境中是存在不存在 leader 以及副本丢失的 region,参考命令如下:

  • 没有 leader 的 region
    pd-ctl -u http://{pd_ip}:2379 -d region --jq '.regions[]|select(has("leader")|not)|{id: .id,peer_stores: [.peers[].store_id]}'

  • 副本丢失的 region
    pd-ctl -u http://{pd_ip}:2379 region --jq=".regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length != 2)}" // 当前环境中 max-replicas 参数为 2

2、如果存在没有 leader 的 region,并且确认是 region 的多副本丢失,请参考下面的数据修复步骤进行修复:

(1) 故障节点 store 15 以及 store 5先 stop 并且不要启动,并且在下述参考的修复操作步骤 (1)~(3) 执行期间这两个 store 务必保持 stop 状态

(2) region 多副本丢失,asktug 有下述贴子可供参考:
https://asktug.com/t/topic/34246

(3) 当使用上述帖子的方式恢复集群,恢复的参考指标为:

  • 当前环境中不存在缺少副本的 region,下述命令返回为空:
    pd-ctl -u http://{pd_ip}:2379 region --jq=“.regions | {id: .id, peer_stores: [.peers.store_id] | select(length != 2)}”

  • 当前环境中不存在缺少 leader 的 region,下述命令返回为空:
    pd-ctl -u http://{pd_ip}:2379 region --jq ‘.regions|select(has(“leader”)|not)|{id: .id,peer_stores: [.peers.store_id]}’

(4) 此时,原故障节点 store 15 以及 store 5 一定要继续保持 stop 状态,并使用下面的命令将故障节点的状态设置为 Tombstone,并清理信息:

  • curl -X POST ‘http://<pd_ip>:<pd_port>/pd/api/v1/store/{store_id}/state?state=Tombstone’ // {store_id} 分别为故障 store 5 以及 15
  • 查询 tombstone 节点信息:curl pd-addr:port/pd/api/v1/stores?state=2
  • 调用 remove-tombstone api 清理 tombstone 节点信息

(5) 设置 sync-log 参数为 true ,并且将 region 副本设置为 3,避免这样的情况再次发生。
(6) 清理原故障节点 store 15 以及 store 5 上的数据,并重新 scale-out 进行扩容。

谢谢, 集群正在恢复了。
有个疑问,如果replicas不能少于3, 为何不在强制无法设置少于3?

如果完全按照官网的推荐来来配置,那成本太高, 很可能一开始就无法在公司推广使用,例如我们目前的情况是1G的网卡,数据库机器统一的普通的ssd做的raid5, 要上线tidb得从交换机,机器到硬盘全部换, 在还没有试用得到验证前,很难以这样的标准来搭建tidb集群。我们一接入就十几T的数据,每天好几千万的数据同步到tidb , 需要的成本过大, 我熟悉的另一个细分行业的龙头公司也因为成本问题而不再推广使用。

我一开始是把sync-log设置了为true, 但写入太慢了,OPS达不到官方的要求, 只能设置为false. 我们认为丢一两分钟的数据并没有问题,我们只需要把dm同步的mysql binlog位置点提前几分钟,重复同步这几分钟的数据就行。但目前的情况是,挂了节点了后, 因为region 丢失了,很可能很久以前的数据都没了,只能整个集群的数据重新同步,这代价与时间太大。就算设置了replica为3, 那也可能挂2台甚至多台机器,而导致丢失数据,需要整个集群重新同步。所以希望能够允许丢失部分最新的数据,让tikv 能够正常启动, 集群恢复正常。 灾难恢复只能恢复不定时间的数据,其实跟整个集群的数据都丢失了差不多, 很难接受挽救回来的数据在时间分布上是零散的。

希望能增加让有问题的tikv 能够强制起来的参数或者功能, 集群增加强制剩下的region能够直接提升为master的功能,让集群迅速恢复正常。 丢失最近几分钟甚至几小时的数据,大家基本都没啥问题,可以快速补回数据,但丢失的数据时间是不确定的,分布零散的,那基本没人能接受。

首先,感谢您的建议 :handshake:

1、replicas 没有强制设置为 3 ,我理解是因为在测试环境中,对数据的高可用性没有强一致要求,所以可以根据实际的服务器容量设置为 1,即可以满足应用端使用 tidb 进行基本的功能测试,也可以在一定程度上降低测试环境的部署成本。

2、sync-log 设置为 true 确实会增加写入的代价,在未来的 5.0 版本中,有计划针对磁盘写入做性能优化,敬请期待哈 ~~

3、副本设置为 3 ,pd 在调度时,理论上不会将同一个 region 上的两个 peer 分布在同一台服务器上(单机多实例打 label)。同时,在实际使用过程中,两台服务器同时宕机的概率低于一台服务器宕机的概率,因为 raft 具有天然的高可用,比如 tidb 提供了两地三中心或者 ticdc 来实现多数据中心的高可用。

4、如果 tikv 某些 region 的副本出现了异常,根据不同的情况,提供了不同的解决办法:

1)比如 commit index is out of range , 出现这个报错的原因是 Region 的 Leader 发来消息命令它 commit 某一条 Raft 日志,而这个 Peer 上又因为掉电而丢失了这条日志。此时,可以通过滚动重启当前 UP 状态的 TiKV,恢复 commit index is out of range 的错误。然后,再尝试启动之前掉电无法启动的 TiKV 节点。

2)如果是 raft 状态机的报错,比如 last index 报错,根据情况可以使用 tikv-ctl 命令可以使用 bad-region 找出故障节点,并且 region 设置为 tombstone 后,再尝试拉起故障节点。

3)如果是集群中的多副本丢失,此时需要使用 unsafe-recover 命令来进行修复,此情况和之前的情况不同,在这个情况下,故障节点无法直接拉起,使用该命令后,会将存活节点的 region 的 peer 强制拉起为 leader,然后继续提供服务,pd 会进行补副本的动作,直到 region 的副本数量为 max-replica,后续需要根据情况来进行扩缩容操作。

replica设置为3, 集群有三台机器挂掉了,那些三个副本刚好分布在这三台机器的region就全丢失了, 数据就丢失了。希望可以做到:
1) 使用 unsafe-recover恢复好集群
2) 强制启动挂掉的三台机器的tikv, 这些tikv启动后,加入集群,对于在其它正常节点找不到peer的region,提升为leader, 恢复正常。如此恢复损失的数据,这样就应该只会丢失最后几分钟没有来得及sync的数据。

这个功能对于DBA的来是最后一根救命稻草。

集群恢复后,我把max-replica设置为3后,但peer增到了2后就不再增加了。



在生产环境中确实有这样的情况发生,比如 region 的 3 副本均丢失,如果出现这个情况,找出副本均丢失的 region ,然后可以通过 recreate-region 重建 region 的方式,来使数据库对象的部分表数据可以正常访问,已丢失的数据需要从应用端或业务端进行修复。可以使用 tikv-ctl --help 找到该命令的帮助描述。

1 个赞