Tidb启动不了,怀疑是异常关机或断电造成的,各位帮忙看一下。

起初,因为断电或者关机后,ansible-playbook start.yml 启动不了Tidb,卡在waiting for Tikv
Port Up。tikv.log如下图
[“[region 107073] 10304415 unexpected raft log index: last_index 5 < applied_index 12335”]
后来寻找博客,建议使用pd-ctl将失败Tikv节点delete掉,经过此操作,发现依旧不可以。一直卡在是waiting for Tidb Port up。
然后在官网上发现可以强制 Region 从多副本失败状态恢复服务,使用unsafe-recover remove-fail-stores 命令可以将故障机器从指定 Region 的 peer 列表中移除。tikv-ctl --db /path/to/tikv/db unsafe-recover remove-fail-stores -s 4,5 --all-regions


目前还有个问题在所有健康的store上执行此命令是什么意思啊?tikv-ctl不是只有中控机上才有嘛执行后仍然不可以。
目前的tidb.log如下
[2020/06/04 10:55:21.311 +08:00] [INFO] [region_cache.go:564] [“switch region peer to next due to NotLeader with NULL leader”] [currIdx=1] [regionID=10597951]
[2020/06/04 10:55:21.440 +08:00] [INFO] [region_cache.go:324] [“invalidate current region, because others failed on same store”] [region=10597951] [store=10.12.5.222:20160]
麻烦大家看看这是什么问题啊

tidb.log


tikv.log

pd.log

tikv-ctl可以copy到其他节点

好的,谢谢。复制完后,还是执行同样的命令是嘛?

tikv-ctl 可以理解为可执行文件,运行方式不变

好的,谢谢啦。我去试一试

:ok_hand:

在每台tikv上执行“unsafe-recover remove-fail-stores`后,重新使用ansible-playbook方式启动数据库服务,还是出现tidb port wait超时。 另外,已从inventory.ini将出错的那台tikv已经信息移除。

对应的tidb.log:

[2020/06/04 14:14:38.873 +08:00] [INFO] [region_cache.go:564] ["switch region peer to next due to NotLeader with NULL leader"] [currIdx=1] [regionID=10597951]
[2020/06/04 14:14:39.373 +08:00] [INFO] [region_cache.go:324] ["invalidate current region, because others failed on same store"] [region=10597951] [store=10.12.5.222:20160]
[2020/06/04 14:14:44.874 +08:00] [WARN] [client_batch.go:223] ["init create streaming fail"] [target=10.12.5.222:20160] [error="context deadline exceeded"]
[2020/06/04 14:14:44.874 +08:00] [INFO] [region_cache.go:937] ["mark store's regions need be refill"] [store=10.12.5.222:20160]
[2020/06/04 14:14:44.875 +08:00] [INFO] [region_cache.go:430] ["switch region peer to next due to send request fail"] [current="region ID: 10597951, meta: id:10597951 end_key:\"mDDLJobLi\\377st\\000\\000\\000\\000\\000\\000\\371\\000\\000\\000\\000\\000\\000\\000l\\200\\000\\000\\000\\000\\000\\000\\000\" region_epoch:<conf_ver:93 version:3 > peers:<id:10597953 store_id:407939 > peers:<id:10657288 store_id:1597655 > , peer: id:10597953 store_id:407939 , addr: 10.12.5.222:20160, idx: 0"] [needReload=false] [error="context deadline exceeded"] [errorVerbose="context deadline exceeded\
github.com/pingcap/errors.AddStack\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/pkg/mod/github.com/pingcap/errors@v0.11.4/errors.go:174\
github.com/pingcap/errors.Trace\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/pkg/mod/github.com/pingcap/errors@v0.11.4/juju_adaptor.go:15\
github.com/pingcap/tidb/store/tikv.sendBatchRequest\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/client_batch.go:585\
github.com/pingcap/tidb/store/tikv.(*rpcClient).SendRequest\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/client.go:287\
github.com/pingcap/tidb/store/tikv.(*RegionRequestSender).sendReqToRegion\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/region_request.go:169\
github.com/pingcap/tidb/store/tikv.(*RegionRequestSender).SendReqCtx\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/region_request.go:133\
github.com/pingcap/tidb/store/tikv.(*RegionRequestSender).SendReq\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/region_request.go:74\
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).get\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:324\
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).Get\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:282\
github.com/pingcap/tidb/kv.(*unionStore).Get\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/kv/union_store.go:194\
github.com/pingcap/tidb/store/tikv.(*tikvTxn).Get\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/txn.go:135\
github.com/pingcap/tidb/structure.(*TxStructure).Get\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/structure/string.go:35\
github.com/pingcap/tidb/structure.(*TxStructure).GetInt64\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/structure/string.go:44\
github.com/pingcap/tidb/meta.(*Meta).GetBootstrapVersion\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/meta/meta.go:697\
github.com/pingcap/tidb/session.getStoreBootstrapVersion.func1\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/session/session.go:1656\
github.com/pingcap/tidb/kv.RunInNewTxn\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/kv/txn.go:50\
github.com/pingcap/tidb/session.getStoreBootstrapVersion\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/session/session.go:1653\
github.com/pingcap/tidb/session.BootstrapSession\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/session/session.go:1492\
main.createStoreAndDomain\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/tidb-server/main.go:210\
main.main\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/tidb-server/main.go:172\
runtime.main\
\t/usr/local/go/src/runtime/proc.go:203\
runtime.goexit\
\t/usr/local/go/src/runtime/asm_amd64.s:1357"]

pd.log: (感觉就是正在的在进行regoin delete之后的迁移工作)

[2020/06/04 06:19:34.875 +00:00] [INFO] [operator_controller.go:391] ["send schedule command"] [region-id=9375280] [step="promote learner peer 10692179 on store 407940 to voter"] [source="active push"]

[2020/06/04 06:19:35.875 +00:00] [INFO] [operator_controller.go:391] ["send schedule command"] [region-id=2759084] [step="promote learner peer 10692174 on store 407940 to voter"] [source="active push"]

@rongyilong-PingCAP

你好,

辛苦提供以下信息,便于排查。

  1. pd-ctl sore 、config show 和 member 信息,需要确认 tikv store id 是多少,并且确定 tikv 节点数,max-replicas 数量。

先判断下,在进行其他操作。

  1. store: 10.12.5.222是出故障的那台tikv,已下线。对应的 tikv store id = 407939。目前还有10个tikv节点。
{
  "count": 12,
  "stores": [
    {
      "store": {
        "id": 6506924,
        "address": "10.12.5.229:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "1007GiB",
        "available": "330.8GiB",
        "leader_count": 5016,
        "leader_weight": 1,
        "leader_score": 770346,
        "leader_size": 770346,
        "region_count": 11296,
        "region_weight": 1,
        "region_score": 384767395.9635596,
        "region_size": 1555911,
        "start_ts": "2020-06-04T02:22:20Z",
        "last_heartbeat_ts": "2020-06-04T06:50:56.220964548Z",
        "uptime": "4h28m36.220964548s"
      }
    },
    {
      "store": {
        "id": 6506925,
        "address": "10.12.5.228:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "1007GiB",
        "available": "330.2GiB",
        "leader_count": 6197,
        "leader_weight": 1,
        "leader_score": 770226,
        "leader_size": 770226,
        "region_count": 13371,
        "region_weight": 1,
        "region_score": 387859637.23420715,
        "region_size": 1557574,
        "start_ts": "2020-06-04T02:22:21Z",
        "last_heartbeat_ts": "2020-06-04T06:50:57.862368459Z",
        "uptime": "4h28m36.862368459s"
      }
    },
    {
      "store": {
        "id": 484920,
        "address": "10.12.5.223:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "293.5GiB",
        "leader_count": 5508,
        "leader_weight": 1,
        "leader_score": 770518,
        "leader_size": 770518,
        "region_count": 10547,
        "region_weight": 1,
        "region_score": 367414078.65778494,
        "region_size": 1347938,
        "start_ts": "2020-06-04T02:22:11Z",
        "last_heartbeat_ts": "2020-06-04T06:50:50.911310527Z",
        "uptime": "4h28m39.911310527s"
      }
    },
    {
      "store": {
        "id": 665678,
        "address": "127.0.0.1:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Down"
      },
      "status": {
        "leader_weight": 1,
        "region_weight": 1,
        "start_ts": "1970-01-01T00:00:00Z"
      }
    },
    {
      "store": {
        "id": 6506926,
        "address": "10.12.5.231:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "1007GiB",
        "available": "331.9GiB",
        "leader_count": 6739,
        "leader_weight": 1,
        "leader_score": 770205,
        "leader_size": 770205,
        "region_count": 15007,
        "region_weight": 1,
        "region_score": 378896581.11973715,
        "region_size": 1571935,
        "start_ts": "2020-06-04T02:22:16Z",
        "last_heartbeat_ts": "2020-06-04T06:50:53.425848093Z",
        "uptime": "4h28m37.425848093s"
      }
    },
    {
      "store": {
        "id": 407938,
        "address": "10.12.5.221:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Disconnected"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "297.7GiB",
        "leader_count": 4,
        "leader_weight": 1,
        "leader_score": 359,
        "leader_size": 359,
        "region_count": 9164,
        "region_weight": 1,
        "region_score": 342058816.292902,
        "region_size": 1275134,
        "start_ts": "2020-06-04T06:50:22Z",
        "last_heartbeat_ts": "2020-06-04T06:50:32.703960133Z",
        "uptime": "10.703960133s"
      }
    },
    {
      "store": {
        "id": 640552,
        "address": "10.12.5.224:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "289.9GiB",
        "leader_count": 5406,
        "leader_weight": 1,
        "leader_score": 770866,
        "leader_size": 770866,
        "region_count": 11157,
        "region_weight": 1,
        "region_score": 389546982.23695135,
        "region_size": 1434670,
        "start_ts": "2020-06-04T02:22:14Z",
        "last_heartbeat_ts": "2020-06-04T06:50:52.039137696Z",
        "uptime": "4h28m38.039137696s"
      }
    },
    {
      "store": {
        "id": 1597655,
        "address": "10.12.5.226:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "835.7GiB",
        "available": "277GiB",
        "leader_count": 4705,
        "leader_weight": 1,
        "leader_score": 770281,
        "leader_size": 770281,
        "region_count": 7494,
        "region_weight": 1,
        "region_score": 368231699.461524,
        "region_size": 1129123,
        "start_ts": "2020-06-04T02:22:20Z",
        "last_heartbeat_ts": "2020-06-04T06:50:52.675496454Z",
        "uptime": "4h28m32.675496454s"
      }
    },
    {
      "store": {
        "id": 2026701,
        "address": "10.12.5.227:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "293.9GiB",
        "leader_count": 5270,
        "leader_weight": 1,
        "leader_score": 770872,
        "leader_size": 770872,
        "region_count": 10448,
        "region_weight": 1,
        "region_score": 365173181.9040165,
        "region_size": 1401498,
        "start_ts": "2020-06-04T02:22:21Z",
        "last_heartbeat_ts": "2020-06-04T06:50:54.346553527Z",
        "uptime": "4h28m33.346553527s"
      }
    },
    {
      "store": {
        "id": 335855,
        "address": "10.12.5.230:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "286.9GiB",
        "leader_count": 6130,
        "leader_weight": 1,
        "leader_score": 770383,
        "leader_size": 770383,
        "region_count": 12575,
        "region_weight": 1,
        "region_score": 407578515.8973632,
        "region_size": 1428141,
        "start_ts": "2020-06-04T02:22:10Z",
        "last_heartbeat_ts": "2020-06-04T06:50:49.849598233Z",
        "uptime": "4h28m39.849598233s"
      }
    },
    {
      "store": {
        "id": 407939,
        "address": "10.12.5.222:20160",
        "state": 1,
        "version": "3.1.0-beta.1",
        "state_name": "Offline"
      },
      "status": {
        "leader_weight": 1,
        "region_count": 9314,
        "region_weight": 1,
        "region_score": 261,
        "region_size": 261,
        "start_ts": "1970-01-01T00:00:00Z"
      }
    },
    {
      "store": {
        "id": 407940,
        "address": "10.12.5.220:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "865.2GiB",
        "available": "293.8GiB",
        "leader_count": 5186,
        "leader_weight": 1,
        "leader_score": 770787,
        "leader_size": 770787,
        "region_count": 8562,
        "region_weight": 1,
        "region_score": 324775021.7264414,
        "region_size": 1167762,
        "start_ts": "2020-06-04T02:22:09Z",
        "last_heartbeat_ts": "2020-06-04T06:50:54.53754892Z",
        "uptime": "4h28m45.53754892s"
      }
    }
  ]
}
  1. config show max-replicas = 2 syn-log = false
{
  "replication": {
    "location-labels": "",
    "max-replicas": 2,
    "strictly-match-label": "false"
  },
  "schedule": {
    "disable-location-replacement": "false",
    "disable-make-up-replica": "false",
    "disable-namespace-relocation": "false",
    "disable-raft-learner": "false",
    "disable-remove-down-replica": "false",
    "disable-remove-extra-replica": "false",
    "disable-replace-offline-replica": "false",
    "enable-one-way-merge": "false",
    "high-space-ratio": 0.6,
    "hot-region-cache-hits-threshold": 3,
    "hot-region-schedule-limit": 4,
    "leader-schedule-limit": 4,
    "low-space-ratio": 0.8,
    "max-merge-region-keys": 200000,
    "max-merge-region-size": 20,
    "max-pending-peer-count": 16,
    "max-snapshot-count": 3,
    "max-store-down-time": "30m0s",
    "merge-schedule-limit": 8,
    "patrol-region-interval": "100ms",
    "region-schedule-limit": 4,
    "replica-schedule-limit": 8,
    "scheduler-max-waiting-operator": 3,
    "schedulers-v2": [
      {
        "args": null,
        "disable": false,
        "type": "balance-region"
      },
      {
        "args": null,
        "disable": false,
        "type": "balance-leader"
      },
      {
        "args": null,
        "disable": false,
        "type": "hot-region"
      },
      {
        "args": null,
        "disable": false,
        "type": "label"
      },
      {
        "args": [
          "1"
        ],
        "disable": false,
        "type": "evict-leader"
      },
      {
        "args": [
          "10"
        ],
        "disable": false,
        "type": "evict-leader"
      },
      {
        "args": [
          "127001"
        ],
        "disable": false,
        "type": "evict-leader"
      },
      {
        "args": [
          "151120"
        ],
        "disable": false,
        "type": "evict-leader"
      },
      {
        "args": [
          "4"
        ],
        "disable": false,
        "type": "evict-leader"
      }
    ],
    "split-merge-interval": "1h0m0s",
    "store-balance-rate": 15,
    "tolerant-size-ratio": 5
  }
}
  1. member
{
  "header": {
    "cluster_id": 6807312755917103041
  },
  "members": [
    {
      "name": "pd_pd3",
      "member_id": 2579653654541892389,
      "peer_urls": [
        "http://10.12.5.115:2380"
      ],
      "client_urls": [
        "http://10.12.5.115:2379"
      ]
    },
    {
      "name": "pd_pd2",
      "member_id": 3717199249823848643,
      "peer_urls": [
        "http://10.12.5.114:2380"
      ],
      "client_urls": [
        "http://10.12.5.114:2379"
      ]
    },
    {
      "name": "pd_pd1",
      "member_id": 4691481983733508901,
      "peer_urls": [
        "http://10.12.5.113:2380"
      ],
      "client_urls": [
        "http://10.12.5.113:2379"
      ]
    }
  ],
  "leader": {
    "name": "pd_pd2",
    "member_id": 3717199249823848643,
    "peer_urls": [
      "http://10.12.5.114:2380"
    ],
    "client_urls": [
      "http://10.12.5.114:2379"
    ]
  },
  "etcd_leader": {
    "name": "pd_pd2",
    "member_id": 3717199249823848643,
    "peer_urls": [
      "http://10.12.5.114:2380"
    ],
    "client_urls": [
      "http://10.12.5.114:2379"
    ]
  }
}

和您确认下,副本为2,坏了几个tikv呢? 如果只坏了一个,不要这样操作,下线后,可以自己恢复. 如果坏了2个以上,会丢数据,以后请尽量使用3副本,2副本没有起到高可用。

操作步骤这里再和您确认下,您看看是否有哪里不一致,多谢。

  1. 禁用调度

(1)记录当前调度规则:

./pd-ctl config show all -u http://172.16.5.90:2679 | grep schedule-limit

"leader-schedule-limit": 4,
"region-schedule-limit": 4,
"replica-schedule-limit": 8,
"merge-schedule-limit": 8,
"hot-region-schedule-limit": 4,

(2)禁止调用

./pd-ctl -u :<pd_client_port> -i

config set leader-schedule-limit 0
config set region-schedule-limit 0
config set replica-schedule-limit 0
config set merge-schedule-limit 0   
config set hot-region-schedule-limit 0
  1. 您已经使用了在健康节点移除掉电的region,请参考步骤是否正确

则可以在所有未发生掉电故障的实例上,对所有 Region 移除掉所有位于故障节点上的 Peer(可以一个一个tikv 执行);

要求:在未发生掉电故障的机器上运行;TiKV 处于关闭状态

需要先关闭tikv:

cd /scripts

./stop_tikv.sh

再执行(1,22 假设为故障store id):

tikv-ctl --db /path/to/tikv-data/db unsafe-recover remove-fail-stores -s 1,22 --all-regions

  1. 重启成功后,会自动补副本,之后再将1步骤的调度修改为原值。

按照您的建议,我已经成功修改了当前调度规则。

在2中,我不知道您说的需要先关闭tikv:是否是指关闭出错tikv? 目前,我已经将该节点下线,并从配置文件inv中将其删除。

另外,我在执行tikv-ctl --db ...时,并未所有节点返回预期”success!“。存在部分节点出现abort或者返回当前tikv中不存在该id对应的region。请问这个是否正常? 例如:

removing stores [407939] from configrations...

Debugger::remove_fail_stores: Not Found "No store ident key"

按照1-3执行下来,目前重启仍然卡在tidb组件的启动上,对应的log仍为:

[2020/06/04 19:48:17.753 +08:00] [INFO] [region_cache.go:564] ["switch region peer to next due to NotLeader with NULL leader"] [currIdx=1] [regionID=10597951]

[2020/06/04 19:48:18.253 +08:00] [INFO] [region_cache.go:324] ["invalidate current region, because others failed on same store"] [region=10597951] [store=10.12.5.222:20160]

[2020/06/04 19:48:23.754 +08:00] [WARN] [client_batch.go:223] ["init create streaming fail"] [target=10.12.5.222:20160] [error="context deadline exceeded"]

[2020/06/04 19:48:23.754 +08:00] [INFO] [region_cache.go:937] ["mark store's regions need be refill"] [store=10.12.5.222:20160]

[2020/06/04 19:48:23.754 +08:00] [INFO] [region_cache.go:430] ["switch region peer to next due to send request fail"] [current="region ID: 10597951, meta: id:10597951 end_key:\"mDDLJobLi\\377st\\000\\000\\000\\000\\000\\000\\371\\000\\000\\000\\000\\000\\000\\000l\\200\\000\\000\\000\\000\\000\\000\\000\" region_epoch:<conf_ver:93 version:3 > peers:<id:10597953 store_id:407939 > peers:<id:10657288 store_id:1597655 > , peer: id:10597953 store_id:407939 , addr: 10.12.5.222:20160, idx: 0"] [needReload=false] [error="context deadline exceeded"] [errorVerbose="context deadline exceeded\
github.com/pingcap/errors.AddStack\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/pkg/mod/github.com/pingcap/errors@v0.11.4/errors.go:174\
github.com/pingcap/errors.Trace\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/pkg/mod/github.com/pingcap/errors@v0.11.4/juju_adaptor.go:15\
github.com/pingcap/tidb/store/tikv.sendBatchRequest\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/client_batch.go:585\
github.com/pingcap/tidb/store/tikv.(*rpcClient).SendRequest\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/client.go:287\
github.com/pingcap/tidb/store/tikv.(*RegionRequestSender).sendReqToRegion\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/region_request.go:169\
github.com/pingcap/tidb/store/tikv.(*RegionRequestSender).SendReqCtx\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/region_request.go:133\
github.com/pingcap/tidb/store/tikv.(*RegionRequestSender).SendReq\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/region_request.go:74\
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).get\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:324\
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).Get\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:282\
github.com/pingcap/tidb/kv.(*unionStore).Get\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/kv/union_store.go:194\
github.com/pingcap/tidb/store/tikv.(*tikvTxn).Get\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/store/tikv/txn.go:135\
github.com/pingcap/tidb/structure.(*TxStructure).Get\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/structure/string.go:35\
github.com/pingcap/tidb/structure.(*TxStructure).GetInt64\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/structure/string.go:44\
github.com/pingcap/tidb/meta.(*Meta).GetBootstrapVersion\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/meta/meta.go:697\
github.com/pingcap/tidb/session.getStoreBootstrapVersion.func1\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/session/session.go:1656\
github.com/pingcap/tidb/kv.RunInNewTxn\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/kv/txn.go:50\
github.com/pingcap/tidb/session.getStoreBootstrapVersion\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/session/session.go:1653\
github.com/pingcap/tidb/session.BootstrapSession\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/session/session.go:1492\
main.createStoreAndDomain\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/tidb-server/main.go:210\
main.main\
\t/home/jenkins/agent/workspace/tidb_v3.1.0-beta.1/go/src/github.com/pingcap/tidb/tidb-server/main.go:172\
runtime.main\
\t/usr/local/go/src/runtime/proc.go:203\
runtime.goexit\
\t/usr/local/go/src/runtime/asm_amd64.s:1357"]

另外,对出错的tikv节点,/path/to/db/是否需要清除数据?

  1. 指的关闭正常的 tikv ,这些操作是在正常的 tikv 移除异常节点的 region。
  2. 您是几个tikv 有问题呢?
  3. 出错的 tikv ,您已经执行了清理,这里的数据没有用了,可以清理。
  4. 以后遇到这种问题,不需要清理整个tikv 的数据,只需要清理这几个region的数据就可以了。

1个tikv有问题。我按照您说的,重新把每个tikv用scripts脚本关闭后,重新执行下述语句,但结果仍存在问题。

tikv-ctl --db /path/to/tikv-data/db unsafe-recover remove-fail-stores

部分tikv节点返回:

tidb@two:~$ sudo bin/tikv-ctl --db /home/tidb/db unsafe-recover remove-fail-stores -s 407939 --all-regions

removing stores [407939] from configrations...

Debugger::remove_fail_stores: Not Found "No store ident key"

pure virtual method called

terminate called without an active exception

另外,看了下pd的log。是否是指当前pd正在完成出错tikv所持有region的复制?是否需要等待复制完成后才能成功启动tidb服务?

[2020/06/05 01:27:45.048 +00:00] [INFO] [operator_controller.go:391] ["send schedule command"] [region-id=9375280] [step="promote learner peer 10692179 on store 407940 to voter"] [source="active push"]

好的,稍后答复,多谢

问题解决了。这边做个总结: 首先,在进行数据update的过程,要多关注服务器是否有问题。(不要像我们这样突然断电)。 然后,遇到tikv所在节点出错,对出错region id(建议)或者对整个出错节点的region进行删除。删除时,需要注意:1)关闭每个tikv,进行tikv-ctl --db ...操作;2)操作时,/path/to/tikv/db 要选择正确;3)尽可能使用多副本,以确保安全。 接着,遇到问题时,可以关注deploy/log/xx.log,判断出错原因。 最后,非常感谢工作人员的耐心解答。

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。