tidb集群整个关闭后,再启动tikv节点一直报,状态为disconnect,怎么解决?

  • 【TiDB 版本】:3.0.12
  • 【问题描述】:
    tidb集群整个关闭后,再启动tikv节点一直报,状态为disconnect,怎么解决
    [2020/09/26 18:05:04.235 +08:00] [INFO] [pd.rs:549] [“try to change peer”] [peer=“id: 1457510 store_id: 1352078 is_learner: true”] [change_type=AddLearnerNode] [region_id=141613]
    [2020/09/26 18:05:04.235 +08:00] [INFO] [pd.rs:549] [“try to change peer”] [peer=“id: 1457511 store_id: 1352078 is_learner: true”] [change_type=AddLearnerNode] [region_id=591409]
    [2020/09/26 18:05:04.235 +08:00] [INFO] [pd.rs:549] [“try to change peer”] [peer=“id: 1457572 store_id: 1352078 is_learner: true”] [change_type=AddLearnerNode] [region_id=200268]
    [2020/09/26 18:05:04.235 +08:00] [INFO] [pd.rs:549] [“try to change peer”] [peer=“id: 1457573 store_id: 1352078 is_learner: true”] [change_type=AddLearnerNode] [region_id=685552]
    [2020/09/26 18:05:04.235 +08:00] [INFO] [pd.rs:549] [“try to change peer”] [peer=“id: 1457517 store_id: 1352078 is_learner: true”] [change_type=AddLearnerNode] [region_id=671663]
    [2020/09/26 18:05:04.235 +08:00] [INFO] [pd.rs:549] [“try to change peer”] [peer=“id: 1457519 store_id: 1352078 is_learner: true”] [change_type=AddLearnerNode] [region_id=170061]
    [2020/09/26 18:05:04.235 +08:00] [INFO] [pd.rs:549] [“try to change peer”] [peer=“id: 1457520 store_id: 1352078 is_learner: true”] [change_type=AddLearnerNode] [region_id=226661]
    [2020/09/26 18:05:04.235 +08:00] [INFO] [pd.rs:549] [“try to change peer”] [peer=“id: 1457583 store_id: 1352078 is_learner: true”] [change_type=AddLearnerNode] [region_id=121022]
    [2020/09/26 18:05:04.235 +08:00] [INFO] [pd.rs:549] [“try to change peer”] [peer=“id: 1457527 store_id: 1352078 is_learner: true”] [change_type=AddLearnerNode] [region_id=192334]

你好,是所有 tikv 节点无法启动,还是只是单个 tikv 实例启动失败?看了下日志都是一些 info 信息,可以提供下完整的日志

因为一开始集群cpu较高,所以用ansible重启了集群,重启后,5台tikv节点,3台在不断刷try to change peer的日志,状态为"state_name": “Disconnected”,pd-ctl查看这些节点上有很多region处于pending状态。例如:

{
  "id": 659960,
  "start_key": "7480000000000004FF9D5F698000000000FF00000101C55A44ECFF85E04280FFA85B6AFF3AFE4D80F7FF0000FF000000000000F701FFFF0203245CE3579DFFFF51300000000000FF00F90382825D68D7FF79F60B0000000000FA",
  "end_key": "7480000000000004FF9D5F698000000000FF00000101C55A44ECFF85E04280FFA85B6AFF3AFE4D80F7FF0000FF000000000000F701FFFF0203245CE3579DFFFF51300000000000FF00F903828531BCE0FF52E9B40000000000FA",
  "epoch": {
    "conf_ver": 203,
    "version": 782
  },
  "peers": [
    {
      "id": 659962,
      "store_id": 6
    },
    {
      "id": 659963,
      "store_id": 234966
    },
    {
      "id": 1300428,
      "store_id": 4
    }
  ],
  "leader": {
    "id": 659962,
    "store_id": 6
  },
  "pending_peers": [
    {
      "id": 1300428,
      "store_id": 4
    }
  ],
  "approximate_size": 104,
  "approximate_keys": 973700

麻烦确认一下几个信息:

  1. 目前集群访问是否正常?SQL 执行会有报错吗?
  2. 上传一下 Overview 面板监控
  3. 上传一下完整的 tikv.log 日志文件
  4. 有没有尝试过 rolling_update 滚动重启集群,看是否能恢复这个情况

导出监控步骤:

  1. 打开 Overview 面板,监控时间选举最近 3 小时
  2. 打开 Grafana 监控面板(先按 d 再按 E 可将所有 Rows 的 Panels 打开,需等待一段时间待页面加载完成)
  3. https://metricstool.pingcap.com/ 使用工具导出 Grafana 数据为快照

具体可以参考文档:[FAQ] Grafana Metrics 页面的导出和导入