强制下线tikv后tidb无法启动

【TiDB 版本】v4.0.6

今天有两台一直卡在pending offline 的节点我发现他们leader cont都为0了就直接把他们两个scale-in --force了
尝试重启集群发现tidb节点无法启动


同时tikv节点日志也出现了error:

store中节点的leader都变为0:

{
  "count": 6,
  "stores": [
    {
      "store": {
        "id": 38833310,
        "address": "10.12.5.147:20160",
        "version": "4.0.6",
        "status_address": "10.12.5.147:20180",
        "git_hash": "ca2475bfbcb49a7c34cf783596acb3edd05fc88f",
        "start_timestamp": 1617101326,
        "deploy_path": "/home/tidb/deploy/bin",
        "last_heartbeat": 1617129874605847067,
        "state_name": "Up"
      },
      "status": {
        "capacity": "0B",
        "available": "0B",
        "used_size": "0B",
        "leader_count": 0,
        "leader_weight": 2,
        "leader_score": 0,
        "leader_size": 0,
        "region_count": 94661,
        "region_weight": 2,
        "region_score": 8681,
        "region_size": 17362,
        "start_ts": "2021-03-30T10:48:46Z",
        "last_heartbeat_ts": "2021-03-30T18:44:34.605847067Z",
        "uptime": "7h55m48.605847067s"
      }
    },
    {
      "store": {
        "id": 256634153,
        "address": "10.12.5.11:20160",
        "state": 1,
        "version": "4.0.6",
        "status_address": "10.12.5.11:20180",
        "git_hash": "ca2475bfbcb49a7c34cf783596acb3edd05fc88f",
        "start_timestamp": 1617003174,
        "deploy_path": "/home/tidb/deploy/bin",
        "last_heartbeat": 1617056393120558835,
        "state_name": "Offline"
      },
      "status": {
        "capacity": "0B",
        "available": "0B",
        "used_size": "0B",
        "leader_count": 0,
        "leader_weight": 1,
        "leader_score": 0,
        "leader_size": 0,
        "region_count": 0,
        "region_weight": 1,
        "region_score": 0,
        "region_size": 0,
        "start_ts": "2021-03-29T07:32:54Z",
        "last_heartbeat_ts": "2021-03-29T22:19:53.120558835Z",
        "uptime": "14h46m59.120558835s"
      }
    },
    {
      "store": {
        "id": 256634687,
        "address": "10.12.5.12:20160",
        "version": "4.0.6",
        "status_address": "10.12.5.12:20180",
        "git_hash": "ca2475bfbcb49a7c34cf783596acb3edd05fc88f",
        "start_timestamp": 1617101516,
        "deploy_path": "/home/tidb/deploy/bin",
        "last_heartbeat": 1617102184699004054,
        "state_name": "Up"
      },
      "status": {
        "capacity": "2.952TiB",
        "available": "2.929TiB",
        "used_size": "14.98GiB",
        "leader_count": 421,
        "leader_weight": 1,
        "leader_score": 421,
        "leader_size": 30916,
        "region_count": 463,
        "region_weight": 1,
        "region_score": 30916,
        "region_size": 30916,
        "start_ts": "2021-03-30T10:51:56Z",
        "last_heartbeat_ts": "2021-03-30T11:03:04.699004054Z",
        "uptime": "11m8.699004054s"
      }
    },
    {
      "store": {
        "id": 24478148,
        "address": "10.12.5.236:20160",
        "version": "4.0.6",
        "status_address": "10.12.5.236:20180",
        "git_hash": "ca2475bfbcb49a7c34cf783596acb3edd05fc88f",
        "start_timestamp": 1617101400,
        "deploy_path": "/home/tidb/deploy/bin",
        "last_heartbeat": 1617128494296419927,
        "state_name": "Up"
      },
      "status": {
        "capacity": "0B",
        "available": "0B",
        "used_size": "0B",
        "leader_count": 0,
        "leader_weight": 2,
        "leader_score": 0,
        "leader_size": 0,
        "region_count": 95298,
        "region_weight": 2,
        "region_score": 5293,
        "region_size": 10586,
        "start_ts": "2021-03-30T10:50:00Z",
        "last_heartbeat_ts": "2021-03-30T18:21:34.296419927Z",
        "uptime": "7h31m34.296419927s"
      }
    },
    {
      "store": {
        "id": 24480822,
        "address": "10.12.5.239:20160",
        "version": "4.0.6",
        "status_address": "10.12.5.239:20180",
        "git_hash": "ca2475bfbcb49a7c34cf783596acb3edd05fc88f",
        "start_timestamp": 1617101450,
        "deploy_path": "/home/tidb/deploy/bin",
        "last_heartbeat": 1617128502506331755,
        "state_name": "Up"
      },
      "status": {
        "capacity": "0B",
        "available": "0B",
        "used_size": "0B",
        "leader_count": 2,
        "leader_weight": 2,
        "leader_score": 1,
        "leader_size": 117,
        "region_count": 86212,
        "region_weight": 2,
        "region_score": 7610,
        "region_size": 15220,
        "start_ts": "2021-03-30T10:50:50Z",
        "last_heartbeat_ts": "2021-03-30T18:21:42.506331755Z",
        "uptime": "7h30m52.506331755s"
      }
    },
    {
      "store": {
        "id": 24590972,
        "address": "10.12.5.240:20160",
        "version": "4.0.6",
        "status_address": "10.12.5.240:20180",
        "git_hash": "ca2475bfbcb49a7c34cf783596acb3edd05fc88f",
        "start_timestamp": 1617101530,
        "deploy_path": "/home/tidb/deploy/bin",
        "last_heartbeat": 1617101423422266377,
        "state_name": "Disconnected"
      },
      "status": {
        "capacity": "5.952TiB",
        "available": "4.033TiB",
        "used_size": "1.829TiB",
        "leader_count": 0,
        "leader_weight": 2,
        "leader_score": 0,
        "leader_size": 0,
        "region_count": 84954,
        "region_weight": 2,
        "region_score": 13013.5,
        "region_size": 26027,
        "start_ts": "2021-03-30T10:52:10Z",
        "last_heartbeat_ts": "2021-03-30T10:50:23.422266377Z"
      }
    }
  ]
}

pd日志出现下面的内容:

另一台tidb的日志出现了不同报错:

leader 为 0 了,但是 region count 还不为 0 ,store 节点上还有 region 的 peer 存在,同时强化缩容两个 tikv 实例,会导致部分 region 丢失大多数副本,无法满足 raft 多数派原则,无法提供服务。

可以参考这个文档尝试恢复一下:Tidb灾难恢复演练-多副本丢失

大佬,现在的regionmiss已经解决
现在的问题是tidb在起的时候日志出现了pdserver time out


pd的日志也出现了connect refused

pd和tidb是在同一台虚拟机上的不同端口
拓扑结构如下:

114 115两台机器的时间是统一的

大神,能不能加个wechat:pleading_face::pleading_face::pleading_face::pleading_face:

其他pd的日志也有连接层面的问题:

报错中有 ERROR 9001 (HY000) : PD Server Timeout

请求 PD 超时,请检查 PD Server 状态/监控/日志以及 TiDB Server 与 PD Server 之间的网络。

如果有需要可以选择 联系社区专家

这边是有两台服务器,每一台上都有pd和tidb部署,两台机器之间是可以通信的,而且日志上显示同一台服务器的tidb也出现访问失败,怀疑可能不是网络层面的问题


pd的日志上面也有展示

或者我如果想加入新的tidb的话,在scale-out的时候如何跳过要启动原有的两个tidb,因为他们起不来,没办法扩充新的结点进来


这个应该是 TiKV 层报出来的 pd server timeout ,强制直接缩容两个 tikv 实例会导致部分 region is unavailable ,这样在 tidb 启动访问 tikv 的时候也会报错,建议先参照上面的灾难恢复演练文档,将集群内的 region 状态恢复一下。

分析region结果发现情况如下:

  1. 存在大量副本缺少leader的情况
  2. 存在一副本、两副本、四副本的的情况

文档中没看到相关恢复操作

对于 1 副本的 region ,可以参考文档中同时宕机 2 个 tikv 节点的相关操作恢复。

对于 >= 2 副本的 region ,理论上是可以自动选举出 leader 的,可以通过 region ${region_id} 的方式,抽几个 region 看下具体的 region 信息

我们已经把两台tikv先强制设置为tombstone状态,然后直接强制清理tombstone,现在集群、监控、store中已经没有这两台机器的内容了,并且已经关机。但region在这两台机器上还有副本。

根据文档执行了相关命令:
`

关闭tikv =》 执行unsafe-recover操作(如图) =》 关闭pd =》 重启pd、Tikv

`


监控显示出现了大量的miss-peer-region

但是检索region信息发现,那peer-store中仍有之前设置成为tombstone并清除的store-id(图中的3854*,2566*)

我的问题:

  1. 上述结果表明,我们的问题副本清理没有成功,请问是哪一步出了问题?
  2. 缺少leader的region如果副本数已经有三个,是否需要手动添加leader。如果需要,请问如何处理?具体包括两种情况:
    1)region的副本对应store-id包含设置成为tombstone并清除的store-id
    2)region的副本对应store-id均为健康store-id
  1. 执行 unsafe-recover 操作是在所有健康的 TiKV 节点上都执行了么
  2. 通过 pd-ctl 查询结果还是有问题,是因为底层 region 没有选举出 leader ,所以无法上报心跳信息给 pd ,pd 中存储的 region 信息没有更新导致的。

可以上传一下几个健康节点,最近一段时间的 tikv.log 日志文本文件看下。文件如果比较大的话,可以压缩一下。截图不太方便看日志。

  1. unsave-recover操作是在全部tikv节点上运行的。
  2. 好像集群并未对miss-peer-region进行处理,数量并未减少。
  3. 相关结点的日志,因为太大了,所以放到了百度网盘:
    https://pan.baidu.com/s/1wx5UlCgQ0vsYmoBREyAM8Q 密码: 0rgw

38546296 这个 store id 是从哪获取到的?

从前面提供的 pd-ctl 命令的输出结果中看到有一个 offline 状态的节点 store id 为 256634153
还有一个 disconnect 状态的节点 store id 为 24590972。并没有找到 store id 为 38546296 的节点。

miss-peer-region 的减少,前提是 region 有 leader peer ,region 的状态正常,可以满足 raft 协议,那样会自动补上缺失的副本。但是如果 region 没有 leader 的时候,是不会不缺失的副本的。

38546296和256634153这两个id就是我们之前强制下线的store id

现在的store信息是这样的:
{
“count”: 5,
“stores”: [
{
“store”: {
“id”: 24478148,
“address”: “10.12.5.236:20160”,
“version”: “4.0.6”,
“status_address”: “10.12.5.236:20180”,
“git_hash”: “ca2475bfbcb49a7c34cf783596acb3edd05fc88f”,
“start_timestamp”: 1617267392,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1617267420362977924,
“state_name”: “Down”
},
“status”: {
“capacity”: “0B”,
“available”: “0B”,
“used_size”: “0B”,
“leader_count”: 0,
“leader_weight”: 2,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 102579,
“region_weight”: 2,
“region_score”: 648925.5,
“region_size”: 1297851,
“start_ts”: “2021-04-01T08:56:32Z”,
“last_heartbeat_ts”: “2021-04-01T08:57:00.362977924Z”,
“uptime”: “28.362977924s”
}
},
{
“store”: {
“id”: 24480822,
“address”: “10.12.5.239:20160”,
“version”: “4.0.6”,
“status_address”: “10.12.5.239:20180”,
“git_hash”: “ca2475bfbcb49a7c34cf783596acb3edd05fc88f”,
“start_timestamp”: 1617267444,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1617267465155871371,
“state_name”: “Down”
},
“status”: {
“capacity”: “0B”,
“available”: “0B”,
“used_size”: “0B”,
“leader_count”: 0,
“leader_weight”: 2,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 88892,
“region_weight”: 2,
“region_score”: 187676,
“region_size”: 375352,
“start_ts”: “2021-04-01T08:57:24Z”,
“last_heartbeat_ts”: “2021-04-01T08:57:45.155871371Z”,
“uptime”: “21.155871371s”
}
},
{
“store”: {
“id”: 24590972,
“address”: “10.12.5.240:20160”,
“version”: “4.0.6”,
“status_address”: “10.12.5.240:20180”,
“git_hash”: “ca2475bfbcb49a7c34cf783596acb3edd05fc88f”,
“start_timestamp”: 1617267523,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1617267459680097985,
“state_name”: “Down”
},
“status”: {
“capacity”: “0B”,
“available”: “0B”,
“used_size”: “0B”,
“leader_count”: 0,
“leader_weight”: 2,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 85336,
“region_weight”: 2,
“region_score”: 182276.5,
“region_size”: 364553,
“start_ts”: “2021-04-01T08:58:43Z”,
“last_heartbeat_ts”: “2021-04-01T08:57:39.680097985Z”
}
},
{
“store”: {
“id”: 38833310,
“address”: “10.12.5.147:20160”,
“version”: “4.0.6”,
“status_address”: “10.12.5.147:20180”,
“git_hash”: “ca2475bfbcb49a7c34cf783596acb3edd05fc88f”,
“start_timestamp”: 1617267322,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1617303971141548886,
“state_name”: “Up”
},
“status”: {
“capacity”: “5.952TiB”,
“available”: “3.644TiB”,
“used_size”: “2.118TiB”,
“leader_count”: 4540,
“leader_weight”: 2,
“leader_score”: 2270,
“leader_size”: 268013,
“region_count”: 99702,
“region_weight”: 2,
“region_score”: 579267,
“region_size”: 1158534,
“start_ts”: “2021-04-01T08:55:22Z”,
“last_heartbeat_ts”: “2021-04-01T19:06:11.141548886Z”,
“uptime”: “10h10m49.141548886s”
}
},
{
“store”: {
“id”: 256634687,
“address”: “10.12.5.12:20160”,
“version”: “4.0.6”,
“status_address”: “10.12.5.12:20180”,
“git_hash”: “ca2475bfbcb49a7c34cf783596acb3edd05fc88f”,
“start_timestamp”: 1617267512,
“deploy_path”: “/home/tidb/deploy/bin”,
“last_heartbeat”: 1617303976264168093,
“state_name”: “Up”
},
“status”: {
“capacity”: “2.952TiB”,
“available”: “2.331TiB”,
“used_size”: “623.3GiB”,
“leader_count”: 24001,
“leader_weight”: 1,
“leader_score”: 24001,
“leader_size”: 1329895,
“region_count”: 26631,
“region_weight”: 1,
“region_score”: 1435620,
“region_size”: 1435620,
“start_ts”: “2021-04-01T08:58:32Z”,
“last_heartbeat_ts”: “2021-04-01T19:06:16.264168093Z”,
“uptime”: “10h7m44.264168093s”
}
}
]
}

$ tail -n 3 tikv239.log
[2021/04/01 10:26:24.363 +00:00] [INFO] [properties.rs:164] ["decode to RangeProperties failed with err: KeyNotFound, try to decode to SizeProperties, maybe upgrade from v2.0 or older version?"]
[2021/04/01 10:26:24.723 +00:00] [INFO] [properties.rs:164] ["decode to RangeProperties failed with err: KeyNotFound, try to decode to SizeProperties, maybe upgrade from v2.0 or older version?"]
[2021/04/01 10:26:24.817 +00:00] [INFO] [properties.rs:164] ["decode to RangeProperties failed with err: KeyNotFound, try to decode to SizeProperties, maybe upgrade from v2.0 or older version?"]

tikv 日志中的时间与实际时间是有时区差异么?还是说你们提供的日志就是上午的日志?

  1. 抱歉,刚才是因为按照文档设置了limit = 0,现在miss-peer-region在减少
  2. 时间是有时差的
  3. 现在的operator主要是 balance-leader和makeup-replica