- 【TiDB 版本】:3.0.12
- 【问题描述】:
tidb集群整个关闭后,再启动tikv节点一直报,状态为disconnect,怎么解决
[2020/09/26 18:05:04.235 +08:00] [INFO] [pd.rs:549] [“try to change peer”] [peer=“id: 1457510 store_id: 1352078 is_learner: true”] [change_type=AddLearnerNode] [region_id=141613]
[2020/09/26 18:05:04.235 +08:00] [INFO] [pd.rs:549] [“try to change peer”] [peer=“id: 1457511 store_id: 1352078 is_learner: true”] [change_type=AddLearnerNode] [region_id=591409]
[2020/09/26 18:05:04.235 +08:00] [INFO] [pd.rs:549] [“try to change peer”] [peer=“id: 1457572 store_id: 1352078 is_learner: true”] [change_type=AddLearnerNode] [region_id=200268]
[2020/09/26 18:05:04.235 +08:00] [INFO] [pd.rs:549] [“try to change peer”] [peer=“id: 1457573 store_id: 1352078 is_learner: true”] [change_type=AddLearnerNode] [region_id=685552]
[2020/09/26 18:05:04.235 +08:00] [INFO] [pd.rs:549] [“try to change peer”] [peer=“id: 1457517 store_id: 1352078 is_learner: true”] [change_type=AddLearnerNode] [region_id=671663]
[2020/09/26 18:05:04.235 +08:00] [INFO] [pd.rs:549] [“try to change peer”] [peer=“id: 1457519 store_id: 1352078 is_learner: true”] [change_type=AddLearnerNode] [region_id=170061]
[2020/09/26 18:05:04.235 +08:00] [INFO] [pd.rs:549] [“try to change peer”] [peer=“id: 1457520 store_id: 1352078 is_learner: true”] [change_type=AddLearnerNode] [region_id=226661]
[2020/09/26 18:05:04.235 +08:00] [INFO] [pd.rs:549] [“try to change peer”] [peer=“id: 1457583 store_id: 1352078 is_learner: true”] [change_type=AddLearnerNode] [region_id=121022]
[2020/09/26 18:05:04.235 +08:00] [INFO] [pd.rs:549] [“try to change peer”] [peer=“id: 1457527 store_id: 1352078 is_learner: true”] [change_type=AddLearnerNode] [region_id=192334]
你好,是所有 tikv 节点无法启动,还是只是单个 tikv 实例启动失败?看了下日志都是一些 info 信息,可以提供下完整的日志
因为一开始集群cpu较高,所以用ansible重启了集群,重启后,5台tikv节点,3台在不断刷try to change peer的日志,状态为"state_name": “Disconnected”,pd-ctl查看这些节点上有很多region处于pending状态。例如:
{
"id": 659960,
"start_key": "7480000000000004FF9D5F698000000000FF00000101C55A44ECFF85E04280FFA85B6AFF3AFE4D80F7FF0000FF000000000000F701FFFF0203245CE3579DFFFF51300000000000FF00F90382825D68D7FF79F60B0000000000FA",
"end_key": "7480000000000004FF9D5F698000000000FF00000101C55A44ECFF85E04280FFA85B6AFF3AFE4D80F7FF0000FF000000000000F701FFFF0203245CE3579DFFFF51300000000000FF00F903828531BCE0FF52E9B40000000000FA",
"epoch": {
"conf_ver": 203,
"version": 782
},
"peers": [
{
"id": 659962,
"store_id": 6
},
{
"id": 659963,
"store_id": 234966
},
{
"id": 1300428,
"store_id": 4
}
],
"leader": {
"id": 659962,
"store_id": 6
},
"pending_peers": [
{
"id": 1300428,
"store_id": 4
}
],
"approximate_size": 104,
"approximate_keys": 973700
麻烦确认一下几个信息:
- 目前集群访问是否正常?SQL 执行会有报错吗?
- 上传一下 Overview 面板监控
- 上传一下完整的 tikv.log 日志文件
- 有没有尝试过 rolling_update 滚动重启集群,看是否能恢复这个情况
导出监控步骤:
- 打开 Overview 面板,监控时间选举最近 3 小时
- 打开 Grafana 监控面板(先按
d
再按E
可将所有 Rows 的 Panels 打开,需等待一段时间待页面加载完成) - https://metricstool.pingcap.com/ 使用工具导出 Grafana 数据为快照
具体可以参考文档:[FAQ] Grafana Metrics 页面的导出和导入