近期数据库状态总是disconnect或者down,且频繁出现no space故障

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】:3.1.0-beta.1
  • 【问题描述】:
    最近数据库需要大量读写,但是出现性能不佳的情况。在操作时出现 tikv timeout 或者region is unavailable的情况。

各tikv节点经常状态不稳定,例如当前的状态为:

tidb@three:~/tidb-ansible/resources/bin$ ./pd-ctl -u "http://10.12.5.113:2379" store|grep state_name
        "state_name": "Up"
        "state_name": "Up"
        "state_name": "Disconnected"
        "state_name": "Up"
        "state_name": "Down"
        "state_name": "Disconnected"
        "state_name": "Disconnected"
        "state_name": "Disconnected"
        "state_name": "Up"
        "state_name": "Up"
        "state_name": "Disconnected"
        "state_name": "Down"
        "state_name": "Down"

tidb@three:~/tidb-ansible/resources/bin$ ./pd-ctl -u “http://10.12.5.113:2379” store

{
  "count": 13,
  "stores": [
    {
      "store": {
        "id": 335855,
        "address": "10.12.5.230:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "199GiB",
        "leader_count": 9390,
        "leader_weight": 1,
        "leader_score": 1043499,
        "leader_size": 1043499,
        "region_count": 17293,
        "region_weight": 1,
        "region_score": 940071725.5503502,
        "region_size": 1643391,
        "start_ts": "2020-08-23T02:10:16Z",
        "last_heartbeat_ts": "2020-08-23T02:46:07.227896007Z",
        "uptime": "35m51.227896007s"
      }
    },
    {
      "store": {
        "id": 484920,
        "address": "10.12.5.223:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "203.2GiB",
        "leader_count": 11427,
        "leader_weight": 1,
        "leader_score": 1213914,
        "leader_size": 1213914,
        "region_count": 18726,
        "region_weight": 1,
        "region_score": 914446997.4288011,
        "region_size": 1770229,
        "start_ts": "2020-08-23T02:09:57Z",
        "last_heartbeat_ts": "2020-08-23T02:46:21.554811523Z",
        "uptime": "36m24.554811523s"
      }
    },
    {
      "store": {
        "id": 17388737,
        "address": "10.12.5.214:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Disconnected"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "193.8GiB",
        "leader_count": 18,
        "leader_weight": 1,
        "leader_score": 1355,
        "leader_size": 1355,
        "region_count": 15416,
        "region_weight": 1,
        "region_score": 971808761.6416945,
        "region_size": 1261973,
        "start_ts": "2020-08-23T02:43:19Z",
        "last_heartbeat_ts": "2020-08-23T02:43:30.469837181Z",
        "uptime": "11.469837181s"
      }
    },
    {
      "store": {
        "id": 407938,
        "address": "10.12.5.221:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "428.9GiB",
        "leader_count": 2279,
        "leader_weight": 1,
        "leader_score": 1155710,
        "leader_size": 1155710,
        "region_count": 2634,
        "region_weight": 1,
        "region_score": 1182197,
        "region_size": 1182197,
        "sending_snap_count": 1,
        "start_ts": "2020-08-23T02:09:32Z",
        "last_heartbeat_ts": "2020-08-23T02:46:15.375248999Z",
        "uptime": "36m43.375248999s"
      }
    },
    {
      "store": {
        "id": 665678,
        "address": "127.0.0.1:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Down"
      },
      "status": {
        "leader_weight": 1,
        "region_weight": 1,
        "start_ts": "1970-01-01T00:00:00Z"
      }
    },
    {
      "store": {
        "id": 640552,
        "address": "10.12.5.224:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "201.1GiB",
        "leader_count": 9736,
        "leader_weight": 1,
        "leader_score": 1086020,
        "leader_size": 1086020,
        "region_count": 18895,
        "region_weight": 1,
        "region_score": 927204434.4555326,
        "region_size": 1657957,
        "start_ts": "2020-08-23T02:10:04Z",
        "last_heartbeat_ts": "2020-08-23T02:46:23.339818555Z",
        "uptime": "36m19.339818555s"
      }
    },
    {
      "store": {
        "id": 2026701,
        "address": "10.12.5.227:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Disconnected"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "198.6GiB",
        "leader_count": 2774,
        "leader_weight": 1,
        "leader_score": 292477,
        "leader_size": 292477,
        "region_count": 18189,
        "region_weight": 1,
        "region_score": 942237682.7523708,
        "region_size": 1502493,
        "start_ts": "2020-08-23T02:11:10Z",
        "last_heartbeat_ts": "2020-08-23T02:26:06.180979812Z",
        "uptime": "14m56.180979812s"
      }
    },
    {
      "store": {
        "id": 6506926,
        "address": "10.12.5.231:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Disconnected"
      },
      "status": {
        "capacity": "1007GiB",
        "available": "311.1GiB",
        "leader_count": 8196,
        "leader_weight": 1,
        "leader_score": 1000551,
        "leader_size": 1000551,
        "region_count": 10832,
        "region_weight": 1,
        "region_score": 489578759.9719987,
        "region_size": 2138334,
        "start_ts": "2020-08-23T02:09:53Z",
        "last_heartbeat_ts": "2020-08-23T02:24:24.524230709Z",
        "uptime": "14m31.524230709s"
      }
    },
    {
      "store": {
        "id": 10968962,
        "address": "10.12.5.233:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "199.6GiB",
        "leader_count": 6514,
        "leader_weight": 1,
        "leader_score": 745635,
        "leader_size": 745635,
        "region_count": 14893,
        "region_weight": 1,
        "region_score": 936286208.6025124,
        "region_size": 2306434,
        "start_ts": "2020-08-23T02:10:19Z",
        "last_heartbeat_ts": "2020-08-23T02:46:22.090152418Z",
        "uptime": "36m3.090152418s"
      }
    },
    {
      "store": {
        "id": 6506924,
        "address": "10.12.5.229:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Down"
      },
      "status": {
        "capacity": "1007GiB",
        "available": "223.2GiB",
        "leader_count": 2,
        "leader_weight": 1,
        "leader_score": 152,
        "leader_size": 152,
        "region_count": 20252,
        "region_weight": 1,
        "region_score": 957450920.5420499,
        "region_size": 734939,
        "start_ts": "2020-08-23T02:10:01Z",
        "last_heartbeat_ts": "2020-08-23T02:10:11.957212698Z",
        "uptime": "10.957212698s"
      }
    },
    {
      "store": {
        "id": 6506925,
        "address": "10.12.5.228:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Down"
      },
      "status": {
        "capacity": "1007GiB",
        "available": "222.5GiB",
        "leader_weight": 1,
        "region_count": 19097,
        "region_weight": 1,
        "region_score": 961158063.6551847,
        "region_size": 430576,
        "start_ts": "2020-08-23T02:09:49Z",
        "last_heartbeat_ts": "2020-08-23T02:09:59.935448794Z",
        "uptime": "10.935448794s"
      }
    },
    {
      "store": {
        "id": 407940,
        "address": "10.12.5.220:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "865.2GiB",
        "available": "197GiB",
        "leader_count": 14264,
        "leader_weight": 1,
        "leader_score": 1448249,
        "leader_size": 1448249,
        "region_count": 18524,
        "region_weight": 1,
        "region_score": 925391192.2735996,
        "region_size": 1906767,
        "start_ts": "2020-08-23T02:09:39Z",
        "last_heartbeat_ts": "2020-08-23T02:46:20.320183413Z",
        "uptime": "36m41.320183413s"
      }
    },
    {
      "store": {
        "id": 1597655,
        "address": "10.12.5.226:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Disconnected"
      },
      "status": {
        "capacity": "835.7GiB",
        "available": "188.7GiB",
        "leader_count": 4064,
        "leader_weight": 1,
        "leader_score": 400408,
        "leader_size": 400408,
        "region_count": 13935,
        "region_weight": 1,
        "region_score": 935338040.0638361,
        "region_size": 1155387,
        "start_ts": "2020-08-23T02:10:41Z",
        "last_heartbeat_ts": "2020-08-23T02:21:35.213273663Z",
        "uptime": "10m54.213273663s"
      }
    }
  ]
}

此外,也会经常出现no space的情况,但实际磁盘可用空间是超过100G的。通常需要进行log清除(约几G) + 删除last_tikv.toml文件,然后通过df -h 磁盘 能发现可用空间是有的。

针对上述问题,我想知道当前性能不佳的原因是什么?有什么解决办法?

若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

  1. 为什么使用 3.1 的beta 版本? 建议升级到 4.0 的正式版本
  2. down 和 disconnected 可以看下 tikv 的日志具体在当时是什么报错
    3 可以查看读写流量高时的监控,是否cpu,内存,io 等资源耗尽。

您好,我这边在计划从3.1 --> 4.0时,尝试从tidb-ansible的方式转为tiup,遇到了问题:

文档中说,需要在各个组件状态正常时进行操作。目前tikv存在一个节点始终是disconnected状态,此状态下是否可以继续升级?

该disconnected节点的tikv.log日志:

[2020/08/24 11:12:29.799 +00:00] [WARN] [endpoint.rs:454] [error-response] [err="region message: \"peer is not leader for region 17188974, leader may None\" not_leader { region_id: 17188974 }"]

[2020/08/24 11:12:30.321 +00:00] [WARN] [endpoint.rs:454] [error-response] [err="region message: \"peer is not leader for region 17188974, leader may None\" not_leader { region_id: 17188974 }"]

[2020/08/24 11:12:30.822 +00:00] [WARN] [endpoint.rs:454] [error-response] [err="region message: \"peer is not leader for region 17188974, leader may None\" not_leader { region_id: 17188974 }"]

[2020/08/24 11:12:31.331 +00:00] [WARN] [endpoint.rs:454] [error-response] [err="region message: \"peer is not leader for region 17188974, leader may None\" not_leader { region_id: 17188974 }"]

[2020/08/24 11:12:31.831 +00:00] [WARN] [endpoint.rs:454] [error-response] [err="region message: \"peer is not leader for region 17188974, leader may None\" not_leader { region_id: 17188974 }"]

[2020/08/24 11:12:32.343 +00:00] [WARN] [endpoint.rs:454] [error-response] [err="region message: \"peer is not leader for region 17188974, leader may None\" not_leader { region_id: 17188974 }"]

[2020/08/24 11:12:32.844 +00:00] [WARN] [endpoint.rs:454] [error-response] [err="region message: \"peer is not leader for region 17188974, leader may None\" not_leader { region_id: 17188974 }"]

[2020/08/24 11:12:33.351 +00:00] [WARN] [endpoint.rs:454] [error-response] [err="region message: \"peer is not leader for region 17188974, leader may None\" not_leader { region_id: 17188974 }"]

[2020/08/24 11:12:33.852 +00:00] [WARN] [endpoint.rs:454] [error-response] [err="region message: \"peer is not leader for region 17188974, leader may None\" not_leader { region_id: 17188974 }"]

[2020/08/24 11:12:34.354 +00:00] [WARN] [endpoint.rs:454] [error-response] [err="region message: \"peer is not leader for region 17188974, leader may Some(id: 18696000 store_id: 1597655)\" not_leader { region_id: 17188974 leader { id: 18696000 store_id: 1597655 } }"]



tidb@three:~/tidb-ansible/resources/bin$ ./pd-ctl -u "http://10.12.5.113:2379"  region 17188974
    {
      "id": 17188974,
      "start_key": "7480000000000000FF1500000000000000F8",
      "end_key": "7480000000000000FF155F698000000000FF0000020380000000FF0000D93A00000000FB",
      "epoch": {
        "conf_ver": 125,
        "version": 12
      },
      "peers": [
        {
          "id": 18695900,
          "store_id": 17388737
        },
        {
          "id": 18696000,
          "store_id": 1597655
        }
      ],
      "leader": {
        "id": 18696000,
        "store_id": 1597655
      },
      "approximate_size": 58,
      "approximate_keys": 997151
    }

从您反馈的store信息查看,应该有多个都是 disconnect和down状态,所以可能副本不足导致没有启动。能否先尝试启动down和disconnect的store。

你好,尝试对down和disconnect的store进行 ansible-playbook start.yml -l [ip]的方式进行启动,等待许久后还是处于上述非正常状态。

tikv的日志输出:

[2020/08/24 14:25:32.497 +00:00] [WARN] [endpoint.rs:454] [error-response] [err="region message: \"peer is not leader for region 37, leader may None\" not_leader { region_id: 37 }"]
[2020/08/24 14:25:33.503 +00:00] [WARN] [endpoint.rs:454] [error-response] [err="region message: \"peer is not leader for region 37, leader may None\" not_leader { region_id: 37 }"]
[2020/08/24 14:25:34.514 +00:00] [WARN] [endpoint.rs:454] [error-response] [err="region message: \"peer is not leader for region 37, leader may None\" not_leader { region_id: 37 }"]
[2020/08/24 14:25:34.970 +00:00] [INFO] [peer.rs:724] ["failed to schedule peer tick"] [err="sending on a disconnected channel"] [tick=RAFT] [peer_id=19924898] [region_id=19924897]
[2020/08/24 14:25:39.875 +00:00] [INFO] [peer.rs:724] ["failed to schedule peer tick"] [err="sending on a disconnected channel"] [tick=RAFT_LOG_GC] [peer_id=19924898] [region_id=19924897]
[2020/08/24 14:25:39.875 +00:00] [INFO] [peer.rs:724] ["failed to schedule peer tick"] [err="sending on a disconnected channel"] [tick=SPLIT_REGION_CHECK] [peer_id=19924898] [region_id=19924897]
[2020/08/24 14:25:44.700 +00:00] [INFO] [peer.rs:724] ["failed to schedule peer tick"] [err="sending on a disconnected channel"] [tick=CHECK_MERGE] [peer_id=19924898] [region_id=19924897]

您的集群里 down 的 store 有 3 个, disconnected 的有 5 个, up 的有 5 个,如果您是 3 副本,那么 2 个节点出问题,都会有 region 受影响,当前有 8 个 store 都有问题,那么有很多数据都无法访问。 如果都无法启动,那么参考文档 https://docs.pingcap.com/zh/tidb/stable/tikv-control#tikv-control-使用说明 和其他帖子,看下如何删除这些无法恢复的 region,由于超过2个 store,所以可能会有数据丢失。

    "state_name": "Up"
    "state_name": "Up"
    "state_name": "Disconnected"
    "state_name": "Up"
    "state_name": "Down"
    "state_name": "Disconnected"
    "state_name": "Disconnected"
    "state_name": "Disconnected"
    "state_name": "Up"
    "state_name": "Up"
    "state_name": "Disconnected"
    "state_name": "Down"
    "state_name": "Down

目前只有一个节点是down的,是否可以将这个节点的region进行删除?
全部tikv节点状态:

    "state_name": "Up"
    "state_name": "Up"
    "state_name": "Up"
    "state_name": "Down"
    "state_name": "Up"
    "state_name": "Up"
    "state_name": "Up"
    "state_name": "Up"
    "state_name": "Up"
    "state_name": "Up"
    "state_name": "Up"
    "state_name": "Up"

从这个节点region的信息,可以看到出问题的region是在两个节点上的:
“peers”: [
{
“id”: 18695900,
“store_id”: 17388737
},
{
“id”: 18696000,
“store_id”: 1597655
}
],

另外,在出错节点(down)检查region状态:

tidb@thirteen:~$ ./tikv-ctl --db /home/tidb/deploy/data/db bad-regions
all regions are healthy

如果只有一个节点是down,那么可以尝试缩容这个节点,但是看上面 store 的可用空间只有199 G,麻烦您算一下,是不是有可能空间不足,所以建议先扩容一个 store ,再缩容这个down的。

尝试用ansible-playbook stop.yml/ start.yml 重启 服务,发现大面积tikv均处于down或者disconnect。 等待数小时后,目前:

tidb@three:~/tidb-ansible/resources/bin$ ./pd-ctl -u "http://10.12.5.113:2379" store | grep state_name -C 4
      "store": {
        "id": 2026701,
        "address": "10.12.5.227:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "207.1GiB",
--
      "store": {
        "id": 6506924,
        "address": "10.12.5.229:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Down"
      },
      "status": {
        "capacity": "1007GiB",
        "available": "234.1GiB",
--
      "store": {
        "id": 6506926,
        "address": "10.12.5.231:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Down"
      },
      "status": {
        "capacity": "1007GiB",
        "available": "234.1GiB",
--
      "store": {
        "id": 407938,
        "address": "10.12.5.221:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "204.8GiB",
--
      "store": {
        "id": 407940,
        "address": "10.12.5.220:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "865.2GiB",
        "available": "201.8GiB",
--
      "store": {
        "id": 484920,
        "address": "10.12.5.223:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Disconnected"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "210.7GiB",
--
      "store": {
        "id": 17388737,
        "address": "10.12.5.214:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Down"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "196.8GiB",
--
      "store": {
        "id": 665678,
        "address": "127.0.0.1:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Down"
      },
      "status": {
        "leader_weight": 1,
        "region_weight": 1,
--
      "store": {
        "id": 640552,
        "address": "10.12.5.224:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "206.7GiB",
--
      "store": {
        "id": 1597655,
        "address": "10.12.5.226:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "835.7GiB",
        "available": "194.4GiB",
--
      "store": {
        "id": 335855,
        "address": "10.12.5.230:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "208.5GiB",
--
      "store": {
        "id": 6506925,
        "address": "10.12.5.228:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Down"
      },
      "status": {
        "capacity": "1007GiB",
        "available": "235GiB",
--
      "store": {
        "id": 10968962,
        "address": "10.12.5.233:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "205.9GiB",
  1. peer is not leader 可以参考文档

https://docs.pingcap.com/zh/tidb/stable/tidb-troubleshooting-map#44-某些-tikv-大量掉-leader

  1. 根据您的反馈,有时可以正常,有时都是down,您尝试手工单个拉起下,看看能否拉起. 在tikv主机的安装目录/deploy/scripts/下的 start_tikv.sh 来启动

  2. 麻烦上传出问题的节点,启动到失败这段时间的 tikv 日志,多谢。

目前各节点均处于up状态。但这里仍然把长期处于disconnect/down状态的节点的error信息进行展示:

tikv_214.log (1.4 MB)

如果12h后ge各节点仍处于up状态,是否可以说明整个tidb服务是正常的?在此基础上,是否可以继续进行从tidb-ansible(v3.1)到tiup(v4.0)的服务版本升级?

  1. 从报错看是无法选出 leader , peer is not leader for region 37, leader may None , 当时有比较多的 store 都是 down 或者 disconnected 状态,有可能导致。
  2. 如果长时间都是 up,也麻烦您再看下tikv日志是否有报错,如果没有可以升级,多谢。

发现214节点又处于disconnect的状态,此时的报错在于:failed to handle split req

详细出错日志如下:

[2020/08/26 11:39:22.604 +00:00] [ERROR] [split_observer.rs:149] ["failed to handle split req"] [err="\"no valid key found for split.\""] [region_id=7069891]
[2020/08/26 11:39:22.604 +00:00] [WARN] [peer.rs:2131] ["skip proposal"] [err="Coprocessor(Other(\"[src/raftstore/coprocessor/split_observer.rs:154]: no valid key found for split.\"))"] [peer_id=17400787] [region_id=7069891]
[2020/08/26 11:39:22.604 +00:00] [WARN] [split_observer.rs:86] ["invalid key, skip"] [err="\"key 7480000000000001FFF15F728000000003FF7221D90000000000FA should be in (7480000000000001FFF15F728000000003FF7221D90000000000FA, 7480000000000001FFF15F728000000003FF72275C0000000000FA)\""] [index=0] [region_id=3354532]
[2020/08/26 11:39:22.605 +00:00] [ERROR] [split_observer.rs:149] ["failed to handle split req"] [err="\"no valid key found for split.\""] [region_id=3354532]
[2020/08/26 11:39:22.605 +00:00] [WARN] [peer.rs:2131] ["skip proposal"] [err="Coprocessor(Other(\"[src/raftstore/coprocessor/split_observer.rs:154]: no valid key found for split.\"))"] [peer_id=18521850] [region_id=3354532]
[2020/08/26 11:39:22.605 +00:00] [WARN] [split_observer.rs:86] ["invalid key, skip"] [err="\"key 7480000000000001FFF15F728000000001FFE9175B0000000000FA should be in (7480000000000001FFF15F728000000001FFE9175B0000000000FA, 7480000000000001FFF15F728000000001FFE9175C0000000000FA)\""] [index=0] [region_id=429447]
[2020/08/26 11:39:22.605 +00:00] [WARN] [split_observer.rs:86] ["invalid key, skip"] [err="\"key 7480000000000001FFF15F728000000001FFE9175B0000000000FA should be in (7480000000000001FFF15F728000000001FFE9175B0000000000FA, 7480000000000001FFF15F728000000001FFE9175C0000000000FA)\""] [index=1] [region_id=429447]
[2020/08/26 11:39:22.605 +00:00] [ERROR] [split_observer.rs:149] ["failed to handle split req"] [err="\"no valid key found for split.\""] [region_id=429447]
[2020/08/26 11:39:22.605 +00:00] [WARN] [peer.rs:2131] ["skip proposal"] [err="Coprocessor(Other(\"[src/raftstore/coprocessor/split_observer.rs:154]: no valid key found for split.\"))"] [peer_id=17551636] [region_id=429447]

new.log (478.2 KB)

no valid key found for split 分裂失败的原因是指定的 split key 不在 region 范围内,通常是由于 PD 重复下发了 split 任务导致的。请问当前集群压力是很大吗?

当前在批量上传数据。但感觉和之前上传的数据量差不多,但之前并未出现这样的情况。

不过待上传结束后,过一会节点状态恢复正常了。所以这类错误是可以忽略的,是吗?

如果业务没有抱错,可以忽略

你好,在访问数据库时频繁出现tikv server timeout。观察当前tikv集群状态,发现214节点仍然处于disconnect,对应的日志为:

[2020/08/28 09:03:28.781 +00:00] [INFO] [util.rs:397] ["connecting to PD endpoint"] [endpoints=10.12.5.115:2379]

[2020/08/28 09:03:28.781 +00:00] [INFO] [subchannel.cc:841] ["New connected subchannel at 0x7f3d4de20a00 for subchannel 0x7f3d4de21e00"]

[2020/08/28 09:03:28.783 +00:00] [INFO] [util.rs:397] ["connecting to PD endpoint"] [endpoints=http://10.12.5.115:2379]

[2020/08/28 09:03:28.783 +00:00] [INFO] [subchannel.cc:841] ["New connected subchannel at 0x7f3d4de20b80 for subchannel 0x7f3d4de12a00"]

[2020/08/28 09:03:28.784 +00:00] [WARN] [client.rs:55] ["validate PD endpoints failed"] [err="Other(\"[src/pd/util.rs:462]: failed to connect to [name: \\\"pd_pd3\\\" member_id: 2579653654541892389 peer_urls: \\\"http://10.12.5.115:2380\\\" client_urls: \\\"http://10.12.5.115:2379\\\", name: \\\"pd_pd2\\\" member_id: 3717199249823848643 peer_urls: \\\"http://10.12.5.114:2380\\\" client_urls: \\\"http://10.12.5.114:2379\\\", name: \\\"pd_pd1\\\" member_id: 4691481983733508901 peer_urls: \\\"http://10.12.5.113:2380\\\" client_urls: \\\"http://10.12.5.113:2379\\\"]\")"]

相应地,使用pd-ctl工具查看health,发现10.12.5.115的pd的health 为false。

因为当前状态还不稳定,所以目前还未进行升级

对应的三个pd的log如下:

pd_113_log.log (1.1 MB)
pd_114_log.log (1.2 MB)
pd_115_log.log (1.2 MB)

收到,我们分析下

看到一个 PD 节点的日志里面有“Got too many pings from the client, closing the connection” 检查下这个 PD 到其他两个节点网络以及每个 PD 的负载情况。另外 214 的 KV 还是 disconnect 问题,报错“failed to connect to ”,看着是访问 PD 的问题。如果集群问题仍旧存在,辛苦给下 PD & TiKV 的监控以及日志信息。方法:

(1)、chrome 安装这个插件https://chrome.google.com/webstore/detail/full-page-screen-capture/fdpohaocaechififmbbbbbknoalclacl

(2)、鼠标焦点置于 Dashboard 上,按 ?可显示所有快捷键,先按 d 再按 E 可将所有 Rows 的 Panels 打开,需等待一段时间待页面加载完成。

(3)、使用这个 full-page-screen-capture 插件进行截屏保存

PD监控信息: