近期数据库状态总是disconnect或者down，且频繁出现no space故障

racecdj · 2020 年8 月 23 日 02:49

为提高效率，提问时请提供以下信息，问题描述清晰可优先响应。

【TiDB 版本】：3.1.0-beta.1
【问题描述】：
最近数据库需要大量读写，但是出现性能不佳的情况。在操作时出现 tikv timeout 或者region is unavailable的情况。

各tikv节点经常状态不稳定，例如当前的状态为：

tidb@three:~/tidb-ansible/resources/bin$ ./pd-ctl -u "http://10.12.5.113:2379" store|grep state_name
        "state_name": "Up"
        "state_name": "Up"
        "state_name": "Disconnected"
        "state_name": "Up"
        "state_name": "Down"
        "state_name": "Disconnected"
        "state_name": "Disconnected"
        "state_name": "Disconnected"
        "state_name": "Up"
        "state_name": "Up"
        "state_name": "Disconnected"
        "state_name": "Down"
        "state_name": "Down"

tidb@three:~/tidb-ansible/resources/bin$ ./pd-ctl -u “http://10.12.5.113:2379” store

{
  "count": 13,
  "stores": [
    {
      "store": {
        "id": 335855,
        "address": "10.12.5.230:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "199GiB",
        "leader_count": 9390,
        "leader_weight": 1,
        "leader_score": 1043499,
        "leader_size": 1043499,
        "region_count": 17293,
        "region_weight": 1,
        "region_score": 940071725.5503502,
        "region_size": 1643391,
        "start_ts": "2020-08-23T02:10:16Z",
        "last_heartbeat_ts": "2020-08-23T02:46:07.227896007Z",
        "uptime": "35m51.227896007s"
      }
    },
    {
      "store": {
        "id": 484920,
        "address": "10.12.5.223:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "203.2GiB",
        "leader_count": 11427,
        "leader_weight": 1,
        "leader_score": 1213914,
        "leader_size": 1213914,
        "region_count": 18726,
        "region_weight": 1,
        "region_score": 914446997.4288011,
        "region_size": 1770229,
        "start_ts": "2020-08-23T02:09:57Z",
        "last_heartbeat_ts": "2020-08-23T02:46:21.554811523Z",
        "uptime": "36m24.554811523s"
      }
    },
    {
      "store": {
        "id": 17388737,
        "address": "10.12.5.214:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Disconnected"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "193.8GiB",
        "leader_count": 18,
        "leader_weight": 1,
        "leader_score": 1355,
        "leader_size": 1355,
        "region_count": 15416,
        "region_weight": 1,
        "region_score": 971808761.6416945,
        "region_size": 1261973,
        "start_ts": "2020-08-23T02:43:19Z",
        "last_heartbeat_ts": "2020-08-23T02:43:30.469837181Z",
        "uptime": "11.469837181s"
      }
    },
    {
      "store": {
        "id": 407938,
        "address": "10.12.5.221:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "428.9GiB",
        "leader_count": 2279,
        "leader_weight": 1,
        "leader_score": 1155710,
        "leader_size": 1155710,
        "region_count": 2634,
        "region_weight": 1,
        "region_score": 1182197,
        "region_size": 1182197,
        "sending_snap_count": 1,
        "start_ts": "2020-08-23T02:09:32Z",
        "last_heartbeat_ts": "2020-08-23T02:46:15.375248999Z",
        "uptime": "36m43.375248999s"
      }
    },
    {
      "store": {
        "id": 665678,
        "address": "127.0.0.1:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Down"
      },
      "status": {
        "leader_weight": 1,
        "region_weight": 1,
        "start_ts": "1970-01-01T00:00:00Z"
      }
    },
    {
      "store": {
        "id": 640552,
        "address": "10.12.5.224:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "201.1GiB",
        "leader_count": 9736,
        "leader_weight": 1,
        "leader_score": 1086020,
        "leader_size": 1086020,
        "region_count": 18895,
        "region_weight": 1,
        "region_score": 927204434.4555326,
        "region_size": 1657957,
        "start_ts": "2020-08-23T02:10:04Z",
        "last_heartbeat_ts": "2020-08-23T02:46:23.339818555Z",
        "uptime": "36m19.339818555s"
      }
    },
    {
      "store": {
        "id": 2026701,
        "address": "10.12.5.227:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Disconnected"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "198.6GiB",
        "leader_count": 2774,
        "leader_weight": 1,
        "leader_score": 292477,
        "leader_size": 292477,
        "region_count": 18189,
        "region_weight": 1,
        "region_score": 942237682.7523708,
        "region_size": 1502493,
        "start_ts": "2020-08-23T02:11:10Z",
        "last_heartbeat_ts": "2020-08-23T02:26:06.180979812Z",
        "uptime": "14m56.180979812s"
      }
    },
    {
      "store": {
        "id": 6506926,
        "address": "10.12.5.231:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Disconnected"
      },
      "status": {
        "capacity": "1007GiB",
        "available": "311.1GiB",
        "leader_count": 8196,
        "leader_weight": 1,
        "leader_score": 1000551,
        "leader_size": 1000551,
        "region_count": 10832,
        "region_weight": 1,
        "region_score": 489578759.9719987,
        "region_size": 2138334,
        "start_ts": "2020-08-23T02:09:53Z",
        "last_heartbeat_ts": "2020-08-23T02:24:24.524230709Z",
        "uptime": "14m31.524230709s"
      }
    },
    {
      "store": {
        "id": 10968962,
        "address": "10.12.5.233:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "199.6GiB",
        "leader_count": 6514,
        "leader_weight": 1,
        "leader_score": 745635,
        "leader_size": 745635,
        "region_count": 14893,
        "region_weight": 1,
        "region_score": 936286208.6025124,
        "region_size": 2306434,
        "start_ts": "2020-08-23T02:10:19Z",
        "last_heartbeat_ts": "2020-08-23T02:46:22.090152418Z",
        "uptime": "36m3.090152418s"
      }
    },
    {
      "store": {
        "id": 6506924,
        "address": "10.12.5.229:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Down"
      },
      "status": {
        "capacity": "1007GiB",
        "available": "223.2GiB",
        "leader_count": 2,
        "leader_weight": 1,
        "leader_score": 152,
        "leader_size": 152,
        "region_count": 20252,
        "region_weight": 1,
        "region_score": 957450920.5420499,
        "region_size": 734939,
        "start_ts": "2020-08-23T02:10:01Z",
        "last_heartbeat_ts": "2020-08-23T02:10:11.957212698Z",
        "uptime": "10.957212698s"
      }
    },
    {
      "store": {
        "id": 6506925,
        "address": "10.12.5.228:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Down"
      },
      "status": {
        "capacity": "1007GiB",
        "available": "222.5GiB",
        "leader_weight": 1,
        "region_count": 19097,
        "region_weight": 1,
        "region_score": 961158063.6551847,
        "region_size": 430576,
        "start_ts": "2020-08-23T02:09:49Z",
        "last_heartbeat_ts": "2020-08-23T02:09:59.935448794Z",
        "uptime": "10.935448794s"
      }
    },
    {
      "store": {
        "id": 407940,
        "address": "10.12.5.220:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "865.2GiB",
        "available": "197GiB",
        "leader_count": 14264,
        "leader_weight": 1,
        "leader_score": 1448249,
        "leader_size": 1448249,
        "region_count": 18524,
        "region_weight": 1,
        "region_score": 925391192.2735996,
        "region_size": 1906767,
        "start_ts": "2020-08-23T02:09:39Z",
        "last_heartbeat_ts": "2020-08-23T02:46:20.320183413Z",
        "uptime": "36m41.320183413s"
      }
    },
    {
      "store": {
        "id": 1597655,
        "address": "10.12.5.226:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Disconnected"
      },
      "status": {
        "capacity": "835.7GiB",
        "available": "188.7GiB",
        "leader_count": 4064,
        "leader_weight": 1,
        "leader_score": 400408,
        "leader_size": 400408,
        "region_count": 13935,
        "region_weight": 1,
        "region_score": 935338040.0638361,
        "region_size": 1155387,
        "start_ts": "2020-08-23T02:10:41Z",
        "last_heartbeat_ts": "2020-08-23T02:21:35.213273663Z",
        "uptime": "10m54.213273663s"
      }
    }
  ]
}

此外，也会经常出现no space的情况，但实际磁盘可用空间是超过100G的。通常需要进行log清除（约几G） + 删除last_tikv.toml文件，然后通过df -h 磁盘能发现可用空间是有的。

针对上述问题，我想知道当前性能不佳的原因是什么？有什么解决办法？

若提问为性能优化、故障排查类问题，请下载脚本运行。终端输出的打印结果，请务必全选并复制粘贴上传。

yilong · 2020 年8 月 24 日 01:34

为什么使用 3.1 的beta 版本？建议升级到 4.0 的正式版本
down 和 disconnected 可以看下 tikv 的日志具体在当时是什么报错
3 可以查看读写流量高时的监控，是否cpu，内存，io 等资源耗尽。

racecdj · 2020 年8 月 24 日 11:18

您好，我这边在计划从3.1 --> 4.0时，尝试从tidb-ansible的方式转为tiup，遇到了问题：

文档中说，需要在各个组件状态正常时进行操作。目前tikv存在一个节点始终是disconnected状态，此状态下是否可以继续升级？

该disconnected节点的tikv.log日志：

[2020/08/24 11:12:29.799 +00:00] [WARN] [endpoint.rs:454] [error-response] [err="region message: \"peer is not leader for region 17188974, leader may None\" not_leader { region_id: 17188974 }"]

[2020/08/24 11:12:30.321 +00:00] [WARN] [endpoint.rs:454] [error-response] [err="region message: \"peer is not leader for region 17188974, leader may None\" not_leader { region_id: 17188974 }"]

[2020/08/24 11:12:30.822 +00:00] [WARN] [endpoint.rs:454] [error-response] [err="region message: \"peer is not leader for region 17188974, leader may None\" not_leader { region_id: 17188974 }"]

[2020/08/24 11:12:31.331 +00:00] [WARN] [endpoint.rs:454] [error-response] [err="region message: \"peer is not leader for region 17188974, leader may None\" not_leader { region_id: 17188974 }"]

[2020/08/24 11:12:31.831 +00:00] [WARN] [endpoint.rs:454] [error-response] [err="region message: \"peer is not leader for region 17188974, leader may None\" not_leader { region_id: 17188974 }"]

[2020/08/24 11:12:32.343 +00:00] [WARN] [endpoint.rs:454] [error-response] [err="region message: \"peer is not leader for region 17188974, leader may None\" not_leader { region_id: 17188974 }"]

[2020/08/24 11:12:32.844 +00:00] [WARN] [endpoint.rs:454] [error-response] [err="region message: \"peer is not leader for region 17188974, leader may None\" not_leader { region_id: 17188974 }"]

[2020/08/24 11:12:33.351 +00:00] [WARN] [endpoint.rs:454] [error-response] [err="region message: \"peer is not leader for region 17188974, leader may None\" not_leader { region_id: 17188974 }"]

[2020/08/24 11:12:33.852 +00:00] [WARN] [endpoint.rs:454] [error-response] [err="region message: \"peer is not leader for region 17188974, leader may None\" not_leader { region_id: 17188974 }"]

[2020/08/24 11:12:34.354 +00:00] [WARN] [endpoint.rs:454] [error-response] [err="region message: \"peer is not leader for region 17188974, leader may Some(id: 18696000 store_id: 1597655)\" not_leader { region_id: 17188974 leader { id: 18696000 store_id: 1597655 } }"]



tidb@three:~/tidb-ansible/resources/bin$ ./pd-ctl -u "http://10.12.5.113:2379"  region 17188974
    {
      "id": 17188974,
      "start_key": "7480000000000000FF1500000000000000F8",
      "end_key": "7480000000000000FF155F698000000000FF0000020380000000FF0000D93A00000000FB",
      "epoch": {
        "conf_ver": 125,
        "version": 12
      },
      "peers": [
        {
          "id": 18695900,
          "store_id": 17388737
        },
        {
          "id": 18696000,
          "store_id": 1597655
        }
      ],
      "leader": {
        "id": 18696000,
        "store_id": 1597655
      },
      "approximate_size": 58,
      "approximate_keys": 997151
    }

yilong · 2020 年8 月 24 日 12:48

从您反馈的store信息查看，应该有多个都是 disconnect和down状态，所以可能副本不足导致没有启动。能否先尝试启动down和disconnect的store。

racecdj · 2020 年8 月 24 日 14:23

你好，尝试对down和disconnect的store进行 ansible-playbook start.yml -l [ip]的方式进行启动，等待许久后还是处于上述非正常状态。

tikv的日志输出：

[2020/08/24 14:25:32.497 +00:00] [WARN] [endpoint.rs:454] [error-response] [err="region message: \"peer is not leader for region 37, leader may None\" not_leader { region_id: 37 }"]
[2020/08/24 14:25:33.503 +00:00] [WARN] [endpoint.rs:454] [error-response] [err="region message: \"peer is not leader for region 37, leader may None\" not_leader { region_id: 37 }"]
[2020/08/24 14:25:34.514 +00:00] [WARN] [endpoint.rs:454] [error-response] [err="region message: \"peer is not leader for region 37, leader may None\" not_leader { region_id: 37 }"]
[2020/08/24 14:25:34.970 +00:00] [INFO] [peer.rs:724] ["failed to schedule peer tick"] [err="sending on a disconnected channel"] [tick=RAFT] [peer_id=19924898] [region_id=19924897]
[2020/08/24 14:25:39.875 +00:00] [INFO] [peer.rs:724] ["failed to schedule peer tick"] [err="sending on a disconnected channel"] [tick=RAFT_LOG_GC] [peer_id=19924898] [region_id=19924897]
[2020/08/24 14:25:39.875 +00:00] [INFO] [peer.rs:724] ["failed to schedule peer tick"] [err="sending on a disconnected channel"] [tick=SPLIT_REGION_CHECK] [peer_id=19924898] [region_id=19924897]
[2020/08/24 14:25:44.700 +00:00] [INFO] [peer.rs:724] ["failed to schedule peer tick"] [err="sending on a disconnected channel"] [tick=CHECK_MERGE] [peer_id=19924898] [region_id=19924897]

yilong · 2020 年8 月 25 日 02:14

您的集群里 down 的 store 有 3 个， disconnected 的有 5 个， up 的有 5 个，如果您是 3 副本，那么 2 个节点出问题，都会有 region 受影响，当前有 8 个 store 都有问题，那么有很多数据都无法访问。如果都无法启动，那么参考文档 https://docs.pingcap.com/zh/tidb/stable/tikv-control#tikv-control-使用说明和其他帖子，看下如何删除这些无法恢复的 region，由于超过2个 store，所以可能会有数据丢失。

    "state_name": "Up"
    "state_name": "Up"
    "state_name": "Disconnected"
    "state_name": "Up"
    "state_name": "Down"
    "state_name": "Disconnected"
    "state_name": "Disconnected"
    "state_name": "Disconnected"
    "state_name": "Up"
    "state_name": "Up"
    "state_name": "Disconnected"
    "state_name": "Down"
    "state_name": "Down

racecdj · 2020 年8 月 25 日 02:31

目前只有一个节点是down的，是否可以将这个节点的region进行删除？
全部tikv节点状态：

    "state_name": "Up"
    "state_name": "Up"
    "state_name": "Up"
    "state_name": "Down"
    "state_name": "Up"
    "state_name": "Up"
    "state_name": "Up"
    "state_name": "Up"
    "state_name": "Up"
    "state_name": "Up"
    "state_name": "Up"
    "state_name": "Up"

从这个节点region的信息，可以看到出问题的region是在两个节点上的：
“peers”: [
{
“id”: 18695900,
“store_id”: 17388737
},
{
“id”: 18696000,
“store_id”: 1597655
}
],

另外，在出错节点（down）检查region状态：

tidb@thirteen:~$ ./tikv-ctl --db /home/tidb/deploy/data/db bad-regions
all regions are healthy

yilong · 2020 年8 月 25 日 03:49

如果只有一个节点是down，那么可以尝试缩容这个节点，但是看上面 store 的可用空间只有199 G，麻烦您算一下，是不是有可能空间不足，所以建议先扩容一个 store ，再缩容这个down的。

racecdj · 2020 年8 月 25 日 05:10

尝试用ansible-playbook stop.yml/ start.yml 重启服务，发现大面积tikv均处于down或者disconnect。等待数小时后，目前：

tidb@three:~/tidb-ansible/resources/bin$ ./pd-ctl -u "http://10.12.5.113:2379" store | grep state_name -C 4
      "store": {
        "id": 2026701,
        "address": "10.12.5.227:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "207.1GiB",
--
      "store": {
        "id": 6506924,
        "address": "10.12.5.229:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Down"
      },
      "status": {
        "capacity": "1007GiB",
        "available": "234.1GiB",
--
      "store": {
        "id": 6506926,
        "address": "10.12.5.231:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Down"
      },
      "status": {
        "capacity": "1007GiB",
        "available": "234.1GiB",
--
      "store": {
        "id": 407938,
        "address": "10.12.5.221:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "204.8GiB",
--
      "store": {
        "id": 407940,
        "address": "10.12.5.220:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "865.2GiB",
        "available": "201.8GiB",
--
      "store": {
        "id": 484920,
        "address": "10.12.5.223:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Disconnected"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "210.7GiB",
--
      "store": {
        "id": 17388737,
        "address": "10.12.5.214:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Down"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "196.8GiB",
--
      "store": {
        "id": 665678,
        "address": "127.0.0.1:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Down"
      },
      "status": {
        "leader_weight": 1,
        "region_weight": 1,
--
      "store": {
        "id": 640552,
        "address": "10.12.5.224:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "206.7GiB",
--
      "store": {
        "id": 1597655,
        "address": "10.12.5.226:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "835.7GiB",
        "available": "194.4GiB",
--
      "store": {
        "id": 335855,
        "address": "10.12.5.230:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "208.5GiB",
--
      "store": {
        "id": 6506925,
        "address": "10.12.5.228:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Down"
      },
      "status": {
        "capacity": "1007GiB",
        "available": "235GiB",
--
      "store": {
        "id": 10968962,
        "address": "10.12.5.233:20160",
        "version": "3.1.0-beta.1",
        "state_name": "Up"
      },
      "status": {
        "capacity": "884.9GiB",
        "available": "205.9GiB",

yilong · 2020 年8 月 25 日 08:44

peer is not leader 可以参考文档

https://docs.pingcap.com/zh/tidb/stable/tidb-troubleshooting-map#44-某些-tikv-大量掉-leader

根据您的反馈，有时可以正常，有时都是down，您尝试手工单个拉起下，看看能否拉起. 在tikv主机的安装目录/deploy/scripts/下的 start_tikv.sh 来启动
麻烦上传出问题的节点，启动到失败这段时间的 tikv 日志，多谢。

racecdj · 2020 年8 月 25 日 16:22

目前各节点均处于up状态。但这里仍然把长期处于disconnect/down状态的节点的error信息进行展示：

tikv_214.log (1.4 MB)

如果12h后ge各节点仍处于up状态，是否可以说明整个tidb服务是正常的？在此基础上，是否可以继续进行从tidb-ansible（v3.1）到tiup（v4.0）的服务版本升级？

yilong · 2020 年8 月 26 日 03:52

从报错看是无法选出 leader ， peer is not leader for region 37, leader may None ，当时有比较多的 store 都是 down 或者 disconnected 状态，有可能导致。
如果长时间都是 up，也麻烦您再看下tikv日志是否有报错，如果没有可以升级，多谢。

racecdj · 2020 年8 月 26 日 11:39

发现214节点又处于disconnect的状态，此时的报错在于：failed to handle split req

详细出错日志如下：

[2020/08/26 11:39:22.604 +00:00] [ERROR] [split_observer.rs:149] ["failed to handle split req"] [err="\"no valid key found for split.\""] [region_id=7069891]
[2020/08/26 11:39:22.604 +00:00] [WARN] [peer.rs:2131] ["skip proposal"] [err="Coprocessor(Other(\"[src/raftstore/coprocessor/split_observer.rs:154]: no valid key found for split.\"))"] [peer_id=17400787] [region_id=7069891]
[2020/08/26 11:39:22.604 +00:00] [WARN] [split_observer.rs:86] ["invalid key, skip"] [err="\"key 7480000000000001FFF15F728000000003FF7221D90000000000FA should be in (7480000000000001FFF15F728000000003FF7221D90000000000FA, 7480000000000001FFF15F728000000003FF72275C0000000000FA)\""] [index=0] [region_id=3354532]
[2020/08/26 11:39:22.605 +00:00] [ERROR] [split_observer.rs:149] ["failed to handle split req"] [err="\"no valid key found for split.\""] [region_id=3354532]
[2020/08/26 11:39:22.605 +00:00] [WARN] [peer.rs:2131] ["skip proposal"] [err="Coprocessor(Other(\"[src/raftstore/coprocessor/split_observer.rs:154]: no valid key found for split.\"))"] [peer_id=18521850] [region_id=3354532]
[2020/08/26 11:39:22.605 +00:00] [WARN] [split_observer.rs:86] ["invalid key, skip"] [err="\"key 7480000000000001FFF15F728000000001FFE9175B0000000000FA should be in (7480000000000001FFF15F728000000001FFE9175B0000000000FA, 7480000000000001FFF15F728000000001FFE9175C0000000000FA)\""] [index=0] [region_id=429447]
[2020/08/26 11:39:22.605 +00:00] [WARN] [split_observer.rs:86] ["invalid key, skip"] [err="\"key 7480000000000001FFF15F728000000001FFE9175B0000000000FA should be in (7480000000000001FFF15F728000000001FFE9175B0000000000FA, 7480000000000001FFF15F728000000001FFE9175C0000000000FA)\""] [index=1] [region_id=429447]
[2020/08/26 11:39:22.605 +00:00] [ERROR] [split_observer.rs:149] ["failed to handle split req"] [err="\"no valid key found for split.\""] [region_id=429447]
[2020/08/26 11:39:22.605 +00:00] [WARN] [peer.rs:2131] ["skip proposal"] [err="Coprocessor(Other(\"[src/raftstore/coprocessor/split_observer.rs:154]: no valid key found for split.\"))"] [peer_id=17551636] [region_id=429447]

new.log (478.2 KB)

yilong · 2020 年8 月 27 日 06:49

no valid key found for split 分裂失败的原因是指定的 split key 不在 region 范围内，通常是由于 PD 重复下发了 split 任务导致的。请问当前集群压力是很大吗？

racecdj · 2020 年8 月 27 日 08:33

当前在批量上传数据。但感觉和之前上传的数据量差不多，但之前并未出现这样的情况。

不过待上传结束后，过一会节点状态恢复正常了。所以这类错误是可以忽略的，是吗？

yilong · 2020 年8 月 27 日 09:52

如果业务没有抱错，可以忽略

racecdj · 2020 年8 月 28 日 09:05

你好，在访问数据库时频繁出现tikv server timeout。观察当前tikv集群状态，发现214节点仍然处于disconnect，对应的日志为：

[2020/08/28 09:03:28.781 +00:00] [INFO] [util.rs:397] ["connecting to PD endpoint"] [endpoints=10.12.5.115:2379]

[2020/08/28 09:03:28.781 +00:00] [INFO] [subchannel.cc:841] ["New connected subchannel at 0x7f3d4de20a00 for subchannel 0x7f3d4de21e00"]

[2020/08/28 09:03:28.783 +00:00] [INFO] [util.rs:397] ["connecting to PD endpoint"] [endpoints=http://10.12.5.115:2379]

[2020/08/28 09:03:28.783 +00:00] [INFO] [subchannel.cc:841] ["New connected subchannel at 0x7f3d4de20b80 for subchannel 0x7f3d4de12a00"]

[2020/08/28 09:03:28.784 +00:00] [WARN] [client.rs:55] ["validate PD endpoints failed"] [err="Other(\"[src/pd/util.rs:462]: failed to connect to [name: \\\"pd_pd3\\\" member_id: 2579653654541892389 peer_urls: \\\"http://10.12.5.115:2380\\\" client_urls: \\\"http://10.12.5.115:2379\\\", name: \\\"pd_pd2\\\" member_id: 3717199249823848643 peer_urls: \\\"http://10.12.5.114:2380\\\" client_urls: \\\"http://10.12.5.114:2379\\\", name: \\\"pd_pd1\\\" member_id: 4691481983733508901 peer_urls: \\\"http://10.12.5.113:2380\\\" client_urls: \\\"http://10.12.5.113:2379\\\"]\")"]

相应地，使用pd-ctl工具查看health，发现10.12.5.115的pd的health 为false。

因为当前状态还不稳定，所以目前还未进行升级

对应的三个pd的log如下：

pd_113_log.log (1.1 MB)
pd_114_log.log (1.2 MB)
pd_115_log.log (1.2 MB)

不懂就问 · 2020 年8 月 28 日 13:55

收到，我们分析下

不懂就问 · 2020 年9 月 3 日 09:04

看到一个 PD 节点的日志里面有“Got too many pings from the client, closing the connection” 检查下这个 PD 到其他两个节点网络以及每个 PD 的负载情况。另外 214 的 KV 还是 disconnect 问题，报错“failed to connect to ”，看着是访问 PD 的问题。如果集群问题仍旧存在，辛苦给下 PD & TiKV 的监控以及日志信息。方法：

(1)、chrome 安装这个插件https://chrome.google.com/webstore/detail/full-page-screen-capture/fdpohaocaechififmbbbbbknoalclacl

(2)、鼠标焦点置于 Dashboard 上，按 ?可显示所有快捷键，先按 d 再按 E 可将所有 Rows 的 Panels 打开，需等待一段时间待页面加载完成。

(3)、使用这个 full-page-screen-capture 插件进行截屏保存

racecdj · 2020 年9 月 3 日 13:44

PD监控信息：