tikv down

【TiDB 版本】
v4.0.4
【问题描述】
上周集群状态正常,周一来查看集群状态,发现tikv3个store down(3副本,4node)


image
查看tikv pod日志没发现error明显错误,截图如下,大佬帮分析下

pod状态,tikv pod状态正常。tiflash 重启21次可忽略,问题已解决。

1 个赞

pd-ctl 执行下 store 命令看下结果
检查一下 tikv pod 与 pd pod 之间的网络情况

pd-ctl store结果如下:
/ # ./pd-ctl store
{
“count”: 7,
“stores”: [
{
“store”: {
“id”: 12,
“address”: “tidb-test-tikv-2.tidb-test-tikv-peer.default.svc:20160”,
“labels”: [
{
“key”: “host”,
“value”: “mgmt12.hh-b.brainpp.cn
}
],
“version”: “4.0.4”,
“status_address”: “tidb-test-tikv-2.tidb-test-tikv-peer.default.svc:20180”,
“git_hash”: “28e3d44b00700137de4fa933066ab83e5f8306cf”,
“start_timestamp”: 1618997318,
“deploy_path”: “/”,
“last_heartbeat”: 1621204797467564724,
“state_name”: “Down”
},
“status”: {
“capacity”: “600GiB”,
“available”: “268.7GiB”,
“used_size”: “309.3GiB”,
“leader_count”: 12657,
“leader_weight”: 1,
“leader_score”: 12657,
“leader_size”: 912919,
“region_count”: 13355,
“region_weight”: 1,
“region_score”: 957026,
“region_size”: 957026,
“start_ts”: “2021-04-21T09:28:38Z”,
“last_heartbeat_ts”: “2021-05-16T22:39:57.467564724Z”,
“uptime”: “613h11m19.467564724s”
}
},
{
“store”: {
“id”: 45,
“address”: “tidb-test-tikv-1.tidb-test-tikv-peer.default.svc:20160”,
“labels”: [
{
“key”: “host”,
“value”: “mgmt11.hh-b.brainpp.cn
}
],
“version”: “4.0.4”,
“status_address”: “tidb-test-tikv-1.tidb-test-tikv-peer.default.svc:20180”,
“git_hash”: “28e3d44b00700137de4fa933066ab83e5f8306cf”,
“start_timestamp”: 1618997318,
“deploy_path”: “/”,
“last_heartbeat”: 1621006849391385689,
“state_name”: “Down”
},
“status”: {
“capacity”: “600GiB”,
“available”: “69.3GiB”,
“used_size”: “530.7GiB”,
“leader_count”: 5,
“leader_weight”: 1,
“leader_score”: 5,
“leader_size”: 346,
“region_count”: 26778,
“region_weight”: 1,
“region_score”: 990616004.5679679,
“region_size”: 1922093,
“start_ts”: “2021-04-21T09:28:38Z”,
“last_heartbeat_ts”: “2021-05-14T15:40:49.391385689Z”,
“uptime”: “558h12m11.391385689s”
}
},
{
“store”: {
“id”: 141064,
“address”: “tidb-test-tikv-3.tidb-test-tikv-peer.default.svc:20160”,
“labels”: [
{
“key”: “host”,
“value”: “mgmt12.hh-b.brainpp.cn
}
],
“version”: “4.0.4”,
“status_address”: “tidb-test-tikv-3.tidb-test-tikv-peer.default.svc:20180”,
“git_hash”: “28e3d44b00700137de4fa933066ab83e5f8306cf”,
“start_timestamp”: 1618997316,
“deploy_path”: “/”,
“last_heartbeat”: 1621230192889010374,
“state_name”: “Up”
},
“status”: {
“capacity”: “600GiB”,
“available”: “285.7GiB”,
“used_size”: “267.3GiB”,
“leader_count”: 10397,
“leader_weight”: 1,
“leader_score”: 10397,
“leader_size”: 742590,
“region_count”: 13626,
“region_weight”: 1,
“region_score”: 971118,
“region_size”: 971118,
“start_ts”: “2021-04-21T09:28:36Z”,
“last_heartbeat_ts”: “2021-05-17T05:43:12.889010374Z”,
“uptime”: “620h14m36.889010374s”
}
},
{
“store”: {
“id”: 24156011,
“address”: “tidb-test-tiflash-1.tidb-test-tiflash-peer.default.svc:3930”,
“labels”: [
{
“key”: “engine”,
“value”: “tiflash”
},
{
“key”: “host”,
“value”: “mgmt13.hh-b.brainpp.cn
}
],
“version”: “v4.0.4”,
“peer_address”: “tidb-test-tiflash-1.tidb-test-tiflash-peer.default.svc:20170”,
“status_address”: “tidb-test-tiflash-1.tidb-test-tiflash-peer.default.svc:20292”,
“git_hash”: “bfa9128f59cf800e129152f06b12480ad78adafd”,
“start_timestamp”: 1620903735,
“deploy_path”: “/tiflash”,
“last_heartbeat”: 1621230194163462734,
“state_name”: “Up”
},
“status”: {
“capacity”: “918.7GiB”,
“available”: “182.8GiB”,
“used_size”: “41.96GiB”,
“leader_count”: 0,
“leader_weight”: 1,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 2167,
“region_weight”: 1,
“region_score”: 542076607.5407978,
“region_size”: 179462,
“start_ts”: “2021-05-13T11:02:15Z”,
“last_heartbeat_ts”: “2021-05-17T05:43:14.163462734Z”,
“uptime”: “90h40m59.163462734s”
}
},
{
“store”: {
“id”: 24156012,
“address”: “tidb-test-tiflash-0.tidb-test-tiflash-peer.default.svc:3930”,
“labels”: [
{
“key”: “engine”,
“value”: “tiflash”
},
{
“key”: “host”,
“value”: “mgmt11.hh-b.brainpp.cn
}
],
“version”: “v4.0.4”,
“peer_address”: “tidb-test-tiflash-0.tidb-test-tiflash-peer.default.svc:20170”,
“status_address”: “tidb-test-tiflash-0.tidb-test-tiflash-peer.default.svc:20292”,
“git_hash”: “bfa9128f59cf800e129152f06b12480ad78adafd”,
“start_timestamp”: 1620903735,
“deploy_path”: “/tiflash”,
“last_heartbeat”: 1621230194884411819,
“state_name”: “Up”
},
“status”: {
“capacity”: “918.7GiB”,
“available”: “310.6GiB”,
“used_size”: “52.92GiB”,
“leader_count”: 0,
“leader_weight”: 1,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 2786,
“region_weight”: 1,
“region_score”: 231135,
“region_size”: 231135,
“start_ts”: “2021-05-13T11:02:15Z”,
“last_heartbeat_ts”: “2021-05-17T05:43:14.884411819Z”,
“uptime”: “90h40m59.884411819s”
}
},
{
“store”: {
“id”: 24156013,
“address”: “tidb-test-tiflash-2.tidb-test-tiflash-peer.default.svc:3930”,
“state”: 1,
“labels”: [
{
“key”: “engine”,
“value”: “tiflash”
},
{
“key”: “host”,
“value”: “mgmt12.hh-b.brainpp.cn
}
],
“version”: “v4.0.4”,
“peer_address”: “tidb-test-tiflash-2.tidb-test-tiflash-peer.default.svc:20170”,
“status_address”: “tidb-test-tiflash-2.tidb-test-tiflash-peer.default.svc:20292”,
“git_hash”: “bfa9128f59cf800e129152f06b12480ad78adafd”,
“start_timestamp”: 1620903735,
“deploy_path”: “/tiflash”,
“last_heartbeat”: 1621230194128329418,
“state_name”: “Offline”
},
“status”: {
“capacity”: “918.7GiB”,
“available”: “285.7GiB”,
“used_size”: “28.74GiB”,
“leader_count”: 0,
“leader_weight”: 1,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 1438,
“region_weight”: 1,
“region_score”: 120494,
“region_size”: 120494,
“start_ts”: “2021-05-13T11:02:15Z”,
“last_heartbeat_ts”: “2021-05-17T05:43:14.128329418Z”,
“uptime”: “90h40m59.128329418s”
}
},
{
“store”: {
“id”: 1,
“address”: “tidb-test-tikv-0.tidb-test-tikv-peer.default.svc:20160”,
“labels”: [
{
“key”: “host”,
“value”: “mgmt13.hh-b.brainpp.cn
}
],
“version”: “4.0.4”,
“status_address”: “tidb-test-tikv-0.tidb-test-tikv-peer.default.svc:20180”,
“git_hash”: “28e3d44b00700137de4fa933066ab83e5f8306cf”,
“start_timestamp”: 1618997317,
“deploy_path”: “/”,
“last_heartbeat”: 1621090849017378690,
“state_name”: “Down”
},
“status”: {
“capacity”: “600GiB”,
“available”: “70.13GiB”,
“used_size”: “529.9GiB”,
“leader_count”: 3675,
“leader_weight”: 1,
“leader_score”: 3675,
“leader_size”: 267232,
“region_count”: 26749,
“region_weight”: 1,
“region_score”: 983161298.89606,
“region_size”: 1918901,
“start_ts”: “2021-04-21T09:28:37Z”,
“last_heartbeat_ts”: “2021-05-15T15:00:49.01737869Z”,
“uptime”: “581h32m12.01737869s”
}
}
]
}

1 个赞

pd-2是主和tikv-0在同物理机上,也down了,查看网络正常。各节点网络正常,能相互平通

业务访问集群有没有什么报错信息或者异常的?

1 个赞

业务已经不查询了,原因是olap场景,磁盘满tikv扩容后, tiflash 一直不可用。周一来发现tikv 又跪了,能否从数据库层面定位问题呢,感觉tikv down了,也没发现error类日志打印,问题定位不明确,初学tidb,请大佬帮分析

先简单通过 client 连接下数据库,使用下数据库看是否有问题
因为你这个只是 tikv 的状态显示为 down 了,看进程应该是没有问题的,因为 pod 没有一直重启。

需要先确认是状态显示异常的问题还是 tikv 进程本身就有问题。

连接数据库正常,能查出部分表数据,但是某些表select查询时报错,报错信息是ERROR 9001 (HY000): PD server timeout,麻烦您抽丝剥茧,再帮分析下

从现象上来和这个一样 使用tidb官方工具:mydumper导出,loader导入工具,进行导入数据时,报错如下: **Error 9001: PD server timeout** - #13,来自 vcdog ,检查kv pd db版本是一致的,
client 查询部分表数据报错如下(其它部分表查询正常):
image
查看tidb日志 error日志如下:


pd 狂刷日志如下:


这个集群之前有做过缩容下线 TiKV 节点或者 TiFlash 节点的操作么?或者通过 pd-recover 恢复 PD 的操作?

pd 日志中刷的是正常调度 region 时产生的日志,可以忽略

因存储满,tikv做过扩容,没做缩容,增加了一台服务器 。目前是3副本,4 nodes 。tiflash available 为0 做过缩容节点删除操作,重新添加节点操作。

能否尝试重启一下集群然后在执行下 SQL 看下是否还有 PD server timeout 的问题?
从日志看,PD server timeout 是因为,想要访问 store id 571705 这个节点的信息,但是 PD 中并没有这个节点的信息,导致访问 pd 的 rpc 请求一直在重启,最后达到最大的重试超时时间,报了 PD server timeout 的错误。

至于 tidb 为什么要访问 store id 为 571705 的节点信息,怀疑是 region cache 引起的,重启集群的话,可以让 tidb 节点重新刷新 region cache 。

重启所有pod后 查询该表还是报错:
ERROR 9001 (HY000): PD server timeout

在 PD 日志中 grep 一下 571705 这个 store id 看下,看能不能看到历史的记录,确认一下是 tikv 节点还是 tiflash 节点

grep 3个pd pod没有发现这个571705 ,pod重启多次,log已丢失很多。可以删除所有 tiflash (replica 0)来验证

当前store情况:
/ # ./pd-ctl store
{
“count”: 4,
“stores”: [
{
“store”: {
“id”: 141064,
“address”: “tidb-test-tikv-3.tidb-test-tikv-peer.default.svc:20160”,
“labels”: [
{
“key”: “host”,
“value”: “mgmt12.hh-b.brainpp.cn
}
],
“version”: “4.0.4”,
“status_address”: “tidb-test-tikv-3.tidb-test-tikv-peer.default.svc:20180”,
“git_hash”: “28e3d44b00700137de4fa933066ab83e5f8306cf”,
“start_timestamp”: 1621411882,
“deploy_path”: “/”,
“last_heartbeat”: 1621427225815682765,
“state_name”: “Up”
},
“status”: {
“capacity”: “650GiB”,
“available”: “314.4GiB”,
“used_size”: “266.9GiB”,
“leader_count”: 9540,
“leader_weight”: 1,
“leader_score”: 9540,
“leader_size”: 681245,
“region_count”: 13620,
“region_weight”: 1,
“region_score”: 972253,
“region_size”: 972253,
“start_ts”: “2021-05-19T08:11:22Z”,
“last_heartbeat_ts”: “2021-05-19T12:27:05.815682765Z”,
“uptime”: “4h15m43.815682765s”
}
},
{
“store”: {
“id”: 1,
“address”: “tidb-test-tikv-0.tidb-test-tikv-peer.default.svc:20160”,
“labels”: [
{
“key”: “host”,
“value”: “mgmt13.hh-b.brainpp.cn
}
],
“version”: “4.0.4”,
“status_address”: “tidb-test-tikv-0.tidb-test-tikv-peer.default.svc:20180”,
“git_hash”: “28e3d44b00700137de4fa933066ab83e5f8306cf”,
“start_timestamp”: 1621411722,
“deploy_path”: “/”,
“last_heartbeat”: 1621427218230705969,
“state_name”: “Up”
},
“status”: {
“capacity”: “650GiB”,
“available”: “120.3GiB”,
“used_size”: “529.7GiB”,
“leader_count”: 5260,
“leader_weight”: 1,
“leader_score”: 5260,
“leader_size”: 387512,
“region_count”: 26804,
“region_weight”: 1,
“region_score”: 617922067.8012838,
“region_size”: 1938099,
“start_ts”: “2021-05-19T08:08:42Z”,
“last_heartbeat_ts”: “2021-05-19T12:26:58.230705969Z”,
“uptime”: “4h18m16.230705969s”
}
},
{
“store”: {
“id”: 12,
“address”: “tidb-test-tikv-2.tidb-test-tikv-peer.default.svc:20160”,
“labels”: [
{
“key”: “host”,
“value”: “mgmt12.hh-b.brainpp.cn
}
],
“version”: “4.0.4”,
“status_address”: “tidb-test-tikv-2.tidb-test-tikv-peer.default.svc:20180”,
“git_hash”: “28e3d44b00700137de4fa933066ab83e5f8306cf”,
“start_timestamp”: 1621411841,
“deploy_path”: “/”,
“last_heartbeat”: 1621427231174412304,
“state_name”: “Up”
},
“status”: {
“capacity”: “650GiB”,
“available”: “314.4GiB”,
“used_size”: “307.3GiB”,
“leader_count”: 6424,
“leader_weight”: 1,
“leader_score”: 6424,
“leader_size”: 463125,
“region_count”: 13295,
“region_weight”: 1,
“region_score”: 966252,
“region_size”: 966252,
“start_ts”: “2021-05-19T08:10:41Z”,
“last_heartbeat_ts”: “2021-05-19T12:27:11.174412304Z”,
“uptime”: “4h16m30.174412304s”
}
},
{
“store”: {
“id”: 45,
“address”: “tidb-test-tikv-1.tidb-test-tikv-peer.default.svc:20160”,
“labels”: [
{
“key”: “host”,
“value”: “mgmt11.hh-b.brainpp.cn
}
],
“version”: “4.0.4”,
“status_address”: “tidb-test-tikv-1.tidb-test-tikv-peer.default.svc:20180”,
“git_hash”: “28e3d44b00700137de4fa933066ab83e5f8306cf”,
“start_timestamp”: 1621411842,
“deploy_path”: “/”,
“last_heartbeat”: 1621417218284338225,
“state_name”: “Down”
},
“status”: {
“capacity”: “650GiB”,
“available”: “119.4GiB”,
“used_size”: “530.6GiB”,
“leader_count”: 5515,
“leader_weight”: 1,
“leader_score”: 5515,
“leader_size”: 406340,
“region_count”: 26786,
“region_weight”: 1,
“region_score”: 625211696.157073,
“region_size”: 1937939,
“start_ts”: “2021-05-19T08:10:42Z”,
“last_heartbeat_ts”: “2021-05-19T09:40:18.284338225Z”,
“uptime”: “1h29m36.284338225s”
}
}
]
}
当前pod情况


重试报错如下:

1 个赞

上传一下完整的 tidb.log 日志文件看下

两个pod 的tidb log,包括昨天重启以后的所有日志。请帮分析
tidb.log (6.3 MB)

tikv 隔断时间会离线,重启pod后一段时间内会在线
image

重启pod之后,dashboard显示tikv正常。
image
image