TiKV宕机无限OOM起不来,如何正确恢复

tidb 5.0.3

因为写入并发过大,tikv负载较高,有一台宕机了,且一直OOM起不来,pd store显示disconnect,且该store上有大量leader region,现在应该如何恢复正常

image

3 个赞

看一下 TiKV 重启的 OOM 对应的 log 里面有记录 slow query 的记录,另外通过 dashboard 的 slow query 和 热力图 定位热点问题看看。一般都是有热点业务查询导致,所以需要进一步看一下 slow query。

2 个赞

不是query导致的,是高并发写入造成的,这个不是关键,关键是一个tikv down了,现在怎么恢复,这是现在最紧要的



image

2 个赞

看一下 TiKV 启动 log 报错 ?

3 个赞

tikv.tar.gz (3.5 MB)
tikv1.tar.gz (18.8 MB)


@Lucien 日志显示是这两个ERROR,我从监控和top中看到的是 tikv启动,内存飙升,近100%后宕机重启,现在pending和down了太多region了,我们想赶紧先能正常起来

2 个赞


尝试了下这个,也不管用,还是OOM

1 个赞

现在集群应该是可用状态,如果担心集群的可用性问题,建议优先进行 TiKV 节点扩容操作。现在是只有 91 节点出现 oom 吗 ?是虚拟机 ?还是物理机 ?机器内存配置是多大 ?

2 个赞

只有这个节点OOM,物理机,16G内存,集群是可用状态,但是还有一些在途的读取写入任务,担心再有节点挂掉,集群就完了
这个节点刚刚宕机重启,还好起来了,起不来,这个集群就GG了。。。
image

2 个赞

现在 OOM 的 91 节点还是无法启动吗?看日志只有 12 点有 welcome ,但是没有最近的状态。

2 个赞

服务器有时差,就是最近的,还是OOM无法启动,我看了下刚才又挂了一个tikv,虽然重启起来了,第三个tikv目前cpu也飙升,我感觉这个集群现在很危险,我直接–force,再改端口扩回来,先保证3个节点正常?

2 个赞

可以的,不过 16 GB 的确有点小,可以扩到 32 GB 内存。

1 个赞

嗯嗯,我们准备升级配置,先把副本补回来

1 个赞

我把pd配置能调大的都调大了,感觉调度好慢啊,12个小时就调度了这些,这还是集群一个任务都没跑的情况下,把这些参数调很大了,cpu,内存,io都还很小,感觉这些参数没起到啥作用。。


{
“replication”: {
“enable-placement-rules”: “true”,
“isolation-level”: “”,
“location-labels”: “host”,
“max-replicas”: 3,
“strictly-match-label”: “false”
},
“schedule”: {
“enable-cross-table-merge”: “true”,
“enable-debug-metrics”: “false”,
“enable-joint-consensus”: “true”,
“enable-location-replacement”: “true”,
“enable-make-up-replica”: “true”,
“enable-one-way-merge”: “false”,
“enable-remove-down-replica”: “true”,
“enable-remove-extra-replica”: “true”,
“enable-replace-offline-replica”: “true”,
“high-space-ratio”: 0.7,
“hot-region-cache-hits-threshold”: 3,
“hot-region-schedule-limit”: 256,
“leader-schedule-limit”: 400,
“leader-schedule-policy”: “count”,
“low-space-ratio”: 0.8,
“max-merge-region-keys”: 200000,
“max-merge-region-size”: 20,
“max-pending-peer-count”: 300,
“max-snapshot-count”: 30,
“max-store-down-time”: “30m0s”,
“merge-schedule-limit”: 80,
“patrol-region-interval”: “100ms”,
“region-schedule-limit”: 2048,
“region-score-formula-version”: “v2”,
“replica-schedule-limit”: 1280,
“scheduler-max-waiting-operator”: 40,
“split-merge-interval”: “1h0m0s”,
“store-limit-mode”: “manual”,
“tolerant-size-ratio”: 0
}
}

store limit 有调整吗?没有的话把 store limit 全部调大点,参考:
https://docs.pingcap.com/zh/tidb/v5.0/configure-store-limit

请问是什么命令用force来启动?

请问这个是什么命令加force 来启动?谢谢

不是force 启动, 我是问 可不可以 --force强制下线,再改端口扩回来,会不会丢数据

在补副本的过程中,tikv一直在尝试连接那两个下线的tikv节点,导致tikv目前不提供服务了,是不是有region丢失了,可是3副本,也没强制下线,我是先补的副本,为什么会有region不可用了呢,查了下没有 无leader的region




6ee3b56eec058226f96bea7ba02b78b

1.现在各节点上磁盘空间充足吗?最上面的监控面板中提示存在 lowspace 的 store ,有可能是磁盘空间不足导致调度一直没有完成,而且机器内存只有 16G ,低于标准部署要求,建议直接在一台新的机器上扩容 tikv 节点;
2.通过 pd-ctl 看下两个处于下线状态的 store ,看下 region_count 和 leader_count 有无减少;
3.检查下目前副本数小于 3 和小于 2 的 region 有多少个,如果是正常下线一般不会导致 region 数据丢失。

1.磁盘都很充足,lowspace的是监控数据问题,

2.27节点的不减少(这个之前用了–force强制踢出去了,实际的数据已经没了),91是有减少的
3.没有小于3的,jian监控显示也没有miss region,但是 用程序读数据,就会报region不可用了

{
“count”: 8,
“stores”: [
{
“store”: {
“id”: 54,
“address”: “10.18.253.10:3930”,
“labels”: [
{
“key”: “engine”,
“value”: “tiflash”
}
],
“version”: “v5.0.3”,
“peer_address”: “10.18.253.10:20170”,
“status_address”: “10.18.253.10:20292”,
“git_hash”: “0194cb4b59438d8d46fc05a4b1abd85eeb69972f”,
“start_timestamp”: 1632965500,
“deploy_path”: “/data/tidb-deploy/tiflash-9000/bin/tiflash”,
“last_heartbeat”: 1632983482633975896,
“state_name”: “Up”
},
“status”: {
“capacity”: “499.9GiB”,
“available”: “319.8GiB”,
“used_size”: “180.1GiB”,
“leader_count”: 0,
“leader_weight”: 1,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 6797,
“region_weight”: 1,
“region_score”: 1058371.249245507,
“region_size”: 604918,
“start_ts”: “2021-09-30T01:31:40Z”,
“last_heartbeat_ts”: “2021-09-30T06:31:22.633975896Z”,
“uptime”: “4h59m42.633975896s”
}
},
{
“store”: {
“id”: 55,
“address”: “10.18.253.24:3930”,
“labels”: [
{
“key”: “engine”,
“value”: “tiflash”
}
],
“version”: “v5.0.3”,
“peer_address”: “10.18.253.24:20170”,
“status_address”: “10.18.253.24:20292”,
“git_hash”: “0194cb4b59438d8d46fc05a4b1abd85eeb69972f”,
“start_timestamp”: 1632965498,
“deploy_path”: “/data/tidb-deploy/tiflash-9000/bin/tiflash”,
“last_heartbeat”: 1632983485966696536,
“state_name”: “Up”
},
“status”: {
“capacity”: “499.9GiB”,
“available”: “313.8GiB”,
“used_size”: “186.1GiB”,
“leader_count”: 0,
“leader_weight”: 1,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 6808,
“region_weight”: 1,
“region_score”: 1062285.9464633858,
“region_size”: 604100,
“start_ts”: “2021-09-30T01:31:38Z”,
“last_heartbeat_ts”: “2021-09-30T06:31:25.966696536Z”,
“uptime”: “4h59m47.966696536s”
}
},
{
“store”: {
“id”: 56,
“address”: “10.18.253.57:3930”,
“labels”: [
{
“key”: “engine”,
“value”: “tiflash”
}
],
“version”: “v5.0.3”,
“peer_address”: “10.18.253.57:20170”,
“status_address”: “10.18.253.57:20292”,
“git_hash”: “0194cb4b59438d8d46fc05a4b1abd85eeb69972f”,
“start_timestamp”: 1632974871,
“deploy_path”: “/data/tidb-deploy/tiflash-9000/bin/tiflash”,
“last_heartbeat”: 1632983480532386631,
“state_name”: “Up”
},
“status”: {
“capacity”: “499.9GiB”,
“available”: “301.3GiB”,
“used_size”: “198.6GiB”,
“leader_count”: 0,
“leader_weight”: 1,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 6569,
“region_weight”: 1,
“region_score”: 1031402.8865471387,
“region_size”: 579941,
“start_ts”: “2021-09-30T04:07:51Z”,
“last_heartbeat_ts”: “2021-09-30T06:31:20.532386631Z”,
“uptime”: “2h23m29.532386631s”
}
},
{
“store”: {
“id”: 404205,
“address”: “10.18.253.27:20161”,
“labels”: [
{
“key”: “host”,
“value”: “tikv4”
}
],
“version”: “5.0.3”,
“status_address”: “10.18.253.27:20181”,
“git_hash”: “63b63edfbb9bbf8aeb875aad28c59f082eeb55d4”,
“start_timestamp”: 1632967435,
“deploy_path”: “/data/tidb-deploy/tikv-20161/bin”,
“last_heartbeat”: 1632983488698100808,
“state_name”: “Up”
},
“status”: {
“capacity”: “849.9GiB”,
“available”: “343.7GiB”,
“used_size”: “455.6GiB”,
“leader_count”: 6747,
“leader_weight”: 3,
“leader_score”: 2249,
“leader_size”: 615156,
“region_count”: 20767,
“region_weight”: 1,
“region_score”: 3032835.002259687,
“region_size”: 1895776,
“start_ts”: “2021-09-30T02:03:55Z”,
“last_heartbeat_ts”: “2021-09-30T06:31:28.698100808Z”,
“uptime”: “4h27m33.698100808s”
}
},
{
“store”: {
“id”: 12918110,
“address”: “10.18.253.91:20161”,
“labels”: [
{
“key”: “host”,
“value”: “tikv5”
}
],
“version”: “5.0.3”,
“status_address”: “10.18.253.91:20181”,
“git_hash”: “63b63edfbb9bbf8aeb875aad28c59f082eeb55d4”,
“start_timestamp”: 1632924943,
“deploy_path”: “/data/tidb-deploy/tikv-20161/bin”,
“last_heartbeat”: 1632983489206887362,
“state_name”: “Up”
},
“status”: {
“capacity”: “1.66TiB”,
“available”: “708.6GiB”,
“used_size”: “287.6GiB”,
“leader_count”: 6751,
“leader_weight”: 3,
“leader_score”: 2250.3333333333335,
“leader_size”: 621627,
“region_count”: 13068,
“region_weight”: 4,
“region_score”: 387715.6971647727,
“region_size”: 1198936,
“start_ts”: “2021-09-29T14:15:43Z”,
“last_heartbeat_ts”: “2021-09-30T06:31:29.206887362Z”,
“uptime”: “16h15m46.206887362s”
}
},
{
“store”: {
“id”: 1,
“address”: “10.18.253.91:20160”,
“state”: 1,
“labels”: [
{
“key”: “host”,
“value”: “tikv3”
}
],
“version”: “5.0.3”,
“status_address”: “10.18.253.91:20180”,
“git_hash”: “63b63edfbb9bbf8aeb875aad28c59f082eeb55d4”,
“start_timestamp”: 1632920959,
“deploy_path”: “/data/tidb-deploy/tikv-20160/bin”,
“last_heartbeat”: 1632915326930575684,
“state_name”: “Offline”
},
“status”: {
“capacity”: “849.9GiB”,
“available”: “251.3GiB”,
“used_size”: “525.7GiB”,
“leader_count”: 57,
“leader_weight”: 1,
“leader_score”: 57,
“leader_size”: 57,
“region_count”: 9985,
“region_weight”: 1,
“region_score”: 1483945.8200369256,
“region_size”: 814446,
“start_ts”: “2021-09-29T13:09:19Z”,
“last_heartbeat_ts”: “2021-09-29T11:35:26.930575684Z”
}
},
{
“store”: {
“id”: 4,
“address”: “10.18.253.27:20160”,
“state”: 1,
“labels”: [
{
“key”: “host”,
“value”: “tikv2”
}
],
“version”: “5.0.3”,
“status_address”: “10.18.253.27:20180”,
“git_hash”: “63b63edfbb9bbf8aeb875aad28c59f082eeb55d4”,
“start_timestamp”: 1632818691,
“deploy_path”: “/data/tidb-deploy/tikv-20160/bin”,
“last_heartbeat”: 1632774762533225969,
“state_name”: “Offline”
},
“status”: {
“capacity”: “0B”,
“available”: “0B”,
“used_size”: “0B”,
“leader_count”: 0,
“leader_weight”: 1,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 2291,
“region_weight”: 1,
“region_score”: 117831,
“region_size”: 117831,
“start_ts”: “2021-09-28T08:44:51Z”,
“last_heartbeat_ts”: “2021-09-27T20:32:42.533225969Z”
}
},
{
“store”: {
“id”: 5,
“address”: “10.18.253.90:20160”,
“labels”: [
{
“key”: “host”,
“value”: “tikv1”
}
],
“version”: “5.0.3”,
“status_address”: “10.18.253.90:20180”,
“git_hash”: “63b63edfbb9bbf8aeb875aad28c59f082eeb55d4”,
“start_timestamp”: 1632920623,
“deploy_path”: “/data/tidb-deploy/tikv-20160/bin”,
“last_heartbeat”: 1632983489593600579,
“state_name”: “Up”
},
“status”: {
“capacity”: “849.9GiB”,
“available”: “242.1GiB”,
“used_size”: “531.5GiB”,
“leader_count”: 9498,
“leader_weight”: 1,
“leader_score”: 9498,
“leader_size”: 776542,
“region_count”: 23053,
“region_weight”: 1,
“region_score”: 3641234.0464543332,
“region_size”: 2013382,
“start_ts”: “2021-09-29T13:03:43Z”,
“last_heartbeat_ts”: “2021-09-30T06:31:29.593600579Z”,
“uptime”: “17h27m46.593600579s”
}
}
]
}