两个Tikv实例down,求助

【 TiDB 使用环境】生产
【 TiDB 版本】v3.0.0-rc.1-309-g8c20289c7
【遇到的问题】
我们四台机器,每台机器两个tikv服务,分布在不同的磁盘上,其中有台机器上两个tikv出现了down机,刚开始的时候是一个tikv实例down机,后来又发现另一个tikv也出现了down机
【复现路径】做过哪些操作出现的问题
【问题现象及影响】

一, 第一个tikv实例:tikv-20161

  1. 首先是查询的时候发现:

ERROR: other error: [src/storage/kv/raftkv.rs:370]: RocksDb Corruption: block checksum mismatch: expected 3653267617, got 3969982044 in /data/tikv00/deploy/data/db/9673429.sst offset 4429149 size 12210

  1. 然后我在系统日志中发现

May 28 08:41:31 tidb01 kernel: EXT4-fs error (device vdc1): ext4_iget:4339: inode #5573940: comm tikv-server: bad extended attribute block 63116356317695
May 28 08:41:31 tidb01 kernel: EXT4-fs error (device vdc1): ext4_iget:4199: inode #5573943: comm tikv-server: bad extra_isize (64125 != 256)
May 28 08:41:49 tidb01 kernel: EXT4-fs error (device vdc1): ext4_iget:4199: inode #5573937: comm tikv-server: bad extra_isize (29360 != 256)
May 28 08:41:49 tidb01 kernel: EXT4-fs error (device vdc1): ext4_iget:4199: inode #5573939: comm tikv-server: bad extra_isize (26035 != 256)
May 28 08:41:49 tidb01 kernel: EXT4-fs error (device vdc1): ext4_iget:4199: inode #5573942: comm tikv-server: bad extra_isize (57968 != 256)

3 我认为是磁盘出现的异常了,所以对磁盘进行修复
由于是用systemd管理,我当时用ps 查看了已经没有tikv进程,并没有作stop处理,作了如下操作:
umount /dev/sdc1
fsck -y /dev/vdc1
mount -a
此时发现日志报错如下:
[FATAL] [server.rs:176] [“failed to create kv engine: RocksDb Corruption: SST file is ahead of WALs”]

二,第二个tikv实例:tikv-20161
在第一个实例出现上述的日志无法启动之后,第二个实例tikv-20161正常运行一段时间,但是当时也没太注意该实例日志内容,由于日志大量输出导致根目录磁盘打满,所以当时每5分钟清理一次tikv.log日志,所我查了早期存在的日志:

[2022/05/27 21:51:27.405 +08:00] [ERROR] [raft_client.rs:207] [“RaftClient fails to send”]
[2022/05/27 21:51:27.405 +08:00] [ERROR] [raft_client.rs:118] [“batch_raft RPC finished fail”] [err=“RpcFinished(Some(RpcStatus { status: Unavailable, details: Some(“Connect Failed”) }))”]
[2022/05/27 21:51:27.405 +08:00] [WARN] [raft_client.rs:132] [“batch_raft/raft RPC finally fail”] [err=“RpcFinished(Some(RpcStatus { status: Unavailable, details: Some(“Connect Failed”) }))”] [to_addr=172.17.53.28:20161]
[2022/05/27 21:51:27.482 +08:00] [ERROR] [raft_client.rs:207] [“RaftClient fails to send”]

等我发现 tikv-20161 进程down掉后,由于一直在重启,出现下面大量类似的日志

[2022/05/29 09:32:42.421 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:33:03.989 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:33:26.367 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:33:47.718 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:34:09.148 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:34:30.501 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:34:52.502 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:34:52.704 +08:00] [ERROR] [process.rs:175] [“get snapshot failed”] [err=“Request(message: “peer is not leader” not_leader { region_id: 856938 })”] [cid=1]
[2022/05/29 09:34:52.778 +08:00] [ERROR] [process.rs:175] [“get snapshot failed”] [err=“Request(message: “peer is not leader” not_leader { region_id: 971502 })”] [cid=2]
[2022/05/29 09:35:14.392 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:35:36.442 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:35:57.817 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:36:20.213 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:36:41.541 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:37:04.320 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:37:27.555 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:37:49.084 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:38:11.239 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:38:33.063 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:38:55.145 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:39:16.784 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]

三,当前store状态

{
“count”:8,
“stores”:[
Object{…},
{
“store”:{
“id”:3384,
“address”:“172.17.53.28:20161”,
“labels”:[
{
“key”:“host”,
“value”:“tikv01”
}
],
“version”:“3.0.0-beta.1”,
“state_name”:“Down”
},
“status”:{
“capacity”:“836 GiB”,
“available”:“135 GiB”,
“leader_weight”:1,
“region_count”:13680,
“region_weight”:1,
“region_score”:1073603526.5664062,
“region_size”:861224,
“start_ts”:“2021-10-14T15:49:07+08:00”,
“last_heartbeat_ts”:“2022-05-27T14:37:23.721634399+08:00”,
“uptime”:“5398h48m16.721634399s”
}
},
Object{…},
Object{…},
Object{…},
Object{…},
{
“store”:{
“id”:1,
“address”:“172.17.53.28:20160”,
“labels”:[
{
“key”:“host”,
“value”:“tikv01”
}
],
“version”:“3.0.0-beta.1”,
“state_name”:“Down”
},
“status”:{
“capacity”:“836 GiB”,
“available”:“137 GiB”,
“leader_count”:395,
“leader_weight”:1,
“leader_score”:25979,
“leader_size”:25979,
“region_count”:19115,
“region_weight”:1,
“region_score”:1073601332.5507812,
“region_size”:1183154,
“start_ts”:“2021-09-10T11:30:09+08:00”,
“last_heartbeat_ts”:“2022-05-28T15:13:07.799823276+08:00”,
“uptime”:“6243h42m58.799823276s”
}
},
Object{…}
]
}

四,业务查询时的报错:
Query 1 ERROR: Region is unavailable
Query 1 ERROR: TiKV server timeout

实例1是盘的问题
实例2看一下有没有其它ERROR或者FATAL级别的日志,这个WARN可能不是重点
报错来看应该是有Region丢失多数派副本了,需要unsafe-recover

那整个集群的数据会不会有丢失,实例1 可以踢掉么,对实例2做unsafe-recover,具体该怎么操作

可能会丢少量数据,也可能不会丢。现在你这两个store都是Down的状态,丢了少数派副本的Region应该已经在其它TiKV上把副本补齐了,查一下哪些Region的多数派副本还在这两个store上,确定一下可能会丢数据的表
pd-ctl region --pd <pd_ip:2379> --jq=‘.regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length as $total | map(if .==(1,3384) then . else empty end) | length>=$total-length) }’

curl http://<tidb_ip:10080>/regions/<region_id>

然后参考如下SOP进行恢复

我查了下,有1万+ region有问题,我司使用的TiDB版本较旧,可否有对应的版本的修复文档,便于操作

1 个赞

如果不做恢复,重新搭一个新的集群,然后把数据迁进去,这种方法是否可行,因为版本太旧,操作不熟悉,修复起来比较吃力。

不恢复的话,始终有Region是unavailable的,如果能接受这些数据的丢失,那可以直接迁移数据到新集群,否则就要先进行恢复