【 TiDB 使用环境】生产
【 TiDB 版本】v3.0.0-rc.1-309-g8c20289c7
【遇到的问题】
我们四台机器,每台机器两个tikv服务,分布在不同的磁盘上,其中有台机器上两个tikv出现了down机,刚开始的时候是一个tikv实例down机,后来又发现另一个tikv也出现了down机
【复现路径】做过哪些操作出现的问题
【问题现象及影响】
一, 第一个tikv实例:tikv-20161
- 首先是查询的时候发现:
ERROR: other error: [src/storage/kv/raftkv.rs:370]: RocksDb Corruption: block checksum mismatch: expected 3653267617, got 3969982044 in /data/tikv00/deploy/data/db/9673429.sst offset 4429149 size 12210
- 然后我在系统日志中发现
May 28 08:41:31 tidb01 kernel: EXT4-fs error (device vdc1): ext4_iget:4339: inode #5573940: comm tikv-server: bad extended attribute block 63116356317695
May 28 08:41:31 tidb01 kernel: EXT4-fs error (device vdc1): ext4_iget:4199: inode #5573943: comm tikv-server: bad extra_isize (64125 != 256)
May 28 08:41:49 tidb01 kernel: EXT4-fs error (device vdc1): ext4_iget:4199: inode #5573937: comm tikv-server: bad extra_isize (29360 != 256)
May 28 08:41:49 tidb01 kernel: EXT4-fs error (device vdc1): ext4_iget:4199: inode #5573939: comm tikv-server: bad extra_isize (26035 != 256)
May 28 08:41:49 tidb01 kernel: EXT4-fs error (device vdc1): ext4_iget:4199: inode #5573942: comm tikv-server: bad extra_isize (57968 != 256)
3 我认为是磁盘出现的异常了,所以对磁盘进行修复
由于是用systemd管理,我当时用ps 查看了已经没有tikv进程,并没有作stop处理,作了如下操作:
umount /dev/sdc1
fsck -y /dev/vdc1
mount -a
此时发现日志报错如下:
[FATAL] [server.rs:176] [“failed to create kv engine: RocksDb Corruption: SST file is ahead of WALs”]
二,第二个tikv实例:tikv-20161
在第一个实例出现上述的日志无法启动之后,第二个实例tikv-20161正常运行一段时间,但是当时也没太注意该实例日志内容,由于日志大量输出导致根目录磁盘打满,所以当时每5分钟清理一次tikv.log日志,所我查了早期存在的日志:
[2022/05/27 21:51:27.405 +08:00] [ERROR] [raft_client.rs:207] [“RaftClient fails to send”]
[2022/05/27 21:51:27.405 +08:00] [ERROR] [raft_client.rs:118] [“batch_raft RPC finished fail”] [err=“RpcFinished(Some(RpcStatus { status: Unavailable, details: Some(“Connect Failed”) }))”]
[2022/05/27 21:51:27.405 +08:00] [WARN] [raft_client.rs:132] [“batch_raft/raft RPC finally fail”] [err=“RpcFinished(Some(RpcStatus { status: Unavailable, details: Some(“Connect Failed”) }))”] [to_addr=172.17.53.28:20161]
[2022/05/27 21:51:27.482 +08:00] [ERROR] [raft_client.rs:207] [“RaftClient fails to send”]
等我发现 tikv-20161 进程down掉后,由于一直在重启,出现下面大量类似的日志
[2022/05/29 09:32:42.421 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:33:03.989 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:33:26.367 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:33:47.718 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:34:09.148 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:34:30.501 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:34:52.502 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:34:52.704 +08:00] [ERROR] [process.rs:175] [“get snapshot failed”] [err=“Request(message: “peer is not leader” not_leader { region_id: 856938 })”] [cid=1]
[2022/05/29 09:34:52.778 +08:00] [ERROR] [process.rs:175] [“get snapshot failed”] [err=“Request(message: “peer is not leader” not_leader { region_id: 971502 })”] [cid=2]
[2022/05/29 09:35:14.392 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:35:36.442 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:35:57.817 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:36:20.213 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:36:41.541 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:37:04.320 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:37:27.555 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:37:49.084 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:38:11.239 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:38:33.063 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:38:55.145 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
[2022/05/29 09:39:16.784 +08:00] [WARN] [store.rs:1122] [“set thread priority for raftstore failed”] [error=“Os { code: 13, kind: PermissionDenied, message: “Permission denied” }”]
三,当前store状态
{
“count”:8,
“stores”:[
Object{…},
{
“store”:{
“id”:3384,
“address”:“172.17.53.28:20161”,
“labels”:[
{
“key”:“host”,
“value”:“tikv01”
}
],
“version”:“3.0.0-beta.1”,
“state_name”:“Down”
},
“status”:{
“capacity”:“836 GiB”,
“available”:“135 GiB”,
“leader_weight”:1,
“region_count”:13680,
“region_weight”:1,
“region_score”:1073603526.5664062,
“region_size”:861224,
“start_ts”:“2021-10-14T15:49:07+08:00”,
“last_heartbeat_ts”:“2022-05-27T14:37:23.721634399+08:00”,
“uptime”:“5398h48m16.721634399s”
}
},
Object{…},
Object{…},
Object{…},
Object{…},
{
“store”:{
“id”:1,
“address”:“172.17.53.28:20160”,
“labels”:[
{
“key”:“host”,
“value”:“tikv01”
}
],
“version”:“3.0.0-beta.1”,
“state_name”:“Down”
},
“status”:{
“capacity”:“836 GiB”,
“available”:“137 GiB”,
“leader_count”:395,
“leader_weight”:1,
“leader_score”:25979,
“leader_size”:25979,
“region_count”:19115,
“region_weight”:1,
“region_score”:1073601332.5507812,
“region_size”:1183154,
“start_ts”:“2021-09-10T11:30:09+08:00”,
“last_heartbeat_ts”:“2022-05-28T15:13:07.799823276+08:00”,
“uptime”:“6243h42m58.799823276s”
}
},
Object{…}
]
}
四,业务查询时的报错:
Query 1 ERROR: Region is unavailable
Query 1 ERROR: TiKV server timeout