tikb报错:KV:Raft:StepLocalMsg

【 TiDB 使用环境】生产环境
【 TiDB 版本】 v7.1.0
【遇到的问题:问题现象及影响】
其中一台tikv日志报错:
[2023/10/16 09:44:46.004 +08:00] [ERROR] [peer.rs:618] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft local message”] [peer_id=59226] [region_id=59223]
[2023/10/16 09:44:48.006 +08:00] [ERROR] [peer.rs:618] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft local message”] [peer_id=59226] [region_id=59223]
[2023/10/16 09:44:48.176 +08:00] [ERROR] [peer.rs:618] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft local message”] [peer_id=59226] [region_id=59223]
[2023/10/16 09:44:50.008 +08:00] [ERROR] [peer.rs:618] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft local message”] [peer_id=59226] [region_id=59223]
[2023/10/16 09:44:52.010 +08:00] [ERROR] [peer.rs:618] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft local message”] [peer_id=59226] [region_id=59223]
[2023/10/16 09:44:54.012 +08:00] [ERROR] [peer.rs:618] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft local message”] [peer_id=59226] [region_id=59223]

【资源配置】进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面

心跳超时了?看看各个节点正常吗

1 个赞

pdctl检查下region59223的状态

TiKV 日志中 “KV:Raft:StepLocalMsg” 的信息,表示 TiKV 正在处理本地的 Raft 消息。

报错日志 上下文,也贴出来 看下

[2023/10/17 09:40:18.007 +08:00] [ERROR] [peer.rs:618] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft lo cal message”] [peer_id=1033] [region_id=1030]
[2023/10/17 09:40:18.253 +08:00] [INFO] [apply.rs:1690] [“execute admin command”] [command=“cmd_type: ChangePeerV2 change_peer_v2 { changes { change_ty pe: AddLearnerNode peer { id: 791328 store_id: 3002 role: Learner } } }”] [index=364329] [term=7] [peer_id=59037] [region_id=59034]
[2023/10/17 09:40:18.253 +08:00] [INFO] [apply.rs:2283] [“exec ConfChangeV2”] [epoch=“conf_ver: 278169 version: 224”] [kind=Simple] [peer_id=59037] [re gion_id=59034]
[2023/10/17 09:40:18.253 +08:00] [INFO] [apply.rs:2464] [“conf change successfully”] [“current region”=“id: 59034 start_key: 7480000000000000FF6A5F7200 00000000FA end_key: 7480000000000000FF6B00000000000000F8 region_epoch { conf_ver: 278170 version: 224 } peers { id: 59035 store_id: 1 } peers { id: 590 36 store_id: 5 } peers { id: 59037 store_id: 2 } peers { id: 791161 store_id: 231 role: Learner } peers { id: 791164 store_id: 3001 role: Learner } pee rs { id: 791328 store_id: 3002 role: Learner }”] [“original region”=“id: 59034 start_key: 7480000000000000FF6A5F720000000000FA end_key: 748000000000000 0FF6B00000000000000F8 region_epoch { conf_ver: 278169 version: 224 } peers { id: 59035 store_id: 1 } peers { id: 59036 store_id: 5 } peers { id: 59037 store_id: 2 } peers { id: 791161 store_id: 231 role: Learner } peers { id: 791164 store_id: 3001 role: Learner }”] [changes=“[change_type: AddLearnerNo de peer { id: 791328 store_id: 3002 role: Learner }]”] [peer_id=59037] [region_id=59034]
[2023/10/17 09:40:18.256 +08:00] [INFO] [raft.rs:2668] [“switched to configuration”] [config=“Configuration { voters: Configuration { incoming: Configu ration { voters: {59036, 59037, 59035} }, outgoing: Configuration { voters: {} } }, learners: {791164, 791161, 791328}, learners_next: {}, auto_leave: false }”] [raft_id=59037] [region_id=59034]
[2023/10/17 09:40:18.648 +08:00] [ERROR] [peer.rs:618] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft lo cal message”] [peer_id=1033] [region_id=1030]
[2023/10/17 09:40:20.649 +08:00] [ERROR] [peer.rs:618] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft lo cal message”] [peer_id=1033] [region_id=1030]
[2023/10/17 09:40:22.651 +08:00] [ERROR] [peer.rs:618] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft lo cal message”] [peer_id=1033] [region_id=1030]
[2023/10/17 09:40:24.653 +08:00] [ERROR] [peer.rs:618] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft lo cal message”] [peer_id=1033] [region_id=1030]
[2023/10/17 09:40:24.765 +08:00] [ERROR] [peer.rs:618] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft lo cal message”] [peer_id=1033] [region_id=1030]
[2023/10/17 09:40:26.655 +08:00] [ERROR] [peer.rs:618] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft lo cal message”] [peer_id=1033] [region_id=1030]
[2023/10/17 09:40:28.657 +08:00] [ERROR] [peer.rs:618] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft lo cal message”] [peer_id=1033] [region_id=1030]
[2023/10/17 09:40:30.659 +08:00] [ERROR] [peer.rs:618] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft lo cal message”] [peer_id=1033] [region_id=1030]

查看下报错的几个region 所在的机器

感觉,磁盘有问题了
只是感觉 :laughing:

有没有其他报错 类似Region is unavailable ?

没有 Region is unavailable的报错,我看tiup查看集群状态是正常的

看grafana大盘,机器磁盘读写延时都是正常的

多观察下日志,有可能是磁盘问题。

pdctl 看看报错的region 的状态信息

[2023/10/17 10:35:34.611 +08:00] [ERROR] [peer.rs:618] [“handle raft message err”] [err_code=KV:Raft:StepLocalMsg] [err=“Raft raft: cannot step raft local message”] [peer_id=59547] [region_id=59544]

状态信息:
» region 59544
{
“id”: 59544,
“start_key”: “7480000000000000FF705F720000000000FA”,
“end_key”: “7480000000000000FF7100000000000000F8”,
“epoch”: {
“conf_ver”: 339484,
“version”: 236
},
“peers”: [
{
“id”: 59545,
“store_id”: 1,
“role_name”: “Voter”
},
{
“id”: 59546,
“store_id”: 5,
“role_name”: “Voter”
},
{
“id”: 59547,
“store_id”: 2,
“role_name”: “Voter”
},
{
“id”: 791334,
“store_id”: 3001,
“role”: 1,
“role_name”: “Learner”,
“is_learner”: true
},
{
“id”: 791340,
“store_id”: 231,
“role”: 1,
“role_name”: “Learner”,
“is_learner”: true
},
{
“id”: 791349,
“store_id”: 3002,
“role”: 1,
“role_name”: “Learner”,
“is_learner”: true
}
],
“leader”: {
“id”: 59547,
“store_id”: 2,
“role_name”: “Voter”
},
“pending_peers”: [
{
“id”: 791349,
“store_id”: 3002,
“role”: 1,
“role_name”: “Learner”,
“is_learner”: true
}
],
“cpu_usage”: 0,
“written_bytes”: 0,
“read_bytes”: 0,
“written_keys”: 0,
“read_keys”: 0,
“approximate_size”: 111,
“approximate_keys”: 664934
}

有三个tiflush 啊,我看有三个learner :joy:

报错里 写的是 这个store id 有问题, 看看这台机器ip, 再看看 是不是 其他 region 报错 也指向这台ip

"store_id": 2,

也是这个,这个需要怎么处理呢?

检查下这个kv有没有损坏的sst
https://docs.pingcap.com/zh/tidb/v6.5/tikv-control#打印损坏的-sst-文件信息

tikvctl 打印出 损坏的 sst ,根据建议修复就好
要是 有 时间 ,可以 给 机器磁盘做个扫描,也保证磁盘是ok的

[root@is-pcstore-pro-dc-tidb-01 ~]# tiup ctl:v7.1.0 tikv --data-dir /data/tidb-data/tikv-20160/ bad-ssts --pd 10.194.132.113:2379
Starting component ctl: /root/.tiup/components/ctl/v7.1.0/ctl tikv --data-dir /data/tidb-data/tikv-20160/ bad-ssts --pd 10.194.132.113:2379

start to print bad ssts; data_dir:/data/tidb-data/tikv-20160/; db:/data/tidb-data/tikv-20160/db

corruption analysis has completed

就这点,没了? :joy: