BR恢复失败,tikv节点报错

Hi

集群版本信息为 v4.0.1 在使用 v4.0.16 恢复显示恢复失败,经排查 tikv 节点尝试自己重启,但是一直无法自己启动,具体报错见tikv报错日志,请问有人遇到过这个问题吗?如何解决的谢谢~!!!

BR报错:
Error: rpc error: code = Unavailable desc = transport is closing; rpc error: code = Unavailable desc = connection error: desc = “transport: Error while dialing dial tcp xxxxx:21164: i/o timeout”; rpc error: code = Unavailable desc = connection error: desc = “transport: Error while dialing dial tcp xxxxx:21164: i/o timeout”; rpc error: code = Unavailable desc = connection error: desc = “transport: Error while dialing dial tcp xxxxx:21164: i/o timeout”; rpc error: code = Unavailable desc = connection error: desc = “transport: Error while dialing dial tcp xxxxx:21164: i/o timeout”; rpc error: code = Unavailable desc = connection error: desc = “transport: Error while dialing dial tcp xxxxx:21164: i/o timeout”; rpc error: code = Unavailable desc = connection error: desc = “transport: Error while dialing dial tcp xxxxx:21164: i/o timeout”; rpc error: code = Unavailable desc = connection error: desc = “transport: Error while dialing dial tcp xxxxx:21164: i/o timeout”

TiKV节点报错日志:

[2022/06/21 09:29:25.964 +08:00] [INFO] [raw_node.rs:222] [“RawNode created with id 1216662.”] [id=1216662] [raft_id=1216662] [region_id=1216172]
[2022/06/21 09:29:25.964 +08:00] [INFO] [peer.rs:158] [“create peer”] [peer_id=1216781] [region_id=1216340]
[2022/06/21 09:29:25.964 +08:00] [INFO] [raft.rs:783] [“became follower at term 9”] [term=9] [raft_id=1216781] [region_id=1216340]
[2022/06/21 09:29:25.964 +08:00] [INFO] [raft.rs:285] [newRaft] [peers="[(1216781, Progress { matched: 20, next_idx: 21, state: Probe, paused: false, pending_snapshot: 0, pending_request_snapshot: 0, recent_active: false, ins: Inflights { start: 0, count: 0, buffer: [] } }), (1216782, Progress { matched: 0, next_idx: 21, state: Probe, paused: false, pending_snapshot: 0, pending_request_snapshot: 0, recent_active: false, ins: Inflights { start: 0, count: 0, buffer: [] } }), (1216783, Progress { matched: 0, next_idx: 21, state: Probe, paused: false, pending_snapshot: 0, pending_request_snapshot: 0, recent_active: false, ins: Inflights { start: 0, count: 0, buffer: [] } })]"] [“last term”=9] [“last index”=20] [applied=20] [commit=20] [term=9] [raft_id=1216781] [region_id=1216340]
[2022/06/21 09:29:25.964 +08:00] [INFO] [raw_node.rs:222] [“RawNode created with id 1216781.”] [id=1216781] [raft_id=1216781] [region_id=1216340]
[2022/06/21 09:29:25.964 +08:00] [INFO] [peer.rs:158] [“create peer”] [peer_id=1216876] [region_id=1216472]
[2022/06/21 09:29:25.964 +08:00] [INFO] [raft.rs:783] [“became follower at term 7”] [term=7] [raft_id=1216876] [region_id=1216472]
[2022/06/21 09:29:25.964 +08:00] [INFO] [raft.rs:285] [newRaft] [peers="[(1216876, Progress { matched: 18, next_idx: 19, state: Probe, paused: false, pending_snapshot: 0, pending_request_snapshot: 0, recent_active: false, ins: Inflights { start: 0, count: 0, buffer: [] } }), (1216877, Progress { matched: 0, next_idx: 19, state: Probe, paused: false, pending_snapshot: 0, pending_request_snapshot: 0, recent_active: false, ins: Inflights { start: 0, count: 0, buffer: [] } }), (1216878, Progress { matched: 0, next_idx: 19, state: Probe, paused: false, pending_snapshot: 0, pending_request_snapshot: 0, recent_active: false, ins: Inflights { start: 0, count: 0, buffer: [] } })]"] [“last term”=7] [“last index”=18] [applied=18] [commit=18] [term=7] [raft_id=1216876] [region_id=1216472]
[2022/06/21 09:29:25.964 +08:00] [INFO] [raw_node.rs:222] [“RawNode created with id 1216876.”] [id=1216876] [raft_id=1216876] [region_id=1216472]
[2022/06/21 09:29:25.964 +08:00] [INFO] [peer.rs:158] [“create peer”] [peer_id=1217557] [region_id=1217085]
[2022/06/21 09:29:25.964 +08:00] [FATAL] [server.rs:576] [“failed to start node: EngineTraits(Other(”[components/raftstore/src/store/fsm/store.rs:812]: \"[components/raftstore/src/store/peer_storage.rs:578]: [region 1217085] 1217557 validate state fail: Other(\\\"[components/raftstore/src/store/peer_storage.rs:455]: log at recorded commit index [8] 17 doesn\\\\\\\'t exist, may lose data\\\")\""))"]

你是恢复过程中失败了,还是刚开始恢复的时候就失败了。网络有问题吗?

刚开始恢复时没有问题,在恢复的过程中出现的这个问题,网络我看了下没啥问题

  1. https://docs.pingcap.com/zh/tidb/stable/backup-and-restore-faq#br-遇到错误信息-rpc-error-code--unavailable-desc-该如何处理 新的集群资源够吗?
  2. 从 v4.0.1 BR restore 到 v4.0.16? 可以尝试先 BR restore 到 v4.0.1 是否成功。 然后在 upgrade 到 v4.0.16

看这段日志是说 Raft Log 有损坏,或许可以去隔壁 TiKV 区问一下是啥状况?