集群突然全部down掉,start起不来

【 TiDB 使用环境】生产环境
【 TiDB 版本】6.5
【复现路径】重新start,起不来
【遇到的问题:问题现象及影响】 全部down状态,拉不起来
错误日志如下:
[2024/06/21 13:37:37.232 +08:00] [ERROR] [client.go:172] [“region sync with leader meet error”] [error=“[PD:grpc:ErrGRPCRecv]rpc err
or: code = Unavailable desc = server not started: rpc error: code = Unavailable desc = server not started”]
[2024/06/21 13:37:38.233 +08:00] [INFO] [client.go:168] [“server starts to synchronize with leader”] [server=pd-3.3.3.23-2379]
[leader=pd-3.3.3.39-2379] [request-index=0]
[2024/06/21 13:37:38.233 +08:00] [ERROR] [client.go:172] [“region sync with leader meet error”] [error=“[PD:grpc:ErrGRPCRecv]rpc err
or: code = Unavailable desc = server not started: rpc error: code = Unavailable desc = server not started”]
[2024/06/21 13:37:39.234 +08:00] [INFO] [client.go:168] [“server starts to synchronize with leader”] [server=pd-3.3.3.23-2379]
[leader=pd-3.3.3.39-2379] [request-index=0]
[2024/06/21 13:37:39.235 +08:00] [ERROR] [client.go:172] [“region sync with leader meet error”] [error=“[PD:grpc:ErrGRPCRecv]rpc err
or: code = Unavailable desc = server not started: rpc error: code = Unavailable desc = server not started”]
[2024/06/21 13:37:39.432 +08:00] [INFO] [trace.go:152] [“trace[2085091103] linearizableReadLoop”] [detail=“{readStateIndex:63; appli
edIndex:64; }”] [duration=118.150015ms] [start=2024/06/21 13:37:39.314 +08:00] [end=2024/06/21 13:37:39.432 +08:00] [steps=“["trace
[2085091103] ‘read index received’ (duration: 118.1459ms)","trace[2085091103] ‘applied index is now lower than readState.Index’
(duration: 3.371µs)"]”]
[2024/06/21 13:37:39.432 +08:00] [WARN] [util.go:163] [“apply request took too long”] [took=118.294517ms] [expected-duration=100ms]
[prefix=“read-only range “] [request=“key:"/tidb/br-stream/info/" range_end:"/tidb/br-stream/info0" revision:37 “] [response=“ra
nge_response_count:0 size:4”]
[2024/06/21 13:37:39.432 +08:00] [INFO] [trace.go:152] [“trace[1691299179] range”] [detail=”{range_begin:/tidb/br-stream/info/; rang
e_end:/tidb/br-stream/info0; response_count:0; response_revision:37; }”] [duration=118.405893ms] [start=2024/06/21 13:37:39.314 +08:
00] [end=2024/06/21 13:37:39.432 +08:00] [steps=”["trace[1691299179] ‘agreement among raft nodes before linearized reading’ (durat
ion: 118.270316ms)"]”]
[2024/06/21 13:37:40.177 +08:00] [WARN] [v3_server.go:814] [“waiting for ReadIndex response took too long, retrying”] [sent-request-
id=17315373243108170272] [retry-timeout=500ms]
[2024/06/21 13:37:40.235 +08:00] [INFO] [client.go:168] [“server starts to synchronize with leader”] [server=pd-3.3.3.23-2379]
[leader=pd-3.3.3.39-2379] [request-index=0]
[2024/06/21 13:37:40.236 +08:00] [ERROR] [client.go:172] [“region sync with leader meet error”] [error=“[PD:grpc:ErrGRPCRecv]rpc err
or: code = Unavailable desc = server not started: rpc error: code = Unavailable desc = server not started”]
[2024/06/21 13:37:40.508 +08:00] [INFO] [trace.go:152] [“trace[1345035771] linearizableReadLoop”] [detail=“{readStateIndex:65; appli
edIndex:65; }”] [duration=831.915351ms] [start=2024/06/21 13:37:39.676 +08:00] [end=2024/06/21 13:37:40.508 +08:00] [steps=“["trace
[1345035771] ‘read index received’ (duration: 831.910612ms)","trace[1345035771] ‘applied index is now lower than readState.Index’
(duration: 3.848µs)"]”]
[2024/06/21 13:37:40.508 +08:00] [WARN] [util.go:163] [“apply request took too long”] [took=832.109277ms] [expected-duration=100ms]
[prefix=“read-only range “] [request=“key:"/tidb/br-stream/info/" range_end:"/tidb/br-stream/info0" revision:37 “] [response=“ra
nge_response_count:0 size:4”]
[2024/06/21 13:37:40.508 +08:00] [INFO] [trace.go:152] [“trace[1562924158] range”] [detail=”{range_begin:/tidb/br-stream/info/; rang
e_end:/tidb/br-stream/info0; response_count:0; response_revision:37; }”] [duration=832.249636ms] [start=2024/06/21 13:37:39.676 +08:
00] [end=2024/06/21 13:37:40.508 +08:00] [steps=”["trace[1562924158] ‘agreement among raft nodes before linearized reading’ (durat
ion: 832.139673ms)"]”]
[2024/06/21 13:37:40.509 +08:00] [WARN] [util.go:163] [“apply request took too long”] [took=500.925554ms] [expected-duration=100ms]
[prefix=“read-only range “] [request=“key:"/pd/7382826522070064087/config" “] [response=“range_response_count:1 size:3670”]
[2024/06/21 13:37:40.509 +08:00] [INFO] [trace.go:152] [“trace[2025518191] range”] [detail=”{range_begin:/pd/7382826522070064087/con
fig; range_end:; response_count:1; response_revision:37; }”] [duration=501.007594ms] [start=2024/06/21 13:37:40.008 +08:00] [end=202
4/06/21 13:37:40.509 +08:00] [steps=”["trace[2025518191] ‘agreement among raft nodes before linearized reading’ (duration: 500.906
239ms)"]”]
[2024/06/21 13:37:41.081 +08:00] [INFO] [trace.go:152] [“trace[494172490] linearizableReadLoop”] [detail=“{readStateIndex:68; applie
dIndex:68; }”] [duration=169.486939ms] [start=2024/06/21 13:37:40.911 +08:00] [end=2024/06/21 13:37:41.081 +08:00] [steps=“["trace[
494172490] ‘read index received’ (duration: 169.484512ms)","trace[494172490] ‘applied index is now lower than readState.Index’ (
duration: 1.868µs)"]”]

参考下这个 专栏 - TiDB集群数据库灾难恢复手册 | TiDB 社区

可能先根据错误看看有没有启动的修复方案。最后没办法才考虑做整体恢复吧。

大神知道为什么会突然挂掉吗?

tiup cluster display xxx 看下你集群现在的情况是什么样子的

去down了先检查pd

这个文档赞,很有帮助

最坏的恢复方法是保留数据新建集群。

咋解决的?

可以试试重启主机试试

pd连通性是否正常,网络策略是否新加了限制

突然down 一般是人为问题,有没有改服务器的相关配置。