求救,tidb起不来

【 TiDB 使用环境】生产环境 /测试/ Poc
【 TiDB 版本】6.1.7
【复现路径】做过哪些操作出现的问题
【遇到的问题:问题现象及影响】使用方其他服务异常操作,导致磁盘满了,集群本来tikv有1.3,1.4.1.5三个,导致1.4.和1.5磁盘满了,手动强制缩容了1.4和1.5后,tidb宕机(跟集群无关,是磁盘坏了),更换磁盘后,缩容再扩容tidb,tidb起不来,还是报和1.4,1.5的连接错误信息,但这两个节点已经强制缩容了,请问这种情况还有没有救?谢谢!
【资源配置】进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面
【附件:截图/日志/监控】

日志也贴出吧,哥们

重启下

tiup display 下看卡,还有几个tikv在线

3个节点宕机了2个,多数副本已经没了,选不出leader,可能得用unsafe recovery了,参考这里试试
https://docs.pingcap.com/zh/tidb/stable/online-unsafe-recovery#online-unsafe-recovery-使用文档

2 个赞

[2024/01/23 14:06:50.237 +08:00] [INFO] [region_cache.go:2486] [“[health check] check health error”] [store=192.168.1.4:20160] [error=“rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 192.168.1.4:20160: connect: connection refused"”]
[2024/01/23 14:06:51.164 +08:00] [INFO] [region_cache.go:2486] [“[health check] check health error”] [store=192.168.1.5:20160] [error=“rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 192.168.1.5:20160: connect: connection refused"”]
[2024/01/23 14:06:51.237 +08:00] [INFO] [region_cache.go:2486] [“[health check] check health error”] [store=192.168.1.4:20160] [error=“rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 192.168.1.4:20160: connect: connection refused"”]
[2024/01/23 14:06:52.164 +08:00] [INFO] [region_cache.go:2486] [“[health check] check health error”] [store=192.168.1.5:20160] [error=“rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 192.168.1.5:20160: connect: connection refused"”]
[2024/01/23 14:06:52.237 +08:00] [INFO] [region_cache.go:2486] [“[health check] check health error”] [store=192.168.1.4:20160] [error=“rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 192.168.1.4:20160: connect: connection refused"”]
[2024/01/23 14:06:53.164 +08:00] [INFO] [region_cache.go:2486] [“[health check] check health error”] [store=192.168.1.5:20160] [error=“rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 192.168.1.5:20160: connect: connection refused"”]
[2024/01/23 14:06:53.236 +08:00] [INFO] [region_cache.go:2486] [“[health check] check health error”] [store=192.168.1.4:20160] [error=“rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 192.168.1.4:20160: connect: connection refused"”]
[2024/01/23 14:06:54.164 +08:00] [INFO] [region_cache.go:2486] [“[health check] check health error”] [store=192.168.1.5:20160] [error=“rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 192.168.1.5:20160: connect: connection refused"”]
[2024/01/23 14:06:54.236 +08:00] [INFO] [region_cache.go:2486] [“[health check] check health error”] [store=192.168.1.4:20160] [error=“rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 192.168.1.4:20160: connect: connection refused"”]
[2024/01/23 14:06:55.164 +08:00] [INFO] [region_cache.go:2486] [“[health check] check health error”] [store=192.168.1.5:20160] [error=“rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 192.168.1.5:20160: connect: connection refused"”]
[2024/01/23 14:06:55.236 +08:00] [INFO] [region_cache.go:2486] [“[health check] check health error”] [store=192.168.1.4:20160] [error=“rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 192.168.1.4:20160: connect: connection refused"”]
[2024/01/23 14:06:56.164 +08:00] [INFO] [region_cache.go:2486] [“[health check] check health error”] [store=192.168.1.5:20160] [error=“rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 192.168.1.5:20160: connect: connection refused"”]
[2024/01/23 14:06:56.236 +08:00] [INFO] [region_cache.go:2486] [“[health check] check health error”] [store=192.168.1.4:20160] [error=“rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 192.168.1.4:20160: connect: connection refused"”]

都是这种

强制缩2个?unsafe-recover 吧
https://docs.pingcap.com/zh/tidb/v6.1/online-unsafe-recovery

Online Unsafe Recovery 使用文档 | PingCAP 文档中心

强制缩容的操作很危险的,缩容时也会提示你可能有数据丢失,竟然也做了
现在建议你先别动了,虽然大家给的建议都没错,找个懂tidb的人再操作吧

考虑下要是重新搭建一套,然后重新同步数据会不会快点

参考这里 专栏 - TiKV缩容下线异常处理的三板斧 | TiDB 社区

你们购买的企业版吗?找官方技术支持给看看。

有备份和增量 建议重搭恢复,不会丢数据,没有就按楼上说的,unsafe recovery,会丢数据

缩容的那俩tikv服务器data数据是否有保留?

肯定要丢数据了


这是正解

看了这个文档,感觉好复杂

是的,但是数据修复本身就是一个细致活。

这个得用数据修复了,三个副本不会丢数据的,修复后启动即可,做任何操作都要保证tikv数>=副本数

安全恢复吧,