【 TiDB 使用环境】生产环境 /测试/ Poc
【 TiDB 版本】6.1.7
【复现路径】做过哪些操作出现的问题
【遇到的问题:问题现象及影响】使用方其他服务异常操作,导致磁盘满了,集群本来tikv有1.3,1.4.1.5三个,导致1.4.和1.5磁盘满了,手动强制缩容了1.4和1.5后,tidb宕机(跟集群无关,是磁盘坏了),更换磁盘后,缩容再扩容tidb,tidb起不来,还是报和1.4,1.5的连接错误信息,但这两个节点已经强制缩容了,请问这种情况还有没有救?谢谢!
【资源配置】进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面
【附件:截图/日志/监控】
日志也贴出吧,哥们
重启下
tiup display 下看卡,还有几个tikv在线
3个节点宕机了2个,多数副本已经没了,选不出leader,可能得用unsafe recovery了,参考这里试试
https://docs.pingcap.com/zh/tidb/stable/online-unsafe-recovery#online-unsafe-recovery-使用文档
[2024/01/23 14:06:50.237 +08:00] [INFO] [region_cache.go:2486] [“[health check] check health error”] [store=192.168.1.4:20160] [error=“rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 192.168.1.4:20160: connect: connection refused"”]
[2024/01/23 14:06:51.164 +08:00] [INFO] [region_cache.go:2486] [“[health check] check health error”] [store=192.168.1.5:20160] [error=“rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 192.168.1.5:20160: connect: connection refused"”]
[2024/01/23 14:06:51.237 +08:00] [INFO] [region_cache.go:2486] [“[health check] check health error”] [store=192.168.1.4:20160] [error=“rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 192.168.1.4:20160: connect: connection refused"”]
[2024/01/23 14:06:52.164 +08:00] [INFO] [region_cache.go:2486] [“[health check] check health error”] [store=192.168.1.5:20160] [error=“rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 192.168.1.5:20160: connect: connection refused"”]
[2024/01/23 14:06:52.237 +08:00] [INFO] [region_cache.go:2486] [“[health check] check health error”] [store=192.168.1.4:20160] [error=“rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 192.168.1.4:20160: connect: connection refused"”]
[2024/01/23 14:06:53.164 +08:00] [INFO] [region_cache.go:2486] [“[health check] check health error”] [store=192.168.1.5:20160] [error=“rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 192.168.1.5:20160: connect: connection refused"”]
[2024/01/23 14:06:53.236 +08:00] [INFO] [region_cache.go:2486] [“[health check] check health error”] [store=192.168.1.4:20160] [error=“rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 192.168.1.4:20160: connect: connection refused"”]
[2024/01/23 14:06:54.164 +08:00] [INFO] [region_cache.go:2486] [“[health check] check health error”] [store=192.168.1.5:20160] [error=“rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 192.168.1.5:20160: connect: connection refused"”]
[2024/01/23 14:06:54.236 +08:00] [INFO] [region_cache.go:2486] [“[health check] check health error”] [store=192.168.1.4:20160] [error=“rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 192.168.1.4:20160: connect: connection refused"”]
[2024/01/23 14:06:55.164 +08:00] [INFO] [region_cache.go:2486] [“[health check] check health error”] [store=192.168.1.5:20160] [error=“rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 192.168.1.5:20160: connect: connection refused"”]
[2024/01/23 14:06:55.236 +08:00] [INFO] [region_cache.go:2486] [“[health check] check health error”] [store=192.168.1.4:20160] [error=“rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 192.168.1.4:20160: connect: connection refused"”]
[2024/01/23 14:06:56.164 +08:00] [INFO] [region_cache.go:2486] [“[health check] check health error”] [store=192.168.1.5:20160] [error=“rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 192.168.1.5:20160: connect: connection refused"”]
[2024/01/23 14:06:56.236 +08:00] [INFO] [region_cache.go:2486] [“[health check] check health error”] [store=192.168.1.4:20160] [error=“rpc error: code = Unavailable desc = connection error: desc = "transport: Error while dialing dial tcp 192.168.1.4:20160: connect: connection refused"”]
都是这种
强制缩容的操作很危险的,缩容时也会提示你可能有数据丢失,竟然也做了
现在建议你先别动了,虽然大家给的建议都没错,找个懂tidb的人再操作吧
考虑下要是重新搭建一套,然后重新同步数据会不会快点
你们购买的企业版吗?找官方技术支持给看看。
有备份和增量 建议重搭恢复,不会丢数据,没有就按楼上说的,unsafe recovery,会丢数据
缩容的那俩tikv服务器data数据是否有保留?
肯定要丢数据了
看了这个文档,感觉好复杂
是的,但是数据修复本身就是一个细致活。
这个得用数据修复了,三个副本不会丢数据的,修复后启动即可,做任何操作都要保证tikv数>=副本数
安全恢复吧,