强行缩容两个tikv节点后,tidb节点restart故障

【 TiDB 使用环境】 /测试/
【 TiDB 版本】7.5.4
【复现路径】 类似问题

tiup cluster scale-in tidb-test --node 192.168.8.150:20160 --force
tiup cluster scale-in tidb-test --node 192.168.8.153:20160 --force

【遇到的问题:问题现象及影响】
【资源配置】进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面
【附件:截图/日志/监控】
集群情况

tidb 组件进行单独如下命令启动,发现它的进程是可以启动的,但是会报一些错误日志,集群中状态还是down,感觉下线的tikv状态没有同步

 tiup cluster  start tidb-test -N 192.168.8.185:4000  --wait-timeout 600

“check bootstrapped failed”] [error=“[tikv:9005]Region is unavailable”]

Unavailable desc = connection error: desc = "transport: Error while dialing: dial tcp 192.168.8.150:20160: connect: no route to host\

你现在的需求是什么?

看你的命令是scale-in 缩容,应该是你缩容没等待数据完全迁移完成导致数据的三副本中的两副本丢失了

能恢复使用就行了

有可能是的

缩容的话,还是要耐心等待收缩成功为好。


 $ tiup ctl:v7.5.4 pd -u 192.168.8.200:2379 store                             
Starting component ctl: /Users/wangqing/.tiup/components/ctl/v7.5.4/ctl pd -u 192.168.8.200:2379 store
{
  "count": 5,
  "stores": [
    {
      "store": {
        "id": 60,
        "address": "192.168.8.200:3930",
        "labels": [
          {
            "key": "engine",
            "value": "tiflash"
          }
        ],
        "version": "v6.1.0",
        "peer_address": "192.168.8.200:20170",
        "status_address": "192.168.8.200:20292",
        "git_hash": "ebf7ce6d9fb4090011876352fe26b89668cbedc4",
        "start_timestamp": 1708510112,
        "deploy_path": "/opt/tidb-deploy/tiflash-9000/bin/tiflash",
        "last_heartbeat": 1708760881403697293,
        "state_name": "Offline"
      },
      "status": {
        "capacity": "0B",
        "available": "0B",
        "used_size": "0B",
        "leader_count": 0,
        "leader_weight": 1,
        "leader_score": 0,
        "leader_size": 0,
        "region_count": 28161,
        "region_weight": 1,
        "region_score": 33950,
        "region_size": 33950,
        "learner_count": 28161,
        "start_ts": "2024-02-21T18:08:32+08:00",
        "last_heartbeat_ts": "2024-02-24T15:48:01.403697293+08:00",
        "uptime": "69h39m29.403697293s"
      }
    },
    {
      "store": {
        "id": 61,
        "address": "192.168.8.238:3930",
        "labels": [
          {
            "key": "engine",
            "value": "tiflash"
          }
        ],
        "version": "v6.1.0",
        "peer_address": "192.168.8.238:20170",
        "status_address": "192.168.8.238:20292",
        "git_hash": "ebf7ce6d9fb4090011876352fe26b89668cbedc4",
        "start_timestamp": 1708508386,
        "deploy_path": "/opt/tidb-deploy/tiflash-9000/bin/tiflash",
        "last_heartbeat": 1708756050565490261,
        "state_name": "Offline"
      },
      "status": {
        "capacity": "0B",
        "available": "0B",
        "used_size": "0B",
        "leader_count": 0,
        "leader_weight": 1,
        "leader_score": 0,
        "leader_size": 0,
        "region_count": 28161,
        "region_weight": 1,
        "region_score": 33950,
        "region_size": 33950,
        "learner_count": 28161,
        "start_ts": "2024-02-21T17:39:46+08:00",
        "last_heartbeat_ts": "2024-02-24T14:27:30.565490261+08:00",
        "uptime": "68h47m44.565490261s"
      }
    },
    {
      "store": {
        "id": 447021,
        "address": "192.168.8.238:20160",
        "version": "7.5.4",
        "peer_address": "192.168.8.238:20160",
        "status_address": "192.168.8.238:20180",
        "git_hash": "b4bddeeb995e7bedc1973ce9e856eeb2d856ce9b",
        "start_timestamp": 1731047784,
        "deploy_path": "/opt/tidb-deploy/tikv-20160/bin",
        "last_heartbeat": 1731048072441438483,
        "state_name": "Up"
      },
      "status": {
        "capacity": "7.972TiB",
        "available": "2.51TiB",
        "used_size": "65.39GiB",
        "leader_count": 19334,
        "leader_weight": 1,
        "leader_score": 19334,
        "leader_size": 153744,
        "region_count": 64053,
        "region_weight": 1,
        "region_score": 500858.5907984277,
        "region_size": 465200,
        "slow_score": 1,
        "slow_trend": {
          "cause_value": 500000,
          "cause_rate": 0,
          "result_value": 1.5,
          "result_rate": 0
        },
        "is_busy": true,
        "start_ts": "2024-11-08T14:36:24+08:00",
        "last_heartbeat_ts": "2024-11-08T14:41:12.441438483+08:00",
        "uptime": "4m48.441438483s"
      }
    },
    {
      "store": {
        "id": 475016,
        "address": "192.168.8.223:20160",
        "version": "7.5.4",
        "peer_address": "192.168.8.223:20160",
        "status_address": "192.168.8.223:20180",
        "git_hash": "b4bddeeb995e7bedc1973ce9e856eeb2d856ce9b",
        "start_timestamp": 1731047784,
        "deploy_path": "/opt/tidb-deploy/tikv-20160/bin",
        "last_heartbeat": 1731048071058775934,
        "state_name": "Up"
      },
      "status": {
        "capacity": "7.273TiB",
        "available": "1.684TiB",
        "used_size": "63.67GiB",
        "leader_count": 32885,
        "leader_weight": 1,
        "leader_score": 32885,
        "leader_size": 222111,
        "region_count": 63850,
        "region_weight": 1,
        "region_score": 522192.3519421916,
        "region_size": 465200,
        "slow_score": 1,
        "slow_trend": {
          "cause_value": 500000,
          "cause_rate": 0,
          "result_value": 0,
          "result_rate": 0
        },
        "is_busy": true,
        "start_ts": "2024-11-08T14:36:24+08:00",
        "last_heartbeat_ts": "2024-11-08T14:41:11.058775934+08:00",
        "uptime": "4m47.058775934s"
      }
    },
    {
      "store": {
        "id": 476374,
        "address": "192.168.8.243:20160",
        "version": "7.5.4",
        "peer_address": "192.168.8.243:20160",
        "status_address": "192.168.8.243:20180",
        "git_hash": "b4bddeeb995e7bedc1973ce9e856eeb2d856ce9b",
        "start_timestamp": 1731047786,
        "deploy_path": "/opt/tidb-deploy/tikv-20160/bin",
        "last_heartbeat": 1731048073811981885,
        "state_name": "Up"
      },
      "status": {
        "capacity": "1.772TiB",
        "available": "770GiB",
        "used_size": "59.27GiB",
        "leader_count": 10565,
        "leader_weight": 1,
        "leader_score": 10565,
        "leader_size": 89345,
        "region_count": 63493,
        "region_weight": 1,
        "region_score": 591545.8411216858,
        "region_size": 465200,
        "slow_score": 1,
        "slow_trend": {
          "cause_value": 500000,
          "cause_rate": 0,
          "result_value": 0,
          "result_rate": 0
        },
        "is_busy": true,
        "start_ts": "2024-11-08T14:36:26+08:00",
        "last_heartbeat_ts": "2024-11-08T14:41:13.811981885+08:00",
        "uptime": "4m47.811981885s"
      }
    }
  ]
}
  1. 不推荐使用 force 下线,尤其你一次下线多个,肯定破坏多副本了。
  2. https://docs.pingcap.com/zh/tidb/stable/online-unsafe-recovery 新版本用这个修一下试试。
    pd-ctl -u <pd_addr> unsafe remove-failed-stores --auto-detect
1 个赞

而且你为啥 tiflash 版本是 6.1.0 ?

–force :cold_sweat:

这个已经删掉了,没用过这个功能,之前想尝试

只能试试三板斧了。。。 专栏 - TiKV缩容下线异常处理的三板斧 | TiDB 社区

1 个赞

下次试试吧,这个新功能就是为了多数派数据丢失 快速修复用的。

5台机器,3副本,强制缩容2个节点,导致某些region只剩1副本了;
想要恢复只能用unsafe-recovery

是你自己测试环境吧,重建吧,别再强制缩容了 :joy:

不建议用–force,尤其是你一次性缩容2个节点,缩容建议一次只缩容1个节点,在正常缩容一个节点的时候,offline是缩容过程中的一个中间状态,下线需要一定时间,下线节点的状态变为 Tombstone 就说明下线成功

突然要搬走机器去生产用 :smiling_face_with_tear:

那你完蛋啦,兄弟 :clown_face:

此话题已在最后回复的 7 天后被自动关闭。不再允许新回复。