掉线2个月的tikv节点如何加入集群？

AlexSheung · 2025 年12 月 27 日 08:42

【 TiKV 使用环境】生产环境
【 TiKV 版本】v8.1.1
【复现路径】有一个tikv节点未知原因掉线，过去两个月后发现与其他store之间的数据量差距较大
【资源配置】
3台48核128G物理机，混合部署3个tikv+3个pd（掉线的一台仅仅是tikv断掉，pd正常）
现在看正常的两个tikv节点数据量比较平均，都是183G左右。掉线的那台为393G。
【Store信息】

{
  "count": 3,
  "stores": [
    {
      "store": {
        "id": 1005,
        "address": "0.0.0.101:20160",
        "version": "8.1.1",
        "peer_address": "0.0.0.101:20160",
        "status_address": "0.0.0.101:20180",
        "git_hash": "7793f1d5dc40206fe406ca001be1e0d7f1b83a8f",
        "start_timestamp": 1751173377,
        "deploy_path": "/",
        "last_heartbeat": 1766821769530714325,
        "state_name": "Up"
      },
      "status": {
        "capacity": "1.345TiB",
        "available": "1.108TiB",
        "used_size": "139.5GiB",
        "leader_count": 3481,
        "leader_weight": 1,
        "leader_score": 3481,
        "leader_size": 7099016,
        "region_count": 8244,
        "region_weight": 1,
        "region_score": 21506574.567600116,
        "region_size": 17440914,
        "slow_score": 1,
        "slow_trend": {
          "cause_value": 250013.37751677854,
          "cause_rate": 0,
          "result_value": 35590.5,
          "result_rate": -55456.687326754385
        },
        "start_ts": "2025-06-29T13:02:57+08:00",
        "last_heartbeat_ts": "2025-12-27T15:49:29.530714325+08:00",
        "uptime": "4346h46m32.530714325s"
      }
    },
    {
      "store": {
        "id": 1004,
        "address": "0.0.0.102:20160",
        "version": "8.1.1",
        "peer_address": "0.0.0.102:20160",
        "status_address": "0.0.0.102:20180",
        "git_hash": "7793f1d5dc40206fe406ca001be1e0d7f1b83a8f",
        "start_timestamp": 1751173380,
        "deploy_path": "/",
        "last_heartbeat": 1760512413898838807,
        "state_name": "Down"
      },
      "status": {
        "capacity": "392.6GiB",
        "available": "0B",
        "used_size": "90.09GiB",
        "leader_count": 0,
        "leader_weight": 1,
        "leader_score": 0,
        "leader_size": 0,
        "region_count": 8244,
        "region_weight": 1,
        "region_score": 10085360052.449165,
        "region_size": 17440914,
        "slow_score": 75,
        "slow_trend": {
          "cause_value": 250011.18456375838,
          "cause_rate": 0,
          "result_value": 3.5,
          "result_rate": 0
        },
        "start_ts": "2025-06-29T13:03:00+08:00",
        "last_heartbeat_ts": "2025-10-15T15:13:33.898838807+08:00",
        "uptime": "2594h10m33.898838807s"
      }
    },
    {
      "store": {
        "id": 1001,
        "address": "0.0.0.103:20160",
        "version": "8.1.1",
        "peer_address": "0.0.0.103:20160",
        "status_address": "0.0.0.103:20180",
        "git_hash": "7793f1d5dc40206fe406ca001be1e0d7f1b83a8f",
        "start_timestamp": 1751173381,
        "deploy_path": "/",
        "last_heartbeat": 1766821770487281780,
        "state_name": "Up"
      },
      "status": {
        "capacity": "1.345TiB",
        "available": "1.109TiB",
        "used_size": "138.7GiB",
        "leader_count": 4763,
        "leader_weight": 1,
        "leader_score": 4763,
        "leader_size": 10341898,
        "region_count": 8244,
        "region_weight": 1,
        "region_score": 21504760.077404324,
        "region_size": 17440914,
        "slow_score": 1,
        "slow_trend": {
          "cause_value": 250018.2701342282,
          "cause_rate": 0,
          "result_value": 43229,
          "result_rate": -99249.85309282198
        },
        "start_ts": "2025-06-29T13:03:01+08:00",
        "last_heartbeat_ts": "2025-12-27T15:49:30.48728178+08:00",
        "uptime": "4346h46m29.48728178s"
      }
    }
  ]
}

【region信息】

"regions": [
        {
            "id": 336442013,
            "start_key": "6A66732D70726F64FFFD4112470B010000FF0000430000000000FE",
            "end_key": "6A66732D70726F64FFFD41125B39050000FF0000430000000700FE",
            "epoch": {
                "conf_ver": 5,
                "version": 102
            },
            "peers": [
                {
                    "role_name": "Voter",
                    "id": 336442014,
                    "store_id": 1001
                },
                {
                    "role_name": "Voter",
                    "id": 336442015,
                    "store_id": 1004
                },
                {
                    "role_name": "Voter",
                    "id": 336442016,
                    "store_id": 1005
                }
            ],
            "leader": {
                "role_name": "Voter",
                "id": 336442014,
                "store_id": 1001
            },
            "down_peers": [
                {
                    "peer": {
                        "role_name": "Voter",
                        "id": 336442015,
                        "store_id": 1004
                    },
                    "down_seconds": 6296086
                }
            ],
            "pending_peers": [
                {
                    "role_name": "Voter",
                    "id": 336442015,
                    "store_id": 1004
                }
            ],
            "cpu_usage": 0,
            "written_bytes": 3398,
            "read_bytes": 88877,
            "written_keys": 60,
            "read_keys": 7,
            "approximate_size": 94,
            "approximate_keys": 609756
        },

有没有比较推荐的恢复手段？

AlexSheung · 2025 年12 月 27 日 10:34

感谢解答，可以详细说一下步骤吗？

我有看到这份博客，我的情况属于：计划外停机，满足raft多数派吧？是否可以直接拉起tikv
博客 - TiKV存储节点计划内外停机，如何去处理？ | TiDB 社区

zhanggame1 · 2025 年12 月 27 日 14:41

直接拉起来就行

TiDBer_Ejh0eCXc · 2025 年12 月 28 日 00:56

配置都是一样的?

AlexSheung · 2025 年12 月 28 日 01:55

感谢解答，生产环境这样操作会不会有风险？是否可以pdctl delete 掉那个down掉的store，然后备份393G数据后，再启动新的tikv加入集群？

我看region已经都balance过了，delete store是不是没什么负担了？只需要考虑新节点加入后的balance？

AlexSheung · 2025 年12 月 28 日 01:55

对，一样的

a大力啊 · 2025 年12 月 29 日 01:23

为啥掉线2个月才发现，这有点慌啊

春风十里不如你 · 2025 年12 月 29 日 02:19

down的store已经没有leader了，可以直接用tiup缩容掉，重新扩一下这个节点。

TiDBer_Ejh0eCXc · 2025 年12 月 30 日 23:46

可能掉线后,pd不能及时清理那些垃圾数据. 是否考虑备份数据然后清空原来的tikv节点, 然后加一个新的tikv节点?

TiDBer_YPbNvXxe · 2026 年1 月 1 日 12:02

感觉是配置哪有问题导致的

TiDBer_EMDRI6T4 · 2026 年1 月 3 日 12:35

时间这么久，启动后数据能不能同步要看rocksdb log里的数据还有没有才可以吧！

AlexSheung · 2026 年1 月 4 日 06:37

感谢大家回复，我们的操作方式：
1、备份后，pd直接delete 掉线的store，发现报错，不能少于3副本
2、手动修改副本数为2，发现掉线的store开始删除本地tikv数据和region count
3、等待store清理完成后，自动变为tombstone并清理
4、扩容新的store，并修改副本数为3

system · 2026 年1 月 11 日 06:38

此话题已在最后回复的 7 天后被自动关闭。不再允许新回复。