tiup 缩容tikv节点,tikv节点一直处于offline状态

因为 4tikv 变为 3tikv ,116845 上的副本需要在其他 store 上添加副本,来完成下线,此为预期效果。

50分钟了,还是offline状态

请明示下当前才去的下线方式。

  1. tiup ctl pd -u http://172.16.4.107:2379 config show 反馈看下。
tiup clt pd -u http://10.59.111.225:2379 config show
Error: component `clt` does not support `linux/amd64` (see `tiup list --refresh`)

config show

» config show
{
  "replication": {
    "enable-placement-rules": "true",
    "location-labels": "",
    "max-replicas": 3,
    "strictly-match-label": "false"
  },
  "schedule": {
    "enable-cross-table-merge": "false",
    "enable-debug-metrics": "false",
    "enable-location-replacement": "true",
    "enable-make-up-replica": "true",
    "enable-one-way-merge": "false",
    "enable-remove-down-replica": "true",
    "enable-remove-extra-replica": "true",
    "enable-replace-offline-replica": "true",
    "high-space-ratio": 0.7,
    "hot-region-cache-hits-threshold": 3,
    "hot-region-schedule-limit": 16,
    "leader-schedule-limit": 16,
    "leader-schedule-policy": "count",
    "low-space-ratio": 0.8,
    "max-merge-region-keys": 200000,
    "max-merge-region-size": 20,
    "max-pending-peer-count": 16,
    "max-snapshot-count": 3,
    "max-store-down-time": "30m0s",
    "merge-schedule-limit": 8,
    "patrol-region-interval": "100ms",
    "region-schedule-limit": 2048,
    "replica-schedule-limit": 64,
    "scheduler-max-waiting-operator": 5,
    "split-merge-interval": "1h0m0s",
    "store-balance-rate": 15,
    "store-limit-mode": "manual",
    "tolerant-size-ratio": 0
  }
}

请明示下当前才去的下线方式。

目前下线方式如上所述,先观察看下把。

上面方法都试过了,没用

反馈下 pd-ctl store 的信息,和关停服务的 tikv 的 tikv.log

1、store

{
  "count": 5,
  "stores": [
    {
      "store": {
        "id": 1,
        "address": "10.59.111.132:20160",
        "version": "4.0.0-rc.2",
        "status_address": "10.59.111.132:20180",
        "git_hash": "2fdb2804bf8ffaab4b18c4996970e19906296497",
        "start_timestamp": 1591929864,
        "deploy_path": "/data/tidb_deploy/tikv-20160/bin",
        "last_heartbeat": 1591943553978873939,
        "state_name": "Up"
      },
      "status": {
        "capacity": "639.7GiB",
        "available": "115GiB",
        "used_size": "486.7GiB",
        "leader_count": 9732,
        "leader_weight": 1,
        "leader_score": 9732,
        "leader_size": 713339,
        "region_count": 29185,
        "region_weight": 1,
        "region_score": 645777972.8961549,
        "region_size": 2139056,
        "start_ts": "2020-06-12T10:44:24+08:00",
        "last_heartbeat_ts": "2020-06-12T14:32:33.978873939+08:00",
        "uptime": "3h48m9.978873939s"
      }
    },
    {
      "store": {
        "id": 4,
        "address": "10.59.111.133:20160",
        "version": "4.0.0-rc.2",
        "status_address": "10.59.111.133:20180",
        "git_hash": "2fdb2804bf8ffaab4b18c4996970e19906296497",
        "start_timestamp": 1591929878,
        "deploy_path": "/data/tidb_deploy/tikv-20160/bin",
        "last_heartbeat": 1591943552741459627,
        "state_name": "Up"
      },
      "status": {
        "capacity": "639.7GiB",
        "available": "116.1GiB",
        "used_size": "488GiB",
        "leader_count": 9723,
        "leader_weight": 1,
        "leader_score": 9723,
        "leader_size": 713445,
        "region_count": 29185,
        "region_weight": 1,
        "region_score": 636551846.4256988,
        "region_size": 2139056,
        "start_ts": "2020-06-12T10:44:38+08:00",
        "last_heartbeat_ts": "2020-06-12T14:32:32.741459627+08:00",
        "uptime": "3h47m54.741459627s"
      }
    },
    {
      "store": {
        "id": 5,
        "address": "10.59.111.224:20160",
        "version": "4.0.0-rc.2",
        "status_address": "10.59.111.224:20180",
        "git_hash": "2fdb2804bf8ffaab4b18c4996970e19906296497",
        "start_timestamp": 1591929878,
        "deploy_path": "/data/tidb_deploy/tikv-20160/bin",
        "last_heartbeat": 1591943553716831978,
        "state_name": "Up"
      },
      "status": {
        "capacity": "639.7GiB",
        "available": "117.8GiB",
        "used_size": "487.9GiB",
        "leader_count": 9730,
        "leader_weight": 1,
        "leader_score": 9730,
        "leader_size": 712272,
        "region_count": 29185,
        "region_weight": 1,
        "region_score": 622396812.477663,
        "region_size": 2139056,
        "start_ts": "2020-06-12T10:44:38+08:00",
        "last_heartbeat_ts": "2020-06-12T14:32:33.716831978+08:00",
        "uptime": "3h47m55.716831978s"
      }
    },
    {
      "store": {
        "id": 46,
        "address": "10.59.111.10:3930",
        "labels": [
          {
            "key": "engine",
            "value": "tiflash"
          }
        ],
        "version": "v4.0.0-rc.2",
        "peer_address": "10.59.111.10:20170",
        "status_address": "10.59.111.10:20292",
        "git_hash": "09bd9e6b9a271b1fcd25c676083104a97f18739a",
        "start_timestamp": 1591845577,
        "last_heartbeat": 1591943550584551746,
        "state_name": "Up"
      },
      "status": {
        "capacity": "200GiB",
        "available": "180.3GiB",
        "used_size": "474.6KiB",
        "leader_count": 0,
        "leader_weight": 1,
        "leader_score": 0,
        "leader_size": 0,
        "region_count": 0,
        "region_weight": 1,
        "region_score": 0,
        "region_size": 0,
        "start_ts": "2020-06-11T11:19:37+08:00",
        "last_heartbeat_ts": "2020-06-12T14:32:30.584551746+08:00",
        "uptime": "27h12m53.584551746s"
      }
    },
    {
      "store": {
        "id": 116845,
        "address": "10.59.111.10:20160",
        "state": 1,
        "version": "4.0.0-rc.2",
        "status_address": "10.59.111.10:20180",
        "git_hash": "2fdb2804bf8ffaab4b18c4996970e19906296497",
        "start_timestamp": 1591941751,
        "deploy_path": "/data/tidb_deploy/tikv-20160/bin",
        "last_heartbeat": 1591942591789709669,
        "state_name": "Offline"
      },
      "status": {
        "capacity": "200GiB",
        "available": "179.9GiB",
        "used_size": "364MiB",
        "leader_count": 0,
        "leader_weight": 1,
        "leader_score": 0,
        "leader_size": 0,
        "region_count": 10,
        "region_weight": 1,
        "region_score": 799,
        "region_size": 799,
        "start_ts": "2020-06-12T14:02:31+08:00",
        "last_heartbeat_ts": "2020-06-12T14:16:31.789709669+08:00",
        "uptime": "14m0.789709669s"
      }
    }
  ]
}

2、offline tikv日志信息

tikv.tar.gz (125.8 KB)

感谢反馈信息,从目前的 TiKV log 查看一些关键 log 信息。

[2020/06/12 14:16:40.158 +08:00] [ERROR] [kv.rs:613] ["KvService::batch_raft send response fail"] [err=RemoteStopped]
[2020/06/12 14:16:40.158 +08:00] [ERROR] [kv.rs:613] ["KvService::batch_raft send response fail"] [err=RemoteStopped]
[2020/06/12 14:16:40.158 +08:00] [ERROR] [kv.rs:613] ["KvService::batch_raft send response fail"] [err=RemoteStopped]

初步判断是 v4.0.0-rc.* 版本的 bug raft 大小限制的过大,超过 gRPC 传输通信限制导致 raft message 卡住的问题,所以影响了 region 的调度。

将 TiKV 集群的 raft-max-size-per-msg 这个配置调小,降低 raft message 大小,观察一下是否恢复 region 调度。
参数位置:https://github.com/tikv/tikv/blob/v4.0.0-rc.2/tests/integrations/config/test-custom.toml#L105
如果恢复不了,麻烦提供新的 TIKV log 以及在 TIKV data 目录下的 last_tikv.toml 文件。

如果可以恢复,那么建议下线后,先将 TiDB 集群升级到 v4.0.0 版本。

@SUN-PingCAP,谢谢回复

1、目前设置raft-max-size-per-msg还是不行,默认tikv中是1MB,我设置成128KB也不行

2、我发现每次stop要offline tikv节点时候,日志中都会打印如下错误:

[ERROR] [kv.rs:613] ["KvService::batch_raft send response fail"] [err=RemoteStopped]

3、offline tikv的日志和last_tikv.toml:

tikv.tar.gz (128.8 KB)

last_tikv.toml (13.6 KB)

你好。

server_configs:
  tikv:
    raftstore.raft-max-size-per-message: 0.5MB

辛苦在试下,如上配置,看是否生效。

@户口舟亢,谢谢

改为0.5MB,目前看来还是不行

非常感谢您的反馈,我们向研发童鞋反馈下,

目前正在排查导致调度 block 的原因。

您这边可以尝试一下使用 operator 操作来手动搬迁 peer,看看能否成功,以及全部搬迁完毕后,tikv 能否成功缩容。在开启了 TiFlash 的集群上,可以先使用更细力度的 add-peer 操作在目标 tikv 上先创建 peer(请小心不要对 tiflash 节点进行这个操作),然后使用 remove-peer 在 store-116845 删除对应 peer 的方式实现 transfer-region 的效果,例如

>> operator add add-peer 1 4          // 在 store 4 上新增 Region 1 的一个副本
>> operator add remove-peer 1 116845  // 移除 store 116845 上的 Region 1 的一个副本

@HunDunDM 感谢回复

1、添加peer的时候出现如下错误:

» operator add add-peer 131043 1
Failed! [500] "region already has peer in store 1" 

原因可能是1中已经有131043的peer了,但是状态始终是learner状态

2、能不能先remove 1中的learner region,然后再添加?

你好,

确认是 4 副本的话,可以尝试先 remove 再 add

1、过了几天要下线的tikv还是处于offline状态

2、解决办法(要下线的store是116845)

step 1:分两种情况
第一种情况:删除状态是learner的region即可,过几秒中tikv会自动补全副本,并删除offline tikv上的region

>>operator add remove-peer 52577 1

第二种情况:如果没有learner状态的region,存粹是extra region,只要删除要offline tikv上的region即可

>>operator add remove-peer 127722 116845

step 2:要等一会儿,count数量才会减少1

这时候执行region check offline-peer,查看count数量会减少1,然后再循环step 1,直到offline region全部删除完

3、问题:

1、operator add remove-peer 能不能并发操作?

2、如果offline region很多的话,有没有快速方法,来删除?

在当前环境下这样操作是可以的,如果是 1 副本,则可能丢数据,所以需要 add - remove

不是同一个 region,可以多个 operator 进行操作。

只能通过一条一条命令进行操作

目前需要借助脚本来做。

目前使用的还是 rc 版本,可以 upgrade 升级到 4.0.1 ,月末 4.0.2 会发布。请持续关注下 tidb 官网的 release note

针对这个问题,我们在 4.0.2 进行了修复。确认时增加了 placement rules 规则之后出现的 bug 。

2 个赞

@户口舟亢 感谢回复

:call_me_hand:

:ok_hand:
@HunDunDM-PingCAP,辛苦研发童鞋~