因为 4tikv 变为 3tikv ,116845 上的副本需要在其他 store 上添加副本,来完成下线,此为预期效果。
50分钟了,还是offline状态
tiup clt pd -u http://10.59.111.225:2379 config show
Error: component `clt` does not support `linux/amd64` (see `tiup list --refresh`)
config show
» config show
{
"replication": {
"enable-placement-rules": "true",
"location-labels": "",
"max-replicas": 3,
"strictly-match-label": "false"
},
"schedule": {
"enable-cross-table-merge": "false",
"enable-debug-metrics": "false",
"enable-location-replacement": "true",
"enable-make-up-replica": "true",
"enable-one-way-merge": "false",
"enable-remove-down-replica": "true",
"enable-remove-extra-replica": "true",
"enable-replace-offline-replica": "true",
"high-space-ratio": 0.7,
"hot-region-cache-hits-threshold": 3,
"hot-region-schedule-limit": 16,
"leader-schedule-limit": 16,
"leader-schedule-policy": "count",
"low-space-ratio": 0.8,
"max-merge-region-keys": 200000,
"max-merge-region-size": 20,
"max-pending-peer-count": 16,
"max-snapshot-count": 3,
"max-store-down-time": "30m0s",
"merge-schedule-limit": 8,
"patrol-region-interval": "100ms",
"region-schedule-limit": 2048,
"replica-schedule-limit": 64,
"scheduler-max-waiting-operator": 5,
"split-merge-interval": "1h0m0s",
"store-balance-rate": 15,
"store-limit-mode": "manual",
"tolerant-size-ratio": 0
}
}
请明示下当前才去的下线方式。
目前下线方式如上所述,先观察看下把。
上面方法都试过了,没用
反馈下 pd-ctl store 的信息,和关停服务的 tikv 的 tikv.log
1、store
{
"count": 5,
"stores": [
{
"store": {
"id": 1,
"address": "10.59.111.132:20160",
"version": "4.0.0-rc.2",
"status_address": "10.59.111.132:20180",
"git_hash": "2fdb2804bf8ffaab4b18c4996970e19906296497",
"start_timestamp": 1591929864,
"deploy_path": "/data/tidb_deploy/tikv-20160/bin",
"last_heartbeat": 1591943553978873939,
"state_name": "Up"
},
"status": {
"capacity": "639.7GiB",
"available": "115GiB",
"used_size": "486.7GiB",
"leader_count": 9732,
"leader_weight": 1,
"leader_score": 9732,
"leader_size": 713339,
"region_count": 29185,
"region_weight": 1,
"region_score": 645777972.8961549,
"region_size": 2139056,
"start_ts": "2020-06-12T10:44:24+08:00",
"last_heartbeat_ts": "2020-06-12T14:32:33.978873939+08:00",
"uptime": "3h48m9.978873939s"
}
},
{
"store": {
"id": 4,
"address": "10.59.111.133:20160",
"version": "4.0.0-rc.2",
"status_address": "10.59.111.133:20180",
"git_hash": "2fdb2804bf8ffaab4b18c4996970e19906296497",
"start_timestamp": 1591929878,
"deploy_path": "/data/tidb_deploy/tikv-20160/bin",
"last_heartbeat": 1591943552741459627,
"state_name": "Up"
},
"status": {
"capacity": "639.7GiB",
"available": "116.1GiB",
"used_size": "488GiB",
"leader_count": 9723,
"leader_weight": 1,
"leader_score": 9723,
"leader_size": 713445,
"region_count": 29185,
"region_weight": 1,
"region_score": 636551846.4256988,
"region_size": 2139056,
"start_ts": "2020-06-12T10:44:38+08:00",
"last_heartbeat_ts": "2020-06-12T14:32:32.741459627+08:00",
"uptime": "3h47m54.741459627s"
}
},
{
"store": {
"id": 5,
"address": "10.59.111.224:20160",
"version": "4.0.0-rc.2",
"status_address": "10.59.111.224:20180",
"git_hash": "2fdb2804bf8ffaab4b18c4996970e19906296497",
"start_timestamp": 1591929878,
"deploy_path": "/data/tidb_deploy/tikv-20160/bin",
"last_heartbeat": 1591943553716831978,
"state_name": "Up"
},
"status": {
"capacity": "639.7GiB",
"available": "117.8GiB",
"used_size": "487.9GiB",
"leader_count": 9730,
"leader_weight": 1,
"leader_score": 9730,
"leader_size": 712272,
"region_count": 29185,
"region_weight": 1,
"region_score": 622396812.477663,
"region_size": 2139056,
"start_ts": "2020-06-12T10:44:38+08:00",
"last_heartbeat_ts": "2020-06-12T14:32:33.716831978+08:00",
"uptime": "3h47m55.716831978s"
}
},
{
"store": {
"id": 46,
"address": "10.59.111.10:3930",
"labels": [
{
"key": "engine",
"value": "tiflash"
}
],
"version": "v4.0.0-rc.2",
"peer_address": "10.59.111.10:20170",
"status_address": "10.59.111.10:20292",
"git_hash": "09bd9e6b9a271b1fcd25c676083104a97f18739a",
"start_timestamp": 1591845577,
"last_heartbeat": 1591943550584551746,
"state_name": "Up"
},
"status": {
"capacity": "200GiB",
"available": "180.3GiB",
"used_size": "474.6KiB",
"leader_count": 0,
"leader_weight": 1,
"leader_score": 0,
"leader_size": 0,
"region_count": 0,
"region_weight": 1,
"region_score": 0,
"region_size": 0,
"start_ts": "2020-06-11T11:19:37+08:00",
"last_heartbeat_ts": "2020-06-12T14:32:30.584551746+08:00",
"uptime": "27h12m53.584551746s"
}
},
{
"store": {
"id": 116845,
"address": "10.59.111.10:20160",
"state": 1,
"version": "4.0.0-rc.2",
"status_address": "10.59.111.10:20180",
"git_hash": "2fdb2804bf8ffaab4b18c4996970e19906296497",
"start_timestamp": 1591941751,
"deploy_path": "/data/tidb_deploy/tikv-20160/bin",
"last_heartbeat": 1591942591789709669,
"state_name": "Offline"
},
"status": {
"capacity": "200GiB",
"available": "179.9GiB",
"used_size": "364MiB",
"leader_count": 0,
"leader_weight": 1,
"leader_score": 0,
"leader_size": 0,
"region_count": 10,
"region_weight": 1,
"region_score": 799,
"region_size": 799,
"start_ts": "2020-06-12T14:02:31+08:00",
"last_heartbeat_ts": "2020-06-12T14:16:31.789709669+08:00",
"uptime": "14m0.789709669s"
}
}
]
}
2、offline tikv日志信息
tikv.tar.gz (125.8 KB)
感谢反馈信息,从目前的 TiKV log 查看一些关键 log 信息。
[2020/06/12 14:16:40.158 +08:00] [ERROR] [kv.rs:613] ["KvService::batch_raft send response fail"] [err=RemoteStopped]
[2020/06/12 14:16:40.158 +08:00] [ERROR] [kv.rs:613] ["KvService::batch_raft send response fail"] [err=RemoteStopped]
[2020/06/12 14:16:40.158 +08:00] [ERROR] [kv.rs:613] ["KvService::batch_raft send response fail"] [err=RemoteStopped]
初步判断是 v4.0.0-rc.* 版本的 bug raft 大小限制的过大,超过 gRPC 传输通信限制导致 raft message 卡住的问题,所以影响了 region 的调度。
将 TiKV 集群的 raft-max-size-per-msg 这个配置调小,降低 raft message 大小,观察一下是否恢复 region 调度。
参数位置:https://github.com/tikv/tikv/blob/v4.0.0-rc.2/tests/integrations/config/test-custom.toml#L105
如果恢复不了,麻烦提供新的 TIKV log 以及在 TIKV data 目录下的 last_tikv.toml 文件。
如果可以恢复,那么建议下线后,先将 TiDB 集群升级到 v4.0.0 版本。
@SUN-PingCAP,谢谢回复
1、目前设置raft-max-size-per-msg还是不行,默认tikv中是1MB,我设置成128KB也不行
2、我发现每次stop要offline tikv节点时候,日志中都会打印如下错误:
[ERROR] [kv.rs:613] ["KvService::batch_raft send response fail"] [err=RemoteStopped]
3、offline tikv的日志和last_tikv.toml:
tikv.tar.gz (128.8 KB)
last_tikv.toml (13.6 KB)
你好。
server_configs:
tikv:
raftstore.raft-max-size-per-message: 0.5MB
辛苦在试下,如上配置,看是否生效。
非常感谢您的反馈,我们向研发童鞋反馈下,
目前正在排查导致调度 block 的原因。
您这边可以尝试一下使用 operator 操作来手动搬迁 peer,看看能否成功,以及全部搬迁完毕后,tikv 能否成功缩容。在开启了 TiFlash 的集群上,可以先使用更细力度的 add-peer 操作在目标 tikv 上先创建 peer(请小心不要对 tiflash 节点进行这个操作),然后使用 remove-peer 在 store-116845 删除对应 peer 的方式实现 transfer-region 的效果,例如
>> operator add add-peer 1 4 // 在 store 4 上新增 Region 1 的一个副本
>> operator add remove-peer 1 116845 // 移除 store 116845 上的 Region 1 的一个副本
@HunDunDM 感谢回复
1、添加peer的时候出现如下错误:
» operator add add-peer 131043 1
Failed! [500] "region already has peer in store 1"
原因可能是1中已经有131043的peer了,但是状态始终是learner状态
2、能不能先remove 1中的learner region,然后再添加?
你好,
确认是 4 副本的话,可以尝试先 remove 再 add
1、过了几天要下线的tikv还是处于offline状态
2、解决办法(要下线的store是116845)
step 1:分两种情况
第一种情况:删除状态是learner的region即可,过几秒中tikv会自动补全副本,并删除offline tikv上的region
>>operator add remove-peer 52577 1
第二种情况:如果没有learner状态的region,存粹是extra region,只要删除要offline tikv上的region即可
>>operator add remove-peer 127722 116845
step 2:要等一会儿,count数量才会减少1
这时候执行region check offline-peer,查看count数量会减少1,然后再循环step 1,直到offline region全部删除完
3、问题:
1、operator add remove-peer 能不能并发操作?
2、如果offline region很多的话,有没有快速方法,来删除?
在当前环境下这样操作是可以的,如果是 1 副本,则可能丢数据,所以需要 add - remove
不是同一个 region,可以多个 operator 进行操作。
只能通过一条一条命令进行操作
目前需要借助脚本来做。
目前使用的还是 rc 版本,可以 upgrade 升级到 4.0.1 ,月末 4.0.2 会发布。请持续关注下 tidb 官网的 release note
针对这个问题,我们在 4.0.2 进行了修复。确认时增加了 placement rules 规则之后出现的 bug 。
@户口舟亢 感谢回复
@HunDunDM-PingCAP,辛苦研发童鞋~