数据从mysql中导入到tidb集群后,发现3个tikv的region分布不均匀?

  • 系统版本 & kernel 版本:CentOS-7.2.1511 && 3.10.0-327.36.3.el7.x86_64
  • TiDB 版本:3.0.1(tidb,pd,tikv)
  • 磁盘型号:普通机械盘
  • 问题描述(我做了什么):

监控region health发现有 缺副本的 Region高达10.54k

从上图来看,3.14上的tikv的region数量,确实与3.13和3.15上的tikv的region数量,相差1万多。

image

想请问一下,如何把3.14上tikv缺失的这1万+的regions恢复?谢谢!

排查思路

通过 pd-ctl 提供给工具确认目前 regoin 存在 2 个副本或者 1 个副本的 region ID;

region --jq=".regions[] | {id: .id, peer_stores: [.peers[].store_id] | select(length != 3)}"

正常如果 region group 中还有 2 个副本正常,那么 1 副本 PD 会发起补副本操作,如果补副本动作感觉比较慢,可以尝试调高 replicate-scheduler-limit 以及 max-snapshot-count,max-pending-peer-count 等参数增加调度速度,这里需要注意业务影响。

{“id”:4912,“peer_stores”:[4,5]} {“id”:14900,“peer_stores”:[4,5]} {“id”:26312,“peer_stores”:[4,5]} {“id”:102832,“peer_stores”:[4,5]} {“id”:16700,“peer_stores”:[4,5]} {“id”:35864,“peer_stores”:[4,5]} {“id”:89520,“peer_stores”:[4,5]} {“id”:26656,“peer_stores”:[4,5]} {“id”:49436,“peer_stores”:[4,5]} {“id”:89880,“peer_stores”:[4,5]} {“id”:94994,“peer_stores”:[4,5]} {“id”:7128,“peer_stores”:[4,5]} {“id”:73051,“peer_stores”:[4,5]} {“id”:64946,“peer_stores”:[4,5]} {“id”:92215,“peer_stores”:[4,5]} {“id”:22692,“peer_stores”:[4,5]} {“id”:48024,“peer_stores”:[4,5]} {“id”:48072,“peer_stores”:[4,5]} {“id”:27160,“peer_stores”:[4,5]} {“id”:57173,“peer_stores”:[4,5]} {“id”:44248,“peer_stores”:[4,5]} {“id”:52092,“peer_stores”:[4,5]} {“id”:20132,“peer_stores”:[4,5]} {“id”:40480,“peer_stores”:[4,5]} {“id”:51560,“peer_stores”:[4,5]} {“id”:8600,“peer_stores”:[4,5]} {“id”:10204,“peer_stores”:[4,5]} {“id”:13024,“peer_stores”:[4,5]} {“id”:15344,“peer_stores”:[4,5]} {“id”:42544,“peer_stores”:[4,5]} {“id”:48432,“peer_stores”:[4,5]} {“id”:56985,“peer_stores”:[4,5]} {“id”:8348,“peer_stores”:[4,5]} {“id”:27232,“peer_stores”:[4,5]} {“id”:29676,“peer_stores”:[4,5]} {“id”:41080,“peer_stores”:[4,5]} {“id”:19492,“peer_stores”:[4,5]} {“id”:57277,“peer_stores”:[4,5]} {“id”:72955,“peer_stores”:[4,5]} {“id”:44660,“peer_stores”:[4,5]} {“id”:3396,“peer_stores”:[4,5]} {“id”:7116,“peer_stores”:[4,5]} {“id”:26176,“peer_stores”:[4,5]} {“id”:86245,“peer_stores”:[4,5]} {“id”:44744,“peer_stores”:[4,5]} {“id”:70668,“peer_stores”:[4,5]} {“id”:72687,“peer_stores”:[4,5]} {“id”:71568,“peer_stores”:[4,5]} {“id”:88381,“peer_stores”:[4,5]}

查询返回这样的信息。

  • 返回的结果就是副本数 < 3 副本的 region 具体 region id 及分布信息,然后通过 pd control 的工具执行 opertor 命令确定,确认副本是正在补 ?
  • 还是没有任何补副本的动作,如果没有补副本的动作,需要通过 pd.log 确认调度是否正常。
  • 想要加快补副本速度可以按照楼上的建议进行调整响应的参数。

[tidb@LinuxCentos72KaiFa16316 ~]$ tail -f /home/tidb/deploy2/ backup/ bin/ conf/ log/ scripts/ tipd_data1/ [tidb@LinuxCentos72KaiFa16316 ~]$ tail -f /home/tidb/deploy2/tipd_data1/ backup/ bin/ conf/ data.pd/ log/ scripts/ [tidb@LinuxCentos72KaiFa16316 ~]$ tail -f /home/tidb/deploy2/tipd_data1/log/pd.log 2019/09/11 12:00:08.847 log.go:86: [warning] wal: [sync duration of 1.071471844s, expected less than 1s] [2019/09/11 12:40:14.767 +08:00] [INFO] [index.go:190] [“compact tree index”] [revision=568260] [2019/09/11 12:40:14.772 +08:00] [INFO] [kvstore_compaction.go:57] [“finished scheduled compaction”] [compact-revision=568260] [took=4.473322ms] [2019/09/11 13:40:14.764 +08:00] [INFO] [index.go:190] [“compact tree index”] [revision=569471] [2019/09/11 13:40:14.769 +08:00] [INFO] [kvstore_compaction.go:57] [“finished scheduled compaction”] [compact-revision=569471] [took=3.907367ms] [2019/09/11 14:10:02.213 +08:00] [WARN] [util.go:144] [“apply request took too long”] [took=275.559213ms] [expected-duration=100ms] [prefix="read-only range “] [request=“key:”/tidb/store/gcworker/saved_safe_point” "] [response=“range_response_count:1 size:79”] [] [2019/09/11 14:28:03.270 +08:00] [WARN] [util.go:144] [“apply request took too long”] [took=105.101058ms] [expected-duration=100ms] [prefix="read-only range “] [request=“key:”/tidb/store/gcworker/saved_safe_point” "] [response=“range_response_count:1 size:79”] [] [2019/09/11 14:40:14.775 +08:00] [INFO] [index.go:190] [“compact tree index”] [revision=570683] [2019/09/11 14:40:14.781 +08:00] [INFO] [kvstore_compaction.go:57] [“finished scheduled compaction”] [compact-revision=570683] [took=5.161239ms] [2019/09/11 15:34:00.128 +08:00] [WARN] [util.go:144] [“apply request took too long”] [took=127.159348ms] [expected-duration=100ms] [prefix="read-only range “] [request=“key:”/tidb/store/gcworker/saved_safe_point” "] [response=“range_response_count:1 size:79”] []

[tidb@LinuxCentos72KaiFa18318 ~]$ tail -f /home/tidb/deploy2/tipd_data1/log/pd.log [2019/09/11 11:40:14.758 +08:00] [INFO] [kvstore_compaction.go:57] [“finished scheduled compaction”] [compact-revision=567049] [took=5.669227ms] 2019/09/11 12:00:08.842 log.go:86: [warning] wal: [sync duration of 1.072621587s, expected less than 1s] [2019/09/11 12:40:14.776 +08:00] [INFO] [index.go:190] [“compact tree index”] [revision=568260] [2019/09/11 12:40:14.782 +08:00] [INFO] [kvstore_compaction.go:57] [“finished scheduled compaction”] [compact-revision=568260] [took=5.769442ms] [2019/09/11 12:55:02.087 +08:00] [WARN] [util.go:144] [“apply request took too long”] [took=105.87855ms] [expected-duration=100ms] [prefix="read-only range “] [request=“key:”/tidb/store/gcworker/saved_safe_point” "] [response=“range_response_count:1 size:79”] [] [2019/09/11 13:00:02.457 +08:00] [WARN] [util.go:144] [“apply request took too long”] [took=162.341842ms] [expected-duration=100ms] [prefix="read-only range “] [request=“key:”/tidb/store/gcworker/saved_safe_point” "] [response=“range_response_count:1 size:79”] [] [2019/09/11 13:40:14.767 +08:00] [INFO] [index.go:190] [“compact tree index”] [revision=569471] [2019/09/11 13:40:14.771 +08:00] [INFO] [kvstore_compaction.go:57] [“finished scheduled compaction”] [compact-revision=569471] [took=3.877803ms] [2019/09/11 14:40:14.779 +08:00] [INFO] [index.go:190] [“compact tree index”] [revision=570683] [2019/09/11 14:40:14.784 +08:00] [INFO] [kvstore_compaction.go:57] [“finished scheduled compaction”] [compact-revision=570683] [took=4.745838ms]

[tidb@LinuxCentos72KaiFa19319 ~]$ tail -f /home/tidb/deploy2/tipd_data1/log/pd.log [2019/09/11 15:35:12.872 +08:00] [INFO] [operator_controller.go:386] [“send schedule command”] [region-id=40508] [step=“add learner peer 104023 on store 1”] [source=“active push”] [2019/09/11 15:35:17.872 +08:00] [INFO] [operator_controller.go:386] [“send schedule command”] [region-id=40508] [step=“add learner peer 104023 on store 1”] [source=“active push”] [2019/09/11 15:35:23.372 +08:00] [INFO] [operator_controller.go:386] [“send schedule command”] [region-id=40508] [step=“add learner peer 104023 on store 1”] [source=“active push”] [2019/09/11 15:35:28.872 +08:00] [INFO] [operator_controller.go:386] [“send schedule command”] [region-id=40508] [step=“add learner peer 104023 on store 1”] [source=“active push”] [2019/09/11 15:35:34.372 +08:00] [INFO] [operator_controller.go:107] [“operator timeout”] [region-id=40508] [operator="“make-up-replica (kind:region,replica, region:40508(194,6), createAt:2019-09-11 15:25:30.5709475 +0800 CST m=+89116.950813621, startAt:2019-09-11 15:25:30.571428152 +0800 CST m=+89116.951294494, currentStep:0, steps:[add learner peer 104023 on store 1 promote learner peer 104023 on store 1 to voter]) timeout”"] [2019/09/11 15:35:56.662 +08:00] [INFO] [operator_controller.go:284] [“add operator”] [region-id=103200] [operator="“balance-leader (kind:leader,balance, region:103200(265,8), createAt:2019-09-11 15:35:56.662819298 +0800 CST m=+89743.042685497, startAt:0001-01-01 00:00:00 +0000 UTC, currentStep:0, steps:[transfer leader from store 5 to store 1]) “”] [2019/09/11 15:35:56.663 +08:00] [INFO] [operator_controller.go:386] [“send schedule command”] [region-id=103200] [step=“transfer leader from store 5 to store 1”] [source=create] [2019/09/11 15:35:56.669 +08:00] [INFO] [cluster_info.go:567] [“leader changed”] [region-id=103200] [from=5] [to=1] [2019/09/11 15:35:56.669 +08:00] [INFO] [operator_controller.go:99] [“operator finish”] [region-id=103200] [operator=”“balance-leader (kind:leader,balance, region:103200(265,8), createAt:2019-09-11 15:35:56.662819298 +0800 CST m=+89743.042685497, startAt:2019-09-11 15:35:56.663188955 +0800 CST m=+89743.043055291, currentStep:1, steps:[transfer leader from store 5 to store 1]) finished”"] [2019/09/11 15:35:57.525 +08:00] [INFO] [grpc_service.go:703] [“updated gc safe point”] [safe-point=411090712229576704]

以上为3个pd的log日志信息。

再次,调整 » config set max-snapshot-count 16 Success! » config set max-pending-peer-count 64 Success! » config set replica-schedule-limit 16 Success! »

查看pd的日志出现如下信息:

3.14上的store 1的状态显示up,但是,region数量变成了7.3k+:

3.14上的tikv日志,显示

大概过了5分钟左右,再次查看3.14的tikv日志:

发现已经不报错了。

监控上查看:

image

比之前,7.3k变成了7.4k,说明有副本添加成功。

再确认一下store 1 中 tikv.log 有没有异常报错。

store 1中的tikv.log日志:

机械磁盘在进行补副本操作会比较慢,也会导致 timeout 超时问题。建议测试环境可以调整一下 tikv 磁盘类型按照官方推荐进行测试。