dm同步mysql数据到tidb,总是报错execute statement failed: commit: Error 9005: Region is unavailable

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】:v4.0.5
  • 【DM 版本】:v1.0.6
  • 【问题描述】:
    使用DM同步mysql5.7的数据到tidb, 经常报Region is unavailable, 然后中断同步。数据量不多,才40G, 写入的TPS也很低,才几十。
    over view的监控如下, tomestone是我下线的5个tiflash, 因为join查询太慢了,都是返回tikv timeout, 不走tiflash, join查询走tikv是正常的


dm-work错误日志:
[2020/09/11 11:23:30.155 +08:00] [WARN] [syncer.go:626] [“something wrong with rollback global checkpoint”] [task=posbill_pro] [u nit=“binlog replication”] [“previous position”=“(mysql-bin|000001.000047, 534433696)”] [“current position”=“(mysql-bin|000001.000
047, 530622622)”]
[2020/09/11 11:23:30.155 +08:00] [INFO] [subtask.go:266] [“unit process returned”] [subtask=posbill_pro] [unit=Sync] [stage=Pause
d] [status=“{"totalEvents":602447,"totalTps":20005,"recentTps":318,"masterBinlog":"(mysql-bin.000047, 534449467)","mas
terBinlogGtid":"9bc13467-10c0-11ea-8000-4cd98f7a2c00:1-3","syncerBinlog":"(mysql-bin|000001.000047, 530622622)"}”]
[2020/09/11 11:23:30.155 +08:00] [ERROR] [subtask.go:285] [“unit process error”] [subtask=posbill_pro] [unit=Sync] [“error inform
ation”=“{"Type":1,"msg":"execute statement failed: commit: Error 9005: Region is unavailable","error":{"ErrCode":10006,
"ErrClass":1,"ErrLevel":3,"Message":"execute statement failed: commit: Error 9005: Region is unavailable","RawCause":"
Error 9005: Region is unavailable"}}”]
[2020/09/11 11:23:30.155 +08:00] [ERROR] [subtask.go:285] [“unit process error”] [subtask=posbill_pro] [unit=Sync] [“error inform
ation”=“{"Type":1,"msg":"execute statement failed: commit: Error 9005: Region is unavailable","error":{"ErrCode":10006,
"ErrClass":1,"ErrLevel":3,"Message":"execute statement failed: commit: Error 9005: Region is unavailable","RawCause":"
Error 9005: Region is unavailable"}}”]
[2020/09/11 11:23:30.155 +08:00] [ERROR] [subtask.go:285] [“unit process error”] [subtask=posbill_pro] [unit=Sync] [“error inform
ation”=“{"Type":1,"msg":"execute statement failed: commit: Error 9005: Region is unavailable","error":{"ErrCode":10006,
"ErrClass":1,"ErrLevel":3,"Message":"execute statement failed: commit: Error 9005: Region is unavailable","RawCause":"
Error 9005: Region is unavailable"}}”]
[2020/09/11 11:23:30.155 +08:00] [ERROR] [subtask.go:285] [“unit process error”] [subtask=posbill_pro] [unit=Sync] [“error inform
ation”=“{"Type":1,"msg":"execute statement failed: commit: Error 9005: Region is unavailable","error":{"ErrCode":10006,
"ErrClass":1,"ErrLevel":3,"Message":"execute statement failed: commit: Error 9005: Region is unavailable","RawCause":"
Error 9005: Region is unavailable"}}”]
[2020/09/11 11:23:33.736 +08:00] [WARN] [task_checker.go:358] [“backoff skip auto resume task”] [component=“task checker”] [task=
posbill_pro] [latestResumeTime=2020/09/11 11:22:48.736 +08:00] [duration=5m0s]
[2020/09/11 11:23:38.736 +08:00] [WARN] [task_checker.go:358] [“backoff skip auto resume task”] [component=“task checker”] [task=
posbill_pro] [latestResumeTime=2020/09/11 11:22:48.736 +08:00] [duration=5m0s]
[2020/09/11 11:23:38.747 +08:00] [INFO] [relay.go:568] [“flush meta finished”] [component=“relay log”] [meta="master-uuid = 9bc13
467-10c0-11ea-8000-4cd98f7a2c00.000001, relay-binlog = (mysql-bin.000047, 534450771), relay-binlog-gtid = "]
[2020/09/11 11:23:43.736 +08:00] [WARN] [task_checker.go:358] [“backoff skip auto resume task”] [component=“task checker”] [task=
posbill_pro] [latestResumeTime=2020/09/11 11:22:48.736 +08:00] [duration=5m0s]
(END)

看tikv的日志,一直在不停的增加与删除peer,这个集群才刚部署,目前只同步少量数据,也没有查询,怎么不停地增加与删除peer?

[2020/09/11 11:54:42.502 +08:00] [INFO] [apply.rs:1485] [“exec ConfChange”] [epoch=“conf_ver: 1691 version: 246”] [type=AddNode] [peer_id=189827] [region_id=189759]
[2020/09/11 11:54:42.502 +08:00] [INFO] [apply.rs:1540] [“add peer successfully”] [region=“id: 189759 start_key: 7480000000000000FF515F698000000000FF00000F0400000000FF000000000419A382FFEA25000000038000FF0000000000060380FF000000012BF6BD00FE end_key: 7480000000000000FF515F698000000000FF00000F0400000000FF000000000419A694FFDA5F000000038000FF0000000000050380FF0000000190FCE000FE region_epoch { conf_ver: 1691 version: 246 } peers { id: 189760 store_id: 20 } peers { id: 189827 store_id: 17 } peers { id: 189956 store_id: 6 is_learner: true }”] [peer=“id: 189956 store_id: 6”] [peer_id=189827] [region_id=189759]
[2020/09/11 11:54:42.504 +08:00] [INFO] [apply.rs:1160] [“execute admin command”] [command=“cmd_type: ChangePeer change_peer { change_type: RemoveNode peer { id: 189827 store_id: 17 } }”] [index=16] [term=6] [peer_id=189827] [region_id=189759]
[2020/09/11 11:54:42.504 +08:00] [INFO] [apply.rs:1485] [“exec ConfChange”] [epoch=“conf_ver: 1692 version: 246”] [type=RemoveNode] [peer_id=189827] [region_id=189759]
[2020/09/11 11:54:42.504 +08:00] [INFO] [apply.rs:1593] [“remove peer successfully”] [region=“id: 189759 start_key: 7480000000000000FF515F698000000000FF00000F0400000000FF000000000419A382FFEA25000000038000FF0000000000060380FF000000012BF6BD00FE end_key: 7480000000000000FF515F698000000000FF00000F0400000000FF000000000419A694FFDA5F000000038000FF0000000000050380FF0000000190FCE000FE region_epoch { conf_ver: 1692 version: 246 } peers { id: 189760 store_id: 20 } peers { id: 189827 store_id: 17 } peers { id: 189956 store_id: 6 }”] [peer=“id: 189827 store_id: 17”] [peer_id=189827] [region_id=189759]
[2020/09/11 11:54:42.504 +08:00] [INFO] [router.rs:265] ["[region 189759] shutdown mailbox"]
[2020/09/11 11:54:42.504 +08:00] [INFO] [peer.rs:1558] [“starts destroy”] [merged_by_target=false] [peer_id=189827] [region_id=189759]
[2020/09/11 11:54:42.504 +08:00] [INFO] [peer.rs:474] [“begin to destroy”] [peer_id=189827] [region_id=189759]
[2020/09/11 11:54:42.504 +08:00] [INFO] [peer_storage.rs:1417] [“finish clear peer meta”] [takes=36.287µs] [raft_logs=5] [raft_key=1] [apply_key=1] [meta_key=1] [region_id=189759]
[2020/09/11 11:54:42.504 +08:00] [INFO] [peer.rs:515] [“peer destroy itself”] [takes=110.456µs] [peer_id=189827] [region_id=189759]
[2020/09/11 11:54:42.504 +08:00] [INFO] [router.rs:265] ["[region 189759] shutdown mailbox"]
[2020/09/11 11:54:42.504 +08:00] [INFO] [region.rs:473] [“register deleting data in range”] [end_key=7A7480000000000000FF515F698000000000FF00000F0400000000FF000000000419A694FFDA5F000000038000FF0000000000050380FF0000000190FCE000FE] [start_key=7A7480000000000000FF515F698000000000FF00000F0400000000FF000000000419A382FFEA25000000038000FF0000000000060380FF000000012BF6BD00FE] [region_id=189759]
[2020/09/11 11:54:44.500 +08:00] [INFO] [peer.rs:900] [“failed to schedule peer tick”] [err=“sending on a disconnected channel”] [tick=RAFT] [peer_id=189827] [region_id=189759]


Region is unavailable 是比较著名的报错,需要查询一下 TiDB 和 TiKV log 确认报错的原因,具体可以看一下文档的处理思路。

https://docs.pingcap.com/zh/tidb/stable/tidb-troubleshooting-map

客户端报 Region is Unavailable 错误
1.1.1 Region is Unavailable 一般是由于 Region 在一段时间不可用(可能会遇到 TiKV server is busy;或者发送给 TiKV 的请求由于 not leader 或者 epoch not match 等原因被打回;又或者请求 TiKV 超时等),TiDB 内部会进行 backoff 重试。backoff 的时间超过一定阈值(默认 20s)后就会报错给客户端。如果 backoff 在阈值内,客户端对该错误无感知。

1.1.2 多台 TiKV 同时内存不足 (OOM),导致 Region 在一定时期内没有 Leader,见案例 case-991。

1.1.3 TiKV 报 TiKV server is busy 错误,超过 backoff 时间,参考4.3 客户端报 server is busy 错误。TiKV server is busy 属于内部流控机制,后续可能不计入 backoff 时间。

1.1.4 多台 TiKV 启动不了,导致 Region 没有 Leader。单台物理主机部署多个 TiKV 实例,一个物理机挂掉,由于 label 配置错误导致 Region 没有 Leader,见案例 case-228。

1.1.5 follower apply 落后,成为 Leader 之后把收到的请求以 epoch not match 理由打回,见案例 case-958(TiKV 内部需要优化该机制)。

没有内存不足,tikv也没有启动不了,都正常的。 如果是其它问题如何解决? 看了这个文档与case,除了升级版本,但我已经是最新的版本

这个问题很蛋痛,同步一会可以不会不可以, 中断的时间可能几分钟,也可能几个小时, 这样没法保证数据的同步,导致不敢上线生产。
在没有节点异常, 数据量很少,写入又非常低时, 不明白为何会报这样的错误。 麻烦帮忙看下怎么解决,解决不了只能下线了

麻烦一下 TiDB Cluster 的 TiKV-details、TiDB、PD 的 log ,我们详细看一下监控情况,再判断一下吧。另外 TiKV 和 TiDB log 里面没有异常的 ERROR 报错么 ?

我自己调整了几个参数,搞得有点乱,我已经重建了整个集群,问题再出现时我把监控与日志贴上来。

:call_me_hand:

DM同步中断的报警:
微信告警 9-11 23:05:06
DM_sync_process_exists_with_error
告警主题: DM_sync_process_exists_with_error
告警时间:2020-09-11 23:05:06
当前状态: firing
描述: DM sync process exists with error
详情:instance: 172.20.51.97:8266, task: posbill_pro, values: 1

DM与tikv当时的日志
dm-11-05.log (75.5 KB) tikv_11_05.log (21.9 KB)

当时的监控

tidblepro-tikv-details-2020-09-12-09_06_38_compressed.pdf (2.3 MB)

tidblepro-tikv-details-2020-09-12-09_11_34.pdf (4.7 MB)

已经贴了当时overview, pd, tikv-detail的监控, tikv与dm的日志。基本没什么负载, tikv也没报错。 tikv是6*2个节点, 6台机器, 普通SSD硬盘做了raid5. 麻烦帮忙看下

Hello~ 从监控看 TiKV 和 overview 的监控没有 block 的地方,dm-worker 的日志看,麻烦抓一下这个时间的TiDB log 看一下。

[error="[code=10006:class=database:scope=not-set:level=high] execute statement failed: commit: Error 9005: Region is unavailable"]

时间是
[2020/09/11 11:08:20.022 +08:00] [ERROR] [db.go:295] [“execute statements failed after retry”] [task=posbill_pro] [unit=“binlog replication”] [queries="[REPLACE INTO posbill.t_agent_withdraw_apply_order (F_id,F_bill_id,F_uid,F_agent_id,F_agent_name,F_withdraw_type,F_withdraw_auto_type,F_invoice_type,F_dk_type,F_apply_amount,F_allow_amount,F_state,F_audit_operator,F_top_agent_id,F_parent_agent_id,F_agent_invoice_id,F_remark,F_create_time,F_update_time,F_withdraw_apply_id,F_agent_class,F_account_version,F_request_id,F_sub_agent_class,F_account_type,F_agent_tax_id,F_is_new,F_invoice_mode,F_fixfee_amount,F_tax_point,F_busy_type,F_config_mode,F_tax_point_back,F_autopay_type) VALUES (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?) DELETE FROM posbill.t_agent_withdraw_apply_order WHERE F_id = ? LIMIT 1 REPLACE INTO posbill.t_agent_withdraw_apply_order (F_id,F_bill_id,F_uid,F_agent_id,F_agent_name,F_withdraw_type,F_withdraw_auto_type,F_invoice_type,F_dk_type,F_apply_amount,F_allow_amount,F_state,F_audit_operator,F_top_agent_id,`F_parent_…"] [arguments="[[13530651 16001350 8360872086 张桂铭 6063 0 0 0 418600 418600 0 8363712368 8369865933 2020-09-11 10:37:38 2020-09-11 10:37:38 606383608720867333967550 5 4 fb654cf64a3148b68994b772d11536 1 0 8360872086 1 0 0 0 0 1 0 0] [13530651] [13530651 16001350 8360872086 张桂铭 6063 0 0 0 418600 418600 1 system 8363712368 8369865933 system audit ok 2020-09-11 10:37:38 2020-09-11 10:37:38 606383608720867333967550 5 4 fb654cf64a3148b68994b772d11536 1 0 8360872086 1 0 0 0 0 1 0 0] [13530651] [13530651 16001350 8360872086 张桂铭 6063 0 0 0 418600 418600 1 system 8363712368 8369865933 system audit ok 2020-09-11 10:37:38 2020-09-11 10:37:38 606383608720867333967550 5 4 fb654cf64a3148b68994b772d11536 1 0 8360872086 1 0 0 0 0 1 0 1] [13530651] [13530651 202009110004704020 16001350 8360872086 张桂铭 6063 0 0 0 418600 418600 3 system 8363712368 8369865933 system audit ok 2020-09-11 10:37:38 2020-09-11 10:37:38 606383608720867333967550 5 4 fb654cf64a3148b68994b772d11536 1 0 8360872086 1…"] [error="[code=10006:class=database:scope=not-set:level=high] execute statement failed: commit: Error 9005: Region is unavailable"]