【TIDB DM】同步的worker日志里有错误

TiDBer_HJLsvyxd · 2022 年5 月 25 日 06:34

execute sql failed by connection error
RawCause: context deadline exceeded"

现在我这边碰到2个问题

当任务启动到一定数量时，很难再提交新的任务，看master的日志，也是 context deadline exceeded
已经提交的某个任务，不再同步任务，看worker日志，里面也是 context deadline exceeded
总感觉都是跟数据库连接相关的原因引起的
但是手工用命令行登录数据库都是OK的。这个问题到底要如何解决啊？

Meditator · 2022 年5 月 25 日 07:49

上游mysql binlog_format配置的是什么？

TiDBer_HJLsvyxd · 2022 年5 月 25 日 07:57

Meditator · 2022 年5 月 25 日 08:05

github.com/pingcap/tiflow

bugfix: fix update statement execute error in safemode cause dm-worker panic

pingcap:master ← GMHDBJD:fixSafemodePanic

已打开 08:08AM - 21 Jan 22 UTC

GMHDBJD

+51 -6

### What problem does this PR solve?  Issue Number: close #4317 ### What is changed and how it works? - when update job split into multiple dmls(delete + replace), len(dmls)>len(jobs), we should determine the error dml is generated from which job - - when in multipleRows mode, we combine multiple jobs into one dml, if dml execute failed, we cannot determine which job it is so simply use the first job when meet error and `len(jobs)!=len(dmls)`, which not affect other feature. ### Check List Tests - Unit test - Integration test - Manual test (add detailed scripts or steps below) - No code Code changes - Has exported function/method change - Has exported variable/fields change - Has interface methods change - Has persistent data change Side effects - Possible performance regression - Increased code complexity - Breaking backward compatibility Related changes - Need to cherry-pick to the release branch - Need to update the documentation - Need to update key monitor metrics in both TiCDC document and official document ### Release note ```release-note Fix the issue that update statement execute error in safemode may cause DM-worker panic. ```

github.com/pingcap/tiflow

fix update statement execute error in safemode cause dm-worker panic(#4432)

pingcap:release-5.4 ← GMHDBJD:cherrypick4432

已打开 03:02PM - 23 Jan 22 UTC

GMHDBJD

+51 -6

This is a manually cherry-pick of #4432, because ticdc test is failed in master now. ### What problem does this PR solve?  Issue Number: close #4317 ### What is changed and how it works? - when update job split into multiple dmls(delete + replace), len(dmls)>len(jobs), we should determine the error dml is generated from which job - when in multipleRows mode, we combine multiple jobs into one dml, if dml execute failed, we cannot determine which job it is ### Check List Tests - Unit test - Integration test - Manual test (add detailed scripts or steps below) - No code Code changes - Has exported function/method change - Has exported variable/fields change - Has interface methods change - Has persistent data change Side effects - Possible performance regression - Increased code complexity - Breaking backward compatibility Related changes - Need to cherry-pick to the release branch - Need to update the documentation - Need to update key monitor metrics in both TiCDC document and official document ### Release note ```release-note Fix the issue that update statement execute error in safemode may cause DM-worker panic. ```

看看这两个，是不是中标了？

TiDBer_HJLsvyxd · 2022 年5 月 25 日 08:24

页面打不开啊

TiDBer_HJLsvyxd · 2022 年5 月 25 日 08:30

场景是这样：
我这里有5个IP的mysql数据库和 10个IP的mysql数据库
一共15个源
其中 5个IP的mysql数据库中，做了分库分表，每个IP上面10个分库10个分表，一共50个分库50个分表
其中10个IP的mysql数据库中，也做了分库分表，每个IP上面10个分库10个分表，一共100个分库，100个分表

我现在要做分库分表的实时同步，使用TIDB DM工具。
其中 50个分库50个分表那个，我一共配置了17个同步任务，每张表对应一个配置文件
100个分库100个分表的那个，我配置了9个同步任务，每张表对应一个配置文件
昨天下午18:00的时候，一切都是好的。在 DM可视化监控界面上，可以看到运行中的任务数 175个

然后今天凌晨1点左右收到了告警，具体看了一下，是某一个任务， binlog同步落后了

今天白天的时候，反复搞，最终发现，我把任务停止后，要再次启动，就非常困难了。大概只能启动60个任务，之后就会提示 RawCause: context deadline exceeded"
重新启动那个binlog落后的任务，我看worker日志里，也是有出现 RawCause: context deadline exceeded"的字眼

TiDBer_HJLsvyxd · 2022 年5 月 25 日 08:33

现在是我们使用TIDB DM的测试阶段，如果稳定运行一个月没问题，我们就正式用了。
我不想把dm_meta表都删除重新弄，就是想通过运行期间出现过的一些故障，来快速演练一下故障恢复。
毕竟我们想把DM用来做业务库实时同步到TIDB，后续用来做实时数仓的ODS层。因为是实时的，所以时效性要求比较高，没出故障的时候，的确很爽，同步延迟都在1秒内，但是一旦出现故障，故障如果不能快速处理的话，那么对后续的实时处理链条就会有较大影响，就不放心使用了

Billmay表妹 · 2022 年5 月 27 日 07:53

翻墙看一下~

buchuitoudegou · 2022 年5 月 27 日 08:23

这里显示 connection error，是数据库连接的问题，可以看下：

每个 dmworker 所在的物理机是否能连上 TiDB
在 TiDB 上检查 max_connections 变量，以及 show processlist 查看同步过程中，该 TiDB 实例消耗的连接数，连接数耗尽也可能出现这样的问题（大量的任务会占用大量的连接）
您的上游分表不知道是否存在 primary key，在没有主键的情况下，dm 会把所有列作为条件连接到一起，以准确定位到您修改的行。这种情况下很有可能让 TiDB 执行延迟太大，导致 DM 的同步进度大大落后于上游
楼上提到的 issue 是多个 dml 合并到一个再进行同步时执行错误（导致无法得知是哪个 dml 错误）。根本问题还是，dml 无法成功同步到下游。

p.s. 您的日志中有业务内容，请谨慎分享。

system · 2022 年10 月 31 日 19:19

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。