Pump和drainer同步数据到kafka 数据丢失问题

咸鱼超人 · 2020 年7 月 1 日 12:01

为提高效率，提问时请提供以下信息，问题描述清晰可优先响应。

【TiDB 版本】：v3.0.2
【问题描述】：
使用 Binlog 同步binlog到kafka中发现有数据丢失查看pump日志发现报错
[2020/06/30 22:02:41.650 +08:00] [ERROR] [server.go:616] [“drainer’s checkpoint is older than pump gc ts, some binlogs are purged”] [drainer=MacBook-Pro.local:8249] [“gc ts”=417593389670203392] [“drainer checkpoint”=417268548745035843]
[2020/06/30 22:02:41.650 +08:00] [ERROR] [server.go:616] [“drainer’s checkpoint is older than pump gc ts, some binlogs are purged”] [drainer=tikv-03:8248] [“gc ts”=417593389670203392] [“drainer checkpoint”=417251783089848329]

几乎每隔10分钟就会出现一次求解

{‘tidb_log_dir’: ‘{{ deploy_dir }}/log’, ‘dummy’: None, ‘tidb_port’: 4000, ‘tidb_status_port’: 10080, ‘tidb_cert_dir’: ‘{{ deploy_dir }}/conf/ssl’}

系统信息
+---------------------------+-----------------------+
|            Host           |        Release        |
+---------------------------+-----------------------+
| tidb-qa-101| 3.10.0-862.el7.x86_64 |
| tidb-qa-01 | 3.10.0-862.el7.x86_64 |
| tikv-qa-03 | 3.10.0-862.el7.x86_64 |
| tikv-qa-02 | 3.10.0-862.el7.x86_64 |
| tikv-qa-01 | 3.10.0-862.el7.x86_64 |
+---------------------------+-----------------------+
TiDB 集群信息
+--------------------+--------------+------+----+------+
|    TiDB_version    | Clu_replicas | TiDB | PD | TiKV |
+--------------------+--------------+------+----+------+
| 5.7.25-TiDB-v3.0.9 |      3       |  2   | 1  |  3   |
+--------------------+--------------+------+----+------+
集群节点信息
+------------+-------------+
|  Node_IP   | Server_info |
+------------+-------------+
| instance_0 |     tikv    |
| instance_1 |     tikv    |
| instance_2 |     tikv    |
| instance_3 |   pd+tidb   |
| instance_4 |     tidb    |
+------------+-------------+
容量 & region 数量
+---------------------+-----------------+--------------+
| Storage_capacity_GB | Storage_uesd_GB | Region_count |
+---------------------+-----------------+--------------+
|        590.21       |      97.45      |    15960     |
+---------------------+-----------------+--------------+
QPS
+---------+----------------+-----------------+
| Clu_QPS | Duration_99_MS | Duration_999_MS |
+---------+----------------+-----------------+
|   3.36  |     22.08      |       3.96      |
+---------+----------------+-----------------+
热点 region 信息
+---------+----------+-----------+
|  Store  | Hot_read | Hot_write |
+---------+----------+-----------+
| store-1 |    0     |     0     |
| store-2 |    1     |     0     |
| store-7 |    0     |     0     |
+---------+----------+-----------+
磁盘延迟信息
+--------+------------+-------------+--------------+
| Device |  Instance  | Read_lat_MS | Write_lat_MS |
+--------+------------+-------------+--------------+
|  sr0   | instance_4 |     nan     |     nan      |
|  sr0   | instance_3 |     nan     |     nan      |
|  sr0   | instance_0 |     nan     |     nan      |
|  sr0   | instance_1 |     nan     |     nan      |
|  sr0   | instance_2 |     nan     |     nan      |
|  vda   | instance_4 |     nan     |     1.16     |
|  vda   | instance_3 |    12.96    |     1.00     |
|  vda   | instance_0 |     9.93    |     1.18     |
|  vda   | instance_1 |     nan     |     1.43     |
|  vda   | instance_2 |     nan     |     1.29     |
|  vdb   | instance_4 |     nan     |     1.00     |
|  vdb   | instance_3 |     nan     |     nan      |
|  vdb   | instance_0 |     nan     |     nan      |
|  vdb   | instance_1 |     nan     |     nan      |
|  vdc   | instance_0 |     nan     |     1.06     |
|  vdc   | instance_1 |     nan     |     0.96     |
|  vdc   | instance_2 |     nan     |     0.94     |
|  vdd   | instance_2 |     nan     |     1.58     |
+--------+------------+-------------+--------------+

来了老弟 · 2020 年7 月 1 日 13:26

你好，

当前 drainer 版本中，如果超过 pump gc 时间并且 drainer 没有消费掉的 binlog 将会被 purge，可以调整下 pump 的 gc 时间，默认是 7天，看下是否有过调整；

https://docs.pingcap.com/zh/tidb/v4.0/tidb-binlog-configuration-file#gc

image997×158 6.56 KB

咸鱼超人 · 2020 年7 月 2 日 05:46

日志后来开启了debug 发现出现报错之前都会去检查drainer的checkpoint 并且每次小时级别内的gc ts的这个时间戳是一样的这个是正常的么？盼回复
[2020/07/02 13:09:00.600 +08:00] [INFO] [server.go:544] [“server info tick”] [writeBinlogCount=8] [alivePullerCount=1] [MaxCommitTS=417770121449177089]
[2020/07/02 13:09:00.600 +08:00] [INFO] [server.go:525] [“use gc ts to detect drainer checkpoint”] [“gc ts”=417611419641446400]
[2020/07/02 13:09:00.601 +08:00] [ERROR] [server.go:616] [“drainer’s checkpoint is older than pump gc ts, some binlogs are purged”] [drainer=likaideMacBook-Pro.local:8249] [“gc ts”=417630294009446400] [“drainer checkpoint”=417268548745035843]
[2020/07/02 13:09:00.601 +08:00] [ERROR] [server.go:616] [“drainer’s checkpoint is older than pump gc ts, some binlogs are purged”] [drainer=sh-tjpm-aries-tikv-qa-03:8248] [“gc ts”=417630294009446400] [“drainer checkpoint”=417251783089848329]

[2020/07/02 13:19:00.600 +08:00] [INFO] [server.go:525] [“use gc ts to detect drainer checkpoint”] [“gc ts”=417611419641446400]
[2020/07/02 13:19:00.601 +08:00] [ERROR] [server.go:616] [“drainer’s checkpoint is older than pump gc ts, some binlogs are purged”] [drainer=likaideMacBook-Pro.local:8249] [“gc ts”=417630294009446400] [“drainer checkpoint”=417268548745035843]
[2020/07/02 13:19:00.601 +08:00] [ERROR] [server.go:616] [“drainer’s checkpoint is older than pump gc ts, some binlogs are purged”] [drainer=sh-tjpm-aries-tikv-qa-03:8248] [“gc ts”=417630294009446400] [“drainer checkpoint”=417251783089848329]

[2020/07/02 13:29:00.600 +08:00] [INFO] [server.go:544] [“server info tick”] [writeBinlogCount=8] [alivePullerCount=1] [MaxCommitTS=417770435720773633]
[2020/07/02 13:29:00.600 +08:00] [INFO] [server.go:525] [“use gc ts to detect drainer checkpoint”] [“gc ts”=417611419641446400]
[2020/07/02 13:29:00.601 +08:00] [ERROR] [server.go:616] [“drainer’s checkpoint is older than pump gc ts, some binlogs are purged”] [drainer=likaideMacBook-Pro.local:8249] [“gc ts”=417630294009446400] [“drainer checkpoint”=417268548745035843]
[2020/07/02 13:29:00.601 +08:00] [ERROR] [server.go:616] [“drainer’s checkpoint is older than pump gc ts, some binlogs are purged”] [drainer=sh-tjpm-aries-tikv-qa-03:8248] [“gc ts”=417630294009446400] [“drainer checkpoint”=417251783089848329]

[2020/07/02 13:39:00.600 +08:00] [INFO] [server.go:525] [“use gc ts to detect drainer checkpoint”] [“gc ts”=417611419641446400]
[2020/07/02 13:39:00.601 +08:00] [ERROR] [server.go:616] [“drainer’s checkpoint is older than pump gc ts, some binlogs are purged”] [drainer=likaideMacBook-Pro.local:8249] [“gc ts”=417630294009446400] [“drainer checkpoint”=417268548745035843]
[2020/07/02 13:39:00.601 +08:00] [ERROR] [server.go:616] [“drainer’s checkpoint is older than pump gc ts, some binlogs are purged”] [drainer=sh-tjpm-aries-tikv-qa-03:8248] [“gc ts”=417630294009446400] [“drainer checkpoint”=417251783089848329]

来了老弟 · 2020 年7 月 2 日 08:15

这样该 binlog 就不会被同步到下游，问下是否有手动修改过 pb 文件的 checkpoint 点？

咸鱼超人 · 2020 年7 月 2 日 08:17

没有修改过呢都是用的默认参数我们还在确认数据丢失是那种业务场景我这边自己测试下来顺序执行insert数据虽然也会有上面报错但是binlog数据是正常同步到kafka的

来了老弟 · 2020 年7 月 2 日 08:33

嗯，先检查下 pump gc 设置多少，可以反馈下。

咸鱼超人 · 2020 年7 月 3 日 02:36

我们确认了业务场景发现binlog解析好像有点问题

执行sql之前数据：
MySQL [action]> select * from test where id=30441;
±------±-----±-----±--------------------+
| id | num | name | updateTime |
±------±-----±-----±--------------------+
| 30441 | 3039 | NULL | 2020-07-03 10:28:46 |
±------±-----±-----±--------------------+
1 row in set (0.00 sec)

执行sql：
INSERT IGNORE INTO test (id,num,name) values(30441,3038,“test”)
ON DUPLICATE KEY UPDATE
num = IF(num > VALUES(num), VALUES(num),num);

kafka接受的消息：
{“inst”:“tidb”,“db”:“cdp-action”,“tbl”:“test”,“ltbl”:"",“part”:"",“tpart”:"",“e”:“UPDATE”,“data”:{“id”:“30441”,“name”:"",“num”:“3039”,“updateTime”:“2020-07-03 10:20:58”},“ltime”:“417790250621861889”,“tid”:"",“sql”:""}

kafka中接收到的消息是修改之前的数据不是修改之后的数据

这个能否帮忙确认下是否正常

来了老弟 · 2020 年7 月 3 日 02:44

hi，

已有数据为 30441、3039

执行 insert ignore 30441、3038 （主键冲突，此 sql 将被忽略）

kafka 同步 3039 的数据。

这个过程貌似是符合事实的吧

咸鱼超人 · 2020 年7 月 3 日 03:29

查了下代码我们这边改了源码是这边的bug
感谢协助排查谢谢啦

来了老弟 · 2020 年7 月 3 日 03:31

hi，

如果认定是 bug 导致的，可以在 github issue 中反馈给我们，辛苦看下私信

咸鱼超人 · 2020 年7 月 3 日 03:31

我们修改了drainer的源码适配我们自己的业务 drainer本身是正常的

来了老弟 · 2020 年7 月 3 日 03:38

这个分析是否正确呢。

ok，辛苦看下私信哈

system · 2022 年10 月 31 日 19:06

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。