Pump和drainer同步数据到kafka 数据丢失问题

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】:v3.0.2
  • 【问题描述】:
    使用 Binlog 同步binlog到kafka中 发现有数据丢失 查看pump日志发现报错
    [2020/06/30 22:02:41.650 +08:00] [ERROR] [server.go:616] [“drainer’s checkpoint is older than pump gc ts, some binlogs are purged”] [drainer=MacBook-Pro.local:8249] [“gc ts”=417593389670203392] [“drainer checkpoint”=417268548745035843]
    [2020/06/30 22:02:41.650 +08:00] [ERROR] [server.go:616] [“drainer’s checkpoint is older than pump gc ts, some binlogs are purged”] [drainer=tikv-03:8248] [“gc ts”=417593389670203392] [“drainer checkpoint”=417251783089848329]

几乎每隔10分钟就会出现一次 求解

{‘tidb_log_dir’: ‘{{ deploy_dir }}/log’, ‘dummy’: None, ‘tidb_port’: 4000, ‘tidb_status_port’: 10080, ‘tidb_cert_dir’: ‘{{ deploy_dir }}/conf/ssl’}

系统信息
+---------------------------+-----------------------+
|            Host           |        Release        |
+---------------------------+-----------------------+
| tidb-qa-101| 3.10.0-862.el7.x86_64 |
| tidb-qa-01 | 3.10.0-862.el7.x86_64 |
| tikv-qa-03 | 3.10.0-862.el7.x86_64 |
| tikv-qa-02 | 3.10.0-862.el7.x86_64 |
| tikv-qa-01 | 3.10.0-862.el7.x86_64 |
+---------------------------+-----------------------+
TiDB 集群信息
+--------------------+--------------+------+----+------+
|    TiDB_version    | Clu_replicas | TiDB | PD | TiKV |
+--------------------+--------------+------+----+------+
| 5.7.25-TiDB-v3.0.9 |      3       |  2   | 1  |  3   |
+--------------------+--------------+------+----+------+
集群节点信息
+------------+-------------+
|  Node_IP   | Server_info |
+------------+-------------+
| instance_0 |     tikv    |
| instance_1 |     tikv    |
| instance_2 |     tikv    |
| instance_3 |   pd+tidb   |
| instance_4 |     tidb    |
+------------+-------------+
容量 & region 数量
+---------------------+-----------------+--------------+
| Storage_capacity_GB | Storage_uesd_GB | Region_count |
+---------------------+-----------------+--------------+
|        590.21       |      97.45      |    15960     |
+---------------------+-----------------+--------------+
QPS
+---------+----------------+-----------------+
| Clu_QPS | Duration_99_MS | Duration_999_MS |
+---------+----------------+-----------------+
|   3.36  |     22.08      |       3.96      |
+---------+----------------+-----------------+
热点 region 信息
+---------+----------+-----------+
|  Store  | Hot_read | Hot_write |
+---------+----------+-----------+
| store-1 |    0     |     0     |
| store-2 |    1     |     0     |
| store-7 |    0     |     0     |
+---------+----------+-----------+
磁盘延迟信息
+--------+------------+-------------+--------------+
| Device |  Instance  | Read_lat_MS | Write_lat_MS |
+--------+------------+-------------+--------------+
|  sr0   | instance_4 |     nan     |     nan      |
|  sr0   | instance_3 |     nan     |     nan      |
|  sr0   | instance_0 |     nan     |     nan      |
|  sr0   | instance_1 |     nan     |     nan      |
|  sr0   | instance_2 |     nan     |     nan      |
|  vda   | instance_4 |     nan     |     1.16     |
|  vda   | instance_3 |    12.96    |     1.00     |
|  vda   | instance_0 |     9.93    |     1.18     |
|  vda   | instance_1 |     nan     |     1.43     |
|  vda   | instance_2 |     nan     |     1.29     |
|  vdb   | instance_4 |     nan     |     1.00     |
|  vdb   | instance_3 |     nan     |     nan      |
|  vdb   | instance_0 |     nan     |     nan      |
|  vdb   | instance_1 |     nan     |     nan      |
|  vdc   | instance_0 |     nan     |     1.06     |
|  vdc   | instance_1 |     nan     |     0.96     |
|  vdc   | instance_2 |     nan     |     0.94     |
|  vdd   | instance_2 |     nan     |     1.58     |
+--------+------------+-------------+--------------+

你好,

当前 drainer 版本中,如果超过 pump gc 时间并且 drainer 没有消费掉的 binlog 将会被 purge,可以调整下 pump 的 gc 时间,默认是 7天,看下是否有过调整;

日志后来开启了debug 发现出现报错之前都会去检查drainer的checkpoint 并且每次小时级别内的gc ts的这个时间戳是一样的 这个是正常的么? 盼回复
[2020/07/02 13:09:00.600 +08:00] [INFO] [server.go:544] [“server info tick”] [writeBinlogCount=8] [alivePullerCount=1] [MaxCommitTS=417770121449177089]
[2020/07/02 13:09:00.600 +08:00] [INFO] [server.go:525] [“use gc ts to detect drainer checkpoint”] [“gc ts”=417611419641446400]
[2020/07/02 13:09:00.601 +08:00] [ERROR] [server.go:616] [“drainer’s checkpoint is older than pump gc ts, some binlogs are purged”] [drainer=likaideMacBook-Pro.local:8249] [“gc ts”=417630294009446400] [“drainer checkpoint”=417268548745035843]
[2020/07/02 13:09:00.601 +08:00] [ERROR] [server.go:616] [“drainer’s checkpoint is older than pump gc ts, some binlogs are purged”] [drainer=sh-tjpm-aries-tikv-qa-03:8248] [“gc ts”=417630294009446400] [“drainer checkpoint”=417251783089848329]

[2020/07/02 13:19:00.600 +08:00] [INFO] [server.go:525] [“use gc ts to detect drainer checkpoint”] [“gc ts”=417611419641446400]
[2020/07/02 13:19:00.601 +08:00] [ERROR] [server.go:616] [“drainer’s checkpoint is older than pump gc ts, some binlogs are purged”] [drainer=likaideMacBook-Pro.local:8249] [“gc ts”=417630294009446400] [“drainer checkpoint”=417268548745035843]
[2020/07/02 13:19:00.601 +08:00] [ERROR] [server.go:616] [“drainer’s checkpoint is older than pump gc ts, some binlogs are purged”] [drainer=sh-tjpm-aries-tikv-qa-03:8248] [“gc ts”=417630294009446400] [“drainer checkpoint”=417251783089848329]

[2020/07/02 13:29:00.600 +08:00] [INFO] [server.go:544] [“server info tick”] [writeBinlogCount=8] [alivePullerCount=1] [MaxCommitTS=417770435720773633]
[2020/07/02 13:29:00.600 +08:00] [INFO] [server.go:525] [“use gc ts to detect drainer checkpoint”] [“gc ts”=417611419641446400]
[2020/07/02 13:29:00.601 +08:00] [ERROR] [server.go:616] [“drainer’s checkpoint is older than pump gc ts, some binlogs are purged”] [drainer=likaideMacBook-Pro.local:8249] [“gc ts”=417630294009446400] [“drainer checkpoint”=417268548745035843]
[2020/07/02 13:29:00.601 +08:00] [ERROR] [server.go:616] [“drainer’s checkpoint is older than pump gc ts, some binlogs are purged”] [drainer=sh-tjpm-aries-tikv-qa-03:8248] [“gc ts”=417630294009446400] [“drainer checkpoint”=417251783089848329]

[2020/07/02 13:39:00.600 +08:00] [INFO] [server.go:525] [“use gc ts to detect drainer checkpoint”] [“gc ts”=417611419641446400]
[2020/07/02 13:39:00.601 +08:00] [ERROR] [server.go:616] [“drainer’s checkpoint is older than pump gc ts, some binlogs are purged”] [drainer=likaideMacBook-Pro.local:8249] [“gc ts”=417630294009446400] [“drainer checkpoint”=417268548745035843]
[2020/07/02 13:39:00.601 +08:00] [ERROR] [server.go:616] [“drainer’s checkpoint is older than pump gc ts, some binlogs are purged”] [drainer=sh-tjpm-aries-tikv-qa-03:8248] [“gc ts”=417630294009446400] [“drainer checkpoint”=417251783089848329]

这样该 binlog 就不会被同步到下游,问下是否有手动修改过 pb 文件的 checkpoint 点?

没有修改过呢 都是用的默认参数 我们还在确认 数据丢失是那种业务场景 我这边自己测试下来 顺序执行insert数据 虽然也会有上面报错 但是binlog数据是正常同步到kafka的

嗯,先检查下 pump gc 设置多少,可以反馈下。

我们确认了业务场景 发现binlog解析好像有点问题

执行sql之前数据:
MySQL [action]> select * from test where id=30441;
±------±-----±-----±--------------------+
| id | num | name | updateTime |
±------±-----±-----±--------------------+
| 30441 | 3039 | NULL | 2020-07-03 10:28:46 |
±------±-----±-----±--------------------+
1 row in set (0.00 sec)

执行sql:
INSERT IGNORE INTO test (id,num,name) values(30441,3038,“test”)
ON DUPLICATE KEY UPDATE
num = IF(num > VALUES(num), VALUES(num),num);

kafka接受的消息:
{“inst”:“tidb”,“db”:“cdp-action”,“tbl”:“test”,“ltbl”:"",“part”:"",“tpart”:"",“e”:“UPDATE”,“data”:{“id”:“30441”,“name”:"",“num”:“3039”,“updateTime”:“2020-07-03 10:20:58”},“ltime”:“417790250621861889”,“tid”:"",“sql”:""}

kafka中接收到的消息是修改之前的数据 不是修改之后的数据

这个能否帮忙确认下是否正常

hi,

已有数据为 30441、3039

执行 insert ignore 30441、3038 (主键冲突,此 sql 将被忽略)

kafka 同步 3039 的数据。

这个过程貌似是符合事实的吧

查了下代码 我们这边改了源码 是这边的bug
感谢协助排查 谢谢啦

hi,

如果认定是 bug 导致的,可以在 github issue 中反馈给我们,辛苦看下私信

我们修改了drainer的源码 适配我们自己的业务 drainer本身是正常的 :stuck_out_tongue_closed_eyes:

这个分析是否正确呢。

ok,辛苦看下私信哈

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。