咸鱼超人
(咸鱼超人)
2020 年7 月 1 日 12:01
1
为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。
【TiDB 版本】:v3.0.2
【问题描述】:
使用 Binlog 同步binlog到kafka中 发现有数据丢失 查看pump日志发现报错
[2020/06/30 22:02:41.650 +08:00] [ERROR] [server.go:616] [“drainer’s checkpoint is older than pump gc ts, some binlogs are purged”] [drainer=MacBook-Pro.local:8249] [“gc ts”=417593389670203392] [“drainer checkpoint”=417268548745035843]
[2020/06/30 22:02:41.650 +08:00] [ERROR] [server.go:616] [“drainer’s checkpoint is older than pump gc ts, some binlogs are purged”] [drainer=tikv-03:8248] [“gc ts”=417593389670203392] [“drainer checkpoint”=417251783089848329]
几乎每隔10分钟就会出现一次 求解
{‘tidb_log_dir’: ‘{{ deploy_dir }}/log’, ‘dummy’: None, ‘tidb_port’: 4000, ‘tidb_status_port’: 10080, ‘tidb_cert_dir’: ‘{{ deploy_dir }}/conf/ssl’}
系统信息
+---------------------------+-----------------------+
| Host | Release |
+---------------------------+-----------------------+
| tidb-qa-101| 3.10.0-862.el7.x86_64 |
| tidb-qa-01 | 3.10.0-862.el7.x86_64 |
| tikv-qa-03 | 3.10.0-862.el7.x86_64 |
| tikv-qa-02 | 3.10.0-862.el7.x86_64 |
| tikv-qa-01 | 3.10.0-862.el7.x86_64 |
+---------------------------+-----------------------+
TiDB 集群信息
+--------------------+--------------+------+----+------+
| TiDB_version | Clu_replicas | TiDB | PD | TiKV |
+--------------------+--------------+------+----+------+
| 5.7.25-TiDB-v3.0.9 | 3 | 2 | 1 | 3 |
+--------------------+--------------+------+----+------+
集群节点信息
+------------+-------------+
| Node_IP | Server_info |
+------------+-------------+
| instance_0 | tikv |
| instance_1 | tikv |
| instance_2 | tikv |
| instance_3 | pd+tidb |
| instance_4 | tidb |
+------------+-------------+
容量 & region 数量
+---------------------+-----------------+--------------+
| Storage_capacity_GB | Storage_uesd_GB | Region_count |
+---------------------+-----------------+--------------+
| 590.21 | 97.45 | 15960 |
+---------------------+-----------------+--------------+
QPS
+---------+----------------+-----------------+
| Clu_QPS | Duration_99_MS | Duration_999_MS |
+---------+----------------+-----------------+
| 3.36 | 22.08 | 3.96 |
+---------+----------------+-----------------+
热点 region 信息
+---------+----------+-----------+
| Store | Hot_read | Hot_write |
+---------+----------+-----------+
| store-1 | 0 | 0 |
| store-2 | 1 | 0 |
| store-7 | 0 | 0 |
+---------+----------+-----------+
磁盘延迟信息
+--------+------------+-------------+--------------+
| Device | Instance | Read_lat_MS | Write_lat_MS |
+--------+------------+-------------+--------------+
| sr0 | instance_4 | nan | nan |
| sr0 | instance_3 | nan | nan |
| sr0 | instance_0 | nan | nan |
| sr0 | instance_1 | nan | nan |
| sr0 | instance_2 | nan | nan |
| vda | instance_4 | nan | 1.16 |
| vda | instance_3 | 12.96 | 1.00 |
| vda | instance_0 | 9.93 | 1.18 |
| vda | instance_1 | nan | 1.43 |
| vda | instance_2 | nan | 1.29 |
| vdb | instance_4 | nan | 1.00 |
| vdb | instance_3 | nan | nan |
| vdb | instance_0 | nan | nan |
| vdb | instance_1 | nan | nan |
| vdc | instance_0 | nan | 1.06 |
| vdc | instance_1 | nan | 0.96 |
| vdc | instance_2 | nan | 0.94 |
| vdd | instance_2 | nan | 1.58 |
+--------+------------+-------------+--------------+
来了老弟
2020 年7 月 1 日 13:26
2
你好,
当前 drainer 版本中,如果超过 pump gc 时间并且 drainer 没有消费掉的 binlog 将会被 purge,可以调整下 pump 的 gc 时间,默认是 7天,看下是否有过调整;
咸鱼超人
(咸鱼超人)
2020 年7 月 2 日 05:46
3
日志后来开启了debug 发现出现报错之前都会去检查drainer的checkpoint 并且每次小时级别内的gc ts的这个时间戳是一样的 这个是正常的么? 盼回复
[2020/07/02 13:09:00.600 +08:00] [INFO] [server.go:544] [“server info tick”] [writeBinlogCount=8] [alivePullerCount=1] [MaxCommitTS=417770121449177089]
[2020/07/02 13:09:00.600 +08:00] [INFO] [server.go:525] [“use gc ts to detect drainer checkpoint”] [“gc ts”=417611419641446400]
[2020/07/02 13:09:00.601 +08:00] [ERROR] [server.go:616] [“drainer’s checkpoint is older than pump gc ts, some binlogs are purged”] [drainer=likaideMacBook-Pro.local:8249] [“gc ts”=417630294009446400] [“drainer checkpoint”=417268548745035843]
[2020/07/02 13:09:00.601 +08:00] [ERROR] [server.go:616] [“drainer’s checkpoint is older than pump gc ts, some binlogs are purged”] [drainer=sh-tjpm-aries-tikv-qa-03:8248] [“gc ts”=417630294009446400] [“drainer checkpoint”=417251783089848329]
[2020/07/02 13:19:00.600 +08:00] [INFO] [server.go:525] [“use gc ts to detect drainer checkpoint”] [“gc ts”=417611419641446400]
[2020/07/02 13:19:00.601 +08:00] [ERROR] [server.go:616] [“drainer’s checkpoint is older than pump gc ts, some binlogs are purged”] [drainer=likaideMacBook-Pro.local:8249] [“gc ts”=417630294009446400] [“drainer checkpoint”=417268548745035843]
[2020/07/02 13:19:00.601 +08:00] [ERROR] [server.go:616] [“drainer’s checkpoint is older than pump gc ts, some binlogs are purged”] [drainer=sh-tjpm-aries-tikv-qa-03:8248] [“gc ts”=417630294009446400] [“drainer checkpoint”=417251783089848329]
[2020/07/02 13:29:00.600 +08:00] [INFO] [server.go:544] [“server info tick”] [writeBinlogCount=8] [alivePullerCount=1] [MaxCommitTS=417770435720773633]
[2020/07/02 13:29:00.600 +08:00] [INFO] [server.go:525] [“use gc ts to detect drainer checkpoint”] [“gc ts”=417611419641446400]
[2020/07/02 13:29:00.601 +08:00] [ERROR] [server.go:616] [“drainer’s checkpoint is older than pump gc ts, some binlogs are purged”] [drainer=likaideMacBook-Pro.local:8249] [“gc ts”=417630294009446400] [“drainer checkpoint”=417268548745035843]
[2020/07/02 13:29:00.601 +08:00] [ERROR] [server.go:616] [“drainer’s checkpoint is older than pump gc ts, some binlogs are purged”] [drainer=sh-tjpm-aries-tikv-qa-03:8248] [“gc ts”=417630294009446400] [“drainer checkpoint”=417251783089848329]
[2020/07/02 13:39:00.600 +08:00] [INFO] [server.go:525] [“use gc ts to detect drainer checkpoint”] [“gc ts”=417611419641446400]
[2020/07/02 13:39:00.601 +08:00] [ERROR] [server.go:616] [“drainer’s checkpoint is older than pump gc ts, some binlogs are purged”] [drainer=likaideMacBook-Pro.local:8249] [“gc ts”=417630294009446400] [“drainer checkpoint”=417268548745035843]
[2020/07/02 13:39:00.601 +08:00] [ERROR] [server.go:616] [“drainer’s checkpoint is older than pump gc ts, some binlogs are purged”] [drainer=sh-tjpm-aries-tikv-qa-03:8248] [“gc ts”=417630294009446400] [“drainer checkpoint”=417251783089848329]
来了老弟
2020 年7 月 2 日 08:15
4
这样该 binlog 就不会被同步到下游,问下是否有手动修改过 pb 文件的 checkpoint 点?
咸鱼超人
(咸鱼超人)
2020 年7 月 2 日 08:17
5
没有修改过呢 都是用的默认参数 我们还在确认 数据丢失是那种业务场景 我这边自己测试下来 顺序执行insert数据 虽然也会有上面报错 但是binlog数据是正常同步到kafka的
来了老弟
2020 年7 月 2 日 08:33
6
嗯,先检查下 pump gc 设置多少,可以反馈下。
咸鱼超人
(咸鱼超人)
2020 年7 月 3 日 02:36
7
我们确认了业务场景 发现binlog解析好像有点问题
执行sql之前数据:
MySQL [action]> select * from test where id=30441;
±------±-----±-----±--------------------+
| id | num | name | updateTime |
±------±-----±-----±--------------------+
| 30441 | 3039 | NULL | 2020-07-03 10:28:46 |
±------±-----±-----±--------------------+
1 row in set (0.00 sec)
执行sql:
INSERT IGNORE INTO test (id,num,name) values(30441,3038,“test”)
ON DUPLICATE KEY UPDATE
num = IF(num > VALUES(num), VALUES(num),num);
kafka接受的消息:
{“inst”:“tidb”,“db”:“cdp-action”,“tbl”:“test”,“ltbl”:"",“part”:"",“tpart”:"",“e”:“UPDATE”,“data”:{“id”:“30441”,“name”:"",“num”:“3039”,“updateTime”:“2020-07-03 10:20:58”},“ltime”:“417790250621861889”,“tid”:"",“sql”:""}
kafka中接收到的消息是修改之前的数据 不是修改之后的数据
这个能否帮忙确认下是否正常
来了老弟
2020 年7 月 3 日 02:44
8
hi,
已有数据为 30441、3039
执行 insert ignore 30441、3038 (主键冲突,此 sql 将被忽略)
kafka 同步 3039 的数据。
这个过程貌似是符合事实的吧
咸鱼超人
(咸鱼超人)
2020 年7 月 3 日 03:29
9
查了下代码 我们这边改了源码 是这边的bug
感谢协助排查 谢谢啦
来了老弟
2020 年7 月 3 日 03:31
10
hi,
如果认定是 bug 导致的,可以在 github issue 中反馈给我们,辛苦看下私信
咸鱼超人
(咸鱼超人)
2020 年7 月 3 日 03:31
11
我们修改了drainer的源码 适配我们自己的业务 drainer本身是正常的
system
(system)
关闭
2022 年10 月 31 日 19:06
13
此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。