drainer异常退出：not found table id

qufudcj · 2022 年3 月 15 日 09:48

为提高效率，请提供以下信息，问题描述清晰能够更快得到解决：
【 TiDB 使用环境】
生产环境

【概述】场景 + 问题概述
未做配置调整，突然报错退出

【备份和数据迁移策略逻辑】
pump – drainer – kafka

【背景】做过哪些操作
无运维操作（不排除有数据库脏数据因素）

【现象】业务和数据库现象
同步中断

【问题】当前遇到的问题
已经异常退出多次，报错是not found table id，但是根据table id有时找不到表，有时找能到表

找不到时：
curl http://tidb-server01:10080/schema?table_id=49499
[schema:1146]Table which ID = 49499 does not exist.

pump未发现ERROR级报错

【业务影响】
暂无

【TiDB 版本】
v4.0.4

【drainer日志】

[2022/03/15 15:06:49.261 +08:00] [INFO] [syncer.go:257] ["write save point"] [ts=431837150463918090]
[2022/03/15 15:07:15.314 +08:00] [INFO] [syncer.go:257] ["write save point"] [ts=431837158328238094]
[2022/03/15 15:07:44.544 +08:00] [INFO] [syncer.go:257] ["write save point"] [ts=431837166205665281]
[2022/03/15 15:14:59.591 +08:00] [INFO] [client.go:772] ["[sarama] client/metadata fetching metadata for [tidb-binlog-pre] from broker kafka01:9092\
"]
[2022/03/15 15:15:08.412 +08:00] [INFO] [syncer.go:257] ["write save point"] [ts=431837284236001287]
[2022/03/15 15:15:08.537 +08:00] [INFO] [async_producer.go:1015] ["[sarama] Producer shutting down."]
[2022/03/15 15:15:08.537 +08:00] [INFO] [client.go:227] ["[sarama] Closing Client"]
[2022/03/15 15:15:08.537 +08:00] [INFO] [syncer.go:249] ["handleSuccess quit"]
[2022/03/15 15:15:08.537 +08:00] [INFO] [broker.go:253] ["[sarama] Closed connection to broker kafka02:9092\
"]
[2022/03/15 15:15:08.537 +08:00] [INFO] [async_producer.go:717] ["[sarama] producer/broker/1 input chan closed\
"]
[2022/03/15 15:15:08.537 +08:00] [INFO] [broker.go:253] ["[sarama] Closed connection to broker kafka01:9092\
"]
[2022/03/15 15:15:08.537 +08:00] [INFO] [async_producer.go:801] ["[sarama] producer/broker/1 shut down\
"]
[2022/03/15 15:15:08.537 +08:00] [ERROR] [server.go:289] ["syncer exited abnormal"] [error="filterTable failed: not found table id: 49499"] [errorVerbose="not found table id: 49499\
github.com/pingcap/tidb-binlog/drainer.filterTable\
\t/Users/weizheng/go/src/github.com/letusgoing/tidb-binlog/drainer/syncer.go:514\
github.com/pingcap/tidb-binlog/drainer.(*Syncer).run\
\t/Users/weizheng/go/src/github.com/letusgoing/tidb-binlog/drainer/syncer.go:368\
github.com/pingcap/tidb-binlog/drainer.(*Syncer).Start\
\t/Users/weizheng/go/src/github.com/letusgoing/tidb-binlog/drainer/syncer.go:132\
github.com/pingcap/tidb-binlog/drainer.(*Server).Start.func4\
\t/Users/weizheng/go/src/github.com/letusgoing/tidb-binlog/drainer/server.go:288\
github.com/pingcap/tidb-binlog/drainer.(*taskGroup).start.func1\
\t/Users/weizheng/go/src/github.com/letusgoing/tidb-binlog/drainer/util.go:71\
runtime.goexit\
\t/usr/local/go/src/runtime/asm_amd64.s:1374\
filterTable failed"]
[2022/03/15 15:15:08.537 +08:00] [INFO] [util.go:68] [Exit] [name=syncer]
[2022/03/15 15:15:08.537 +08:00] [INFO] [server.go:451] ["begin to close drainer server"]
[2022/03/15 15:15:08.540 +08:00] [INFO] [server.go:416] ["has already update status"] [id=drainer:8249]
[2022/03/15 15:15:08.540 +08:00] [INFO] [server.go:455] ["commit status done"]
[2022/03/15 15:15:08.540 +08:00] [INFO] [pump.go:77] ["pump is closing"] [id=pump03:8250]
[2022/03/15 15:15:08.541 +08:00] [INFO] [pump.go:77] ["pump is closing"] [id=pump02:8250]
[2022/03/15 15:15:08.540 +08:00] [INFO] [util.go:68] [Exit] [name=heartbeat]
[2022/03/15 15:15:08.541 +08:00] [INFO] [pump.go:77] ["pump is closing"] [id=pump01:8250]
[2022/03/15 15:15:08.541 +08:00] [INFO] [collector.go:135] ["publishBinlogs quit"]
[2022/03/15 15:15:08.541 +08:00] [INFO] [util.go:68] [Exit] [name=collect]
[2022/03/15 15:15:08.546 +08:00] [INFO] [main.go:73] ["drainer exit"]

【两个tidb-server都已开启binlog】

[binlog]
enable = true
ignore-error = true
write-timeout = "15s"

【drainer配置】

# WARNING: This file is auto-generated. Do not edit! All your modification will be overwritten!
# You can use 'tiup cluster edit-config' and 'tiup cluster reload' to update the configuration
# All configuration items you want to change can be added to:
# server_configs:
#   drainer:
#     aa.b1.c3: value
#     aa.b2.c4: value
[syncer]
db-type = "kafka"
ignore-schemas = "INFORMATION_SCHEMA,METRICS_SCHEMA,PERFORMANCE_SCHEMA,aaa,aliprod119,bak_marketing_0925,dusto_job,goinception,mysql,stress_testing,test,test33,test55,thisaliprod,thisisuat,tidb_loader,txmanager,uat_databank,uat_goods,uat_grid,uat_info,uat_integration,uat_job,uat_marketing,uat_member,uat_message,uat_nacos,uat_order,uat_pay,uat_product,uat_report,uat_saga,uat_security,uat_stock,uat_support,uat_zipkin,pre_databank,pre_info,pre_job,pre_message,pre_nacos,pre_nacos113,pre_product,pre_report,pre_security,pre_support,dev_member,dev_databank,dev_message,dev_info,dev_order,dev_pay,dev_marketing,dev_report,dev_stock,dev_grid,dev_support,dev_security,dev_goods"

[[syncer.ignore-table]]
db-name = "test"
tbl-name = "test"

[[syncer.ignore-table]]
db-name = "pre_member"
tbl-name = "member_info_temp_20190925"
[syncer.to]
kafka-addrs = "kafka01:9092,kafka02:9092,kafka03:9092"
kafka-max-message-size = 1610612736
kafka-max-messages = 1536
kafka-version = "1.0.2"
topic-name = "tidb-binlog-pre"

songxuecheng · 2022 年3 月 15 日 11:41

检查一下过滤条件

qufudcj · 2022 年3 月 15 日 12:26

您指的是drainer的ignore配置吗？

xfworld · 2022 年3 月 15 日 13:50

这样还是有可能有问题的，因为 TiDB 集群只有一个 ddl owner，由这个 owner 来执行 ddl，
并且写 binlog。 你这个场景可能会把 ddl owner 分配到未开启 binlog 的 TiDB 上，
这样就没有产生 ddl 的 binlog，导致 drainer 遇到这个表相关的 binlog 都会报错。
 可以确认一下 ddl owner 是否在未开启 binlog 的 TiDB 上 尝试重启 Drainer

你看看这个答案对你是否有帮助

qufudcj · 2022 年3 月 15 日 19:00

谢谢，我在提问前已经搜索到了这个帖子，我查看了两台tidb-server的配置文件，发现配置都是
[binlog]
enable = true
ignore-error = true
write-timeout = “15s”

我还需要确认其他的配置吗？

xfworld · 2022 年3 月 16 日 00:12

not found table id: 49499

可以查一下这个table id 对应的是什么

qufudcj · 2022 年3 月 16 日 01:14

我根据这个报错的id查不到表

[tidb@pre-central-01 ~]$ curl http://tidb-server01:10080/schema?table_id=49499
[schema:1146]Table which ID = 49499 does not exist.[tidb@pre-central-01 ~]$
[tidb@pre-central-01 ~]$
[tidb@pre-central-01 ~]$ curl http://tidb-server02:10080/schema?table_id=49499
[schema:1146]Table which ID = 49499 does not exist.[tidb@pre-central-01 ~]$

有时其他报错的ID是可以查到表的，我每次报错退出的table id 都不一样
比如最新的一次就可以查到

[tidb@pre-central-01 ~]$ curl http://tidb-server01:10080/schema?table_id=49534
{
 "id": 49534,
 "name": {
  "O": "dim_str_info",
  "L": "dim_str_info"
 },
 "charset": "utf8",
 "collate": "utf8_bin",
 "cols": [
  {
   "id": 1,
   "name": {
    "O": "str_code",
    "L": "str_code"
   },
   "offset": 0,
   "origin_default": null,
   "default": null,
   "default_bit": null,
   "default_is_expr": false,
   "generated_expr_string": "",
   "generated_stored": false,
   "dependences": null,
   "type": {
    "Tp": 15,
    "Flag": 4,
    "Flen": 30,
    "Decimal": 0,
    "Charset": "utf8",
    "Collate": "utf8_bin",
    "Elems": null
   },
   "state": 5,
   "comment": "门店编码",
   "hidden": false,
   "version": 2
  },
  {
   "id": 2,
   "name": {
    "O": "str_code_ls",
    "L": "str_code_ls"
   },
   "offset": 1,
   "origin_default": null,
   "default": null,

    ......
    ......
    这个json非常长

qufudcj · 2022 年3 月 16 日 09:46

我发现报错的都是一些近期新建的表，请问我要如何检查原因或者修复这一现象呢

songxuecheng · 2022 年3 月 17 日 02:21

对，这个表是否在过滤条件内？

qufudcj · 2022 年3 月 17 日 02:29

我已确认这些表所属的数据库是被过滤的

有些表根据id已经查不到了，所以我无法证实

现在我开始担心后面会不会出现 ignore-error 触发 critical error的情况，按照教程中说法要重新同步整个数据，我们的数据量巨大，这是无法承受的

yilong · 2022 年3 月 18 日 06:05

请问 binlog 是 4.0.4 版本吗？
看报错应该是这里：
https://github.com/pingcap/tidb-binlog/blob/v4.0.4/drainer/syncer.go#L514
tableid 取到了 49499：
https://github.com/pingcap/tipb/blob/master/go-binlog/binlog.pb.go#L125
能否上传[2022/03/15 15:15:08.537 +08:00] 前完整的 pump 和 drainer 日志（比如 2 小时）。如果有 tidb.log 相同时间段的日志也麻烦上传下。
admin show ddl jobs 上传下在 [2022/03/15 15:15:08.537 +08:00] 前 2 小时的变更信息。

qufudcj · 2022 年3 月 19 日 05:17

我已截取好日志并上传了
admin show ddl jobs结果在csv表格中
binlog 是 4.0.4版本，使用tiup安装
tidb-log.zip (22.1 MB)

yilong · 2022 年3 月 21 日 04:01

根据 ddl 可以看到 49499 table id 对应的：
49502 uat_grid operating_store_group truncate table public 21391 49499 0 2022/3/15 15:20 2022/3/15 15:20 synced
tidb.log 中也有记录 49499 的相关日志：
[2022/03/15 15:15:07.242 +08:00] [INFO] [domain.go:127] [“diff load InfoSchema success”] [usedSchemaVersion=38861] [neededSchemaVersion=38862] [“start time”=1.769535ms] [tblIDs="[49416,49499]"]
[2022/03/15 15:20:30.467 +08:00] [INFO] [ddl_worker.go:584] ["[ddl] run DDL job"] [worker=“worker 1, tp general”] [job=“ID:49502, Type:truncate table, State:none, SchemaState:none, SchemaID:21391, TableID:49499, RowCount:0, ArgLen:0, start time: 2022-03-15 15:20:29.82 +0800 CST, Err:, ErrCount:0, SnapshotVersion:0”]
[2022/03/15 15:20:30.476 +08:00] [INFO] [region_cache.go:828] [“switch region leader to specific leader due to kv return NotLeader”] [regionID=34] [currIdx=2] [leaderStoreID=1]
[2022/03/15 15:20:30.484 +08:00] [INFO] [domain.go:127] [“diff load InfoSchema success”] [usedSchemaVersion=38862] [neededSchemaVersion=38863] [“start time”=1.616594ms] [tblIDs="[49499,49501]"]
[2022/03/15 15:20:30.485 +08:00] [INFO] [schema_validator.go:246] [“the schema validator enqueue, queue is too long”] [“delta max count”=1024] [“remove schema version”=37235]
[2022/03/15 15:20:30.522 +08:00] [INFO] [server.go:388] [“new connection”] [conn=86608235] [remoteAddr=172.16.16.213:44842]
[2022/03/15 15:20:30.525 +08:00] [INFO] [set.go:207] [“set session var”] [conn=86608235] [name=character_set_results] [val=NULL]
[2022/03/15 15:20:30.533 +08:00] [INFO] [ddl_worker.go:773] ["[ddl] wait latest schema version changed"] [worker=“worker 1, tp general”] [ver=38863] [“take time”=53.724945ms] [job=“ID:49502, Type:truncate table, State:done, SchemaState:public, SchemaID:21391, TableID:49499, RowCount:0, ArgLen:2, start time: 2022-03-15 15:20:29.82 +0800 CST, Err:, ErrCount:0, SnapshotVersion:0”]
[2022/03/15 15:20:30.535 +08:00] [INFO] [delete_range.go:346] ["[ddl] insert into delete-range table"] [jobID=49502] [elementID=49499]
[2022/03/15 15:20:30.539 +08:00] [INFO] [delete_range.go:105] ["[ddl] add job into delete-range table"] [jobID=49502] [jobType=“truncate table”]
[2022/03/15 15:20:30.540 +08:00] [INFO] [ddl_worker.go:367] ["[ddl] finish DDL job"] [worker=“worker 1, tp general”] [job=“ID:49502, Type:truncate table, State:synced, SchemaState:public, SchemaID:21391, TableID:49499, RowCount:0, ArgLen:2, start time: 2022-03-15 15:20:29.82 +0800 CST, Err:, ErrCount:0, SnapshotVersion:0”]
[2022/03/15 15:20:30.767 +08:00] [INFO] [server.go:391] [“connection closed”] [conn=86607910]
请问这个表结构是什么样的？uat_grid.operating_store_group 当时是频繁的 truncate，drop ，create 吗？

qufudcj · 2022 年3 月 21 日 05:59

表结构：

操作确实是truncate，但是只进行了一次

并且我发现其他的报错的表也都是在进行truncate，但这些表在3月14日第一次出事之前，就已经连续半年truncate操作了，当时drainer并没有down

qufudcj · 2022 年3 月 21 日 06:34

这是另一个最新报错的表

每次进行truncate是否会改变表的table id？是否是drainer在根据新的table id查询时未能查询到导致的呢

guominghao · 2022 年3 月 22 日 10:29

truncate 会改变 table id，但是 drainer 收到 truncate 后会更新 table id，所以不是这个原因。从日志上看疑似 drainer 没有收到该 truncate 语句，导致 table id 没有更新，后续 dml 报错。如果持续出现该错误，麻烦把 debug 日志打开，方便调查

guominghao · 2022 年3 月 23 日 10:00

可以 curl http://host:10080/info/all，看看 ddl binlog 有没有写失败，然后因为 ignore-error 被跳过

qufudcj · 2022 年3 月 25 日 01:45

我在上传debug日志后尝试重启了所有tidb-server组件，并多次测试了truncate操作，目前没有再发现报错现象了。

现在打算再观察测试几天

qufudcj · 2022 年4 月 6 日 01:20

迄今为止没有再出现过drainer的down状态

system · 2022 年8 月 4 日 08:05

该主题在最后一个回复创建后60天后自动关闭。不再允许新的回复。