drainer频繁重启,Error 8141: assertion failed

TiDB 使用环境】生产环境
【 TiDB 版本】5.1.4
【复现路径】
【遇到的问题:问题现象及影响】

  1. pump状态正常


    但是drainer日志上也有些 can’t read binlog from pump 信息。

  2. drainer频繁重启,运行一会就重启,错误日志上显示 Error 8141: assertion failed .

  1. drainer配置

  1. pump日志

日志文本:
[2024/11/05 14:23:03.819 +08:00] [INFO] [syncer.go:252] [“handleSuccess quit”]
[2024/11/05 14:23:03.819 +08:00] [ERROR] [syncer.go:488] [“Failed to close syncer”] [error=“Error 8141: assertion failed: key: 7480000000000001f05f69800000000000000201003800440
0440044ff0032003900460041ff0034003500460033ff0044004600380042ff0041004100310042ff0035003000460046ff0032003700320043ff0039003400450041ff0041003000390041ff0036003900380039ff00000
00000000000f7, assertion: Exist, start_ts: 453715632524886020, existing start ts: 0, existing commit ts: 0”] [errorVerbose=“Error 8141: assertion failed: key: 7480000000000001f
05f698000000000000002010038004400440044ff0032003900460041ff0034003500460033ff0044004600380042ff0041004100310042ff0035003000460046ff0032003700320043ff0039003400450041ff004100300
0390041ff0036003900380039ff0000000000000000f7, assertion: Exist, start_ts: 453715632524886020, existing start ts: 0, existing commit ts: 0\ngithub.com/pingcap/errors.AddStack\n
\t/nfs/cache/mod/github.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/pingcap/errors.Trace\n\t/nfs/cache/mod/github.com/pingcap/errors@v0.1
1.5-0.20201126102027-b0a155152ca3/juju_adaptor.go:15\ngithub.com/pingcap/tidb-binlog/pkg/loader.(*tx).commit\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pin
gcap/tidb-binlog/pkg/loader/executor.go:128\ngithub.com/pingcap/tidb-binlog/pkg/loader.(*executor).singleExec\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pi
ngcap/tidb-binlog/pkg/loader/executor.go:382\ngithub.com/pingcap/tidb-binlog/pkg/loader.(*executor).singleExecRetry.func1\n\t/home/jenkins/agent/workspace/build-common/go/src/g
ithub.com/pingcap/tidb-binlog/pkg/loader/executor.go:308\ngithub.com/pingcap/tidb-binlog/pkg/util.RetryContext\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/p
ingcap/tidb-binlog/pkg/util/util.go:186\ngithub.com/pingcap/tidb-binlog/pkg/loader.(*executor).singleExecRetry\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/p
ingcap/tidb-binlog/pkg/loader/executor.go:307\ngithub.com/pingcap/tidb-binlog/pkg/loader.(*loaderImpl).execByHash.func1\n\t/home/jenkins/agent/workspace/build-common/go/src/git
hub.com/pingcap/tidb-binlog/pkg/loader/load.go:434\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/nfs/cache/mod/golang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup
/errgroup.go:57\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371”]
[2024/11/05 14:23:03.819 +08:00] [ERROR] [server.go:291] [“syncer exited abnormal”] [error=“failed to add item: Error 8141: assertion failed: key: 7480000000000001f05f698000000
000000002010038004400440044ff0032003900460041ff0034003500460033ff0044004600380042ff0041004100310042ff0035003000460046ff0032003700320043ff0039003400450041ff0041003000390041ff003
6003900380039ff0000000000000000f7, assertion: Exist, start_ts: 453715632524886020, existing start ts: 0, existing commit ts: 0”] [errorVerbose=“Error 8141: assertion failed: ke
y: 7480000000000001f05f698000000000000002010038004400440044ff0032003900460041ff0034003500460033ff0044004600380042ff0041004100310042ff0035003000460046ff0032003700320043ff0039003
400450041ff0041003000390041ff0036003900380039ff0000000000000000f7, assertion: Exist, start_ts: 453715632524886020, existing start ts: 0, existing commit ts: 0\ngithub.com/pingc
ap/errors.AddStack\n\t/nfs/cache/mod/github.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\ngithub.com/pingcap/errors.Trace\n\t/nfs/cache/mod/github.com
/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/juju_adaptor.go:15\ngithub.com/pingcap/tidb-binlog/pkg/loader.(*tx).commit\n\t/home/jenkins/agent/workspace/build-common/g
o/src/github.com/pingcap/tidb-binlog/pkg/loader/executor.go:128\ngithub.com/pingcap/tidb-binlog/pkg/loader.(*executor).singleExec\n\t/home/jenkins/agent/workspace/build-common/
go/src/github.com/pingcap/tidb-binlog/pkg/loader/executor.go:382\ngithub.com/pingcap/tidb-binlog/pkg/loader.(*executor).singleExecRetry.func1\n\t/home/jenkins/agent/workspace/b
uild-common/go/src/github.com/pingcap/tidb-binlog/pkg/loader/executor.go:308\ngithub.com/pingcap/tidb-binlog/pkg/util.RetryContext\n\t/home/jenkins/agent/workspace/build-common
/go/src/github.com/pingcap/tidb-binlog/pkg/util/util.go:186\ngithub.com/pingcap/tidb-binlog/pkg/loader.(*executor).singleExecRetry\n\t/home/jenkins/agent/workspace/build-common
/go/src/github.com/pingcap/tidb-binlog/pkg/loader/executor.go:307\ngithub.com/pingcap/tidb-binlog/pkg/loader.(*loaderImpl).execByHash.func1\n\t/home/jenkins/agent/workspace/bui
ld-common/go/src/github.com/pingcap/tidb-binlog/pkg/loader/load.go:434\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/nfs/cache/mod/golang.org/x/sync@v0.0.0-20201020160332-6
7f06af15bc9/errgroup/errgroup.go:57\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371\nfailed to add item”]

这个报错一般是主键冲突、或者索引不一致插入失败造成的

看日志报错,有启动时间超过pump的gc了

删除冲突的数据,然后重启一下
./binlogctl -cmd drainers --pd-urls=http://IP:2379 --state=online

几个问题吧:

  1. drainer’s checkpoint is older xxx 这个日志就字面意思,drainer 同步失败后长时间没有启动成功,pump gc 默认 7 天,如果改小了,drainer 失败时间或者同步 lag 很大,超过了 pump gc 时间,就会有这个错误,简单来说就是需要消费的数据没了,drainer 就拒绝启动了。
  2. Error 8141: assertion failed: key: 7480000000000001f05f698000000000000002010038004400440044ff0032003900460041ff0034003500460033ff0044004600380042ff0041004100310042ff0035003000460046ff0032003700320043ff0039003400450041ff0041003000390041ff0036003900380039ff0000000000000000f7 这个日志看起来是 Drainer 内部进行了一项断言,但断言失败了,表明遇到了不一致的状态。
    那么解析下这个 key 即可:
补充下 你这个报错日志是:(root@127.0.0.1) [(none)]>select tidb_decode_key('7480000000000001f05f698000000000000002010038004400440044ff0032003900460041ff0034003500460033ff0044004600380042ff0041004100310042ff0035003000460046ff0032003700320043ff0039003400450041ff0041003000390041ff0036003900380039ff0000000000000000f7');
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tidb_decode_key('7480000000000001f05f698000000000000002010038004400440044ff0032003900460041ff0034003500460033ff0044004600380042ff0041004100310042ff0035003000460046ff0032003700320043ff0039003400450041ff0041003000390041ff0036003900380039ff0000000000000000f7'                                                                      |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| {"index_id":2,"index_vals":"\u00008\u0000D\u0000D\u0000D\u00002\u00009\u0000F\u0000A\u00004\u00005\u0000F\u00003\u0000D\u0000F\u00008\u0000B\u0000A\u0000A\u00001\u0000B\u00005\u00000\u0000F\u0000F\u00002\u00007\u00002\u0000C\u00009\u00004\u0000E\u0000A\u0000A\u00000\u00009\u0000A\u00006\u00009\u00008\u00009","table_id":496} |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)

去看下 table id 496 表,index id 是 2 的这个数据的行怎么回事。是不是有什么非法数据或者什么的。admin check 下这个表检查下。
3. drainer 同步也推荐有有效索引。
4. 版本都 EOL 了,快升级。

admin checkinformation_schema > select * from tables where tidb_table_id=496\G;
*************************** 1. row ***************************
TABLE_CATALOG: def
TABLE_SCHEMA: community
TABLE_NAME: sq_visit_user_deviceid
TABLE_TYPE: BASE TABLE
ENGINE: InnoDB
VERSION: 10
ROW_FORMAT: Compact
TABLE_ROWS: 3877302
AVG_ROW_LENGTH: 141
DATA_LENGTH: 549905324
MAX_DATA_LENGTH: 0
INDEX_LENGTH: 185889852
DATA_FREE: 0
AUTO_INCREMENT: 1
CREATE_TIME: 2022-09-29 16:13:24
UPDATE_TIME: NULL
CHECK_TIME: NULL
TABLE_COLLATION: utf8mb4_general_ci
CHECKSUM: NULL
CREATE_OPTIONS:
TABLE_COMMENT: 用户设备访问记录表
TIDB_TABLE_ID: 496
TIDB_ROW_ID_SHARDING_INFO: NOT_SHARDED
TIDB_PK_TYPE: NONCLUSTERED

show create table community.sq_visit_user_deviceid\G
*************************** 1. row ***************************
Table: sq_visit_user_deviceid
Create Table: CREATE TABLE sq_visit_user_deviceid (
id int(11) NOT NULL AUTO_INCREMENT COMMENT ‘自增id’,
deviceid varchar(200) COLLATE utf8mb4_general_ci NOT NULL COMMENT ‘设备id’,
first_time datetime NOT NULL COMMENT ‘首次访问时间’,
last_time datetime NOT NULL COMMENT ‘最后访问时间’,
count int(11) NOT NULL DEFAULT ‘0’ COMMENT ‘访问次数’,
first_tabid int(11) NOT NULL DEFAULT ‘0’ COMMENT ‘首次访问频道id’,
last_tabid int(11) NOT NULL DEFAULT ‘0’ COMMENT ‘最新访问频道id’,
first_userid int(11) NOT NULL DEFAULT ‘0’ COMMENT ‘首次访问用户id’,
last_userid int(11) NOT NULL DEFAULT ‘0’ COMMENT ‘最新访问用户id’,
first_pm varchar(100) COLLATE utf8mb4_general_ci NOT NULL DEFAULT ‘’ COMMENT ‘首次访问端类型’,
last_pm varchar(100) COLLATE utf8mb4_general_ci NOT NULL DEFAULT ‘’ COMMENT ‘最新访问端类型’,
last_id int(11) NOT NULL DEFAULT ‘0’ COMMENT ‘最新访问内容id’,
is_del tinyint(4) DEFAULT ‘0’ COMMENT ‘是否删除0否1是’,
created_stime datetime NOT NULL DEFAULT CURRENT_TIMESTAMP COMMENT ‘创建时间’,
modified_stime datetime NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP COMMENT ‘更新时间’,
UNIQUE KEY uniq_id (id),
UNIQUE KEY uniq_deviceid (deviceid)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci AUTO_INCREMENT=8010001 COMMENT=‘用户设备访问记录表’
1 row in set (0.00 sec)

https://docs.pingcap.com/zh/tidb/v5.4/sql-statement-admin-check-table-index#admin-check-tableindex
admin check table community.sq_visit_user_deviceid ; 表比较大,check会有风险不? 答:无风险。
检查源表还是目标表? 答:目标表

目标实例上tidb-server log:

2024/11/06 22:28:52.299 +08:00] [ERROR] [session.go:773] [“assertion failed”] [conn=2191605662] [session_alias=] [message=“[tikv:8141]assertion failed: key: 7480000000000001f05f698000000000000002010038004400440044ff0032003900460041ff0034003500460033ff0044004600380042ff0041004100310042ff0035003000460046ff0032003700320043ff0039003400450041ff0041003000390041ff0036003900380039ff0000000000000000f7, assertion: Exist, start_ts: 453745923130130444, existing start ts: 0, existing commit ts: 0”] [“mvcc history”=“{"key":"7480000000000001F05F698000000000000002010038004400440044FF0032003900460041FF0034003500460033FF0044004600380042FF0041004100310042FF0035003000460046FF0032003700320043FF0039003400450041FF0041003000390041FF0036003900380039FF0000000000000000F7","mvcc":{"info":{"writes":[{"type":3,"start_ts":453745922867986439,"commit_ts":453745922867986439}]}},"regionID":51698}”]
[2024/11/06 22:28:52.299 +08:00] [WARN] [session.go:946] [“can not retry txn”] [conn=2191605662] [session_alias=] [label=general] [error=“[tikv:8141]assertion failed: key: 7480000000000001f05f698000000000000002010038004400440044ff0032003900460041ff0034003500460033ff0044004600380042ff0041004100310042ff0035003000460046ff0032003700320043ff0039003400450041ff0041003000390041ff0036003900380039ff0000000000000000f7, assertion: Exist, start_ts: 453745923130130444, existing start ts: 0, existing commit ts: 0”] [IsBatchInsert=false] [IsPessimistic=false] [InRestrictedSQL=false] [tidb_retry_limit=10] [tidb_disable_txn_auto_retry=true]
[2024/11/06 22:28:52.299 +08:00] [WARN] [session.go:962] [“commit failed”] [conn=2191605662] [session_alias=] [“finished txn”=“Txn{state=invalid}”] [error=“[tikv:8141]assertion failed: key: 7480000000000001f05f698000000000000002010038004400440044ff0032003900460041ff0034003500460033ff0044004600380042ff0041004100310042ff0035003000460046ff0032003700320043ff0039003400450041ff0041003000390041ff0036003900380039ff0000000000000000f7, assertion: Exist, start_ts: 453745923130130444, existing start ts: 0, existing commit ts: 0”]
[2024/11/06 22:28:52.299 +08:00] [WARN] [session.go:2287] [“run statement failed”] [conn=2191605662] [session_alias=] [schemaVersion=75] [error=“previous statement: REPLACE INTO community.sq_visit_user_deviceid(count,created_stime,deviceid,first_pm,first_tabid,first_time,first_userid,id,is_del,last_id,last_pm,last_tabid,last_time,last_userid,modified_stime) VALUES(193144,‘2022-10-21 14:21:20’,‘8ddd29fa45f3df8baa1b50ff272c94eaa09a6989’,‘1’,49,‘2022-10-21 14:21:20’,0,895317,0,63034,‘1’,49,‘2024-11-06 13:40:23’,0,‘2024-11-06 13:40:23’): [tikv:8141]assertion failed: key: 7480000000000001f05f698000000000000002010038004400440044ff0032003900460041ff0034003500460033ff0044004600380042ff0041004100310042ff0035003000460046ff0032003700320043ff0039003400450041ff0041003000390041ff0036003900380039ff0000000000000000f7, assertion: Exist, start_ts: 453745923130130444, existing start ts: 0, existing commit ts: 0”] [session=“{\n "currDBName": "",\n "id": 2191605662,\n "status": 2,\n "strictMode": false,\n "user": {\n "Username": "drainer_user",\n "Hostname": "10.28.2.149",\n "CurrentUser": false,\n "AuthUsername": "drainer_user",\n "AuthHostname": "10.%",\n "AuthPlugin": "mysql_native_password"\n }\n}”]

尝试手动执行SQL出错:

数据很奇怪,表中显示有deviceid=‘8ddd29fa45f3df8baa1b50ff272c94eaa09a6989’,但是select 不出来。

用dumpling/tidb-lightning: 成功。

./bin/dumpling -h 10.xx -P 4000 -udbxx -plsxxx -B community -T community.sq_visit_user_deviceid --filetype sql -t 64 -o /data/dbatemp/tmp/community_sq_visit_user_deviceid_843 -r 200000 -F256MiB

tiup tidb-lightning:v7.5.3 -config tidb-lightning-commentnew.toml

定位下来就是下游有个奇奇怪怪的数据,修复下就好了。

此话题已在最后回复的 7 天后被自动关闭。不再允许新回复。