使用 br 从 v4.0.14 备份恢复到 v.6.5.8,并使用 cdc 同步,迁移后 gc 卡住

【 TiDB 使用环境】生产环境
【 TiDB 版本】
【遇到的问题:问题现象及影响】
使用 br 从 v4.0.14 升级到了 v6.5.8 ,并使用 cdc 进行实时同步,结果在切入流量后 gc 卡住,tidb server 的 gc 正常推进,但是监控上 kv 的 gc 时间一直停留 在较早的时间;通过尝试重启 gc leader 的 tidb server 以及切换 pd 后,扔没有恢复.
【资源配置】进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面
【附件:截图/日志/监控】


image

[2024/07/20 03:19:07.682 +08:00] [INFO] [range_task.go:310] [“canceling range task because of error”] [name=resolve-locks-runner] [st
artKey=7480000000000011285f728000000001d2b53c] [endKey=7480000000000011285f728000000002eb4124] [error=“context canceled”]
[2024/07/20 03:19:07.682 +08:00] [INFO] [range_task.go:310] [“canceling range task because of error”] [name=resolve-locks-runner] [st
artKey=748000000000000f6e5f728000000000b8818c] [endKey=748000000000000f6e5f728000000000c599c4] [error=“context canceled”]
[2024/07/20 03:19:07.682 +08:00] [INFO] [range_task.go:310] [“canceling range task because of error”] [name=resolve-locks-runner] [st
artKey=7480000000000011285f6980000000000000030142786f326b385967ff4846424b356b4870ff615a6e626d375653ff6a35697339456168ff00000000000000
00f703800000000186183c] [endKey=7480000000000011285f7280000000009a5800] [error=“context canceled”]
[2024/07/20 03:19:07.682 +08:00] [INFO] [region_request.go:1285] [“send request failed, err: context canceled”] [req-ts=4511911237003
50976] [req-type=ScanLock] [region=“{ region id: 76937, ver: 11838, confVer: 124 }”] [region-is-valid=true] [retry-times=0] [replica-
read-type=leader] [replica-selector-state=accessKnownLeader] [stale-read=false] [replica-status=“peer: 77131, store: 1, isEpochStale:
false, attempts: 0, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable; peer: 77132, store: 4,
isEpochStale: false, attempts: 1, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable; peer: 77
130, store: 5, isEpochStale: false, attempts: 0, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reach
able; peer: 2352384, store: 7, isEpochStale: false, attempts: 0, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liven
ess-state: reachable; peer: 2352556, store: 11, isEpochStale: false, attempts: 0, replica-epoch: 0, store-epoch: 0, store-state: reso
lved, store-liveness-state: reachable; peer: 2419708, store: 2314744, isEpochStale: false, attempts: 0, replica-epoch: 0, store-epoch
: 0, store-state: resolved, store-liveness-state: reachable”] [total-backoff-ms=0] [total-backoff-times=0] [total-region-errors=]
[2024/07/20 03:19:07.682 +08:00] [INFO] [range_task.go:310] [“canceling range task because of error”] [name=resolve-locks-runner] [st
artKey=748000000000000f6e5f728000000000ac81ac] [endKey=748000000000000f6e5f728000000000b8818c] [error=“context canceled”]
[2024/07/20 03:19:07.682 +08:00] [INFO] [range_task.go:310] [“canceling range task because of error”] [name=resolve-locks-runner] [st
artKey=7480000000000011285f7280000000009a5800] [endKey=7480000000000011285f728000000001d2b53c] [error=“context canceled”]
[2024/07/20 03:19:07.682 +08:00] [INFO] [region_request.go:1285] [“send request failed, err: context canceled”] [req-ts=4511911237003
50976] [req-type=ScanLock] [region=“{ region id: 2484216, ver: 12517, confVer: 45886 }”] [region-is-valid=true] [retry-times=0] [repl
ica-read-type=leader] [replica-selector-state=accessKnownLeader] [stale-read=false] [replica-status=“peer: 2484217, store: 2314744, i
sEpochStale: false, attempts: 0, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable; peer: 2484
218, store: 10, isEpochStale: false, attempts: 0, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reac
hable; peer: 2484219, store: 6, isEpochStale: false, attempts: 0, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-live
ness-state: reachable; peer: 2484220, store: 248, isEpochStale: false, attempts: 1, replica-epoch: 0, store-epoch: 0, store-state: re
solved, store-liveness-state: reachable; peer: 2484221, store: 8, isEpochStale: false, attempts: 0, replica-epoch: 0, store-epoch: 0,
store-state: resolved, store-liveness-state: reachable; peer: 2484222, store: 5, isEpochStale: false, attempts: 0, replica-epoch: 0,
store-epoch: 0, store-state: resolved, store-liveness-state: reachable”] [total-backoff-ms=0] [total-backoff-times=0] [total-region-
errors=]
[2024/07/20 03:19:07.682 +08:00] [INFO] [range_task.go:310] [“canceling range task because of error”] [name=resolve-locks-runner] [st
artKey=7480000000000011285f698000000000000001038000000003f87c0d] [endKey=7480000000000011285f6980000000000000020419b3906d9a0000000380
000000069bd2e7] [error=“context canceled”]
[2024/07/20 03:19:07.682 +08:00] [INFO] [region_request.go:1285] [“send request failed, err: context canceled”] [req-ts=4511911237003
50976] [req-type=ScanLock] [region=“{ region id: 4791199, ver: 12370, confVer: 16960 }”] [region-is-valid=true] [retry-times=0] [repl
ica-read-type=leader] [replica-selector-state=accessKnownLeader] [stale-read=false] [replica-status=“peer: 4791200, store: 2314744, i
sEpochStale: false, attempts: 0, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reachable; peer: 4845
168, store: 5, isEpochStale: false, attempts: 0, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liveness-state: reach
able; peer: 4845233, store: 1, isEpochStale: false, attempts: 1, replica-epoch: 0, store-epoch: 0, store-state: resolved, store-liven
ess-state: reachable; peer: 4845316, store: 12, isEpochStale: false, attempts: 0, replica-epoch: 0, store-epoch: 0, store-state: reso
lved, store-liveness-state: reachable; peer: 4866911, store: 10, isEpochStale: false, attempts: 0, replica-epoch: 0, store-epoch: 0,
store-state: resolved, store-liveness-state: reachable; peer: 4870837, store: 7, isEpochStale: false, attempts: 0, replica-epoch: 0,
store-epoch: 0, store-state: resolved, store-liveness-state: reachable”] [total-backoff-ms=0] [total-backoff-times=0] [total-region-e
rrors=]
[2024/07/20 03:19:07.682 +08:00] [INFO] [range_task.go:310] [“canceling range task because of error”] [name=resolve-locks-runner] [st
artKey=748000000000000f6e5f728000000001156f98] [endKey=74800000000000108e5f72800000000008cff2] [error=“context canceled”]
[2024/07/20 03:19:07.682 +08:00] [INFO] [range_task.go:233] [“range task failed”] [name=resolve-locks-runner] [startKey=] [endKey=] [
“cost time”=1.788887891s] [error=“context canceled”]
[2024/07/20 03:19:07.682 +08:00] [ERROR] [gc_worker.go:1057] [“[gc worker] resolve locks failed”] [uuid=642e69e50b00080] [safePoint=4
51191123700350976] [error=“context canceled”]
[2024/07/20 03:19:07.682 +08:00] [ERROR] [gc_worker.go:628] [“[gc worker] resolve locks returns an error”] [uuid=642e69e50b00080] [er
ror=“context canceled”]
[2024/07/20 03:19:07.682 +08:00] [ERROR] [gc_worker.go:226] [“[gc worker] runGCJob”] [error=“context canceled”]

grep GC 的日志如下:
[2024/07/26 14:08:16.916 +08:00] [ERROR] [gc_worker.go:226] [“[gc worker] runGCJob”] [error=“unexpected resolve err: commit_ts_expired:<start_ts:450761024419921921 attempted_commit_ts:450761024642744333 key:"t\200\000\000\000\000\000\021(_i\200\000\000\000\000\000\000\002\004\031\263\2708\207\000\000\000\003\200\000\000\000\t%\320\227" min_commit_ts:450761024904626185 > “]
[2024/07/26 14:18:16.896 +08:00] [ERROR] [gc_worker.go:226] [”[gc worker] runGCJob”] [error=“scanRegion from PD failed, startKey: "t\x80\x00\x00\x00\x00\x00\x11-_i\x80\x00\x00\x00\x00\x00\x00\x02\x01GJjJuFQ1\xffwXddXtbf\xffpYUeB35M\xffRm7tLGzZ\xff\x00\x00\x00\x00\x00\x00\x00\x00\xf7", limit: ‘\u0080’, err: rpc error: code = Canceled desc = context canceled”]
[2024/07/26 14:28:16.920 +08:00] [ERROR] [gc_worker.go:226] [“[gc worker] runGCJob”] [error=“context canceled”]
[2024/07/26 14:38:16.897 +08:00] [ERROR] [gc_worker.go:226] [“[gc worker] runGCJob”] [error=“context canceled”]
[2024/07/26 14:48:16.909 +08:00] [ERROR] [gc_worker.go:226] [“[gc worker] runGCJob”] [error=“unexpected resolve err: commit_ts_expired:<start_ts:450761024419921921 attempted_commit_ts:450761024642744333 key:"t\200\000\000\000\000\000\021(_i\200\000\000\000\000\000\000\002\004\031\263\2708\207\000\000\000\003\200\000\000\000\t%\320\227" min_commit_ts:450761024904626185 > “]
[2024/07/26 14:58:16.896 +08:00] [ERROR] [gc_worker.go:226] [”[gc worker] runGCJob”] [error=“context canceled”]
[2024/07/26 15:08:16.908 +08:00] [ERROR] [gc_worker.go:226] [“[gc worker] runGCJob”] [error=“context canceled”]
[2024/07/26 15:18:16.928 +08:00] [ERROR] [gc_worker.go:226] [“[gc worker] runGCJob”] [error=“scanRegion from PD failed, startKey: "t\x80\x00\x00\x00\x00\x00\x11-_i\x80\x00\x00\x00\x00\x00\x00\x02\x013zDHA6dn\xffUTx2DTED\xffn3Dk1jKM\xffvPAGwHwZ\xff\x00\x00\x00\x00\x00\x00\x00\x00\xf7", limit: ‘\u0080’, err: rpc error: code = Canceled desc = context canceled”]
[2024/07/26 15:28:16.904 +08:00] [ERROR] [gc_worker.go:226] [“[gc worker] runGCJob”] [error=“context canceled”]
[2024/07/26 15:38:16.962 +08:00] [ERROR] [gc_worker.go:226] [“[gc worker] runGCJob”] [error=“context canceled”]
[2024/07/26 15:48:16.830 +08:00] [ERROR] [gc_worker.go:226] [“[gc worker] runGCJob”] [error=“context canceled”]
[2024/07/26 15:58:16.853 +08:00] [ERROR] [gc_worker.go:226] [“[gc worker] runGCJob”] [error=“context canceled”]
[2024/07/26 16:08:16.876 +08:00] [ERROR] [gc_worker.go:226] [“[gc worker] runGCJob”] [error=“context canceled”]
[2024/07/26 16:18:16.846 +08:00] [ERROR] [gc_worker.go:226] [“[gc worker] runGCJob”] [error=“context canceled”]
[2024/07/26 16:28:16.842 +08:00] [ERROR] [gc_worker.go:226] [“[gc worker] runGCJob”] [error=“context canceled”]
[2024/07/26 16:38:16.856 +08:00] [ERROR] [gc_worker.go:226] [“[gc worker] runGCJob”] [error=“context canceled”]
[2024/07/26 16:48:16.931 +08:00] [ERROR] [gc_worker.go:226] [“[gc worker] runGCJob”] [error=“context canceled”]
[2024/07/26 16:58:16.842 +08:00] [ERROR] [gc_worker.go:226] [“[gc worker] runGCJob”] [error=“context canceled”]
[2024/07/26 17:08:16.902 +08:00] [ERROR] [gc_worker.go:226] [“[gc worker] runGCJob”] [error=“context canceled”]
[2024/07/26 17:18:16.867 +08:00] [ERROR] [gc_worker.go:226] [“[gc worker] runGCJob”] [error=“scanRegion from PD failed, startKey: "t\x80\x00\x00\x00\x00\x00\x11-_i\x80\x00\x00\x00\x00\x00\x00\x01\x03\x80\x00\x00\x00\x82?\x87\xb9", limit: ‘\u0080’, err: rpc error: code = Canceled desc = context canceled”]
[2024/07/26 17:28:16.892 +08:00] [ERROR] [gc_worker.go:226] [“[gc worker] runGCJob”] [error=“unexpected resolve err: commit_ts_expired:<start_ts:450761024419921921 attempted_commit_ts:450761024642744333 key:"t\200\000\000\000\000\000\021(_i\200\000\000\000\000\000\000\002\004\031\263\2708\207\000\000\000\003\200\000\000\000\t%\320\227" min_commit_ts:450761024904626185 > “]
[2024/07/26 17:38:16.872 +08:00] [ERROR] [gc_worker.go:226] [”[gc worker] runGCJob”] [error=“context canceled”]
[2024/07/26 17:48:16.853 +08:00] [ERROR] [gc_worker.go:226] [“[gc worker] runGCJob”] [error=“context canceled”]
按照 gc 的步骤,应该是卡在了第二步 resolve locks 时失败了.

» region key --format=encode ‘t\200\000\000\000\000\000\021(_i\200\000\000\000\000\000\000\002\004\031\263\2708\207\000\000\000\003\200\000\000\000\t%\320\227’
{“id”:2211936,“start_key”:“6E00000000000000F8”,“end_key”:“7480000000000000FF185F728000000000FF0075BC0000000000FA”,“epoch”:{“conf_ver”:12298,“version”:194},“peers”:[{“role_name”:“Learner”,“is_learner”:true,“id”:2349021,“store_id”:2314744,“role”:1},{“role_name”:“Voter”,“id”:3582611,“store_id”:9},{“role_name”:“Voter”,“id”:5374913,“store_id”:6},{“role_name”:“Voter”,“id”:5382250,“store_id”:4},{“role_name”:“Voter”,“id”:5388561,“store_id”:8},{“role_name”:“Voter”,“id”:5389373,“store_id”:1}],“leader”:{“role_name”:“Voter”,“id”:5389373,“store_id”:1},“cpu_usage”:0,“written_bytes”:1137,“read_bytes”:7940318,“written_keys”:16,“read_keys”:113980,“approximate_size”:97,“approximate_keys”:562180}
» region 2211936
{“id”:2211936,“start_key”:“6E00000000000000F8”,“end_key”:“7480000000000000FF185F728000000000FF0075BC0000000000FA”,“epoch”:{“conf_ver”:12298,“version”:194},“peers”:[{“role_name”:“Learner”,“is_learner”:true,“id”:2349021,“store_id”:2314744,“role”:1},{“role_name”:“Voter”,“id”:3582611,“store_id”:9},{“role_name”:“Voter”,“id”:5374913,“store_id”:6},{“role_name”:“Voter”,“id”:5382250,“store_id”:4},{“role_name”:“Voter”,“id”:5388561,“store_id”:8},{“role_name”:“Voter”,“id”:5389373,“store_id”:1}],“leader”:{“role_name”:“Voter”,“id”:5389373,“store_id”:1},“cpu_usage”:0,“written_bytes”:1827,“read_bytes”:8432230,“written_keys”:27,“read_keys”:121356,“approximate_size”:97,“approximate_keys”:562180}
mysql> select TIDB_DECODE_KEY(‘6E00000000000000F8’);
±--------------------------------------+
| TIDB_DECODE_KEY(‘6E00000000000000F8’) |
±--------------------------------------+
| 6E00000000000000F8 |
±--------------------------------------+
1 row in set, 1 warning (0.00 sec)

mysql> select TIDB_DECODE_KEY(‘7480000000000000FF185F728000000000FF0075BC0000000000FA’);
±--------------------------------------------------------------------------+
| TIDB_DECODE_KEY(‘7480000000000000FF185F728000000000FF0075BC0000000000FA’) |
±--------------------------------------------------------------------------+
| {“_tidb_rowid”:30140,“table_id”:“24”} |
±--------------------------------------------------------------------------+
1 row in set (0.00 sec)
mysql> select * from TABLES where TIDB_TABLE_ID = 24\G;
*************************** 1. row ***************************
TABLE_CATALOG: def
TABLE_SCHEMA: mysql
TABLE_NAME: stats_histograms
TABLE_TYPE: BASE TABLE
ENGINE: InnoDB
VERSION: 10
ROW_FORMAT: Compact
TABLE_ROWS: 0
AVG_ROW_LENGTH: 0
DATA_LENGTH: 0
MAX_DATA_LENGTH: 0
INDEX_LENGTH: 0
DATA_FREE: 0
AUTO_INCREMENT: NULL
CREATE_TIME: 2024-05-29 18:03:31
UPDATE_TIME: NULL
CHECK_TIME: NULL
TABLE_COLLATION: utf8mb4_bin
CHECKSUM: NULL
CREATE_OPTIONS:
TABLE_COMMENT:
TIDB_TABLE_ID: 24
TIDB_ROW_ID_SHARDING_INFO: NULL
TIDB_PK_TYPE: NONCLUSTERED
TIDB_PLACEMENT_POLICY_NAME: NULL

mysql> select count() from stats_histograms;
±---------+
| count(
) |
±---------+
| 20579 |
±---------+
看着是统计信息的表卡住了 GC

如果你已经确认 相关 key 是应该被删除的。可以考虑去对应 tikv 使用 tikv-ctl 执行 recover-mvcc 操作。

https://docs.pingcap.com/zh/tidb/stable/tikv-control#恢复损坏的-mvcc-数据