【TiDB 使用环境】生产环境 /测试/ Poc
【TiDB 版本】7.5.1 -->8.5.1
【操作系统】centos7.9
【部署方式】物理机 ssd
【集群数据量】1T+
【集群节点数】3
【问题复现路径】做过哪些操作出现的问题
步骤:br备份,br恢复,创建cdc同步。
参数文件:
case-sensitive = true
enable-sync-point = true
sync-point-interval = “10m”
sync-point-retention = “1h”
[consistent]
level = “eventual”
max-log-size = 64
flush-interval = 2000
storage = “s3://xxxxx/xxx/xxx?access-key=xxx&secret-access-key=xxxxx&endpoint=http://xxxxx:xxxx&force-path-style=true”
[filter]
rules = [‘.’]
【遇到的问题:问题现象及影响】
cdc的checkpoint_time不动
minio中没有生成redo的目录
下游确实有数据传递过去了
查看状态一直是:
{
“upstream_id”: 7341229539313770488,
“namespace”: “default”,
“id”: “dr-primary-to-secondary01”,
“state”: “normal”,
“checkpoint_tso”: 456542685585997837,
“checkpoint_time”: “2025-03-10 10:02:14.346”,
“error”: null
}
【资源配置】进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面
【复制黏贴 ERROR 报错的日志】
没有error日志
[2025/03/10 17:20:11.974 +08:00] [INFO] [memory_control.go:170] [“send area feedback”] [area=0] [feedbackType=PauseArea] [memoryUsageRatio=0.8000649348832667] [lastTime=2025/03/10 17:20:11.974 +08:00] [now=2025/03/10 17:20:11.974 +08:00] [sinceLastTime=14.164µs]
[2025/03/10 17:20:11.975 +08:00] [INFO] [subscription_client.go:370] [“subscription client pause push region event”]
[2025/03/10 17:20:11.975 +08:00] [INFO] [subscription_client.go:370] [“subscription client pause push region event”]
[2025/03/10 17:20:11.974 +08:00] [INFO] [memory_control.go:170] [“send area feedback”] [area=0] [feedbackType=PauseArea] [memoryUsageRatio=0.8000309788621962] [lastTime=2025/03/10 17:20:11.742 +08:00] [now=2025/03/10 17:20:11.974 +08:00] [sinceLastTime=232.266209ms]
[2025/03/10 17:20:21.777 +08:00] [WARN] [barrier_event.go:389] [“barrier event is not resolved”] [changefeed=dr-primary-to-secondary01] [commitTs=456542807654400000] [isSyncPoint=true] [selected=false] [writerDispatcherAdvanced=false] [coverage=“reported count: 97, require count: 126”] [blocker=] [resend=null]
[2025/03/10 17:20:22.122 +08:00] [INFO] [memory_control.go:170] [“send area feedback”] [area=0] [feedbackType=ResumeArea] [memoryUsageRatio=0.49998086132109165] [lastTime=2025/03/10 17:20:11.974 +08:00] [now=2025/03/10 17:20:22.122 +08:00] [sinceLastTime=10.148007699s]
[2025/03/10 17:20:22.123 +08:00] [INFO] [subscription_client.go:374] [“subscription client resume push region event”]
[2025/03/10 17:20:22.409 +08:00] [INFO] [memory_control.go:170] [“send area feedback”] [area=0] [feedbackType=PauseArea] [memoryUsageRatio=0.8000111323781312] [lastTime=2025/03/10 17:20:22.122 +08:00] [now=2025/03/10 17:20:22.409 +08:00] [sinceLastTime=286.344869ms]
[2025/03/10 17:20:22.409 +08:00] [INFO] [subscription_client.go:370] [“subscription client pause push region event”]
[2025/03/10 17:20:23.257 +08:00] [INFO] [barrier.go:104] [“handle block status”] [from=8123ac50-2b55-4c2b-98d7-ad490cee9a49] [changefeed=dr-primary-to-secondary01] [detail=“changefeedID:<high:10757762319605898412 low:2326884390507815090 name:"dr-primary-to-secondary01" namespace:"default" > blockStatuses:<ID:<high:9940413477000645800 low:11116363142267610296 > state:<IsBlocked:true BlockTs:456542807654400000 BlockTables:<> IsSyncPoint:true stage:WAITING > > “]
[2025/03/10 17:20:31.795 +08:00] [WARN] [barrier_event.go:389] [“barrier event is not resolved”] [changefeed=dr-primary-to-secondary01] [commitTs=456542807654400000] [isSyncPoint=true] [selected=false] [writerDispatcherAdvanced=false] [coverage=“reported count: 98, require count: 126”] [blocker=] [resend=null]
[2025/03/10 17:20:32.296 +08:00] [WARN] [persist_storage.go:629] [“schema store upper bound not changed”]
[2025/03/10 17:20:32.296 +08:00] [INFO] [subscription_client.go:896] [“subscription client initializes a region too slow”] [subscriptionID=213] [slowRegion=”{"RegionID":176405,"ResolvedTs":456542685585997837,"Initialized":false,"Created":"2025-03-10T17:10:30.363416021+08:00"}”]
[2025/03/10 17:20:32.296 +08:00] [INFO] [subscription_client.go:890] [“subscription client finds a initialized slow region”] [subscriptionID=214] [slowRegion=“{"RegionID":71178,"ResolvedTs":456542685585997837,"Initialized":true,"Created":"2025-03-10T17:10:30.365205175+08:00"}”]
[2025/03/10 17:20:32.296 +08:00] [INFO] [subscription_client.go:896] [“subscription client initializes a region too slow”] [subscriptionID=248] [slowRegion=“{"RegionID":368212,"ResolvedTs":456542685585997837,"Initialized":false,"Created":"2025-03-10T17:10:30.373028137+08:00"}”]
[2025/03/10 17:20:32.296 +08:00] [INFO] [subscription_client.go:896] [“subscription client initializes a region too slow”] [subscriptionID=234] [slowRegion=“{"RegionID":15344,"ResolvedTs":456542685585997837,"Initialized":false,"Created":"2025-03-10T17:10:30.378009154+08:00"}”]
[2025/03/10 17:20:32.296 +08:00] [INFO] [subscription_client.go:896] [“subscription client initializes a region too slow”] [subscriptionID=250] [slowRegion=“{"RegionID":174861,"ResolvedTs":456542685585997837,"Initialized":false,"Created":"2025-03-10T17:10:30.376891867+08:00"}”]
[2025/03/10 17:20:32.297 +08:00] [INFO] [subscription_client.go:896] [“subscription client initializes a region too slow”] [subscriptionID=215] [slowRegion=“{"RegionID":193959,"ResolvedTs":456542685585997837,"Initialized":false,"Created":"2025-03-10T17:10:30.364352516+08:00"}”]
[2025/03/10 17:20:32.297 +08:00] [INFO] [subscription_client.go:896] [“subscription client initializes a region too slow”] [subscriptionID=251] [slowRegion=“{"RegionID":391987,"ResolvedTs":456542685585997837,"Initialized":false,"Created":"2025-03-10T17:10:30.377727806+08:00"}”]
[2025/03/10 17:20:32.297 +08:00] [INFO] [subscription_client.go:896] [“subscription client initializes a region too slow”] [subscriptionID=195] [slowRegion=“{"RegionID":272278,"ResolvedTs":456542685585997837,"Initialized":false,"Created":"2025-03-10T17:10:30.371053317+08:00"}”]
[2025/03/10 17:20:32.297 +08:00] [INFO] [subscription_client.go:896] [“subscription client initializes a region too slow”] [subscriptionID=228] [slowRegion=“{"RegionID":84592,"ResolvedTs":456542685585997837,"Initialized":false,"Created":"2025-03-10T17:10:30.37248193+08:00"}”]
[2025/03/10 17:20:34.474 +08:00] [INFO] [memory_control.go:170] [“send area feedback”] [area=0] [feedbackType=ResumeArea] [memoryUsageRatio=0.49998885905370116] [lastTime=2025/03/10 17:20:22.409 +08:00] [now=2025/03/10 17:20:34.474 +08:00] [sinceLastTime=12.065380093s]
[2025/03/10 17:20:34.474 +08:00] [INFO] [subscription_client.go:374] [“subscription client resume push region event”]
[2025/03/10 17:20:34.687 +08:00] [INFO] [memory_control.go:170] [“send area feedback”] [area=0] [feedbackType=PauseArea] [memoryUsageRatio=0.8000185913406312] [lastTime=2025/03/10 17:20:34.474 +08:00] [now=2025/03/10 17:20:34.687 +08:00] [sinceLastTime=212.468179ms]
[2025/03/10 17:20:34.687 +08:00] [INFO] [subscription_client.go:370] [“subscription client pause push region event”]
[2025/03/10 17:20:36.144 +08:00] [INFO] [barrier.go:104] [“handle block status”] [from=8123ac50-2b55-4c2b-98d7-ad490cee9a49] [changefeed=dr-primary-to-secondary01] [detail=“changefeedID:<high:10757762319605898412 low:2326884390507815090 name:"dr-primary-to-secondary01" namespace:"default" > blockStatuses:<ID:<high:1669873259515829414 low:12920738461852206235 > state:<IsBlocked:true BlockTs:456542807654400000 BlockTables:<> IsSyncPoint:true stage:WAITING > > “]
[2025/03/10 17:20:39.835 +08:00] [INFO] [event_broker.go:296] [“dispatcher not reset”] [dispatcher=22528473606462451472789850978229157272]
[2025/03/10 17:20:39.835 +08:00] [INFO] [event_broker.go:296] [“dispatcher not reset”] [dispatcher=1457668694207027609617189039599822753153]
[2025/03/10 17:20:39.835 +08:00] [INFO] [event_broker.go:296] [“dispatcher not reset”] [dispatcher=175323360393191024648123132422732144572]
[2025/03/10 17:20:39.835 +08:00] [INFO] [event_broker.go:296] [“dispatcher not reset”] [dispatcher=275652892467983688610168177270215909772]
[2025/03/10 17:20:39.835 +08:00] [INFO] [event_broker.go:296] [“dispatcher not reset”] [dispatcher=28304401040268565262088751604957783977]
[2025/03/10 17:20:39.835 +08:00] [INFO] [event_broker.go:296] [“dispatcher not reset”] [dispatcher=59289268468832123832544025193965788841]
[2025/03/10 17:20:39.835 +08:00] [INFO] [event_broker.go:296] [“dispatcher not reset”] [dispatcher=17475356784239890405049997649317844629]
[2025/03/10 17:20:39.835 +08:00] [INFO] [event_broker.go:296] [“dispatcher not reset”] [dispatcher=1630669953256818195812326599776038917805]
[2025/03/10 17:20:39.835 +08:00] [INFO] [event_broker.go:296] [“dispatcher not reset”] [dispatcher=254013200885796626015218476286061537155]
[2025/03/10 17:20:41.821 +08:00] [WARN] [barrier_event.go:389] [“barrier event is not resolved”] [changefeed=dr-primary-to-secondary01] [commitTs=456542807654400000] [isSyncPoint=true] [selected=false] [writerDispatcherAdvanced=false] [coverage=“reported count: 99, require count: 126”] [blocker=] [resend=null]
另一个cdc节点:
[2025/03/10 17:20:39.626 +08:00] [INFO] [event_broker.go:296] [“dispatcher not reset”] [dispatcher=1205528990152177670813197871101761603502]
[2025/03/10 17:20:39.627 +08:00] [INFO] [event_broker.go:296] [“dispatcher not reset”] [dispatcher=65052286244549306972331951690095457714]
[2025/03/10 17:20:39.627 +08:00] [INFO] [event_broker.go:296] [“dispatcher not reset”] [dispatcher=16738318289506599387216421412660991624]
[2025/03/10 17:20:39.627 +08:00] [INFO] [event_broker.go:296] [“dispatcher not reset”] [dispatcher=1573072170493609159611937881084465636023]
[2025/03/10 17:20:39.627 +08:00] [INFO] [event_broker.go:296] [“dispatcher not reset”] [dispatcher=35525719219999459041924221432410357145]
[2025/03/10 17:20:39.627 +08:00] [INFO] [event_broker.go:296] [“dispatcher not reset”] [dispatcher=52787506978246609169164611392497813644]
[2025/03/10 17:20:39.627 +08:00] [INFO] [event_broker.go:296] [“dispatcher not reset”] [dispatcher=1363850695836208087015914595636941033378]
[2025/03/10 17:20:39.627 +08:00] [INFO] [event_broker.go:296] [“dispatcher not reset”] [dispatcher=1781879750263766863116163294317368610490]
[2025/03/10 17:20:39.627 +08:00] [INFO] [event_broker.go:296] [“dispatcher not reset”] [dispatcher=1551374324658779742612247721248658418352]
[2025/03/10 17:20:39.627 +08:00] [INFO] [event_broker.go:296] [“dispatcher not reset”] [dispatcher=14357671523075185223831909820931895483]
[2025/03/10 17:20:39.627 +08:00] [INFO] [event_broker.go:296] [“dispatcher not reset”] [dispatcher=86674293875750262082231992748045631384]
[2025/03/10 17:20:39.627 +08:00] [INFO] [event_broker.go:296] [“dispatcher not reset”] [dispatcher=1493531126164504400215689151957502018990]
[2025/03/10 17:20:52.219 +08:00] [INFO] [subscription_client.go:896] [“subscription client initializes a region too slow”] [subscriptionID=215] [slowRegion=”{"RegionID":70064,"ResolvedTs":456542685585997837,"Initialized":false,"Created":"2025-03-10T17:10:30.76871845+08:00"}”]
[2025/03/10 17:20:52.220 +08:00] [INFO] [subscription_client.go:896] [“subscription client initializes a region too slow”] [subscriptionID=248] [slowRegion=“{"RegionID":106141,"ResolvedTs":456542685585997837,"Initialized":false,"Created":"2025-03-10T17:10:30.373578683+08:00"}”]
[2025/03/10 17:20:52.220 +08:00] [INFO] [subscription_client.go:896] [“subscription client initializes a region too slow”] [subscriptionID=242] [slowRegion=“{"RegionID":262351,"ResolvedTs":456542685585997837,"Initialized":false,"Created":"2025-03-10T17:10:30.362115175+08:00"}”]
[2025/03/10 17:20:52.220 +08:00] [INFO] [subscription_client.go:896] [“subscription client initializes a region too slow”] [subscriptionID=244] [slowRegion=“{"RegionID":51045,"ResolvedTs":456542685585997837,"Initialized":false,"Created":"2025-03-10T17:10:30.362200798+08:00"}”]
[2025/03/10 17:20:52.220 +08:00] [INFO] [subscription_client.go:896] [“subscription client initializes a region too slow”] [subscriptionID=228] [slowRegion=“{"RegionID":344980,"ResolvedTs":456542685585997837,"Initialized":false,"Created":"2025-03-10T17:10:30.76784983+08:00"}”]
【其他附件:截图/日志/监控】
看下监控的 resolved ts 和 checkpoint ts 呢
你这个监控是老版本的,新版本有更多的监控可以看
看下监控的 dataflow 面板和 sink 的 memory quota 呢
又试了一下从7.5.1到7.5.1,没patch到最新版之前,正常同步,redo日志正常产生,cdc checkpoint也正常推进。patch以后,redo不产生,checkpint不推进了,但是下游还是有数据同步过去。cdc日志有许多类似报错在刷[2025/03/12 11:20:27.652 +08:00] [WARN] [barrier_event.go:389] [“barrier event is not resolved”] [changefeed=dr-primary-to-secondary] [commitTs=456583544832000000] [isSyncPoint=true] [selected=false] [writerDispatcherAdvanced=false] [coverage=“reported count: 125, require count: 126”] [blocker=] [resend=null]
想问一下具体用的是哪个 commit 的 ticdc呢?昨天刚修复了一下 syncpoint 遇到的问题,跟这个日志看起来比较类似。可以看看用一下最新的 binary 看看还有没有这个问题。
早上下载的,解压看cdc文件最后修改时间为03-11 09:53
- 目前新架构暂时不支持 redo,会忽略 redo 相关的配置,redo 的支持会在 GA 之前完成;
- checkpoint 不推进的问题是昨天修复的,方便的话可以用最新的 nightly 版本尝试一下,或者提供一下 cdc 的 commit hash,我们看下是否包含了对应的修复?
用的就已经是最新的nightly版本了,解压出来直接就是一个文件,怎么看commit hash;明天再下载一下最新的版本patch上去试试
可以把 ./cdc version 的输出贴出来
Release Version: v9.0.0-alpha-143-gfe0e19a
Git Commit Hash: fe0e19ac8286a77e02b51707cbaa82a364f56e27
Git Branch: HEAD
UTC Build Time: 2025-03-11 01:51:50
Go Version: go1.23.7
Failpoint Build: false
这个版本没有包含对应的修复,后续有空可以用最新的 nightly 再试一下
Release Version: v9.0.0-beta.1.pre-157-gfe8123d
Git Commit Hash: fe8123d644770fdae3810ddd0c3fcbc94eeb744f
Git Branch: HEAD
UTC Build Time: 2025-03-13 04:36:30
Go Version: go1.23.7
Failpoint Build: false
这个版本checkpoint往前推进了一些,然后又卡主了
{
“upstream_id”: 7341229539313770488,
“namespace”: “default”,
“id”: “dr-primary-to-secondary”,
“state”: “normal”,
“checkpoint_tso”: 456615002111999999,
“checkpoint_time”: “2025-03-13 14:39:59.999”,
“error”: null
}
可以提供一下 cdc new arch 页面的监控吗?包括 Summary,Lag Summary,Dataflow,Lag analyze
- 麻烦用这个命令抓一下 cdc 的 goroutine:curl -X GET http://<cdc_address>:<cdc_port>/debug/pprof/goroutine?debug=2 > goroutine.log
- 方便再提供一下 cdc 和 tikv 的日志吗?如果 tikv 日志太多的话,可以先提供一下 cdc 的日志,后面我们再判断需要提供哪几个节点的 tikv 日志。(另外尽量先不要重启 tikv,如果要重启的话,可以先打开 tikv debug 日志并维持 15 分钟左右然后保存一下这部分日志)
- 另外有一些监控通过截图看不太清楚,能否用 clinic 收集一下监控?https://docs.pingcap.com/zh/tidb/stable/clinic-introduction/
Downloads.7z (34.1 KB)
Download URL: Clinic Service