BR全量备份时，ticdc报错ErrSnapshotLostByGC

wfxxh · 2023 年7 月 21 日 03:03

我觉得就是bug。
事件梳理：

2023年7月9日 14:28:56 启动了ticdc任务，对应的tso为 442730843081539585。
从ticdc任务启动到2023年7月20日 15:06:19 ticdc同步一直正常，resolved ts 一直在增长变化。
2023年7月20 14:39:09 启动了全量 BR备份，一直持续到 2023年7月20 15:42:23 。
在2023年7月20日 15:08 左右收到了ticdc任务不推进的告警。查看任务状态结果为第一张截图，且 checkpoint-ts 是我首次启动ticdc的 tso 即为：442730843081539585

wfxxh · 2023 年7 月 21 日 03:08

兄弟，我重复了很多次了，我的ticdc任务，一直到昨天出错，任务都是正常推进的。

wfxxh · 2023 年7 月 21 日 03:24

ticdc全部日志：
链接: 百度网盘请输入提取码提取码: w9cg

裤衩儿飞上天 · 2023 年7 月 21 日 07:01

问题能复现吗？

wfxxh · 2023 年7 月 21 日 07:13

生产环境，不可能经常全量导出进行测试

dba-kit · 2023 年7 月 21 日 07:17

从表现上来看，像是在BR备份期间，TiCDC一直没有推进checkpoint，在超过gc_life_time(1h)后，TiCDC因为changelog被GC掉的原因，导致失败了。

裤衩儿飞上天 · 2023 年7 月 21 日 07:18

不确定跟br备份到底有没有关系。你等下次备份的时候，看还会不会复现
方便的话，能传一份昨天pd的日志上来吗

wfxxh · 2023 年7 月 21 日 07:19

我从14:28开始导出，一直到15:06 ticdc才断的，期间resolved ts 一直在增长变化

裤衩儿飞上天 · 2023 年7 月 21 日 07:19

另外，你gc-ttl设置的多大？

dba-kit · 2023 年7 月 21 日 07:22

https://docs.pingcap.com/zh/tidb/stable/ticdc-faq#ticdc-gc-safepoint-的完整行为是什么
看官方文档描述，TiCDC的默认gc-ttl是24小时，理论上即便changefeed有问题，也会阻塞gc的

裤衩儿飞上天 · 2023 年7 月 21 日 07:24

是的。备份肯定没超过24小时，所以感觉可能跟br关系不大。除非他改了默认的gc-ttl
还有个奇怪的点是他报的那个checkpoint是chageefeed最一开始的checkpoint

wfxxh · 2023 年7 月 21 日 07:25

没指定，默认的24小时

wfxxh · 2023 年7 月 21 日 07:29

pd日志中一直到报错之前，safepoint都在推进

dba-kit · 2023 年7 月 21 日 07:29

截一下checkpoint lag的监控？

wfxxh · 2023 年7 月 21 日 07:32

中间是断掉的时间段

wfxxh · 2023 年7 月 21 日 07:33

这是pd 14点~16点的日志，覆盖了出错的时间

pd-14-15.log (2.2 MB)

dba-kit · 2023 年7 月 21 日 07:46

"error\":{\"time\":\"2023-07-10T17:42:50.616674411+08:00\",\"addr\":\"10.1.3.123:8300\",\"code\":\"CDC:ErrReachMaxTry\",\"message\":\"[CDC:ErrReachMaxTry]reach maximum try: 10, error: rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster: rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster\"}

看原始报错信息是PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster，你中间有重新创建changefeed的动作么？

wfxxh · 2023 年7 月 21 日 07:47

中间没有，只在报错后，删除并重建了changefeed

dba-kit · 2023 年7 月 21 日 07:48

有没有可能你重新创建changefeed时候，复制了老的命令，而没有修改--start-ts参数的值？感觉可以history看下当时你的执行命令

裤衩儿飞上天 · 2023 年7 月 21 日 07:50

查下changefeed完整的状态