changfeed未接收到TiCDC 的请求

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:
【 TiDB 使用环境】
测试环境,V5.0.4
架构 tidb → ticdc → kafka。

【概述】 场景 + 问题概述
因某些问题,重新创建 changefeed。

新的changefeed表过滤跟老的一样。

【备份和数据迁移策略逻辑】

【背景】 做过哪些操作

  1. 清理老的changefeed:tiup ctl cdc remove
  2. 创建新的changefeed: changefeed-id=kafka-canal-json3
[tidb@ log]$ tiup ctl:v5.0.4 cdc changefeed list
Starting component `ctl`: /home/tidb/.tiup/components/ctl/v5.0.4/ctl cdc changefeed list
[
  {
    "id": "kafka-canal-json3",
    "summary": {
      "state": "normal",
      "tso": 432040937796927489,
      "checkpoint": "2022-03-24 15:03:04.052",
      "error": null
    }
  }
]
  1. tidb模拟DML操作
    MySQL [sbtest]> insert into sbtest1 values(1,2,‘c’,‘c’);
    Query OK, 1 row affected (0.00 sec)

【现象】 业务和数据库现象

  1. 发现新changefeed没有收到事务消息
    "create-time": "2022-03-24T15:03:04.079313501+08:00",
    "start-ts": 432040937796927489,
    "target-ts": 0,
    "admin-job-type": 0,
    "sort-engine": "unified",
"filter": {
        "rules": [
          "sbtest.sbtest1"
        ],
        "ignore-txn-start-ts": []
      },
      "mounter": {
        "worker-num": 8
      },
      "sink": {
        "dispatchers": [
          {
            "matcher": [
              "sbtest.sbtest1"
            ],
            "dispatcher": "default"
          }
        ],
  "status": {
    "resolved-ts": 0,
    "checkpoint-ts": 432040937796927489,
    "admin-job-type": 0

【问题】 当前遇到的问题
问题1、查看日志看到:为什么ticdc有请求,但是changefeed未接收到消息请求?

[2022/03/24 15:03:19.988 +08:00] [INFO] [client.go:779] ["**start new request**"] [request="{\"header\":{\"cluster_id\":7009090993806118458,\"ticdc_version\":\"5.0.4\"},\"region_id\":98161,\"region_epoch\":{\"conf_ver\":5,\"version\":51},\"checkpoint_ts\":432040941912588289,\"start_key\":\"bURETEpvYkH/ZP9kSWR4TGn/c3T/AAAAAAD/AAAA9wAAAAD/AAAAbAAAAAD7\",\"end_key\":\"bURETEpvYkH/ZP9kSWR4TGn/c3T/AAAAAAD/AAAA9wAAAAD/AAAAbQAAAAD7\",\"request_id\":4344,\"Request\":null}"] [addr=localhost:20160]
[2022/03/24 15:03:20.233 +08:00] [INFO] [client.go:1258] ["region state entry will be replaced because received message of newer requestID"] [regionID=98161] [oldRequestID=4342] [requestID=4344] [addr=localhost:20160]
[2022/03/24 15:09:31.319 +08:00] [INFO] [client.go:959] ["EventFeed disconnected"] [regionID=98161] [requestID=4343] [span="[6d44444c4a6f624cff69ff737400000000ff0000f90000000000ff00006c0000000000fa, 6d44444c4a6f624cff69ff737400000000ff0000f90000000000ff00006d0000000000fa)"] [checkpoint=432041039194226689] [error="[CDC:ErrEventFeedEventError]not_leader:<region_id:98161 > "]
[2022/03/24 15:09:31.319 +08:00] [INFO] [client.go:959] ["EventFeed disconnected"] [regionID=98161] [requestID=4344] [span="[6d44444c4a6f6241ff64ff644964784c69ff7374ff0000000000ff000000f700000000ff0000006c00000000fb, 6d44444c4a6f6241ff64ff644964784c69ff7374ff0000000000ff000000f700000000ff0000006d00000000fb)"] [checkpoint=432041039194226689] [error="[CDC:ErrEventFeedEventError]not_leader:<region_id:98161 > "]
[2022/03/24 15:09:31.319 +08:00] [INFO] [region_range_lock.go:370] ["unlocked range"] [lockID=22] [regionID=98161] [startKey=6d44444c4a6f6241ff64ff644964784c69ff7374ff0000000000ff000000f700000000ff0000006c00000000fb] [endKey=6d44444c4a6f6241ff64ff644964784c69ff7374ff0000000000ff000000f700000000ff0000006d00000000fb] [checkpointTs=432041039194226689]
[2022/03/24 15:09:31.319 +08:00] [INFO] [region_cache.go:958] ["switch region peer to next due to NotLeader with NULL leader"] [currIdx=0] [regionID=98161]
[2022/03/24 15:09:31.319 +08:00] [INFO] [region_range_lock.go:370] ["unlocked range"] [lockID=21] [regionID=98161] [startKey=6d44444c4a6f624cff69ff737400000000ff0000f90000000000ff00006c0000000000fa] [endKey=6d44444c4a6f624cff69ff737400000000ff0000f90000000000ff00006d0000000000fa] [checkpointTs=432041039194226689]
[2022/03/24 15:09:31.319 +08:00] [INFO] [region_range_lock.go:218] ["range locked"] [lockID=22] [regionID=98161] [startKey=6d44444c4a6f6241ff64ff644964784c69ff7374ff0000000000ff000000f700000000ff0000006c00000000fb] [endKey=6d44444c4a6f6241ff64ff644964784c69ff7374ff0000000000ff000000f700000000ff0000006d00000000fb] [checkpointTs=432041039194226689]
[2022/03/24 15:09:31.319 +08:00] [INFO] [region_cache.go:958] ["switch region peer to next due to NotLeader with NULL leader"] [currIdx=0] [regionID=98161]
[2022/03/24 15:09:31.319 +08:00] [INFO] [region_range_lock.go:218] ["range locked"] [lockID=21] [regionID=98161] [startKey=6d44444c4a6f624cff69ff737400000000ff0000f90000000000ff00006c0000000000fa] [endKey=6d44444c4a6f624cff69ff737400000000ff0000f90000000000ff00006d0000000000fa] [checkpointTs=432041039194226689]
[2022/03/24 15:09:31.319 +08:00] [INFO] [client.go:779] ["**start new request**"] [request="{\"header\":{\"cluster_id\":7009090993806118458,\"ticdc_version\":\"5.0.4\"},\"region_id\":98161,\"region_epoch\":{\"conf_ver\":5,\"version\":51},\"checkpoint_ts\":432041039194226689,\"start_key\":\"bURETEpvYkH/ZP9kSWR4TGn/c3T/AAAAAAD/AAAA9wAAAAD/AAAAbAAAAAD7\",\"end_key\":\"bURETEpvYkH/ZP9kSWR4TGn/c3T/AAAAAAD/AAAA9wAAAAD/AAAAbQAAAAD7\",\"request_id\":4345,\"Request\":null}"] [addr=localhost:20160]
[2022/03/24 15:09:31.319 +08:00] [INFO] [client.go:779] ["**start new request**"] [request="{\"header\":{\"cluster_id\":7009090993806118458,\"ticdc_version\":\"5.0.4\"},\"region_id\":98161,\"region_epoch\":{\"conf_ver\":5,\"version\":51},\"checkpoint_ts\":432041039194226689,\"start_key\":\"bURETEpvYkz/af9zdAAAAAD/AAD5AAAAAAD/AABsAAAAAAD6\",\"end_key\":\"bURETEpvYkz/af9zdAAAAAD/AAD5AAAAAAD/AABtAAAAAAD6\",\"request_id\":4346,\"Request\":null}"] [addr=localhost:20160]

问题2: 既然changefeed没有收到请求,为什么还会" changefeed checkpoint is lagging too much, so it will be stopped. "?
创建一小时后,报错gc报错gc-ttl =3600 (1小时)

【业务影响】

【TiDB 版本】
测试环境,V5.0.4

【附件】

  • 相关日志、配置文件、Grafana 监控(https://metricstool.pingcap.com/)
  • TiUP Cluster Display 信息
  • TiUP CLuster Edit config 信息
  • TiDB-Overview 监控
  • 对应模块的 Grafana 监控(如有 BR、TiDB-binlog、TiCDC 等)
  • 对应模块日志(包含问题前后 1 小时日志)

若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

4 个赞

创建changefeed config配置

1 个赞

确认一下TiCDC 启动配置里面的两个参数,可以看一下 TiCDC 启动日志的部分参数配置。

参数: auto-create-topic
参数: partition-num’

1 个赞

创建changefeed命令:

tiup ctl:v5.0.4 cdc changefeed create --pd=http://127.0.0.1:2379 --changefeed-id="kafka-canal-json3" --sink-uri="kafka://127.0.0.1:9092/kafka-canal-json-test2-topic?protocol=canal-json&partition-num=1&max-message-bytes=67108864&replication-factor=2" --config kafka-canal-json-changefeed.toml

kafka中能看到topic: kafka-canal-json-test2-topic

1 个赞

看一下 ticdc 的启动日志,我发的那两个参数的情况哈。

1 个赞

changefeed 相关启动日志中有:partition_num=1

另一个参数没找到。cdc.log (56.2 KB)
没有指定这个参数,官网说默认是true

1 个赞
[2022/03/23 15:31:22.779 +08:00] [WARN] [owner.go:1739] ["changefeed checkpoint is lagging too much, so it will be stopped."] [changefeed=kafka-canal-json2] [error="[CDC:ErrSnapshotLostByGC]fail to create or maintain changefeed due to snapshot loss caused by GC. checkpoint-ts 432017789629759499 is earlier than GC safepoint at 432017789629759499"] [errorVerbose="[CDC:ErrSnapshotLostByGC]fail to create or maintain changefeed due to snapshot loss caused by GC. checkpoint-ts 432017789629759499 is earlier than GC safepoint at 432017789629759499\
github.com/pingcap/errors.AddStack\
\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\
github.com/pingcap/errors.(*Error).GenWithStackByArgs\
\tgithub.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/normalize.go:156\
github.com/pingcap/ticdc/cdc.(*Owner).handleStaleChangeFeed\
\tgithub.com/pingcap/ticdc@/cdc/owner.go:1738\
github.com/pingcap/ticdc/cdc.(*Owner).flushChangeFeedInfos\
\tgithub.com/pingcap/ticdc@/cdc/owner.go:826\
github.com/pingcap/ticdc/cdc.(*Owner).run\
\tgithub.com/pingcap/ticdc@/cdc/owner.go:1445\
github.com/pingcap/ticdc/cdc.(*Owner).Run\
\tgithub.com/pingcap/ticdc@/cdc/owner.go:1308\
github.com/pingcap/ticdc/cdc.(*Server).campaignOwnerLoop\
\tgithub.com/pingcap/ticdc@/cdc/server.go:233\
github.com/pingcap/ticdc/cdc.(*Server).run.func3\
\tgithub.com/pingcap/ticdc@/cdc/server.go:334\
golang.org/x/sync/errgroup.(*Group).Go.func1\
\tgolang.org/x/sync@v0.0.0-20201020160332-67f06af15bc9/errgroup/errgroup.go:57\
runtime.goexit\
\truntime/asm_amd64.s:1357"]

参考一下这个文档,调整一下 start checkpoint ts 设置 ticdc报错 : CDC:ErrSnapshotLostByGC

你提供的日志没有 TiCDC 启动的 log ,一般拉起 TiCDC 服务时候,会有加载启动 TiCDC 的相关日志,里面会包含着 TiCDC 启动的默认参数和参数配置,可以再帮忙找找,可以发一下启动日志出来,一起看一下。

1 个赞

大佬您好,总共是三个 cdc-server,这是其中一个cdc.log 前几行,不知道找的方法对不对。
cdc_1_tmp.log (11.4 KB)

1 个赞

这个 topic name 是下游已经创建的吧? Kafka 的 log 有啥报错吗?

1 个赞

确定已经创建完成。下游flink能匹配到 topic
看起来kafka没有报错:

1 个赞

同步到下游 kafka 只使用一个 topic 对不 ?那么在启动配置里面,在 sink url 添加一下 auto-create-topic 设置为 false看一下行不行。

  1. 同步到下游只有一个topic

  2. auto-create-topic = false 新建的changefeed,下游kafka没有创建 topic

changefeed具体日志辛苦您看下
changefeed.log (7.8 KB)
3. kafka topic list, 没有最新的topic : kafka-canal-json-test3-topic

如果设置 auto-create-topic = true 才会自动创建 topic 。如果是 false ,需要手动创建,并且设置 partition-num 配置。

tiup ctl:v5.0.4 cdc changefeed create --pd=http://127.0.0.1:2379 --changefeed-id=“kafka-canal-json4” --sink-uri=“kafka://localhost:9092/kafka-canal-json-test3-topic?protocol=canal-json&auto-create-topic=false&partition-num=1&max-message-bytes=67108864&replication-factor=2” --config kafka-canal-json-changefeed.toml

这个后来执行成功接收数据了吗?

看changfeed 的query状态是收到了。

auto-create-topic是因为这个参数原因么?

再观察一下,应该是和这个参数配置有关系。

auto-create-topic 开启【默认】情况下,TiCDC 会自己获取 partitionnum 配置;如果是关闭情况下,需要自己手动设置 partitionnum 的值。

可以看到 changefeed能收到消息:

delete from sbtest1 limit 1;

这个问题算是bug么

不算吧,默认 auto-create-topic 参数开启不需要配置 partition-num,文档里面有说明使用 canal json 配置 sink-url 一定要配置partition num 吗?:thinking:

https://docs.pingcap.com/zh/tidb/v5.0/manage-ticdc#sink-uri-配置-kafka
步骤看:
demo的sink-url是这样:没有指定auto-create-topic。

--sink-uri="kafka://127.0.0.1:9092/topic-name?kafka-version=2.4.0&partition-num=6&max-message-bytes=67108864&replication-factor=1"
  • 公司研发要求partition-num=1 :joy:
  • 下游是clickhouse,要保证消息的顺序一致。

问题:如果指定partition-num=1 会造成接收不到 请求么?

sink-uri 中的参数,只是指定 kafka 相关的行为,此处设置为 partition-num=1 是没问题的。

从已有信息来看,问题不在 sink-uri 的配置上,--sink-uri=“kafka://127.0.0.1:9092/kafka-canal-json-test2-topic?protocol=canal-json&partition-num=1&max-message-bytes=67108864&replication-factor=2” 是一个合理的配置,上述配置需要保证:

  1. 因为 replication-factor = 2,下游 kafka 集群至少要有 2 个节点。