TiCDC 任务暂停,无法同步数据

Bug 反馈
ticdc 有2个问题
【 TiDB 版本】7.1.0
【 影响】ticdc同步数据不推进

问题1:同步任务报错,导致部分表数据不同步(总共5张,有1张表不同步)
发现tso未推进,还是几个月以前的tso(任务状态是normal,也没有告警,理论应该告警)

[2025/03/21 16:51:02.061 +08:00] [WARN] [metrics_collector.go:96] ["Get Kafka brokers failed, use historical brokers to collect kafka broker level metrics"] [namespace=default] [changefeed=ticdc-topic] [role=processor] [duration=157.575642ms] [error="[CDC:ErrReachMaxTry]reach maximum try: 3, error: kafka: tried to use a client that was closed: kafka: tried to use a client that was closed"] [errorVerbose="[CDC:ErrReachMaxTry]reach maximum try: 3, error: kafka: tried to use a client that was closed: kafka: tried to use a client that was closed\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20221009092201-b66cddb77c32/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/errors@v0.11.5-0.20221009092201-b66cddb77c32/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/retry.run\n\tgithub.com/pingcap/tiflow/pkg/retry/retry_with_opt.go:69\ngithub.com/pingcap/tiflow/pkg/retry.Do\n\tgithub.com/pingcap/tiflow/pkg/retry/retry_with_opt.go:34\ngithub.com/pingcap/tiflow/pkg/sink/kafka.(*saramaAdminClient).queryClusterWithRetry\n\tgithub.com/pingcap/tiflow/pkg/sink/kafka/admin.go:92\ngithub.com/pingcap/tiflow/pkg/sink/kafka.(*saramaAdminClient).GetAllBrokers\n\tgithub.com/pingcap/tiflow/pkg/sink/kafka/admin.go:130\ngithub.com/pingcap/tiflow/pkg/sink/kafka.(*saramaMetricsCollector).updateBrokers\n\tgithub.com/pingcap/tiflow/pkg/sink/kafka/metrics_collector.go:94\ngithub.com/pingcap/tiflow/pkg/sink/kafka.(*saramaMetricsCollector).Run\n\tgithub.com/pingcap/tiflow/pkg/sink/kafka/metrics_collector.go:85\nruntime.goexit\n\truntime/asm_amd64.s:1598"]

官方解决方案:
https://docs.pingcap.com/zh/tidb/stable/troubleshoot-ticdc/#使用-ticdc-同步消息到-kafka-时报错-kafka-client-has-run-out-of-available-brokers-to-talk-to-eof该如何处理

问题2:按照官方解决方案处理,无法识别kafka版本,已确定kafka版本(2.12-2.2.1)正确

-- sink 配置
protocol=canal-json&max-message-bytes=67108864&replication-factor=2&partition-num=8&kafka-version=2.12

报错:ErrKafkaInvalidVersion]invalid kafka version: invalid version `2.12

你说的同步任务报错只有 kafka 这个报错吗? kafka-version 错误只是其中一个可能,可以看下下游 kafka 日志中有没有更多信息

看下上下游的日志呢。

先看下日志,目前获取信息太少了

总共同步5张表,其余4张表能消费到数据。

有比较多重试日志 kafka: tried to use a client that was closed,其余就没了

[2025/03/20 09:47:42.013 +08:00] [WARN] [metrics_collector.go:96] ["Get Kafka brokers failed, use historical brokers to collect kafka broker level metrics"] [namespace=default] [changefeed=ticdc-topic-prod-xxxxx] [role=processor] [duration=109.199919ms] [error="[CDC:ErrReachMaxTry]reach maximum try: 3, error: kafka: tried to use a client that was closed: kafka: tried to use a client that was closed"] [errorVerbose="[CDC:ErrReachMaxTry]reach maximum try: 3, error: kafka: tried to use a client that was closed: kafka: tried to use a client that was closed\ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20221009092201-b66cddb77c32/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/errors@v0.11.5-0.20221009092201-b66cddb77c32/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/retry.run\n\tgithub.com/pingcap/tiflow/pkg/retry/retry_with_opt.go:69\ngithub.com/pingcap/tiflow/pkg/retry.Do\n\tgithub.com/pingcap/tiflow/pkg/retry/retry_with_opt.go:34\ngithub.com/pingcap/tiflow/pkg/sink/kafka.(*saramaAdminClient).queryClusterWithRetry\n\tgithub.com/pingcap/tiflow/pkg/sink/kafka/admin.go:92\ngithub.com/pingcap/tiflow/pkg/sink/kafka.(*saramaAdminClient).GetAllBrokers\n\tgithub.com/pingcap/tiflow/pkg/sink/kafka/admin.go:130\ngithub.com/pingcap/tiflow/pkg/sink/kafka.(*saramaMetricsCollector).updateBrokers\n\tgithub.com/pingcap/tiflow/pkg/sink/kafka/metrics_collector.go:94\ngithub.com/pingcap/tiflow/pkg/sink/kafka.(*saramaMetricsCollector).Run\n\tgithub.com/pingcap/tiflow/pkg/sink/kafka/metrics_collector.go:85\nruntime.goexit\n\truntime/asm_amd64.s:1598"]

[2025/03/20 09:47:45.717 +08:00] [INFO] [region_worker.go:197] ["single region event feed disconnected"] [namespace=default] [changefeed=ticdc-topic-prod-xxxxx] [regionID=59643528] [requestID=102196] [span={table_id:0,start_key:7480000000000001ff155f7280000000d0ff7e4d520000000000fa,end_key:7480000000000001ff155f7280000000d0ff7fdd200000000000fa}] [resolvedTs=456768950223765539] [error="[CDC:ErrEventFeedEventError]eventfeed returns event error: not_leader:<region_id:59643528 leader:<id:59644469 store_id:55700677 role:IncomingVoter > > "] [errorVerbose="[CDC:ErrEventFeedEventError]eventfeed returns event error: not_leader:<region_id:59643528 leader:<id:59644469 store_id:55700677 role:IncomingVoter > > \ngithub.com/pingcap/errors.AddStack\n\tgithub.com/pingcap/errors@v0.11.5-0.20221009092201-b66cddb77c32/errors.go:174\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\tgithub.com/pingcap/errors@v0.11.5-0.20221009092201-b66cddb77c32/normalize.go:164\ngithub.com/pingcap/tiflow/pkg/errors.WrapError\n\tgithub.com/pingcap/tiflow/pkg/errors/helper.go:34\ngithub.com/pingcap/tiflow/cdc/kv.(*regionWorker).processEvent\n\tgithub.com/pingcap/tiflow/cdc/kv/region_worker.go:377\ngithub.com/pingcap/tiflow/cdc/kv.(*regionWorker).eventHandler\n\tgithub.com/pingcap/tiflow/cdc/kv/region_worker.go:512\ngithub.com/pingcap/tiflow/cdc/kv.(*regionWorker).run.func4\n\tgithub.com/pingcap/tiflow/cdc/kv/region_worker.go:603\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\tgolang.org/x/sync@v0.1.0/errgroup/errgroup.go:75\nruntime.goexit\n\truntime/asm_amd64.s:1598"]

kafka version 应该设置为 2.2.1,你的报错看起来不是 EOF 吧,那个解决方案是处理 EOF 问题的。

cdc cli changefeed query 看下完整的cdc配置,发出来看看

原始任务(告警太多)清理了,相同配置的任务。

{
  "upstream_id": 6965866413841334309,
  "namespace": "default",
  "id": "ticdc-topic-xxx",
  "sink_uri": "kafka://kafka-ip:9092/ticdc-topic-xxx?protocol=canal-json\u0026max-message-bytes=67108864\u0026replication-factor=2\u0026partition-num=8",
  "config": {
    "memory_quota": 1073741824,
    "case_sensitive": true,
    "enable_old_value": true,
    "force_replicate": false,
    "ignore_ineligible_table": false,
    "check_gc_safe_point": true,
    "enable_sync_point": false,
    "bdr_mode": false,
    "sync_point_interval": 600000000000,
    "sync_point_retention": 86400000000000,
    "filter": {
      "rules": [
        "db1.t1",  -- 这个数据没同步
            "db1.t2",
            "db1.t3",
            "db1.t4",
            "db1.t5"
      ],
      "event_filters": null
    },
    "mounter": {
      "worker_num": 4
    },
    "sink": {
      "protocol": "canal-json",
      "schema_registry": "",
      "csv": {
        "delimiter": ",",
        "quote": "\"",
        "null": "\\N",
        "include_commit_ts": false
      },
      "dispatchers": [
        {
          "matcher": [
            "db1.t1",  -- 这个数据没同步
            "db1.t2",
            "db1.t3",
            "db1.t4",
            "db1.t5"
          ],
          "partition": "index-value",
          "topic": ""
        }
      ],
      "column_selectors": null,
      "transaction_atomicity": "",
      "encoder_concurrency": 16,
      "terminator": "\r\n",
      "date_separator": "day",
      "enable_partition_separator": true,
      "file_index_digit": 0,
      "enable_kafka_sink_v2": false,
      "only_output_updated_columns": null
    },
    "consistent": {
      "level": "none",
      "max_log_size": 64,
      "flush_interval": 2000,
      "storage": "",
      "use_file_backend": false
    },
    "scheduler": {
      "enable_table_across_nodes": false,
      "region_threshold": 100000,
      "write_key_threshold": 0
    },
    "integrity": {
      "integrity_check_level": "none",
      "corruption_handle_level": "warn"
    }
  },
  "create_time": "2025-03-24 15:04:52.611",
  "start_ts": 456864528712597520,
  "resolved_ts": 456906359397941262,
  "target_ts": 0,
  "checkpoint_tso": 456906359358619686,
  "checkpoint_time": "2025-03-26 11:23:59.692",
  "state": "normal",
  "error": null,
  "error_history": null,
  "creator_version": "v7.1.0",
  "task_status": [
    {
      "capture_id": "1abb1849-b1af-48d2-b0df-8615dd878dfa",
      "table_ids": [
        243,
        275,
        283
      ],
      "table_operations": null
    },
    {
      "capture_id": "3a33aaf6-3f59-4b0c-8a53-fd66a63deec7",
      "table_ids": [
        277,
        451
      ],
      "table_operations": null
    }
  ]
}
  1. cdc 默认只同步具有有效索引的表 https://docs.pingcap.com/zh/tidb/stable/ticdc-overview/#有效索引
  2. 查看监控面板,重点看 checkpoint-ts 和 resolved-ts 两个指标 TiCDC 监控指标分析指南

1、有主键
image

2、


你的checkpoint point 时间怎么能是2024年9月呢 :scream:

你没看错,就是卡住了,checkpoint lag 24weeks。

重要的是,没告警出来。

那现在可能已经超过GC 时间了吧,现在已经废了,删除重开吧。监控的之前没有搞过其他的,自己写shell 发报警,主要是用: curl -X GET http://127.0.0.1:8300/api/v2/changefeeds?state=normal 这个来解析,文档在 https://docs.pingcap.com/zh/tidb/stable/ticdc-open-api-v2/#查询同步任务列表 希望能够帮到你

1、有可能过了gc,已重开

2、除了这监控,自己采集有normal状态监控,但没采集checkpoint-lag,感觉有必要加上。

看你这个任务的sink uri 并没有指定 kafka-version 呢

超GC 时间了,重开吧。

1、指定后任务update 任务报错;

2、重建没制定kafka版本,能正常运行。