TiCDC mounter_unmarshal_and_mount_time指标持续异常报警

Culbr · 2021 年6 月 21 日 04:47

【背景】
将TiDB某库中部分表的binlog同步至下游Kafka。配置文件如下：

case-sensitive = true

enable-old-value = true

[filter]
rules = ['workorder_index.workorder', 'workorder_index.workflow', 'workorder_index.workflow_reply']

[mounter]
worker-num = 16

[sink]
dispatchers = [
    {matcher = ['workorder_index.*'], dispatcher = "rowid"}
]
protocol = "canal-json"

[cyclic-replication]
enable = false

启动命令：

tiup ctl cdc changefeed create \
--pd=http://[PD_IP]:2379 \
--sink-uri="kafka://[KAFKA_IP]:9092/ticdc_canal_workorder_index?kafka-version=2.2.1&partition-num=20&replication-factor=3&protocol=canal-json&max-message-bytes=10485760" \
--changefeed-id="workorder-index" \
--config cdc_workorder_index.toml

【问题】
Binlog能够正常同步，但持续出现以下报警，从任务启动后未停止过。

[1] Firing
Labels
alertname = ticdc_mounter_unmarshal_and_mount_time_more_than_1s
capture = [CDC_IP]:8300
changefeed = workorder-index
cluster = sht-tidb-cluster-pro
env = ENV_LABELS_ENV
expr = histogram_quantile(0.9, rate(ticdc_mounter_unmarshal_and_mount_bucket[1m])) * 1000 > 1000
instance = [CDC_IP]:8300
job = ticdc
level = warning
monitor = prometheus
Annotations
description = cluster: ENV_LABELS_ENV, instance: [CDC_IP]:8300, values: 4600
summary = cdc_mounter unmarshal and mount time more than 1s
value = 4600

集群中有3个TiCDC实例，启动了3个CDC任务，其他两个任务都正常，且出现问题的这个任务数据量是最小的。

【TiDB 版本】
4.0.10

【附件】
Grafana监控图表：

yilong · 2021 年6 月 21 日 12:56

文档描述：

Mounter unmarshal duration：TiCDC 节点解码数据变更的耗时直方图
Mounter unmarshal duration percentile：每秒钟中 95%，99% 和 99.9% 的情况下，TiCDC 解码数据变更所花费的时间

能否反馈下这个 3 个 ticdc 节点服务器的配置，和压力情况。查看下 cpu，内存，IO 使用情况，多谢。
这个节点是一直都这么高的耗时吗？

Culbr · 2021 年6 月 22 日 07:40

Hi,

3台CDC机器的配置都是16核64G和500G SSD，通过top和iotop命令看到的压力都相当低。另外如果重启出现问题的CDC任务，耗时会暂时降低一段时间（和其他CDC节点一样都是毫秒级别），大约过数个小时，又会变成如上图那个样子。

yilong · 2021 年6 月 22 日 11:33

麻烦您帮忙上传下ticdc.log 多谢。

Culbr · 2021 年6 月 23 日 08:15

这是今天有问题的CDC节点的日志文件。根据监控，unmarshal duration最大的时间集中在上午8~12点区间内（昨天也差不多），但是日志里似乎没有明显的报错信息，查看对应该库的DM任务，流量也不大（十几QPS）。还请帮忙定位下问题，非常感谢

cdc.log.0623 (577.1 KB)

yilong · 2021 年6 月 23 日 08:20

麻烦执行下 tiup cluster display 反馈下拓扑信息，多谢。
麻烦反馈下 tikv.log 这个时间段的日志，多谢。
能点开看看 p999 延时吗，第二个面板
麻烦反馈下 detail-tikv 和 ticdc 这段时间的完整监控。
[FAQ] Grafana Metrics 页面的导出和导入

Forrestshao · 2021 年9 月 14 日 06:32

请教一下后来这个告警解决了吗我这边也出现了这个问题

Billmay表妹 · 2021 年11 月 1 日 05:41

如果你也出现这样子的问题，麻烦你重新发帖描述你的问题，这边技术人员会帮你处理~

system · 2022 年10 月 31 日 19:20

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。