ticdc_processor_checkpoint_ts指标怎么没了？

Holland · 2024 年10 月 16 日 06:34

【 TiDB 使用环境】生产
【 TiDB 版本】7.1.2, 7.5.1
【复现路径】做过哪些操作出现的问题

【遇到的问题：问题现象及影响】

ticdc的prometheus监控指标怎么突然就没了

ticdc_processor_checkpoint_ts这个指标在6.5.1版本还能查到。升级到了7.1.2和7.5.1之后就没有了

这个是7.5.1的ticdc.rules.yml告警规则，里面没有ticdc_processor_checkpoint_ts了,我们之前的告警是基于ticdc_processor_checkpoint_ts这个指标的，现在升级之后没有了，导致cdc断更2天了没有告警。但是官我看方文档还在https://docs.pingcap.com/zh/tidb/v7.5/ticdc-alert-rules#cdc_processor_checkpoint_tso_no_change_for_1m

groups:
- name: alert.rules
  rules:
  # server related alter rules
  - alert: cdc_multiple_owners
    expr: sum(rate(ticdc_owner_ownership_counter[30s])) >= 2
    for: 1m
    labels:
      env: 8394e088-e799-4e24-aaec-d39c6a75ee38
      level: warning
      expr: sum(rate(ticdc_owner_ownership_counter[30s])) >= 2
    annotations:
      description: 'cluster: 8394e088-e799-4e24-aaec-d39c6a75ee38, instance: {{ $labels.instance }}, values: {{ $value }}'
      value: '{{ $value }}'
      summary: cdc cluster has multiple owners

  - alert: cdc_no_owner
    expr: sum(rate(ticdc_owner_ownership_counter[30s])) < 0.5
    for: 10m
    labels:
      env: 8394e088-e799-4e24-aaec-d39c6a75ee38
      level: warning
      expr: sum(rate(ticdc_owner_ownership_counter[30s])) < 0.5
    annotations:
      description: 'cluster: 8394e088-e799-4e24-aaec-d39c6a75ee38, instance: {{ $labels.instance }}, values: {{ $value }}'
      value: '{{ $value }}'
      summary: cdc cluster has no owner for more than 10 minutes

  # changefeed related alter rules
  - alert: ticdc_changefeed_failed
    expr: (max_over_time(ticdc_owner_status[1m]) == 2) > 0
    for: 1m
    labels:
      env: 8394e088-e799-4e24-aaec-d39c6a75ee38
      level: critical
      expr: (max_over_time(ticdc_owner_status[1m]) == 2) > 0
    annotations:
      description: 'cluster: 8394e088-e799-4e24-aaec-d39c6a75ee38, instance: {{ $labels.instance }}, values: {{ $value }}'
      value: '{{ $value }}'
      summary: cdc changefeed failed, it can not be automatically resumed

  - alert: cdc_checkpoint_high_delay
    expr: ticdc_owner_checkpoint_ts_lag > 600
    for: 1m
    labels:
      env: 8394e088-e799-4e24-aaec-d39c6a75ee38
      level: critical
      expr: ticdc_owner_checkpoint_ts_lag > 600
    annotations:
      description: 'cluster: 8394e088-e799-4e24-aaec-d39c6a75ee38, instance: {{ $labels.instance }}, values: {{ $value }}'
      value: '{{ $value }}'
      summary: cdc owner checkpoint delay more than 10 minutes

  - alert: cdc_resolvedts_high_delay
    expr: ticdc_owner_resolved_ts_lag > 300
    for: 1m
    labels:
      env: 8394e088-e799-4e24-aaec-d39c6a75ee38
      level: critical
      expr: ticdc_owner_resolved_ts_lag > 300
    annotations:
      description: 'cluster: 8394e088-e799-4e24-aaec-d39c6a75ee38, instance: {{ $labels.instance }}, values: {{ $value }}'
      value: '{{ $value }}'
      summary: cdc owner resolved ts delay more than 5 minutes

  - alert: ticdc_sink_execution_error
    expr: changes(ticdc_sink_execution_error[1m]) > 0
    for: 1m
    labels:
      env: 8394e088-e799-4e24-aaec-d39c6a75ee38
      level: warning
      expr: changes(ticdc_sink_execution_error[1m]) > 0
    annotations:
      description: 'cluster: 8394e088-e799-4e24-aaec-d39c6a75ee38, instance: {{ $labels.instance }}, values: {{ $value }}'
      value: '{{ $value }}'
      summary: cdc sink execution meets errors

  - alert: ticdc_processor_exit_with_error_count
    expr: changes(ticdc_processor_exit_with_error_count[1m]) > 0
    for: 1m
    labels:
      env: 8394e088-e799-4e24-aaec-d39c6a75ee38
      level: warning
      expr: changes(ticdc_processor_exit_with_error_count[1m]) > 0
    annotations:
      description: 'cluster: 8394e088-e799-4e24-aaec-d39c6a75ee38, instance: {{ $labels.instance }}, values: {{ $value }}'
      value: '{{ $value }}'
      summary: cdc processor exits with error

  - alert: ticdc_changefeed_meet_error
    expr: (max_over_time(ticdc_owner_status[1m]) == 1) > 0
    for: 1m
    labels:
      env: 8394e088-e799-4e24-aaec-d39c6a75ee38
      level: warning
      expr: (max_over_time(ticdc_owner_status[1m]) == 1) > 0
    annotations:
      description: 'cluster: 8394e088-e799-4e24-aaec-d39c6a75ee38, instance: {{ $labels.instance }}, values: {{ $value }}'
      value: '{{ $value }}'
      summary: cdc changefeed meet error

  # tikv related alter rules
  - alert: tikv_cdc_min_resolved_ts_no_change_for_1m
    expr: changes(tikv_cdc_min_resolved_ts[1m]) < 1 and ON (instance) tikv_cdc_region_resolve_status{status="resolved"} > 0 and ON (instance) tikv_cdc_captured_region_total > 0
    for: 1m
    labels:
      env: 8394e088-e799-4e24-aaec-d39c6a75ee38
      level: warning
      expr: changes(tikv_cdc_min_resolved_ts[1m]) < 1 and ON (instance) tikv_cdc_region_resolve_status{status="resolved"} > 0
    annotations:
      description: 'cluster: 8394e088-e799-4e24-aaec-d39c6a75ee38, instance: {{ $labels.instance }}, values: {{ $value }}'
      value: '{{ $labels.instance }}'
      summary: tikv cdc min resolved ts no change for 1m

  - alert: tikv_cdc_scan_duration_seconds_more_than_10min
    expr: histogram_quantile(0.9, rate(tikv_cdc_scan_duration_seconds_bucket{}[1m])) > 600
    for: 1m
    labels:
      env: 8394e088-e799-4e24-aaec-d39c6a75ee38
      level: warning
      expr: histogram_quantile(0.9, rate(tikv_cdc_scan_duration_seconds_bucket{}[1m])) > 600
    annotations:
      description: 'cluster: 8394e088-e799-4e24-aaec-d39c6a75ee38, instance: {{ $labels.instance }}, values: {{ $value }}'
      value: '{{ $value }}'
      summary: tikv cdc scan duration seconds more than 10 min

  - alert: ticdc_memory_abnormal
    expr: go_memstats_heap_alloc_bytes{job="ticdc"} > 1e+10
    for: 1m
    labels:
      env: 8394e088-e799-4e24-aaec-d39c6a75ee38
      level: warning
      expr: go_memstats_heap_alloc_bytes{job="ticdc"} > 1e+10
    annotations:
      description: 'cluster: 8394e088-e799-4e24-aaec-d39c6a75ee38, instance: {{ $labels.instance }}, values:{{ $value }}'
      value: '{{ $value }}'
      summary: TiCDC heap memory usage is over 10 GB

【资源配置】进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面
【附件：截图/日志/监控】

kkpeter · 2024 年10 月 16 日 07:16

看了下好像确实没了，这个指标挺好用的，线上会使用这个指标判断任务是否正常

小王同学Plus · 2024 年10 月 16 日 07:24

嗯，看了下 ticdc_processor_checkpoint_ts 已在 processor,scheduler(ticdc): clean up unused method and metrics by overvenus · Pull Request #8049 · pingcap/tiflow · GitHub 刪掉了, 而且它在更早的 metrics: replace processor checkpoint with owner checkpoint by overvenus · Pull Request #7051 · pingcap/tiflow · GitHub 已被 ticdc_owner_checkpoint_ts 取代了,文档应该没有及时更新，可以考虑给文档提个 PR

告警配置可以基于 ticdc_owner_checkpoint_ts 指标。

Holland · 2024 年10 月 16 日 09:20

那这个新的怎么用呢？把这内容直接换成ticdc_owner_checkpoint_ts好像不可以，所有的同步任务这个表达式都不会小于一。

直到任务异常结束这个表达式都没有小于1

小龙虾爱大龙虾 · 2024 年10 月 16 日 15:58

因为它这个表达式只有在 cdc 卡住，checkpoint 不推进的时候才会 <1

Holland · 2024 年10 月 17 日 06:29

图示这个任务是，同步到kafka失败了，报错中断了，这种失败情况的话，也不会使表达式<1吗

小龙虾爱大龙虾 · 2024 年10 月 17 日 08:34

这个没注意，如果这个告警出不来的话，应该有别的告警出来，你看下你那里有告警不

Holland · 2024 年10 月 17 日 08:38

是的，有别的

system · 2024 年10 月 24 日 08:38

此话题已在最后回复的 7 天后被自动关闭。不再允许新回复。