tidb v3.0.5 TiDB 集群启动,必发生一次 TiKV_GC_can_not_work 告警

版本

TiDB:v3.0.5

告警规则配置

 16   - alert: TiKV_GC_can_not_work
 17     expr: sum(increase(tikv_gcworker_gc_tasks_vec{task="gc"}[1d])) < 1
 18     for: 1m
 19     labels:
 20       env: ENV_LABELS_ENV
 21       level: emergency
 22       expr: sum(increase(tikv_gcworker_gc_tasks_vec{task="gc"}[1d])) < 1
 23     annotations:
 24       description: 'cluster: ENV_LABELS_ENV, instance: {{ $labels.instance }}, values:{{ $value }}'
 25       value: '{{ $value }}'
 26       summary: TiKV GC can not work

现象
拉起新集群 必定 产生一次 [TiKV_GC_can_not_work 告警]

持续时间大概一次GC的间隔时间10min

排查结果
https://pingcap.com/docs-cn/stable/alert-rules/#tikv_gc_can_not_work 基本确认跟文档中描述一致,GC正常

疑问

是什么原因造成这种现象的?数据采集抖动?

你好,

辛苦上传下出错 tikv 的 tikv.log 看下是否有更加详细的报错,

{"log":"[2020/06/17 14:21:29.654 +08:00] [INFO] [mod.rs:334] [\"starting working thread\"] [worker=split-check]\
","stream":"stderr","time":"2020-06-17T06:21:29.654235548Z"}
{"log":"[2020/06/17 14:21:29.733 +08:00] [INFO] [mod.rs:334] [\"starting working thread\"] [worker=snapshot-worker]\
","stream":"stderr","time":"2020-06-17T06:21:29.733120358Z"}
{"log":"[2020/06/17 14:21:29.763 +08:00] [INFO] [mod.rs:334] [\"starting working thread\"] [worker=raft-gc-worker]\
","stream":"stderr","time":"2020-06-17T06:21:29.763360495Z"}
{"log":"[2020/06/17 14:21:29.793 +08:00] [INFO] [mod.rs:334] [\"starting working thread\"] [worker=compact-worker]\
","stream":"stderr","time":"2020-06-17T06:21:29.793921818Z"}
{"log":"[2020/06/17 14:21:29.823 +08:00] [INFO] [future.rs:131] [\"starting working thread\"] [worker=pd-worker]\
","stream":"stderr","time":"2020-06-17T06:21:29.823980845Z"}
{"log":"[2020/06/17 14:21:29.825 +08:00] [INFO] [mod.rs:334] [\"starting working thread\"] [worker=consistency-check]\
","stream":"stderr","time":"2020-06-17T06:21:29.82544529Z"}
{"log":"[2020/06/17 14:21:29.825 +08:00] [INFO] [mod.rs:334] [\"starting working thread\"] [worker=cleanup-sst]\
","stream":"stderr","time":"2020-06-17T06:21:29.825575964Z"}
{"log":"[2020/06/17 14:21:29.826 +08:00] [WARN] [store.rs:1116] [\"set thread priority for raftstore failed\"] [error=\"Os { code: 13, kind: PermissionDenied, message: \\\"Permission denied\\\" }\"]\
","stream":"stderr","time":"2020-06-17T06:21:29.82680576Z"}
{"log":"[2020/06/17 14:21:29.826 +08:00] [INFO] [node.rs:160] [\"put store to PD\"] [store=\"id: 4 address: \\\"---------------\\\" labels { key: \\\"host\\\" value: \\\"------------\\\" } labels { key: \\\"zone\\\" value: \\\"---------\\\" } labels { key: \\\"rack\\\" value: \\\"-----------\\\" } version: \\\"3.0.5\\\"\"]\
","stream":"stderr","time":"2020-06-17T06:21:29.826875336Z"}
{"log":"[2020/06/17 14:21:29.830 +08:00] [INFO] [gc_worker.rs:759] [\"gc-manager is initializing\"]\
","stream":"stderr","time":"2020-06-17T06:21:29.830428768Z"}
{"log":"[2020/06/17 14:21:29.833 +08:00] [INFO] [gc_worker.rs:762] [\"gc-manager started\"] [safe_point=0]\
","stream":"stderr","time":"2020-06-17T06:21:29.833401259Z"}
{"log":"[2020/06/17 14:21:29.836 +08:00] [INFO] [gc_worker.rs:727] [\"gc-manager is started\"]\
","stream":"stderr","time":"2020-06-17T06:21:29.836176451Z"}
{"log":"[2020/06/17 14:21:29.836 +08:00] [INFO] [future.rs:131] [\"starting working thread\"] [worker=waiter-manager]\
","stream":"stderr","time":"2020-06-17T06:21:29.836233513Z"}
{"log":"[2020/06/17 14:21:29.837 +08:00] [INFO] [future.rs:131] [\"starting working thread\"] [worker=deadlock-detector]\
","stream":"stderr","time":"2020-06-17T06:21:29.837347672Z"}
{"log":"[2020/06/17 14:21:29.841 +08:00] [INFO] [mod.rs:334] [\"starting working thread\"] [worker=snap-handler]\
","stream":"stderr","time":"2020-06-17T06:21:29.841157353Z"}
{"log":"[2020/06/17 14:21:29.842 +08:00] [INFO] [tcp_server_posix.cc:308] [\"Failed to add :: listener, the environment may not support IPv6: {\\\"created\\\":\\\"@1592374889.841624139\\\",\\\"description\\\":\\\"OS Error\\\",\\\"errno\\\":97,\\\"file\\\":\\\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.4.5/grpc/src/core/lib/iomgr/socket_utils_common_posix.cc\\\",\\\"file_line\\\":379,\\\"os_error\\\":\\\"Address family not supported by protocol\\\",\\\"syscall\\\":\\\"socket\\\",\\\"target_address\\\":\\\"[::]:10020\\\"}\"]\
","stream":"stderr","time":"2020-06-17T06:21:29.842498883Z"}
{"log":"[2020/06/17 14:21:29.842 +08:00] [INFO] [server.rs:199] [\"listening on addr\"] [addr=0.0.0.0:10020]\
","stream":"stderr","time":"2020-06-17T06:21:29.842552048Z"}
{"log":"[2020/06/17 14:21:29.851 +08:00] [INFO] [server.rs:217] [\"TiKV is ready to serve\"]\
","stream":"stderr","time":"2020-06-17T06:21:29.851974519Z"}
{"log":"[2020/06/17 14:21:29.855 +08:00] [INFO] [status_server.rs:29] [\"Status server started\"]\
","stream":"stderr","time":"2020-06-17T06:21:29.855670061Z"}
{"log":"[2020/06/17 14:21:39.640 +08:00] [INFO] [util.rs:73] [\"heartbeat receiver is refreshed\"]\
","stream":"stderr","time":"2020-06-17T06:21:39.640119334Z"}
{"log":"[2020/06/17 14:21:41.421 +08:00] [INFO] [kv.rs:705] [\"batch_raft RPC is called, new gRPC stream established\"]\
","stream":"stderr","time":"2020-06-17T06:21:41.421228704Z"}


....


{"log":"[2020/06/17 14:24:46.051 +08:00] [INFO] [raft.rs:858] [\"[region 212] 214 received MsgRequestVoteResponse from 215 at term 7\"]\
","stream":"stderr","time":"2020-06-17T06:24:46.051915123Z"}
{"log":"[2020/06/17 14:24:46.051 +08:00] [INFO] [raft.rs:1587] [\"[region 212] 214 [quorum:2] has received 2 MsgRequestVoteResponse votes and 0 vote rejections\"]\
","stream":"stderr","time":"2020-06-17T06:24:46.05196144Z"}
{"log":"[2020/06/17 14:24:46.051 +08:00] [INFO] [raft.rs:793] [\"[region 212] 214 became leader at term 7\"]\
","stream":"stderr","time":"2020-06-17T06:24:46.051996949Z"}
{"log":"[2020/06/17 14:32:36.663 +08:00] [INFO] [process.rs:188] [\"get snapshot failed\"] [err=\"Request(message: \\\"peer is not leader for region 64, leader may Some(id: 67 store_id: 6)\\\" not_leader { region_id: 64 leader { id: 67 store_id: 6 } })\"] [cid=212]\
","stream":"stderr","time":"2020-06-17T06:32:36.663726006Z"}
{"log":"[2020/06/17 14:33:22.702 +08:00] [ERROR] [endpoint.rs:454] [error-response] [err=\"region message: \\\"peer is not leader for region 64, leader may Some(id: 67 store_id: 6)\\\" not_leader { region_id: 64 leader { id: 67 store_id: 6 } }\"]\
","stream":"stderr","time":"2020-06-17T06:33:22.702301308Z"}
{"log":"[2020/06/17 14:33:22.702 +08:00] [ERROR] [endpoint.rs:454] [error-response] [err=\"region message: \\\"peer is not leader for region 64, leader may Some(id: 67 store_id: 6)\\\" not_leader { region_id: 64 leader { id: 67 store_id: 6 } }\"]\
","stream":"stderr","time":"2020-06-17T06:33:22.702355495Z"}
{"log":"[2020/06/17 14:34:29.884 +08:00] [INFO] [gc_worker.rs:861] [\"gc_worker: start auto gc\"] [safe_point=417431540617445376]\
","stream":"stderr","time":"2020-06-17T06:34:29.884202216Z"}
{"log":"[2020/06/17 14:34:29.894 +08:00] [INFO] [gc_worker.rs:901] [\"gc_worker: finished auto gc\"] [processed_regions=18]\
","stream":"stderr","time":"2020-06-17T06:34:29.894234743Z"}
{"log":"[2020/06/17 14:44:16.694 +08:00] [INFO] [gc_worker.rs:300] [\"unsafe destroy range started\"] [end_key=7480000000000000FF7000000000000000F8] [start_key=7480000000000000FF6F00000000000000F8]\
","stream":"stderr","time":"2020-06-17T06:44:16.694475106Z"}
{"log":"[2020/06/17 14:44:16.694 +08:00] [INFO] [gc_worker.rs:336] [\"unsafe destroy range finished deleting files in range\"] [cost_time=13.578µs] [end_key=7480000000000000FF7000000000000000F8] [start_key=7480000000000000FF6F00000000000000F8]\
","stream":"stderr","time":"2020-06-17T06:44:16.69449603Z"}
{"log":"[2020/06/17 14:44:16.694 +08:00] [INFO] [gc_worker.rs:374] [\"unsafe destroy range finished cleaning up all\"] [cost_time=35.777µs] [end_key=7480000000000000FF7000000000000000F8] [start_key=7480000000000000FF6F00000000000000F8]\
","stream":"stderr","time":"2020-06-17T06:44:16.694528217Z"}
{"log":"[2020/06/17 14:44:16.722 +08:00] [INFO] [gc_worker.rs:300] [\"unsafe destroy range started\"] [end_key=7480000000000000FF2F5F698000000000FF0000020000000000FA] [start_key=7480000000000000FF2F5F698000000000FF0000010000000000FA]\
","stream":"stderr","time":"2020-06-17T06:44:16.722764879Z"}
{"log":"[2020/06/17 14:44:16.722 +08:00] [INFO] [gc_worker.rs:336] [\"unsafe destroy range finished deleting files in range\"] [cost_time=11.159µs] [end_key=7480000000000000FF2F5F698000000000FF0000020000000000FA] [start_key=7480000000000000FF2F5F698000000000FF0000010000000000FA]\
","stream":"stderr","time":"2020-06-17T06:44:16.722788978Z"}
{"log":"[2020/06/17 14:44:16.722 +08:00] [INFO] [gc_worker.rs:374] [\"unsafe destroy range finished cleaning up all\"] [cost_time=83.252µs] [end_key=7480000000000000FF2F5F698000000000FF0000020000000000FA] [start_key=7480000000000000FF2F5F698000000000FF0000010000000000FA]\
","stream":"stderr","time":"2020-06-17T06:44:16.722870583Z"}
{"log":"[2020/06/17 14:44:29.932 +08:00] [INFO] [gc_worker.rs:861] [\"gc_worker: start auto gc\"] [safe_point=417431697903845376]\
","stream":"stderr","time":"2020-06-17T06:44:29.932298251Z"}
{"log":"[2020/06/17 14:44:29.960 +08:00] [INFO] [gc_worker.rs:901] [\"gc_worker: finished auto gc\"] [processed_regions=18]\
","stream":"stderr","time":"2020-06-17T06:44:29.961003509Z"}
{"log":"[2020/06/17 14:54:29.999 +08:00] [INFO] [gc_worker.rs:861] [\"gc_worker: start auto gc\"] [safe_point=417431855190245376]\
","stream":"stderr","time":"2020-06-17T06:54:29.999832466Z"}
{"log":"[2020/06/17 14:54:30.011 +08:00] [INFO] [gc_worker.rs:901] [\"gc_worker: finished auto gc\"] [processed_regions=18]\
","stream":"stderr","time":"2020-06-17T06:54:30.011217412Z"}
{"log":"[2020/06/17 15:04:30.048 +08:00] [INFO] [gc_worker.rs:861] [\"gc_worker: start auto gc\"] [safe_point=417432012476645376]\
","stream":"stderr","time":"2020-06-17T07:04:30.048993304Z"}
{"log":"[2020/06/17 15:04:30.057 +08:00] [INFO] [gc_worker.rs:901] [\"gc_worker: finished auto gc\"] [processed_regions=18]\
","stream":"stderr","time":"2020-06-17T07:04:30.05714544Z"}
{"log":"[2020/06/17 15:14:30.093 +08:00] [INFO] [gc_worker.rs:861] [\"gc_worker: start auto gc\"] [safe_point=417432169763045376]\
","stream":"stderr","time":"2020-06-17T07:14:30.09379631Z"}
{"log":"[2020/06/17 15:14:30.102 +08:00] [INFO] [gc_worker.rs:901] [\"gc_worker: finished auto gc\"] [processed_regions=18]\
","stream":"stderr","time":"2020-06-17T07:14:30.102481257Z"}
{"log":"[2020/06/17 15:24:30.139 +08:00] [INFO] [gc_worker.rs:861] [\"gc_worker: start auto gc\"] [safe_point=417432327049445376]\
","stream":"stderr","time":"2020-06-17T07:24:30.139635593Z"}
{"log":"[2020/06/17 15:24:30.150 +08:00] [INFO] [gc_worker.rs:901] [\"gc_worker: finished auto gc\"] [processed_regions=18]\
","stream":"stderr","time":"2020-06-17T07:24:30.150440727Z"}


告警时间段为 20200617/14:35 – 20200617/14:44

另外kv节点上日志近似,给出了一个kv节点上的日志

此问题是已知问题,在3.0 引入分布式 gc 会有误报警问题,可以升级到 tidb 3.x 最新版,或者 4.x 。gc worker 工作是正常的。

解决方案是:将此报警规则改成 sum(increase(tikv_gcworker_gc_tasks_vec{task="gc"}[1d])) 即可

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。