pushgateway性能比较差,导致推送metrics经常失败

TiKV经常有如下日志:

[2020/06/02 10:42:26.010 +08:00] [ERROR] [mod.rs:64] [“fail to push metrics”] [err=“Msg("error sending request for url (http://192.168.1.73** :9091/metrics/job/tikv_6106820/instance/tikv95): operation timed out")”]

pushgateway版本:

pushgateway, version 1.2.0 (branch: HEAD, revision: b7e0167e9574f4f88404dde9653ee1d3c940f2eb)
  build user:       root@0e823ccfff84
  build date:       20200311-18:51:01
  go version:       go1.13.8

用perf top观察pushgateway的CPU使用, 发现似乎与checkMetricConsistency有关

  16.53%  pushgateway       [.] github.com/prometheus/client_golang/prometheus/internal.metricSorter.Less
   7.93%  pushgateway       [.] github.com/prometheus/client_golang/prometheus.checkMetricConsistency
   7.24%  pushgateway       [.] compress/flate.(*compressor).deflate
   6.19%  pushgateway       [.] runtime.scanobject
   5.51%  pushgateway       [.] compress/flate.(*compressor).findMatch
   4.11%  pushgateway       [.] github.com/cespare/xxhash/v2.(*Digest).Write
   3.61%  pushgateway       [.] runtime.memmove
   3.35%  pushgateway       [.] runtime.greyobject
   2.88%  pushgateway       [.] cmpbody

pushgateway有如下的日志:

level=info ts=2020-06-02T02:55:22.551Z caller=diskmetricstore.go:165 msg="metric families inconsistent help strings" err="Metric families have inconsistent help strings. The latter will have priority. This is bad. Fix your pushed metrics!" new="name:\"go_gc_duration_seconds\" help:\"A summary of the GC invocation durations.\" type:SUMMARY metric:<label:<name:\"instance\" value:\"pd80\" > label:<name:\"job\" value:\"pd80\" > summary:<sample_count:639 sample_sum:0.060894421 quantile:<quantile:0 value:3.5733e-05 > quantile:<quantile:0.25 value:7.6505e-05 > quantile:<quantile:0.5 value:9.0162e-05 > quantile:<quantile:0.75 value:0.000106062 > quantile:<quantile:1 value:0.00209099 > > > " old="name:\"go_gc_duration_seconds\" help:\"A summary of the pause duration of garbage collection cycles.\" type:SUMMARY metric:<label:<name:\"instance\" value:\"nm-new_4000\" > label:<name:\"job\" value:\"tidb\" > summary:<sample_count:291387 sample_sum:36.927654353 quantile:<quantile:0 value:3.0795e-05 > quantile:<quantile:0.25 value:4.4548e-05 > quantile:<quantile:0.5 value:5.6141e-05 > quantile:<quantile:0.75 value:7.1326e-05 > quantile:<quantile:1 value:0.000237697 > > > metric:<label:<name:\"instance\" value:\"pd82\" > label:<name:\"job\" value:\"pd82\" > summary:<sample_count:639 sample_sum:0.187962472 quantile:<quantile:0 value:7.6433e-05 > quantile:<quantile:0.25 value:0.000125852 > quantile:<quantile:0.5 value:0.000149738 > quantile:<quantile:0.75 value:0.000175919 > quantile:<quantile:1 value:0.095060158 > > > metric:<label:<name:\"instance\" value:\"nana02_4000\" > label:<name:\"job\" value:\"tidb\" > summary:<sample_count:2715 sample_sum:1.140953048 quantile:<quantile:0 value:7.8814e-05 > quantile:<quantile:0.25 value:0.000139369 > quantile:<quantile:0.5 value:0.000182838 > quantile:<quantile:0.75 value:0.000232283 > quantile:<quantile:1 value:0.025967204 > > > metric:<label:<name:\"instance\" value:\"pd86\" > label:<name:\"job\" value:\"pd86\" > summary:<sample_count:2523 sample_sum:0.422274003 quantile:<quantile:0 value:5.3267e-05 > quantile:<quantile:0.25 value:0.000149183 > quantile:<quantile:0.5 value:0.000170963 > quantile:<quantile:0.75 value:0.000203436 > quantile:<quantile:1 value:0.000526604 > > > metric:<label:<name:\"instance\" value:\"pd81\" > label:<name:\"job\" value:\"pd81\" > summary:<sample_count:640 sample_sum:0.067000798 quantile:<quantile:0 value:6.7459e-05 > quantile:<quantile:0.25 value:8.9384e-05 > quantile:<quantile:0.5 value:0.000107208 > quantile:<quantile:0.75 value:0.000124537 > quantile:<quantile:1 value:0.000462857 > > > metric:<label:<name:\"instance\" value:\"pd83\" > label:<name:\"job\" value:\"pd83\" > summary:<sample_count:637 sample_sum:0.056335101 quantile:<quantile:0 value:4.0852e-05 > quantile:<quantile:0.25 value:7.788e-05 > quantile:<quantile:0.5 value:9.1012e-05 > quantile:<quantile:0.75 value:0.000103816 > quantile:<quantile:1 value:0.00039727 > > > metric:<label:<name:\"instance\" value:\"pd84\" > label:<name:\"job\" value:\"pd84\" > summary:<sample_count:638 sample_sum:0.083071867 quantile:<quantile:0 value:7.3848e-05 > quantile:<quantile:0.25 value:9.716e-05 > quantile:<quantile:0.5 value:0.000116701 > quantile:<quantile:0.75 value:0.000141285 > quantile:<quantile:1 value:0.003605592 > > > metric:<label:<name:\"instance\" value:\"pd85\" > label:<name:\"job\" value:\"pd85\" > summary:<sample_count:656 sample_sum:0.078868245 quantile:<quantile:0 value:4.968e-05 > quantile:<quantile:0.25 value:9.6586e-05 > quantile:<quantile:0.5 value:0.000115319 > quantile:<quantile:0.75 value:0.000131384 > quantile:<quantile:1 value:0.000301089 > > > "

这些metrics都是从PD server推送过来的, 是因为pushgateway用的版本不兼容吗?

你好,

请提供下

  1. tidb 的版本
  2. 集群部署方式,

tidb v2.1.3 以上的版本,已经支持 prometheus 直接通过 tikv 获取数据。优化掉了 pushgetway 组件

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。