TiFlash从KV导数据很慢

版本4.0.8

TiFlash节点数据导入很慢,基本上1分钟2region的速度,是我们的配置需要优化吗,现在已经参考文档有调整了,但是似乎没什么用

执行的语句

ALTER TABLE summary_register SET TIFLASH REPLICA 1;
ALTER TABLE summary_register_day SET TIFLASH REPLICA 1;
截取了5分钟前后的日志

2020-11-25 13:16:48,898  TiFlashManager: report to tidb: id 16593, region_count 85, flash_region_count 54
2020-11-25 13:16:53,822  TiFlashManager: report to tidb: id 16016, region_count 4850, flash_region_count 1839
2020-11-25 13:16:53,844  TiFlashManager: report to tidb: id 16593, region_count 85, flash_region_count 54
2020-11-25 13:16:58,824  TiFlashManager: report to tidb: id 16593, region_count 85, flash_region_count 54
2020-11-25 13:16:58,857  TiFlashManager: report to tidb: id 16016, region_count 4850, flash_region_count 1839
2020-11-25 13:17:03,827  TiFlashManager: report to tidb: id 16016, region_count 4850, flash_region_count 1839
2020-11-25 13:17:03,849  TiFlashManager: report to tidb: id 16593, region_count 85, flash_region_count 54
2020-11-25 13:17:08,842  TiFlashManager: report to tidb: id 16016, region_count 4850, flash_region_count 1839
2020-11-25 13:17:08,862  TiFlashManager: report to tidb: id 16593, region_count 85, flash_region_count 54
2020-11-25 13:17:14,032  TiFlashManager: report to tidb: id 16593, region_count 85, flash_region_count 54
2020-11-25 13:17:14,066  TiFlashManager: report to tidb: id 16016, region_count 4850, flash_region_count 1839
2020-11-25 13:17:18,808  TiFlashManager: report to tidb: id 16593, region_count 85, flash_region_count 54
2020-11-25 13:17:18,842  TiFlashManager: report to tidb: id 16016, region_count 4850, flash_region_count 1839

2020-11-25 13:20:33,846 TiFlashManager: report to tidb: id 16593, region_count 85, flash_region_count 60
2020-11-25 13:20:38,800 TiFlashManager: report to tidb: id 16016, region_count 4850, flash_region_count 1849
2020-11-25 13:20:38,822 TiFlashManager: report to tidb: id 16593, region_count 85, flash_region_count 60
2020-11-25 13:20:43,832 TiFlashManager: report to tidb: id 16016, region_count 4850, flash_region_count 1849
2020-11-25 13:20:43,855 TiFlashManager: report to tidb: id 16593, region_count 85, flash_region_count 60
2020-11-25 13:20:48,877 TiFlashManager: report to tidb: id 16016, region_count 4850, flash_region_count 1849
2020-11-25 13:20:48,899 TiFlashManager: report to tidb: id 16593, region_count 85, flash_region_count 60
2020-11-25 13:20:53,831 TiFlashManager: report to tidb: id 16016, region_count 4850, flash_region_count 1849
2020-11-25 13:20:53,853 TiFlashManager: report to tidb: id 16593, region_count 85, flash_region_count 60
2020-11-25 13:20:58,806 TiFlashManager: report to tidb: id 16593, region_count 85, flash_region_count 60
2020-11-25 13:20:58,842 TiFlashManager: report to tidb: id 16016, region_count 4850, flash_region_count 1849
2020-11-25 13:21:03,846 TiFlashManager: report to tidb: id 16016, region_count 4850, flash_region_count 1849
2020-11-25 13:21:03,869 TiFlashManager: report to tidb: id 16593, region_count 85, flash_region_count 60
2020-11-25 13:21:08,844 TiFlashManager: report to tidb: id 16016, region_count 4850, flash_region_count 1849
2020-11-25 13:21:08,866 TiFlashManager: report to tidb: id 16593, region_count 85, flash_region_count 60
2020-11-25 13:21:13,828 TiFlashManager: report to tidb: id 16016, region_count 4850, flash_region_count 1849
2020-11-25 13:21:13,849 TiFlashManager: report to tidb: id 16593, region_count 85, flash_region_count 60
2020-11-25 13:21:18,819 TiFlashManager: report to tidb: id 16016, region_count 4850, flash_region_count 1849
2020-11-25 13:21:18,840 TiFlashManager: report to tidb: id 16593, region_count 85, flash_region_count 60
2020-11-25 13:21:23,833 TiFlashManager: report to tidb: id 16016, region_count 4850, flash_region_count 1849
2020-11-25 13:21:23,856 TiFlashManager: report to tidb: id 16593, region_count 85, flash_region_count 60
2020-11-25 13:21:28,829 TiFlashManager: report to tidb: id 16016, region_count 4850, flash_region_count 1849

配置 config show all

{
  "client-urls": "http://0.0.0.0:2379",
  "peer-urls": "http://10.155.111.139:2380",
  "advertise-client-urls": "http://10.155.111.139:2379",
  "advertise-peer-urls": "http://10.155.111.139:2380",
  "name": "pd_10.155.111.139",
  "data-dir": "/home/tidb/deploy/data.pd",
  "force-new-cluster": false,
  "enable-grpc-gateway": true,
  "initial-cluster": "pd_10.155.111.139=http://10.155.111.139:2380,pd_52-136=http://10.155.111.136:2380,pd_10.155.111.236=http://10.155.111.236:2380",
  "initial-cluster-state": "new",
  "initial-cluster-token": "pd-cluster",
  "join": "",
  "lease": 3,
  "log": {
    "level": "info",
    "format": "text",
    "disable-timestamp": false,
    "file": {
      "filename": "/home/tidb/deploy/log/pd.log",
      "max-size": 300,
      "max-days": 0,
      "max-backups": 0
    },
    "development": false,
    "disable-caller": false,
    "disable-stacktrace": false,
    "disable-error-verbose": true,
    "sampling": null
  },
  "tso-save-interval": "3s",
  "metric": {
    "job": "pd_10.155.111.139",
    "address": "",
    "interval": "15s"
  },
  "schedule": {
    "max-snapshot-count": 3,
    "max-pending-peer-count": 64,
    "max-merge-region-size": 20,
    "max-merge-region-keys": 200000,
    "split-merge-interval": "1h0m0s",
    "enable-one-way-merge": "false",
    "enable-cross-table-merge": "false",
    "patrol-region-interval": "100ms",
    "max-store-down-time": "1h0m0s",
    "leader-schedule-limit": 1,
    "leader-schedule-policy": "count",
    "region-schedule-limit": 1024,
    "replica-schedule-limit": 4,
    "merge-schedule-limit": 8,
    "hot-region-schedule-limit": 2,
    "hot-region-cache-hits-threshold": 3,
    "store-limit": {
      "20847613": {
        "add-peer": 30,
        "remove-peer": 30
      },
      "20847615": {
        "add-peer": 30,
        "remove-peer": 30
      },
      "22290001": {
        "add-peer": 30,
        "remove-peer": 30
      },
      "23684702": {
        "add-peer": 4096,
        "remove-peer": 4096
      }
    },
    "tolerant-size-ratio": 5,
    "low-space-ratio": 0.8,
    "high-space-ratio": 0.6,
    "scheduler-max-waiting-operator": 3,
    "enable-remove-down-replica": "true",
    "enable-replace-offline-replica": "true",
    "enable-make-up-replica": "true",
    "enable-remove-extra-replica": "true",
    "enable-location-replacement": "true",
    "enable-debug-metrics": "false",
    "schedulers-v2": [
      {
        "type": "balance-region",
        "args": null,
        "disable": false,
        "args-payload": ""
      },
      {
        "type": "balance-leader",
        "args": null,
        "disable": false,
        "args-payload": ""
      },
      {
        "type": "hot-region",
        "args": null,
        "disable": false,
        "args-payload": ""
      },
      {
        "type": "label",
        "args": null,
        "disable": false,
        "args-payload": ""
      }
    ],
    "schedulers-payload": {
      "balance-hot-region-scheduler": null,
      "balance-leader-scheduler": {
        "name": "balance-leader-scheduler",
        "ranges": [
          {
            "end-key": "",
            "start-key": ""
          }
        ]
      },
      "balance-region-scheduler": {
        "name": "balance-region-scheduler",
        "ranges": [
          {
            "end-key": "",
            "start-key": ""
          }
        ]
      },
      "label-scheduler": {
        "name": "label-scheduler",
        "ranges": [
          {
            "end-key": "",
            "start-key": ""
          }
        ]
      }
    },
    "store-limit-mode": "manual"
  },
  "replication": {
    "max-replicas": 3,
    "location-labels": "",
    "strictly-match-label": "false",
    "enable-placement-rules": "true"
  },
  "pd-server": {
    "use-region-storage": "true",
    "max-gap-reset-ts": "24h0m0s",
    "key-type": "table",
    "runtime-services": "",
    "metric-storage": "http://10.155.111.136:9090",
    "dashboard-address": "http://10.155.111.136:2379",
    "trace-region-flow": "true"
  },
  "cluster-version": "4.0.8",
  "quota-backend-bytes": "8GiB",
  "auto-compaction-mode": "periodic",
  "auto-compaction-retention-v2": "1h",
  "TickInterval": "500ms",
  "ElectionInterval": "3s",
  "PreVote": true,
  "security": {
    "cacert-path": "",
    "cert-path": "",
    "key-path": "",
    "cert-allowed-cn": null
  },
  "label-property": {},
  "WarningMsgs": [
    "disable-telemetry in conf/pd.toml is deprecated, use enable-telemetry instead"
  ],
  "DisableStrictReconfigCheck": false,
  "HeartbeatStreamBindInterval": "1m0s",
  "LeaderPriorityCheckInterval": "1m0s",
  "dashboard": {
    "tidb-cacert-path": "",
    "tidb-cert-path": "",
    "tidb-key-path": "",
    "public-path-prefix": "/dashboard",
    "internal-proxy": false,
    "enable-telemetry": true,
    "enable-experimental": false
  },
  "replication-mode": {
    "replication-mode": "majority",
    "dr-auto-sync": {
      "label-key": "",
      "primary": "",
      "dr": "",
      "primary-replicas": 0,
      "dr-replicas": 0,
      "wait-store-timeout": "1m0s",
      "wait-sync-timeout": "1m0s"
    }
  }
}

同步慢问题参数调整:目前大概就两个方向:

  1. 提高 PD 的调度速度,即让 PD 更快的把region 副本调度到 tiflash。配置 PD 参数:
    region-schedule-limit
    replica-schedule-limit
    以及
    store limit (需要持久化用 store-balance-rate)https://docs.pingcap.com/zh/tidb/stable/configure-store-limit

  2. 提高 TiFlash的消费能力,目前主要增加 TiFlash 处理 snapshot 的线程数。配置 tiflash-proxy:
    server_configs:
    tiflash-learner:
    raftstore.snap-handle-pool-size : 4 (默认是 2 )

昨天已经设置过
store limit 23684702 4096
config set max-pending-peer-count 64
config set region-schedule-limit 1024

raftstore.snap-handle-pool-size : 12

刚刚调整
config set replica-schedule-limit 64
原来是4

依旧没有明显变化

那可以导出一下 TiFlash 相关的监控看下

导出监控步骤:

  1. 打开监控面板,选择监控时间
  2. 打开 Grafana 监控面板(先按 d 再按 E 可将所有 Rows 的 Panels 打开,需等待一段时间待页面加载完成)
  3. https://metricstool.pingcap.com/ 使用工具导出 Grafana 数据为快照

test-cluster-TiFlash-Summary_2020-11-26T04_00_29.697Z.json (2.6 MB)

TiFlash 的监控似乎没有特别的异常发现,麻烦再拿一下 PD 面板的监控吧

test-cluster-PD_2020-11-26T06_13_44.217Z.json (4.2 MB)

请问有什么发现吗

config show 帮拿下现在的参数配置,辛苦了

主楼有发,折叠了

您好,能看下当前的 placement rule 吗?可以通过 pd-ctl config placement-rule show 查看

» config placement-rule show
[
  {
    "group_id": "pd",
    "id": "default",
    "start_key": "",
    "end_key": "",
    "role": "voter",
    "count": 3
  },
  {
    "group_id": "tiflash",
    "id": "table-16016-r",
    "override": true,
    "start_key": "748000000000003EFF905F720000000000FA",
    "end_key": "748000000000003EFF9100000000000000F8",
    "role": "learner",
    "count": 1,
    "label_constraints": [
      {
        "key": "engine",
        "op": "in",
        "values": [
          "tiflash"
        ]
      }
    ]
  },
  {
    "group_id": "tiflash",
    "id": "table-16593-r",
    "override": true,
    "start_key": "7480000000000040FFD15F720000000000FA",
    "end_key": "7480000000000040FFD200000000000000F8",
    "role": "learner",
    "count": 1,
    "label_constraints": [
      {
        "key": "engine",
        "op": "in",
        "values": [
          "tiflash"
        ]
      }
    ]
  },
  {
    "group_id": "tiflash",
    "id": "table-20063-r",
    "override": true,
    "start_key": "748000000000004EFF5F5F720000000000FA",
    "end_key": "748000000000004EFF6000000000000000F8",
    "role": "learner",
    "count": 1,
    "label_constraints": [
      {
        "key": "engine",
        "op": "in",
        "values": [
          "tiflash"
        ]
      }
    ]
  },
  {
    "group_id": "tiflash",
    "id": "table-22481-r",
    "override": true,
    "start_key": "7480000000000057FFD15F720000000000FA",
    "end_key": "7480000000000057FFD200000000000000F8",
    "role": "learner",
    "count": 1,
    "label_constraints": [
      {
        "key": "engine",
        "op": "in",
        "values": [
          "tiflash"
        ]
      }
    ]
  }
]

可以尝试通过 pd-ctl set patrol-region-interval 10ms 加速 region 扫描速度看看是否有效,具体配置的含义可以查看 https://docs.pingcap.com/zh/tidb/stable/pd-control#config-show--set-option-value--placement-rules

1 个赞

有效果,快了很多,现在1分钟10region的速度

另外 看监控里面 空 region 特别多,可以稍微调大 region merge 的参数,合并完空 region 对同步速度也可能会有提升。

请问哪个仪表盘代表空的region,以下配置以后region数量有减少,但是好像不是特别理想
config set max-merge-region-size 64
config set enable-cross-table-merge true

在 PD 监控页面有个 empty-region 的监控指标,应该有 4k 多空 region,可以考虑调大 region 合并的参数以及跨表合并 region,具体的参数可以在 https://docs.pingcap.com/zh/tidb/dev/massive-regions-best-practices#海量-region-集群调优最佳实践 以及 跨表 merge https://docs.pingcap.com/zh/tidb/dev/tidb-troubleshooting-map#51-pd-调度问题

1 个赞

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。