集群部分tikv节点疯狂连接tombstone状态的tiflash

集群是从4.0.0.rc1升级到4.0.0,然后从4.0.0升级到4.0.4的
这个错误暂时不影响正常服务,OLTP写入看起来没什么影响,但是tiflash节点暂时不敢添加。
正常的tikv节点一段时间之后,状态可能会down掉,但是服务还是正常提供。
tiup cluster reload tidb-cluster滚动重启之后,监控会正常一段时间。
没有连接tombstone状态tiflash的那个节点,region-leader数量一直上不去
还有regionheartbeat变成10秒这种情况:scream:


image






image

  1. tikv 连接 tombstone 的 tiflash 节点是升级到 4.0.4 以后才出现的吗?
  2. 能否反馈下 pd-ctl 的 store 节点,多谢。

1.可能在从4.0.rc版升级到4.0.0时就已经出现了,但是tikv 现在已经不连接 tombstone 的 tiflash 节点了
2.store节点如下图
image
3.tikv节点有两台升级后老是变成down状态


4.有很多异常peer,应该是tiflash下线留下的,但是tiflash目录已经删了

region-leader升级之后一直不均衡,两个状态异常节点CPU经常飙高

  1. 当前看两个 tikv down 了,之后都需要您手工拉起吗?
  2. 麻烦反馈 tikv 的日志,最好可以包含一个完整的启动到 down 的时间段,多谢。
  3. 麻烦反馈一个包含 down 时段的 over-view 和 detail-tikv 监控,多谢。

(1)、chrome 安装这个插件https://chrome.google.com/webstore/detail/full-page-screen-capture/fdpohaocaechififmbbbbbknoalclacl

(2)、鼠标焦点置于 Dashboard 上,按 ?可显示所有快捷键,先按 d 再按 E 可将所有 Rows 的 Panels 打开,需等待一段时间待页面加载完成。

(3)、使用这个 full-page-screen-capture 插件进行截屏保存

1.tikv服务还是正常运行的,只是监控状态显示down,每次tiup reload之后,状态会正常一段时间
2.日志及监控
链接:https://pan.baidu.com/s/1HpR30LGPU4_W5Jw20LKeOw
提取码:2eda

down的原因可能是在补peer,或者是删peer,因为现在的region里面有很多还包括tiflash的副本

reload之后,差不多一个小时,已经有一个disconnected了,但是还是在正常服务。

请问,您的placement rules 可以查看下吗?是不是这里没有清理,多谢

是下面这个命令吗?第一个截图里面有,已经清空为null了。
curl http://<pd_ip>:<pd_port>/pd/api/v1/config/rules/group/tiflash

老师您好:
已解决
通过调整store limit和间断的进行tiup reload,把异常peer清空之后,现在没有down状态的tikv节点了。
新的tiflash节点也已经重新加入集群,数据也完成了同步。

待解决
1.store limit仍然显示已经tombstone掉的tiflash节点
image
2.leader分配仍然不能均衡

麻烦再反馈下当前 store 和 config show all 的结果,多谢。

{
“count”: 4,
“stores”: [
{
“store”: {
“id”: 90,
“address”: “172.16.117.177:20171”,
“labels”: [
{
“key”: “host”,
“value”: “tikv2”
}
],
“version”: “4.0.4”,
“status_address”: “172.16.117.177:20181”,
“git_hash”: “28e3d44b00700137de4fa933066ab83e5f8306cf”,
“start_timestamp”: 1596467638,
“deploy_path”: “/data/soft/tidb/1/deploy/bin”,
“last_heartbeat”: 1596680697577131902,
“state_name”: “Up”
},
“status”: {
“capacity”: “1TiB”,
“available”: “490.1GiB”,
“used_size”: “533.9GiB”,
“leader_count”: 9404,
“leader_weight”: 1,
“leader_score”: 9404,
“leader_size”: 1362937,
“region_count”: 14412,
“region_weight”: 1,
“region_score”: 2092097,
“region_size”: 2092097,
“start_ts”: “2020-08-03T23:13:58+08:00”,
“last_heartbeat_ts”: “2020-08-06T10:24:57.577131902+08:00”,
“uptime”: “59h10m59.577131902s”
}
},
{
“store”: {
“id”: 37876005,
“address”: “172.16.117.77:3930”,
“labels”: [
{
“key”: “engine”,
“value”: “tiflash”
}
],
“version”: “v4.0.4”,
“peer_address”: “172.16.117.77:20170”,
“status_address”: “172.16.117.77:20292”,
“git_hash”: “bfa9128f59cf800e129152f06b12480ad78adafd”,
“start_timestamp”: 1596468769,
“deploy_path”: “/data/soft/tidb/tiflash/bin/tiflash”,
“last_heartbeat”: 1596680700848855667,
“state_name”: “Up”
},
“status”: {
“capacity”: “3.161TiB”,
“available”: “2.591TiB”,
“used_size”: “329.5GiB”,
“leader_count”: 0,
“leader_weight”: 1,
“leader_score”: 0,
“leader_size”: 0,
“region_count”: 9458,
“region_weight”: 1,
“region_score”: 1460101,
“region_size”: 1460101,
“start_ts”: “2020-08-03T23:32:49+08:00”,
“last_heartbeat_ts”: “2020-08-06T10:25:00.848855667+08:00”,
“uptime”: “58h52m11.848855667s”
}
},
{
“store”: {
“id”: 2,
“address”: “172.16.116.153:20171”,
“labels”: [
{
“key”: “host”,
“value”: “tikv4”
}
],
“version”: “4.0.4”,
“status_address”: “172.16.116.153:20181”,
“git_hash”: “28e3d44b00700137de4fa933066ab83e5f8306cf”,
“start_timestamp”: 1596468242,
“deploy_path”: “/data/soft/tidb/1/deploy/bin”,
“last_heartbeat”: 1596680696114682216,
“state_name”: “Up”
},
“status”: {
“capacity”: “1TiB”,
“available”: “490GiB”,
“used_size”: “534GiB”,
“leader_count”: 496,
“leader_weight”: 1,
“leader_score”: 496,
“leader_size”: 80093,
“region_count”: 14412,
“region_weight”: 1,
“region_score”: 2092097,
“region_size”: 2092097,
“start_ts”: “2020-08-03T23:24:02+08:00”,
“last_heartbeat_ts”: “2020-08-06T10:24:56.114682216+08:00”,
“uptime”: “59h0m54.114682216s”
}
},
{
“store”: {
“id”: 91,
“address”: “172.16.116.213:20171”,
“labels”: [
{
“key”: “host”,
“value”: “tikv3”
}
],
“version”: “4.0.4”,
“status_address”: “172.16.116.213:20181”,
“git_hash”: “28e3d44b00700137de4fa933066ab83e5f8306cf”,
“start_timestamp”: 1596467965,
“deploy_path”: “/date/soft/tidb/1/deploy/bin”,
“last_heartbeat”: 1596680695999438827,
“state_name”: “Up”
},
“status”: {
“capacity”: “1TiB”,
“available”: “491.3GiB”,
“used_size”: “532.7GiB”,
“leader_count”: 4512,
“leader_weight”: 1,
“leader_score”: 4512,
“leader_size”: 649067,
“region_count”: 14412,
“region_weight”: 1,
“region_score”: 2092097,
“region_size”: 2092097,
“start_ts”: “2020-08-03T23:19:25+08:00”,
“last_heartbeat_ts”: “2020-08-06T10:24:55.999438827+08:00”,
“uptime”: “59h5m30.999438827s”
}
}
]
}

{
“client-urls”: “http://0.0.0.0:2379”,
“peer-urls”: “http://172.16.117.177:2380”,
“advertise-client-urls”: “http://172.16.117.177:2379”,
“advertise-peer-urls”: “http://172.16.117.177:2380”,
“name”: “pd_zj177”,
“data-dir”: “/data/soft/tidb/deploy/data.pd”,
“force-new-cluster”: false,
“enable-grpc-gateway”: true,
“initial-cluster”: “pd_zj177=http://172.16.117.177:2380,pd_zj153=http://172.16.116.153:2380,pd_zj77=http://172.16.117.77:2380”,
“initial-cluster-state”: “new”,
“join”: “”,
“lease”: 3,
“log”: {
“level”: “info”,
“format”: “text”,
“disable-timestamp”: false,
“file”: {
“filename”: “/data/soft/tidb/deploy/log/pd.log”,
“max-size”: 300,
“max-days”: 0,
“max-backups”: 0
},
“development”: false,
“disable-caller”: false,
“disable-stacktrace”: false,
“disable-error-verbose”: true,
“sampling”: null
},
“tso-save-interval”: “3s”,
“metric”: {
“job”: “pd_zj177”,
“address”: “”,
“interval”: “15s”
},
“schedule”: {
“max-snapshot-count”: 3,
“max-pending-peer-count”: 16,
“max-merge-region-size”: 20,
“max-merge-region-keys”: 200000,
“split-merge-interval”: “1h0m0s”,
“enable-one-way-merge”: “false”,
“enable-cross-table-merge”: “false”,
“patrol-region-interval”: “100ms”,
“max-store-down-time”: “2h0m0s”,
“leader-schedule-limit”: 4,
“leader-schedule-policy”: “count”,
“region-schedule-limit”: 2048,
“replica-schedule-limit”: 64,
“merge-schedule-limit”: 8,
“hot-region-schedule-limit”: 4,
“hot-region-cache-hits-threshold”: 3,
“store-limit”: {
“2”: {
“add-peer”: 15,
“remove-peer”: 15
},
“35578126”: {
“add-peer”: 30,
“remove-peer”: 30
},
“37869367”: {
“add-peer”: 15,
“remove-peer”: 15
},
“37876005”: {
“add-peer”: 30,
“remove-peer”: 30
},
“631028”: {
“add-peer”: 15,
“remove-peer”: 15
},
“90”: {
“add-peer”: 15,
“remove-peer”: 15
},
“91”: {
“add-peer”: 15,
“remove-peer”: 15
}
},
“tolerant-size-ratio”: 0,
“low-space-ratio”: 0.8,
“high-space-ratio”: 0.6,
“scheduler-max-waiting-operator”: 3,
“enable-remove-down-replica”: “true”,
“enable-replace-offline-replica”: “true”,
“enable-make-up-replica”: “true”,
“enable-remove-extra-replica”: “true”,
“enable-location-replacement”: “true”,
“enable-debug-metrics”: “false”,
“schedulers-v2”: [
{
“type”: “balance-region”,
“args”: null,
“disable”: false,
“args-payload”: “”
},
{
“type”: “balance-leader”,
“args”: null,
“disable”: false,
“args-payload”: “”
},
{
“type”: “hot-region”,
“args”: null,
“disable”: false,
“args-payload”: “”
},
{
“type”: “label”,
“args”: null,
“disable”: false,
“args-payload”: “”
}
],
“schedulers-payload”: {
“balance-hot-region-scheduler”: null,
“balance-leader-scheduler”: {
“name”: “balance-leader-scheduler”,
“ranges”: [
{
“end-key”: “”,
“start-key”: “”
}
]
},
“balance-region-scheduler”: {
“name”: “balance-region-scheduler”,
“ranges”: [
{
“end-key”: “”,
“start-key”: “”
}
]
},
“label-scheduler”: {
“name”: “label-scheduler”,
“ranges”: [
{
“end-key”: “”,
“start-key”: “”
}
]
}
},
“store-limit-mode”: “manual”
},
“replication”: {
“max-replicas”: 3,
“location-labels”: “zone,rack,host”,
“strictly-match-label”: “false”,
“enable-placement-rules”: “true”
},
“pd-server”: {
“use-region-storage”: “true”,
“max-gap-reset-ts”: “24h0m0s”,
“key-type”: “table”,
“runtime-services”: “”,
“metric-storage”: “http://172.16.116.213:9090”,
“dashboard-address”: “http://172.16.117.77:2379
},
“cluster-version”: “4.0.4”,
“quota-backend-bytes”: “8GiB”,
“auto-compaction-mode”: “periodic”,
“auto-compaction-retention-v2”: “1h”,
“TickInterval”: “500ms”,
“ElectionInterval”: “3s”,
“PreVote”: true,
“security”: {
“cacert-path”: “”,
“cert-path”: “”,
“key-path”: “”,
“cert-allowed-cn”: null
},
“label-property”: {},
“WarningMsgs”: null,
“DisableStrictReconfigCheck”: false,
“HeartbeatStreamBindInterval”: “1m0s”,
“LeaderPriorityCheckInterval”: “1m0s”,
“dashboard”: {
“tidb-cacert-path”: “”,
“tidb-cert-path”: “”,
“tidb-key-path”: “”,
“public-path-prefix”: “”,
“internal-proxy”: false,
“enable-telemetry”: true
},
“replication-mode”: {
“replication-mode”: “majority”,
“dr-auto-sync”: {
“label-key”: “”,
“primary”: “”,
“dr”: “”,
“primary-replicas”: 0,
“dr-replicas”: 0,
“wait-store-timeout”: “1m0s”,
“wait-sync-timeout”: “1m0s”
}
}
}

这个均衡速度太慢了

您好,麻烦导出一份 PD leader 的完整监控吧

不好意思,搬工位搬到了现在。
screencapture-172-16-116-213-3000-d-Q6RuHYIWk-tidb-cluster-pd-2020-08-06-19_38_32.rar (4.0 MB)

请问是生产环境吗?是否可以使用 v4.0.2 或者使用 v4.0.5(预计下周会发布)?

是生产环境,v4.0.4版本有问题吗?:sob:

嗯,4.0.3 引入了一个问题可能会导致 balance leader 无法正常工作,将在 4.0.5 修复。请问您这边是用到了 tiflash 是吗?