生产问题，请加急处理，tiflash宕机，起不来

Lawrence · 2021 年5 月 24 日 03:29

为提高效率，请提供以下信息，问题描述清晰能够更快得到解决：

【TiDB 版本】
v4.0.11

【问题描述】
tiflash突然宕机，且起不来，日志中一直再报如下错误

image1089×541 34.7 KB

image819×226 7.72 KB

image795×161 6.18 KB

image710×544 40.4 KB

image1300×480 48.3 KB

若提问为性能优化、故障排查类问题，请下载脚本运行。终端输出的打印结果，请务必全选并复制粘贴上传。

日志如下：
tiflash_log.zip (2.2 MB)

懂的都懂 · 2021 年5 月 24 日 05:01

麻烦 tiflash 的 error 日志和 tiflash 的 log 文件形式上传一下

Lawrence · 2021 年5 月 24 日 05:04

文件如下：
tiflash_error.zip (41.1 KB) tiflash_log.zip (2.2 MB)

懂的都懂 · 2021 年5 月 24 日 05:10

pd-ctl -u http://10.18.253.83:2379 config show |grep enable-placement-rules
执行一下看一下 replacement-rules 是否开启。
tiup cluster display 看一下 pd 的状态

Lawrence · 2021 年5 月 24 日 05:28

是开启的

懂的都懂 · 2021 年5 月 24 日 05:35

这样操作

Lawrence · 2021 年5 月 24 日 05:43

[tidb@ip-10-18-253-5 ~]$ tiup ctl:v4.0.11 pd -u http://10.18.253.83:2379 config show
Starting component ctl: /home/tidb/.tiup/components/ctl/v4.0.11/ctl pd -u http://10.18.253.83:2379 config show
{
“replication”: {
“enable-placement-rules”: “true”,
“location-labels”: “”,
“max-replicas”: 3,
“strictly-match-label”: “false”
},
“schedule”: {
“enable-cross-table-merge”: “false”,
“enable-debug-metrics”: “false”,
“enable-location-replacement”: “true”,
“enable-make-up-replica”: “true”,
“enable-one-way-merge”: “false”,
“enable-remove-down-replica”: “true”,
“enable-remove-extra-replica”: “true”,
“enable-replace-offline-replica”: “true”,
“high-space-ratio”: 0.7,
“hot-region-cache-hits-threshold”: 3,
“hot-region-schedule-limit”: 4,
“leader-schedule-limit”: 4,
“leader-schedule-policy”: “count”,
“low-space-ratio”: 0.8,
“max-merge-region-keys”: 200000,
“max-merge-region-size”: 20,
“max-pending-peer-count”: 16,
“max-snapshot-count”: 3,
“max-store-down-time”: “30m0s”,
“merge-schedule-limit”: 8,
“patrol-region-interval”: “100ms”,
“region-schedule-limit”: 2048,
“replica-schedule-limit”: 64,
“scheduler-max-waiting-operator”: 5,
“split-merge-interval”: “1h0m0s”,
“store-limit-mode”: “manual”,
“tolerant-size-ratio”: 0
}
}

配置都是默认的，没改过，就今天tiflash突然down了

懂的都懂 · 2021 年5 月 24 日 05:46

10.18.253.83:2379 这个 pd 日志提供一下

Lawrence · 2021 年5 月 24 日 05:49

pd_log.zip (4.9 MB)

现在能大致确定是什么原因导致tiflash down机和起不来的吗

懂的都懂 · 2021 年5 月 24 日 05:57

pd 的日志里面 [2021/05/23 05:23:22.432 +00:00] [WARN] [grpclog.go:60] [“transport: http2Server.HandleStreams failed to read frame: read tcp 10.18.253.83:2379->10.18.253.69:38752: read: connection reset by peer”]
查一下网络有没有问题.

Lawrence · 2021 年5 月 24 日 06:00

ping是没有问题的，telnet不通，因为69和73的tiflash挂了，而且69的38752端口也没使用

Lawrence · 2021 年5 月 24 日 06:00

应该是kv写完后，同步给tiflash，然后报这个connection问题，因为69和73的tiflash服务本身就是down了，所以不通吧

懂的都懂 · 2021 年5 月 24 日 06:08

从 tiflash 结点 telnet 这个 10.18.253.83:2379
网络上有什么变化吗?

Lawrence · 2021 年5 月 24 日 06:14

ping和telnet都没有问题

懂的都懂 · 2021 年5 月 24 日 06:20

tiflash 的 manger 日志也提供一下。

懂的都懂 · 2021 年5 月 24 日 06:24

tiflash_cluster_manager.log。在如果没有修改默认在 tiflash log 目录下。

Lawrence · 2021 年5 月 24 日 06:26

tiflash_manager.zip (1.6 KB)

懂的都懂 · 2021 年5 月 24 日 06:28

在 tiflash 结点上运行一下 curl http://10.18.253.12:2379/pd/api/v1/members

Lawrence · 2021 年5 月 24 日 06:29

[tidb@ip-10-18-253-73 log]$ curl http://10.18.253.12:2379/pd/api/v1/members
{
“header”: {
“cluster_id”: 6938337866423754123
},
“members”: [
{
“name”: “pd-10.18.253.83-2379”,
“member_id”: 98094734768936214,
“peer_urls”: [
“http://10.18.253.83:2380”
],
“client_urls”: [
“http://10.18.253.83:2379”
],
“deploy_path”: “/data/tidb-deploy/pd-2379/bin”,
“binary_version”: “v4.0.11”,
“git_hash”: “96efd6f402361f2f84e39ea5b3a7a6f6b10acfb1”
},
{
“name”: “pd-10.18.253.106-2379”,
“member_id”: 10368929817371469603,
“peer_urls”: [
“http://10.18.253.106:2380”
],
“client_urls”: [
“http://10.18.253.106:2379”
],
“deploy_path”: “/data/tidb-deploy/pd-2379/bin”,
“binary_version”: “v4.0.11”,
“git_hash”: “96efd6f402361f2f84e39ea5b3a7a6f6b10acfb1”
},
{
“name”: “pd-10.18.253.12-2379”,
“member_id”: 11416212041977913084,
“peer_urls”: [
“http://10.18.253.12:2380”
],
“client_urls”: [
“http://10.18.253.12:2379”
],
“deploy_path”: “/data/tidb-deploy/pd-2379/bin”,
“binary_version”: “v4.0.11”,
“git_hash”: “96efd6f402361f2f84e39ea5b3a7a6f6b10acfb1”
}
],
“leader”: {
“name”: “pd-10.18.253.83-2379”,
“member_id”: 98094734768936214,
“peer_urls”: [
“http://10.18.253.83:2380”
],
“client_urls”: [
“http://10.18.253.83:2379”
]
},
“etcd_leader”: {
“name”: “pd-10.18.253.83-2379”,
“member_id”: 98094734768936214,
“peer_urls”: [
“http://10.18.253.83:2380”
],
“client_urls”: [
“http://10.18.253.83:2379”
],
“deploy_path”: “/data/tidb-deploy/pd-2379/bin”,
“binary_version”: “v4.0.11”,
“git_hash”: “96efd6f402361f2f84e39ea5b3a7a6f6b10acfb1”
}
}

懂的都懂 · 2021 年5 月 24 日 06:47

已经在内部排查了。
麻烦看一下 tiup cluster edit-config 里面 tiflash 的部分。参数配置是什么样的。

生产问题，请加急处理，tiflash宕机，起不来

【问题描述】 tiflash突然宕机，且起不来，日志中一直再报如下错误 image1089×541 34.7 KB image819×226 7.72 KB image795×161 6.18 KB image710×544 40.4 KB image1300×480 48.3 KB

【问题描述】
tiflash突然宕机，且起不来，日志中一直再报如下错误

image1089×541 34.7 KB

image819×226 7.72 KB

image795×161 6.18 KB

image710×544 40.4 KB

image1300×480 48.3 KB