TiDB Server宕机，PD在线，报错连不上PD：fail to load safepoint from pd

Lawrence · 2021 年6 月 21 日 01:37

为提高效率，请提供以下信息，问题描述清晰能够更快得到解决：

【概述】场景+问题概述
生产环境，tidb server连不上pd，一直在报错，假死状态，日志如下：
tidb_stderr.log (90.5 KB) tidb.log (1.4 MB)

【背景】做过哪些操作

【现象】业务和数据库现象

【业务影响】

【TiDB 版本】

【附件】

相关日志和监控

TiUP Cluster Display 信息
TiUP Cluster Edit Config 信息
TiDB- Overview 监控

对应模块日志（包含问题前后1小时日志）

若提问为性能优化、故障排查类问题，请下载脚本运行。终端输出的打印结果，请务必全选并复制粘贴上传。

Lucien-卢西恩 · 2021 年6 月 21 日 01:41

是所有 TiDB Server 都连接不上 PD ？还是只有单个 TiDB Server 连接不上 PD ？
反馈一下 TiDB log 和 PD log 在报错时间前后的日志，需要进一步排查
按照提贴规范，反馈相关信息（版本、业务影响、现象以及相关日志信息）。如果没有请反馈无。

Lawrence · 2021 年6 月 21 日 01:43

都连不上，3台tidb日志都在刷上面的报错

Lawrence · 2021 年6 月 21 日 01:44

我找下pd日志

Lawrence · 2021 年6 月 21 日 01:49

这是tidb日志，有8小时误差，出错是在今天早上7点半

这是pd日志，也是从7点半开始报错
pd_stderr.log (209.2 KB) pd.log (512.8 KB)

还烦请加急处理下，这是生产环境。。

qizheng · 2021 年6 月 21 日 01:59

tidb 节点 telnet 或 nc 检查到 pd 端口是不是通的

Lucien-卢西恩 · 2021 年6 月 21 日 02:03

日志看应该是网络问题，可以按照楼上建议检查一下网络端口状态。

[2021/06/17 09:16:55.499 +00:00] [INFO] [dynamic_config_manager.go:178] ["Load dynamic config from etcd"] [json="{\"keyvisual\":{\"auto_collection_disabled\":false,\"policy\":\"db\",\"policy_kv_separator\":
\"\"},\"profiling\":{\"auto_collection_targets\":null,\"auto_collection_duration_secs\":0,\"auto_collection_interval_secs\":0}}"]
[2021/06/17 09:16:55.499 +00:00] [WARN] [tidb.go:74] ["Alive of TiDB has expired, maybe local time in different hosts are not synchronized"] [key=/topology/tidb/10.18.251.163:4000/ttl] [value=16239199536381
00465]
[2021/06/17 09:16:55.499 +00:00] [WARN] [tidb.go:74] ["Alive of TiDB has expired, maybe local time in different hosts are not synchronized"] [key=/topology/tidb/10.18.251.204:4000/ttl] [value=16239199536374
21720]
[2021/06/17 09:16:55.499 +00:00] [WARN] [tidb.go:74] ["Alive of TiDB has expired, maybe local time in different hosts are not synchronized"] [key=/topology/tidb/10.18.251.77:4000/ttl] [value=1623919953639318400]
[2021/06/17 09:16:55.503 +00:00] [INFO] [dynamic_config_manager.go:199] ["Save dynamic config to etcd"] [json="{\"keyvisual\":{\"auto_collection_disabled\":false,\"policy\":\"db\",\"policy_kv_separator\":\"\"},\"profiling\":{\"auto_collection_targets\":null,\"auto_collection_duration_secs\":0,\"auto_collection_interval_secs\":0}}"]
[2021/06/17 09:16:55.510 +00:00] [INFO] [manager.go:85] ["Key visual service is started"]
[2021/06/17 09:16:57.498 +00:00] [WARN] [proxy.go:189] ["fail to recv activity from remote, stay inactive and wait to next checking round"] [remote=10.18.251.77:10080] [interval=2s] [error="dial tcp 10.18.251.77:10080: connect: connection refused"]
[2021/06/17 09:16:57.498 +00:00] [WARN] [proxy.go:189] ["fail to recv activity from remote, stay inactive and wait to next checking round"] [remote=10.18.251.77:4000] [interval=2s] [error="dial tcp 10.18.251.77:4000: connect: connection refused"]
[2021/06/17 09:16:57.498 +00:00] [WARN] [proxy.go:189] ["fail to recv activity from remote, stay inactive and wait to next checking round"] [remote=10.18.251.204:4000] [interval=2s] [error="dial tcp 10.18.251.204:4000: connect: connection refused"]

Lawrence · 2021 年6 月 21 日 02:07

都是没问题的

Lawrence · 2021 年6 月 21 日 02:07

看下面截图，没问题的

qizheng · 2021 年6 月 21 日 02:19

最早从 6.17 号就出现 fail to load safepoint from pd 报错，报错不是持续的，检查下网络环境是不是稳定，比如监控 Node Exporter 面板或 Blackbox Exporter 面板有没有 network 相关报错或 ping latency 异常

Lawrence · 2021 年6 月 21 日 02:37

看样子是有个机器能telnet通，但是登录后sudo等命令都执行不了，有点问题，我们看下，谢谢

yilong · 2021 年6 月 21 日 12:59

有个机器是指的 PD leader 的机器吗？这个问题请问解决了吗？多谢。

Lawrence · 2021 年6 月 21 日 13:03

对，是有个pd leader(tidb server也部署在这里)，一个大sql，tidb server 内存不够了，机器异常的卡，重启机器后解决了，但是有个问题，我设置了oom-action为cancel以及mem-quota-query设置了28G，但是超过了28G没有被cancel，是因为啥呢。。一个大sql过来，还是让tidb server oom了。。

yilong · 2021 年6 月 22 日 02:24

不知道您的 tidb 服务器内存是多大？如果有多个大 sql 并发执行，那么可能每个都没有达到 28G，所以无法cancel，但是几个大sql总共将内存消耗完了。

Lawrence · 2021 年6 月 22 日 02:34

tidb是 30.8G，然后我设置了单个sql 最大使用28G就cancel，tidb server 机器内存使用超过28G就cancel的两个参数，但是我看不管是单个sql超过28G，还是tidb server 超过28G，都不管用，都是tidb server oom 重启

看下面的设置，没有问题，但是还是只是一个warn然后重启了

yilong · 2021 年6 月 22 日 11:40

方便上传下这个大sql的执行计划吗？ explain analyze sql ，多谢。

Lawrence · 2021 年6 月 22 日 12:00

就是这个帖子的一楼sql，跑了50s的是测试环境，在生产上直接就不行了

yilong · 2021 年6 月 26 日 04:47

额，这个应该也没问题了吧

Lawrence · 2021 年6 月 28 日 01:04

没问题了,大佬，能帮忙再看下这两个帖子吗，
oom-action没作用的：v4.0.13 oom-action没作用
这个是我们使用tispark多个application写入时，会有索引和真实数据对不上的问题，到现在还没有很好解决，生产需要手动修复，这个是肯定不行的：查询报错：inconsistent extra index PRIMARY, handle 4975703 not found in table

小王同学 · 2021 年6 月 28 日 02:28

没问题的话，其他问题就其他帖子跟进吧。