Tidb库频繁出现访问不到,10多分钟后自动又能连接的问题

为提高效率,提问时请尽量提供详细背景信息,问题描述清晰可优先响应。以下信息点请尽量提供:

  • 系统版本 & kernel 版本:CentOS7.5

  • TiDB 版本:v2.1.8

  • 磁盘型号:ssd

  • 集群节点分布: |节点1|中控机、PD1| |节点2|tikv1、PD2| |节点3|tikv2、tidb1| |节点4|tikv3、tidb2|

  • 数据量 & region 数量 & 副本数:数据量20G左右

  • 问题描述 tidb库频繁出现访问不到,10多分钟后自动又能连接的问题,经查看服务器性能指标正常,网络万兆网,连接不上的时候,tidb日志有大量20190902tidb.log (1.5 MB) 20190902pd.log (79.0 KB) 错误,pd有少量错误

1赞

1.根据 tidb 日志有报错

[2019/09/02 11:06:24.867 +08:00] [ERROR] [session.go:1025] [“active transaction fail”] [error="[tikv:9001]PD server timeout[try again later]"] [errorVerbose="[tikv:9001]PD server timeout[try again later]

2.然后查看 pd 节点日志

2019/09/02 11:06:32.833 log.go:84: [warning] etcdserver: [server is likely overloaded]

2019/09/02 11:06:33.805 log.go:84: [warning] rafthttp: [closed an existing TCP streaming connection with peer 1271ec2a36ae23ca (stream Message writer)]

2019/09/02 11:06:33.805 log.go:88: [info] rafthttp: [established a TCP streaming connection with peer 1271ec2a36ae23ca (stream Message writer)]

2019/09/02 11:06:33.807 log.go:84: [warning] rafthttp: [closed an existing TCP streaming connection with peer 1271ec2a36ae23ca (stream MsgApp v2 writer)]

2019/09/02 11:06:33.807 log.go:88: [info] rafthttp: [established a TCP streaming connection with peer 1271ec2a36ae23ca (stream MsgApp v2 writer)]

2019/09/02 11:06:33.956 log.go:84: [warning] etcdserver: [read-only range request “key:”/tidb/store/gcworker/saved_safe_point" " took too long (4.270724089s) to execute]

2019/09/02 11:06:33.960 util.go:214: [warning] txn runs too slow, resp: &{cluster_id:5826383677316915514 member_id:8230488343191936483 revision:4003011 raft_term:253 true [response_put:<header:<revision:4003011 > > ]}, err: , cost: 3.241257913s

2019/09/02 11:06:33.961 tso.go:137: [warning] clock offset: 3.241515268s, prev: 2019-09-02 11:06:30.719518173 +0800 CST m=+2080575.748806204, now: 2019-09-02 11:06:33.961033441 +0800 CST m=+2080578.990321518

2019/09/02 11:06:40.337 log.go:84: [warning] rafthttp: [the clock difference against peer 1271ec2a36ae23ca is too high [59.933040521s > 1s]]

2019/09/02 11:07:01.471 tso.go:137: [warning] clock offset: 501.587848ms, prev: 2019-09-02 11:07:00.969478327 +0800 CST m=+2080605.998766381, now: 2019-09-02 11:07:01.471066175 +0800 CST m=+2080606.500354199

2019/09/02 11:07:10.338 log.go:84: [warning] rafthttp: [the clock difference against peer 1271ec2a36ae23ca is too high [59.932763341s > 1s]]

2019/09/02 11:07:32.566 log.go:84: [warning] wal: [sync duration of 1.596602305s, expected less than 1s]

2019/09/02 11:07:32.566 log.go:84: [warning] etcdserver: [failed to send out heartbeat on time (exceeded the 500ms timeout for 610.615821ms)]

2019/09/02 11:07:32.566 log.go:84: [warning] etcdserver: [server is likely overloaded]

2019/09/02 11:07:33.995 cluster_info.go:495: [info] [region 2567] Leader changed from store {1} to {6}

2019/09/02 11:07:34.957 log.go:84: [warning] rafthttp: [lost the TCP streaming connection with peer 1271ec2a36ae23ca (stream Message reader)]

2019/09/02 11:07:34.957 log.go:82: [error] rafthttp: [failed to read 1271ec2a36ae23ca on stream Message (read tcp 10.50.70.16:49744->10.50.70.17:2380: i/o timeout)]

PD 日志中有几个问题

1.服务器时间有差异,建议修改 2.服务器超载以及 io timeout;建议查看下 PD 所在机器的负载情况,比如 CPU 等信息,除此之外,PD 机器上是否存在其他进程,例如 prometheus,目前日志来看是 PD 负载过高导致 tidb 去取 tso 超时。

负载有查看,当时服务器看到的,和监控看到的都无异常

服务器时间9月2日已调一致。9月3日11点又出现这种问题tidblog.txt (213.9 KB) 当时服务器指标正常https://asktug.com/uploads/default/original/2X/9/971031b256052a8aea24e53de428b3f532dbecb0.mp4

新的日志的报错仍然是 PD server timeout 的问题,建议把 PD 单独部署。监控看下 内存、CPU、IO 情况。