tidb 5.0.3 集群突然就挂了,不知道啥原因

tidb 5.0.3
1.21号 12点的时候 集群所有机器的CPU突然就降为0了。数据库连不上了。 我看集群状态都是UP,没办法,只能重启,发现tiflash节点启动异常 4个独立部署的节点,有两个起不来,还有两个反复重启。 又买了两个新机器,扩展了两台TIFLASH, 把起不来的两个下线了。 反复重启的那两台,是随机时间重启。 后来等了2个小时也不重启了。
很奇怪,不清楚啥原因。突然就挂了。 监控上看 压力很平稳。中午整好是休息时间,使用的人会少。 反儿挂了。 请帮忙看一下, 都需要什么,我提供一下。 挂掉的那两台着急恢复,我强制下线了。上边的日志没有拿到。 那个时间点整个集群都停止响应。更感觉是tidb 连接层的问题。 tidb我还在不断学习中, 不清楚从哪里入手排查。
:joy: 又被业务方投诉了~ 太难了~~
tidb-rpt-TiFlash-Summary_2022-01-22T16_16_31.017Z.json (10.1 MB)
tidb-rpt-Overview_2022-01-22T16_13_47.632Z.json (10.0 MB)

tidb-2022-01-21T12-43-01.223_tidb1.log.gz (19.5 MB)
tiflash.log.101.gz (18.3 MB)
tiflash.log.102.gz (17.1 MB)

后补日志如下:集群拓扑: PD 3 TIKV 3 TIDB 3 TIFLASH 4 都是独立部署。
3个tidb,3个pd的故障时间的日志:
pd_1.log.gz (21.6 MB)
pd_2.log.gz (11.9 MB)
pd_3.log.gz (21.2 MB)

tidb-2022-01-21T12-43-01.223_节点1上.log.gz (19.5 MB)
tidb-2022-01-21T13-50-11.198_节点1下.log.gz (12.0 MB)
tidb-2022-01-21T13-51-43.215_节点2.log.gz (11.4 MB)

tidb-2022-01-21T13-15-33.728_节点3.log.gz (19.1 MB)

3 个赞

能登dashboard吗?可以指定tiflash搜索日志,如welcome、out of memory、region unavailable之类的
单从图上看,128和180的CPU和内存使用率突然暴增,是不是主机挂了?

3 个赞

业务正常CPU的使用率是50%多左右,故障前CPU使用率是正常的, 只是突然降下来了,不知道为啥。
DASHBAORD 登录进去,打不开监控图。 mysql命令行登录 tidb 没响应。

1 个赞

1、先是ERROR

[2022/01/21 12:05:11.481 +08:00] [ERROR] [select_result.go:343] [“invalid cop task execution summaries length”] [conn=345625] [expected=4] [received=0]

在5.0.6中找到类似的说明

  • 修复运行 MPP 查询时出现 invalid cop task execution summaries length 相关日志的问题 #1791

2、其次是12:15开始出现(grafana上内存下降到0貌似是13:18左右)

[2022/01/21 12:15:55.122 +08:00] [INFO] [server.go:569] [kill] [connID=346981] [query=true]
看了下源码,貌似是杀会话?

// Kill implements the SessionManager interface.
    func (s *Server) Kill(connectionID uint64, query bool) {
logutil.BgLogger().Info("kill", zap.Uint64("connID", connectionID), zap.Bool("query", query))
metrics.ServerEventCounter.WithLabelValues(metrics.EventKill).Inc()

s.rwlock.RLock()
defer s.rwlock.RUnlock()
conn, ok := s.clients[connectionID]
if !ok {
	return
}

if !query {
	// Mark the client connection status as WaitShutdown, when clientConn.Run detect
	// this, it will end the dispatch loop and exit.
	atomic.StoreInt32(&conn.status, connStatusWaitShutdown)
}
killConn(conn)}

3、大量慢查询
4、

[2022/01/21 12:17:32.929 +08:00] [WARN] [domain.go:402] [“loading schema takes a long time”] [“take time”=6m43.451259607s]

看了下源码,无论是否有缓存都强制更新lease

等大佬来解答了:joy:

3 个赞

另外能否提供下tidb.log和pd.log 2022-01-21 12:00:00 - 2022-01-21 13:15:00 的日志?想看看最后是如何挂掉的

3 个赞

日志已上传,补在了原贴中

1 个赞

看了下日志,有几点疑问:

首先是PD的日志:

[2022/01/21 13:05:13.983 +08:00] [INFO] [operator_controller.go:626] ["send schedule command"] [region-id=32594676] [step="remove peer on store 5"] [source="active push"]
[2022/01/21 13:05:14.987 +08:00] [WARN] [heartbeat_streams.go:138] ["send keepalive message fail, store maybe disconnected"] [target-store-id=11585778] [error=EOF]
[2022/01/21 13:05:14.987 +08:00] [WARN] [heartbeat_streams.go:138] ["send keepalive message fail, store maybe disconnected"] [target-store-id=5] [error=EOF]
[2022/01/21 13:05:14.987 +08:00] [WARN] [heartbeat_streams.go:138] ["send keepalive message fail, store maybe disconnected"] [target-store-id=4] [error=EOF]
[2022/01/21 13:05:16.097 +08:00] [INFO] [main.go:125] ["Got signal to exit"] [signal=terminated]
[2022/01/21 13:05:16.097 +08:00] [INFO] [server.go:526] ["server is closed, exit metrics loop"]
[2022/01/21 13:05:16.097 +08:00] [INFO] [server.go:551] ["server is closed, exist encryption key manager loop"]
[2022/01/21 13:05:16.097 +08:00] [INFO] [coordinator.go:737] ["scheduler has been stopped"] [scheduler-name=balance-hot-region-scheduler] [error="context canceled"]
[2022/01/21 13:05:16.097 +08:00] [INFO] [lease.go:142] ["stop lease keep alive worker"] [purpose="pd leader election"]

显示是4、5、11585778三个store一起失去连接
紧接着pd里的etcd发生重选举:(还有重启?)

[2022/01/21 13:05:16.100 +08:00] [INFO] [raft.go:713] ["1ae90ad07220de9a became candidate at term 27"]
...
[2022/01/21 13:05:17.092 +08:00] [INFO] [leadership.go:120] ["write leaderData to leaderPath ok"] [leaderPath=/pd/6971708166954465753/leader] [purpose="pd leader election"]
[2022/01/21 13:05:17.092 +08:00] [INFO] [server.go:1231] ["campaign pd leader ok"] [campaign-pd-leader-name=pd-10.102.0.82-2379]
[2022/01/21 13:05:17.092 +08:00] [INFO] [server.go:1238] ["initializing the global TSO allocator"]
...
[2022/01/21 13:05:17.120 +08:00] [INFO] [server.go:1301] ["server is closed"]
[2022/01/21 13:05:17.121 +08:00] [INFO] [cluster.go:317] ["metrics are reset"]
[2022/01/21 13:05:17.121 +08:00] [INFO] [server.go:111] ["region syncer has been stopped"]
[2022/01/21 13:05:17.122 +08:00] [INFO] [cluster.go:319] ["background jobs has been stopped"]
[2022/01/21 13:05:17.122 +08:00] [INFO] [cluster.go:367] ["raftcluster is stopped"]
[2022/01/21 13:05:17.122 +08:00] [INFO] [tso.go:378] ["reset the timestamp in memory"]
[2022/01/21 13:05:17.123 +08:00] [INFO] [server.go:1163] ["server is closed, return pd leader loop"]
...
[2022/01/21 13:06:47.898 +08:00] [INFO] [util.go:41] ["Welcome to Placement Driver (PD)"]
[2022/01/21 13:06:47.898 +08:00] [INFO] [util.go:42] [PD] [release-version=v5.0.3]

PD leader从3变成了1,同时连上了store :1、4、5、11585778

[2022/01/21 13:07:38.957 +08:00] [INFO] [util.go:77] ["load cluster version"] [cluster-version=5.0.3]
[2022/01/21 13:07:38.957 +08:00] [INFO] [server.go:1282] ["PD cluster leader is ready to serve"] [pd-leader-name=pd-10.102.0.82-2379]
[2022/01/21 13:07:38.959 +08:00] [INFO] [grpc_service.go:356] ["put store ok"] [store="id:4 address:\"10.102.0.137:20160\" version:\"5.0.3\" status_address:\"10.102.0.137:20180\" git_hash:\"63b63edfbb9bbf8aeb875aad28c59f082eeb55d4\" start_timestamp:1642741614 deploy_path:\"/data1/tidb-rpt/tikv/bin\" "]
[2022/01/21 13:07:38.959 +08:00] [INFO] [util.go:77] ["load cluster version"] [cluster-version=5.0.3]
[2022/01/21 13:07:39.110 +08:00] [INFO] [grpc_service.go:356] ["put store ok"] [store="id:1 address:\"10.102.0.227:20160\" version:\"5.0.3\" status_address:\"10.102.0.227:20180\" git_hash:\"63b63edfbb9bbf8aeb875aad28c59f082eeb55d4\" start_timestamp:1642741615 deploy_path:\"/data1/tidb-rpt/tikv/bin\" "]
[2022/01/21 13:07:39.110 +08:00] [INFO] [util.go:77] ["load cluster version"] [cluster-version=5.0.3]
[2022/01/21 13:07:39.154 +08:00] [INFO] [grpc_service.go:356] ["put store ok"] [store="id:5 address:\"10.102.0.186:20160\" version:\"5.0.3\" status_address:\"10.102.0.186:20180\" git_hash:\"63b63edfbb9bbf8aeb875aad28c59f082eeb55d4\" start_timestamp:1642741615 deploy_path:\"/data1/tidb-rpt/tikv/bin\" "]
[2022/01/21 13:07:39.154 +08:00] [INFO] [util.go:77] ["load cluster version"] [cluster-version=5.0.3]
[2022/01/21 13:07:39.906 +08:00] [INFO] [grpc_service.go:356] ["put store ok"] [store="id:11585778 address:\"10.102.0.193:20160\" version:\"5.0.3\" status_address:\"10.102.0.193:20180\" git_hash:\"63b63edfbb9bbf8aeb875aad28c59f082eeb55d4\" start_timestamp:1642741614 deploy_path:\"/data1/tidb-rpt/tikv/bin\" "]
[2022/01/21 13:07:39.906 +08:00] [INFO] [util.go:77] ["load cluster version"] [cluster-version=5.0.3]

但没连上tidb-server?

[2022/01/21 13:07:51.908 +08:00] [WARN] [forwarder.go:118] ["Unable to resolve connection address since no alive TiDB instance"]
[2022/01/21 13:07:51.908 +08:00] [ERROR] [tidb_requests.go:74] ["fail to send schema request to TiDB"] [error=error.tidb.no_alive_tidb]
[2022/01/21 13:07:57.906 +08:00] [WARN] [proxy.go:189] ["fail to recv activity from remote, stay inactive and wait to next checking round"] [remote=10.102.0.71:4000] [interval=2s] [error="dial tcp 10.102.0.71:4000: connect: connection refused"]
[2022/01/21 13:07:57.906 +08:00] [WARN] [proxy.go:189] ["fail to recv activity from remote, stay inactive and wait to next checking round"] [remote=10.102.0.187:4000] [interval=2s] [error="dial tcp 10.102.0.187:4000: connect: connection refused"]
[2022/01/21 13:07:57.906 +08:00] [WARN] [proxy.go:189] ["fail to recv activity from remote, stay inactive and wait to next checking round"] [remote=10.102.0.86:10080] [interval=2s] [error="dial tcp 10.102.0.86:10080: connect: connection refused"]
[2022/01/21 13:07:57.906 +08:00] [WARN] [proxy.go:189] ["fail to recv activity from remote, stay inactive and wait to next checking round"] [remote=10.102.0.187:10080] [interval=2s] [error="dial tcp 10.102.0.187:10080: connect: connection refused"]
[2022/01/21 13:07:57.906 +08:00] [WARN] [proxy.go:189] ["fail to recv activity from remote, stay inactive and wait to next checking round"] [remote=10.102.0.71:10080] [interval=2s] [error="dial tcp 10.102.0.71:10080: connect: connection refused"]
[2022/01/21 13:07:57.906 +08:00] [WARN] [proxy.go:189] ["fail to recv activity from remote, stay inactive and wait to next checking round"] [remote=10.102.0.86:4000] [interval=2s] [error="dial tcp 10.102.0.86:4000: connect: connection refused"]

这是TiDB的日志:

[2022/01/21 13:04:34.448 +08:00] [INFO] [batch_coprocessor.go:561] ["stream unknown error"] [error="rpc error: code = Canceled desc = context canceled"]
[2022/01/21 13:04:34.448 +08:00] [INFO] [conn.go:812] ["command dispatched failed"] [conn=349305] [connInfo="id:349305, addr:10.1.39.27:43098 status:10, collation:utf8_general_ci, user:dolphin_rpt_r"] [command=Query] [status="inTxn:0, autocommit:1"] [sql=] [txn_mode=PESSIMISTIC] [err="[tikv:13][FLASH:Coprocessor:Internal] TiFlash server is terminating"]
[2022/01/21 13:04:34.474 +08:00] [INFO] [batch_coprocessor.go:561] ["stream unknown error"] [error="rpc error: code = Canceled desc = context canceled"]
[2022/01/21 13:04:34.475 +08:00] [INFO] [conn.go:812] ["command dispatched failed"] [conn=349413] [connInfo="id:349413, addr:10.1.39.23:50846 status:10, collation:utf8_general_ci, user:dolphin_rpt_r"] [command=Query] [status="inTxn:0, autocommit:1"] [sql="] [txn_mode=PESSIMISTIC] [err="[tikv:13][FLASH:Coprocessor:Internal] TiFlash server is terminating"]
[2022/01/21 13:04:34.736 +08:00] [ERROR] [terror.go:291] ["encountered error"] [error="write tcp 10.102.0.86:4000->10.1.39.21:45696: write: connection timed out"] [stack="github.com/pingcap/parser/terror.Log\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/pkg/mod/github.com/pingcap/parser@v0.0.0-20210623034316-5ee95ed0081f/terror/terror.go:291\
github.com/pingcap/tidb/server.(*clientConn).Run\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/server/conn.go:821\
github.com/pingcap/tidb/server.(*Server).onConn\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/server/server.go:477"]
[2022/01/21 13:04:34.761 +08:00] [INFO] [conn.go:812] ["command dispatched failed"] [conn=349411] [connInfo="id:349411, addr:10.1.37.96:50642 status:10, collation:utf8_general_ci, user:dolphin_rpt_r"]  [txn_mode=PESSIMISTIC] [err="[tikv:13][FLASH:Coprocessor:Internal] TiFlash server is terminating"]
[2022/01/21 13:04:34.761 +08:00] [ERROR] [terror.go:291] ["encountered error"] [error="write tcp 10.102.0.86:4000->10.1.37.96:50642: write: connection timed out"] [stack="github.com/pingcap/parser/terror.Log\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/pkg/mod/github.com/pingcap/parser@v0.0.0-20210623034316-5ee95ed0081f/terror/terror.go:291\
github.com/pingcap/tidb/server.(*clientConn).Run\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/server/conn.go:821\
github.com/pingcap/tidb/server.(*Server).onConn\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/server/server.go:477"]

从日志上看貌似是pd和tidb-server之间的网络问题?还有PD3和store之间的网络

3 个赞

至于Tiflash,通过日志发现是被中止的:

[2022/01/21 13:03:36.846 +08:00] [INFO] [<unknown>] ["Application: Received termination signal (Terminated)"] [thread_id=242142338]
[2022/01/21 13:03:36.853 +08:00] [INFO] [<unknown>] ["Application: Set tmt context terminated"] [thread_id=1]
[2022/01/21 13:03:36.853 +08:00] [INFO] [<unknown>] ["Application: wait tiflash proxy to stop all services"] [thread_id=1]
......
[2022/01/21 13:03:37.253 +08:00] [INFO] [<unknown>] ["Application: all services in tiflash proxy are stopped"] [thread_id=1]
......
[2022/01/21 13:03:37.447 +08:00] [DEBUG] [<unknown>] ["Application: Received termination signal."] [thread_id=1]
[2022/01/21 13:03:37.447 +08:00] [DEBUG] [<unknown>] ["Application: Waiting for current connections to close."] [thread_id=1]
[2022/01/21 13:03:37.572 +08:00] [DEBUG] [<unknown>] ["Application: Closed all listening sockets."] [thread_id=1]
[2022/01/21 13:03:37.572 +08:00] [DEBUG] [<unknown>] ["Application: Closed connections."] [thread_id=1]
[2022/01/21 13:03:40.005 +08:00] [INFO] [<unknown>] ["Application: Begin to shut down flash grpc server"] [thread_id=1]
......
[2022/01/21 13:03:40.008 +08:00] [INFO] [<unknown>] ["Application: Shut down flash grpc server"] [thread_id=1]
[2022/01/21 13:03:40.008 +08:00] [INFO] [<unknown>] ["Application: Begin to shut down flash service"] [thread_id=1]
[2022/01/21 13:03:40.031 +08:00] [INFO] [<unknown>] ["Application: Shut down flash service"] [thread_id=1]
[2022/01/21 13:03:40.031 +08:00] [INFO] [<unknown>] ["Application: Shutting down storages."] [thread_id=1]
......
[2022/01/21 13:03:49.066 +08:00] [DEBUG] [<unknown>] ["Application: Destroyed global context."] [thread_id=1]
[2022/01/21 13:03:49.067 +08:00] [INFO] [<unknown>] ["Application: let tiflash proxy shutdown"] [thread_id=1]
[2022/01/21 13:03:49.067 +08:00] [INFO] [<unknown>] ["Application: wait for tiflash proxy thread to join"] [thread_id=1]
[2022/01/21 13:03:49.392 +08:00] [INFO] [<unknown>] ["Application: tiflash proxy thread is joined"] [thread_id=1]
[2022/01/21 13:03:49.393 +08:00] [INFO] [<unknown>] ["Application: shutting down"] [thread_id=1]
[2022/01/21 13:03:49.393 +08:00] [DEBUG] [<unknown>] ["Application: Uninitializing subsystem: Logging Subsystem"] [thread_id=1]
[2022/01/21 13:03:49.393 +08:00] [INFO] [<unknown>] ["BaseDaemon: Stop SignalListener thread"] [thread_id=242142338]

然后在4分钟后又拉起来了

[2022/01/21 13:07:59.488 +08:00] [INFO] [<unknown>] ["Starting daemon with revision 54381"] [thread_id=1]
[2022/01/21 13:07:59.488 +08:00] [INFO] [<unknown>] ["TiFlash build info: TiFlash\
Release Version: v5.0.3\
Edition:         Community\
Git Commit Hash: 0194cb4b59438d8d46fc05a4b1abd85eeb69972f\
Git Branch:      heads/refs/tags/v5.0.3\
UTC Build Time:  2021-07-01 06:53:53\
Profile:         RELWITHDEBINFO\
"] [thread_id=1]
[2022/01/21 13:07:59.488 +08:00] [INFO] [<unknown>] ["Application: starting up"] [thread_id=1]
[2022/01/21 13:07:59.490 +08:00] [INFO] [<unknown>] ["Application: Got jemalloc version: 5.2.1-0-gea6b3e973b477b8061e0076bb257dbd7f3faa756"] [thread_id=1]
[2022/01/21 13:07:59.490 +08:00] [INFO] [<unknown>] ["Application: Not found environment variable MALLOC_CONF"] [thread_id=1]
[2022/01/21 13:07:59.490 +08:00] [INFO] [<unknown>] ["Application: Got jemalloc config: opt.background_thread 0, opt.max_background_threads 4"] [thread_id=1]
[2022/01/21 13:07:59.490 +08:00] [INFO] [<unknown>] ["Application: Try to use background_thread of jemalloc to handle purging asynchronously"] [thread_id=1]
[2022/01/21 13:07:59.490 +08:00] [INFO] [<unknown>] ["Application: Set jemalloc.max_background_threads 1"] [thread_id=1]
[2022/01/21 13:07:59.490 +08:00] [INFO] [<unknown>] ["Application: Set jemalloc.background_thread 1"] [thread_id=1]
[2022/01/21 13:07:59.494 +08:00] [INFO] [<unknown>] ["Application: start raft store proxy"] [thread_id=1]
[2022/01/21 13:07:59.494 +08:00] [INFO] [<unknown>] ["Application: wait for tiflash proxy initializing"] [thread_id=1]
[2022/01/21 13:07:59.694 +08:00] [INFO] [<unknown>] ["Application: tiflash proxy is initialized"] [thread_id=1]

估计是触发了什么保护机制导致重启了吧?等大佬来解答了

1 个赞
  • 如果你需要获得快速 “加急”处理问题的权限,加快问题响应速度, 点击完成认证,获得“加急”处理问题的权限,方便你更快速地解决问题。
    认证后在 导航栏:我的团队-全部主题-加急,直接加急你的问题

监控重新导出一份:tidb-server 和 tikv-detail 的吧(这问题挺严重的,建议先看监控),监控信息有点少,不过看这个指标,异常签,集群就有问题了

1 个赞

tidb:
tidb-rpt-TiDB_2022-01-25T05_11_33.817Z.json (13.2 MB)

tikv-details
tidb-rpt-TiKV-Details_2022-01-25T04_58_51.930Z.json (74.0 MB)

辛苦帮忙看一下

通过监控发现tikv貌似也挂了?没有tikv的日志:joy:,能麻烦传一份吗?

tikv.log.2022-01-21-13_54_48.218398745.gz (20.3 MB)
因为集群几乎没响应了,我是在2022-01-21 13:06 重启的集群。

那相当于分两个部分问题
1、2022-01-21 13:05:16 重启之前tidb集群连不上
2、2022-01-21 13:07:39 集群重启后tiflash状态不正常

第一个从日志里看到了大量kill query的动作,需要大佬解答
第二个,能否上传下以下列出来的tiflash节点的日志(时间段就是括号里的)

10.102.0.128 (从13:10 到 15:38 之间不停重启,之后宕掉)
image
10.102.0.44 (重启后直到14:58有过两次短暂的online;14:30 - 15:30)
image
10.102.0.183 (重启后在不停重启;13:00 - 16:00)
image
10.102.0.215 (重启后在不停重启;13:00 - 16:00)
image

监控上 TiDB 和 TiKV 节点负载都不高,应该不是 TiDB 或 TiKV 的问题;TiDB 日志从 12:07 左右 expensive_query 明显增多,可能有一些查询请求 TiFlash 处理比较慢,其他会话拿不到 token,当请求的会话超过 token-limit 的限制(从日志看这个参数设置是 9),后续其他查询会在队列中排队,没有查询能够执行完成,导致 QPS 掉零。

你好,是tiflash 节点压力大,处理不过来吗,我看故障时tiflash cpu 还是挺平稳 一直是50%左右,没有突增的趋势。 目前是4个节点,机器配置都是:88C/384G 独立部署。
tidb QPS: 30, 是大查询导致的吗, 平时duratoin 99.9% 8s之内。一天会有不超过10次全表查询,时间会超过1m以上的,比如:1千万-3千万的全表查询。
我扩容tiflash 1-2节点是不是可以提高处理性能,解决这个问题。
比如:

可以考虑把 token-limit 调大一些,另外在应用层访问也可以考虑做一些流控和区分,比如走 tiflash 的大 SQL 只访问部分 TiDB,避免其他 TiDB 受影响; TiFlash 副本数可以设置成 2,避免个别 TiFlash 节点挂掉不可用

我觉得还是tiflash出了问题:joy:
如果要说压力大的话,22日早上也是相同的压力

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。