TiDB双节点有一个节点无法登陆

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:
【 TiDB 使用环境】
【概述】TiDB一个节点无法登陆
【背景】TiDB双节点,每个节点都做了可读写监控,早上有一个节点无法登陆
【现象】使用MySQL客户端登陆故障节点,出现卡死线下,Ctrl+C无法退出,
【业务影响】原使用2个节点负载均衡,连接故障节点的业务全部报错
【TiDB 版本】v4.0.8
【附件】

  1. TiUP Cluster Display 信息
    集群状态显示正常

  2. TiUP Cluster Edit Config 信息

  3. TiDB- Overview 监控

补充下背景和环境状态,以及你的期望吧,
信息不够,让小伙伴怎么帮你? :sweat:

背景?您是说出现问题的过程吗?这个我不知道,凌晨出现的,早上看到告警说生产环境集群偶现无法读写的情况。

经过排查是两个节点的TiDB,有一个节点无法连接(上面所说的连接会卡死),但是display查看集群状态是up的;

刚才我试了这个节点还是无法连接,查看集群状态还是up状态,故障节点我圈出来了,如下图

圈的哪一台,目前是否正常呢?
如果不正常,有没有近期的日志? 你是否可以自己排查下错误的信息,然后在上传一下
另外,各种资源的占用情况,是否有判断?

ID Role Host Ports OS/Arch Status Data Dir Deploy Dir
172.29.1.63:4000 tidb 172.29.1.63 4000/10080 linux/x86_64 Up - /data/tidb-deploy/tidb-4000

出故障的是这台,上面图中圈出来了。

服务器资源我当时就查过,磁盘、CPU、内存都正常,很多定时监控的任务都在等待连接。

故障TiDB节点的日志在上面贴出来了,我再贴一下:
[2021/09/12 01:38:33.305 +08:00] [INFO] [row_container.go:506] [“memory exceeds quota, spill to disk now.”] [consumed=1138187897] [quota=1073741824]
[2021/09/12 01:38:36.740 +08:00] [INFO] [row_container.go:506] [“memory exceeds quota, spill to disk now.”] [consumed=1129307453] [quota=1073741824]
[2021/09/12 01:38:42.584 +08:00] [INFO] [row_container.go:506] [“memory exceeds quota, spill to disk now.”] [consumed=1129213225] [quota=1073741824]
[2021/09/12 01:39:02.583 +08:00] [ERROR] [terror.go:272] [“encountered error”] [error=“write tcp 172.29.1.63:4000->172.29.1.84:47684: write: broken pipe”] [stack=“github.com/pingcap/parser/terror.Log\n\t/home/jenkins/agent/workspace/tidb_v4.0.8/go/pkg/mod/github.com/pingcap/parser@v0.0.0-20201022083903-fbe80b0c40bb/terror/terror.go:272\ngithub.com/pingcap/tidb/server.(*packetIO.writePacket\n\t/home/jenkins/agent/workspace/tidb_v4.0.8/go/src/github.com/pingcap/tidb/server/packetio.go:159[ngithub.com/pingcap/tidb/server.(*clientConn](http://ngithub.com/pingcap/tidb/server.(*clientConn)).writePacket\n\t/home/jenkins/agent/workspace/tidb_v4.0.8/go/src/github.com/pingcap/tidb/server/conn.go:338[ngithub.com/pingcap/tidb/server.(*clientConn](http://ngithub.com/pingcap/tidb/server.(*clientConn)).writeChunks\n\t/home/jenkins/agent/workspace/tidb_v4.0.8/go/src/github.com/pingcap/tidb/server/conn.go:1529[ngithub.com/pingcap/tidb/server.(*clientConn](http://ngithub.com/pingcap/tidb/server.(*clientConn)).writeResultset\n\t/home/jenkins/agent/workspace/tidb_v4.0.8/go/src/github.com/pingcap/tidb/server/conn.go:1460[ngithub.com/pingcap/tidb/server.(*clientConn](http://ngithub.com/pingcap/tidb/server.(*clientConn)).handleQuery\n\t/home/jenkins/agent/workspace/tidb_v4.0.8/go/src/github.com/pingcap/tidb/server/conn.go:1368[ngithub.com/pingcap/tidb/server.(*clientConn](http://ngithub.com/pingcap/tidb/server.(*clientConn)).dispatch\n\t/home/jenkins/agent/workspace/tidb_v4.0.8/go/src/github.com/pingcap/tidb/server/conn.go:985[ngithub.com/pingcap/tidb/server.(*clientConn](http://ngithub.com/pingcap/tidb/server.(*clientConn)).Run\n\t/home/jenkins/agent/workspace/tidb_v4.0.8/go/src/github.com/pingcap/tidb/server/conn.go:772[ngithub.com/pingcap/tidb/server.(*Server](http://ngithub.com/pingcap/tidb/server.(*Server)).onConn\n\t/home/jenkins/agent/workspace/tidb_v4.0.8/go/src/github.com/pingcap/tidb/server/server.go:421”]
[2021/09/12 01:39:03.625 +08:00] [INFO] [conn.go:787] [“command dispatched failed”] [conn=36119] [connInfo=“id:36119, addr:172.29.1.84:47684 status:11, collation:utf8mb4_general_ci, user:root”] [command=Query] [status=“inTxn:1, autocommit:1”] [sql="SELECT * FROM db_name . device_ts ORDER BY _tidb_rowid "] [txn_mode=PESSIMISTIC] err=“connection was bad[ngithub.com/pingcap/errors.AddStack\n\t/home/jenkins/agent/workspace/tidb_v4.0.8/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20200917111840-a15ef68f753d/errors.go:174\ngithub.com/pingcap/errors.Trace\n\t/home/jenkins/agent/workspace/tidb_v4.0.8/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20200917111840-a15ef68f753d/juju_adaptor.go:15\ngithub.com/pingcap/tidb/server.(*packetIO.writePacket\n\t/home/jenkins/agent/workspace/tidb_v4.0.8/go/src/github.com/pingcap/tidb/server/packetio.go:160[ngithub.com/pingcap/tidb/server.(*clientConn](http://ngithub.com/pingcap/tidb/server.(*clientConn)).writePacket\n\t/home/jenkins/agent/workspace/tidb_v4.0.8/go/src/github.com/pingcap/tidb/server/conn.go:338[ngithub.com/pingcap/tidb/server.(*clientConn](http://ngithub.com/pingcap/tidb/server.(*clientConn)).writeChunks\n\t/home/jenkins/agent/workspace/tidb_v4.0.8/go/src/github.com/pingcap/tidb/server/conn.go:1529[ngithub.com/pingcap/tidb/server.(*clientConn](http://ngithub.com/pingcap/tidb/server.(*clientConn)).writeResultset\n\t/home/jenkins/agent/workspace/tidb_v4.0.8/go/src/github.com/pingcap/tidb/server/conn.go:1460[ngithub.com/pingcap/tidb/server.(*clientConn](http://ngithub.com/pingcap/tidb/server.(*clientConn)).handleQuery\n\t/home/jenkins/agent/workspace/tidb_v4.0.8/go/src/github.com/pingcap/tidb/server/conn.go:1368[ngithub.com/pingcap/tidb/server.(*clientConn](http://ngithub.com/pingcap/tidb/server.(*clientConn)).dispatch\n\t/home/jenkins/agent/workspace/tidb_v4.0.8/go/src/github.com/pingcap/tidb/server/conn.go:985[ngithub.com/pingcap/tidb/server.(*clientConn](http://ngithub.com/pingcap/tidb/server.(*clientConn)).Run\n\t/home/jenkins/agent/workspace/tidb_v4.0.8/go/src/github.com/pingcap/tidb/server/conn.go:772[ngithub.com/pingcap/tidb/server.(*Server](http://ngithub.com/pingcap/tidb/server.(*Server)).onConn\n\t/home/jenkins/agent/workspace/tidb_v4.0.8/go/src/github.com/pingcap/tidb/server/server.go:421\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1357”]
[2021/09/12 01:39:03.626 +08:00] [ERROR] [terror.go:272] [“encountered error”] [error=“write tcp 172.29.1.63:4000->172.29.1.84:47684: write: broken pipe”] [stack=“github.com/pingcap/parser/terror.Log\n\t/home/jenkins/agent/workspace/tidb_v4.0.8/go/pkg/mod/github.com/pingcap/parser@v0.0.0-20201022083903-fbe80b0c40bb/terror/terror.go:272\ngithub.com/pingcap/tidb/server.(*packetIO.writePacket\n\t/home/jenkins/agent/workspace/tidb_v4.0.8/go/src/github.com/pingcap/tidb/server/packetio.go:159[ngithub.com/pingcap/tidb/server.(*clientConn](http://ngithub.com/pingcap/tidb/server.(*clientConn)).writePacket\n\t/home/jenkins/agent/workspace/tidb_v4.0.8/go/src/github.com/pingcap/tidb/server/conn.go:338[ngithub.com/pingcap/tidb/server.(*clientConn](http://ngithub.com/pingcap/tidb/server.(*clientConn)).writeError\n\t/home/jenkins/agent/workspace/tidb_v4.0.8/go/src/github.com/pingcap/tidb/server/conn.go:1100[ngithub.com/pingcap/tidb/server.(*clientConn](http://ngithub.com/pingcap/tidb/server.(*clientConn)).Run\n\t/home/jenkins/agent/workspace/tidb_v4.0.8/go/src/github.com/pingcap/tidb/server/conn.go:795[ngithub.com/pingcap/tidb/server.(*Server](http://ngithub.com/pingcap/tidb/server.(*Server)).onConn\n\t/home/jenkins/agent/workspace/tidb_v4.0.8/go/src/github.com/pingcap/tidb/server/server.go:421”]
[2021/09/12 01:39:03.626 +08:00] [ERROR] [terror.go:272] [“encountered error”] [error=“connection was bad”] [stack=“github.com/pingcap/parser/terror.Log\n\t/home/jenkins/agent/workspace/tidb_v4.0.8/go/pkg/mod/github.com/pingcap/parser@v0.0.0-20201022083903-fbe80b0c40bb/terror/terror.go:272\ngithub.com/pingcap/tidb/server.(*clientConn.Run\n\t/home/jenkins/agent/workspace/tidb_v4.0.8/go/src/github.com/pingcap/tidb/server/conn.go:796[ngithub.com/pingcap/tidb/server.(*Server](http://ngithub.com/pingcap/tidb/server.(*Server)).onConn\n\t/home/jenkins/agent/workspace/tidb_v4.0.8/go/src/github.com/pingcap/tidb/server/server.go:421”]

结合晚上那个点的操作,晚上凌晨做备份,猜测是备份导致TiDB节点内存爆了引出的什么故障,能否确认一下是这方面的问题,然后有啥恢复方法?

最傻瓜的方式,就是重启tidb,就可以了

看日志是内存被吃完了,开始使用磁盘,后面hand 住了

另外可以排查一下慢查询

什么样的备份?
BR 的话,对 tidb 无影响;要是 dumping,那就歇菜了…

。。。就是dumpling
重启故障节点服务,这个我估计应该是可以,是想方便官方排查问题然后看看有什么更方便的恢复方式就没操作

所以官方是不建议用dumpling备份导出?这个工具换了很多次,2.0时候的mydumper和loader,4.0的dumpling和loader

br对TiDB无影响,但是不知道会不会对PD有影响,因为在我看来,TiDB节点的故障处理起来要比PD简单得多,TiDB的故障我可以单独开一台TiDB节点来给dumpling使用

不是,dumpling 适合数据量较少,而且需要做格式转换的情况下使用,因为导出的是 SQL 或者 CSV

那么数据会从tikv 读取后,由tidb 来执行这块的操作,这也就说明了tidb 为啥会出问题

就是要控制量的大小,不影响业务为优先了;

BR,备份对 PD 影响不是太大,但是目前有个缺陷,checksum 这个会影响到 pd 和tikv,估计后面会有版本的修复,需要等待一下;

好,我先单独开一台tidb做备份
故障节点我先restart一下

使用dumpling备份的时候会有/tmp磁盘空间不足的问题报出来,麻烦帮忙看一下:dumpling备份数据报/tmp空间不足

这是另外一个议题了~ 回头在看了,呵呵 :nerd_face:

  • 该问题是否已经解决?如已经解决,请 对问题标记【对我有用】,问题 才能被搜索到,也能帮助他人更高效地找到答案。如果你的问题还没有解决,请继续追问及反馈你遇到的问题,附上操作提示或者截图。