3个TiKV节点,突然有一个节点挂掉了,尝试启动3次均失败

【 TiDB 使用环境】
【概述】:3个TiKV节点,突然有一个节点挂掉了,尝试启动3次均失败
【背景】:没有做过什么操作,突然就挂了
【现象】:目前还可以连接上tidb,只是有一个tikv挂掉了
【问题】:
【业务影响】:
【TiDB 版本】:
【附件】:


{
	"error": true,
	"message": "error.api.other: record not found",
	"code": "error.api.other",
	"full_text": "error.api.other: record not found\
 at github.com/pingcap/tidb-dashboard/pkg/apiserver/utils.NewAPIError()\
\t/nfs/cache/mod/github.com/pingcap/tidb-dashboard@v0.0.0-20210826074103-29034af68525/pkg/apiserver/utils/error.go:67\
 at github.com/pingcap/tidb-dashboard/pkg/apiserver/utils.MWHandleErrors.func1()\
\t/nfs/cache/mod/github.com/pingcap/tidb-dashboard@v0.0.0-20210826074103-29034af68525/pkg/apiserver/utils/error.go:96\
 at github.com/gin-gonic/gin.(*Context).Next()\
\t/nfs/cache/mod/github.com/gin-gonic/gin@v1.5.0/context.go:147\
 at github.com/gin-contrib/gzip.Gzip.func2()\
\t/nfs/cache/mod/github.com/gin-contrib/gzip@v0.0.1/gzip.go:47\
 at github.com/gin-gonic/gin.(*Context).Next()\
\t/nfs/cache/mod/github.com/gin-gonic/gin@v1.5.0/context.go:147\
 at github.com/gin-gonic/gin.RecoveryWithWriter.func1()\
\t/nfs/cache/mod/github.com/gin-gonic/gin@v1.5.0/recovery.go:83\
 at github.com/gin-gonic/gin.(*Context).Next()\
\t/nfs/cache/mod/github.com/gin-gonic/gin@v1.5.0/context.go:147\
 at github.com/gin-gonic/gin.(*Engine).handleHTTPRequest()\
\t/nfs/cache/mod/github.com/gin-gonic/gin@v1.5.0/gin.go:403\
 at github.com/gin-gonic/gin.(*Engine).ServeHTTP()\
\t/nfs/cache/mod/github.com/gin-gonic/gin@v1.5.0/gin.go:364\
 at github.com/pingcap/tidb-dashboard/pkg/apiserver.(*Service).handler()\
\t/nfs/cache/mod/github.com/pingcap/tidb-dashboard@v0.0.0-20210826074103-29034af68525/pkg/apiserver/apiserver.go:208\
 at net/http.HandlerFunc.ServeHTTP()\
\t/usr/local/go/src/net/http/server.go:2069\
 at github.com/pingcap/tidb-dashboard/pkg/utils.(*ServiceStatus).NewStatusAwareHandler.func1()\
\t/nfs/cache/mod/github.com/pingcap/tidb-dashboard@v0.0.0-20210826074103-29034af68525/pkg/utils/service_status.go:79\
 at net/http.HandlerFunc.ServeHTTP()\
\t/usr/local/go/src/net/http/server.go:2069\
 at net/http.(*ServeMux).ServeHTTP()\
\t/usr/local/go/src/net/http/server.go:2448\
 at go.etcd.io/etcd/embed.(*accessController).ServeHTTP()\
\t/nfs/cache/mod/go.etcd.io/etcd@v0.5.0-alpha.5.0.20191023171146-3cf2f69b5738/embed/serve.go:359\
 at net/http.serverHandler.ServeHTTP()\
\t/usr/local/go/src/net/http/server.go:2887\
 at net/http.(*conn).serve()\
\t/usr/local/go/src/net/http/server.go:1952\
 at runtime.goexit()\
\t/usr/local/go/src/runtime/asm_amd64.s:1371"
}

挂掉tikv实例的问题日志上传下,包含挂掉前10分钟内容

你好,这台tikv应该是11月4号 11:22号之前就down掉了,我把全部日志都上传,帮忙看看是什么问题

tikv_stderr.log (1 字节) tikv.log (2.2 MB) tikv.log.2021-11-02-14_41_21.777108889.log (498.2 KB) tikv.log.2021-11-03-14_41_23.223933191.log (233.7 KB) tikv.log.2021-11-04-14_41_31.790983859.log (2.4 MB) tikv.log.2021-11-10-16_53_43.217172911.log (2.1 MB) tikv.log.2021-11-15-15_42_53.766639200.log (22.4 KB) tikv.log.2021-11-16-15_43_01.586074849.log (2.3 MB) tikv.log.2021-11-17-15_43_03.580820283.log (2.3 MB) tikv.log.2021-11-18-15_43_16.906116324.log (2.3 MB) tikv.log.2021-11-19-15_43_29.340019826.log (2.3 MB) tikv.log.2021-11-20-15_43_29.574089180.log (2.3 MB) tikv.log.2021-11-21-15_43_41.887345523.log (2.3 MB)

确认下这个TiKV节点的磁盘空间以及是否可以写入。
还有你这3次启动的时间点是多少?

有写权限:image

3次重启时间:11月19号 15点20分钟到16点之间:image

看下无法启动tikv所在主机的文件系统剩余空间(df -h)

剩余空间充足:

看了下tikv日志,貌似是MANIFEST文件找不到导致初始化失败

[2021/11/22 14:14:34.260 +08:00] [FATAL] [server.rs:1249] [“failed to create kv engine: Storage Engine IO error: No such file or directory While opening a file for sequentially reading: /data/tidb-data/tikv-20160/db/MANIFEST-044150: No such file or directory”]

看看其他两个正常的tikv中有没有这个文件(/data/tidb-data/tikv-20160/db/MANIFEST-044150)

如果急的话,就把这个tikv节点delete了,下线后再重新加回来
参考:https://docs.pingcap.com/zh/tidb/stable/scale-tidb-using-tiup#缩容-tidbpdtikv-节点

非常感谢,为什么这个问题会消失了?有没有什么原因?万一以后到生产环境上出现这种情况,就麻烦了。能不能把其他TiKV节点上的MANIFEST文件拷贝到214节点磁盘上?

只能重装这个节点吗?能不能把其他TiKV节点上的MANIFEST文件拷贝到214节点磁盘上?

MANIFEST文件涉及到rocksDB(我对这个不是很懂,不清楚里面是不是有类似头部校验的东西)
理论上如果文件被删了,tidb会自动去补副本,但具体实现的细节可能得请教研发大佬来解答下

谢谢:handshake:,我重新安装一下这个tikv节点,具体是什么原因丢失了,先不找了。

不用重装,把tikv踢出去再加进去就可以了,tikv会自动补副本:joy:

我试试

尴尬了,今天凌晨4点又自动停了一个TIKV节点:sweat_smile:


日志提示:
[2021/11/23 08:12:30.308 +08:00] [ERROR] [server.rs:1030] [“failed to init io snooper”] [err_code=KV:Unknown] [err=“"IO snooper is not started due to not compiling with BCC"”]
[2021/11/23 08:12:30.764 +08:00] [FATAL] [server.rs:1231] [“failed to create raft engine: Storage Engine IO error: No space left on deviceWhile appending to file: /data/tidb-data/tikv-20160/raft/376319.sst: No space left on device”]

:joy:刚刚搜了下,发现curl把tikv强制改为 tombstone的方式在5版本以上行不通了
tikv修复参考:
https://asktug.com/t/topic/400

提示空间不足,检查下文件系统的剩余空间

failed to create raft engine: Storage Engine IO error: No space left on deviceWhile appending to file: /data/tidb-data/tikv-20160/raft/376319.sst: No space left on device

是不是214挂掉了,数据全转移到216导致的?
image

估计是;可以看看grafana里的历史数据,看下之前3节点都在时,每个节点主机上使用了多少空间,再来估算下