为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:
【TiDB 版本】v4.0.11
【问题描述】
补充说明:集群版本是采用tiup离线升级上去的。
今天打开虚拟机,发现一个tikv节点无法正常启动(该服务器上其他服务节点可以正常启动),tikv日志中的错误信息如下:
[2021/05/13 04:32:42.905 -04:00] [ERROR] [server.rs:854] ["failed to init io snooper"] [err_code=KV:Unknown] [err="\"IO snooper is not started due to not compiling with BCC\""]
[2021/05/13 04:32:47.925 -04:00] [FATAL] [lib.rs:465] ["called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: \"No such file or directory\" }"] [backtrace="stack backtrace:\
0: tikv_util::set_panic_hook::{{closure}}
at /home/jenkins/agent/workspace/build_tikv_multi_branch_v5.0.1/tikv/components/tikv_util/src/lib.rs:464
1: std::panicking::rust_panic_with_hook
at /rustc/bc39d4d9c514e5fdb40a5782e6ca08924f979c35//library/std/src/panicking.rs:595
2: std::panicking::begin_panic_handler::{{closure}}
at /rustc/bc39d4d9c514e5fdb40a5782e6ca08924f979c35//library/std/src/panicking.rs:497
3: std::sys_common::backtrace::__rust_end_short_backtrace
at /rustc/bc39d4d9c514e5fdb40a5782e6ca08924f979c35//library/std/src/sys_common/backtrace.rs:141
4: rust_begin_unwind
at /rustc/bc39d4d9c514e5fdb40a5782e6ca08924f979c35//library/std/src/panicking.rs:493
5: core::panicking::panic_fmt
at /rustc/bc39d4d9c514e5fdb40a5782e6ca08924f979c35//library/core/src/panicking.rs:92
6: core::option::expect_none_failed
at /rustc/bc39d4d9c514e5fdb40a5782e6ca08924f979c35//library/core/src/option.rs:1266
7: core::result::Result<T,E>::unwrap
at /rustc/bc39d4d9c514e5fdb40a5782e6ca08924f979c35/library/core/src/result.rs:969
cmd::server::TiKVServer::init_fs
at /home/jenkins/agent/workspace/build_tikv_multi_branch_v5.0.1/tikv/cmd/src/server.rs:373
cmd::server::run_tikv
at /home/jenkins/agent/workspace/build_tikv_multi_branch_v5.0.1/tikv/cmd/src/server.rs:133
8: tikv_server::main
at /home/jenkins/agent/workspace/build_tikv_multi_branch_v5.0.1/tikv/cmd/src/bin/tikv-server.rs:181
9: core::ops::function::FnOnce::call_once
at /rustc/bc39d4d9c514e5fdb40a5782e6ca08924f979c35/library/core/src/ops/function.rs:227
std::sys_common::backtrace::__rust_begin_short_backtrace
at /rustc/bc39d4d9c514e5fdb40a5782e6ca08924f979c35/library/std/src/sys_common/backtrace.rs:125
10: main
11: __libc_start_main
12:
"] [location=cmd/src/server.rs:385] [thread_name=main]
日志中循环刷出tikv节点启动日志和报错信息;
在asktug上搜索“IO snooper is not started due to not compiling with BCC”,回答者大多指向:磁盘存在问题
于是:检查虚拟机的磁盘
1.检查是否可以正常读写 结果是:可以的。
2.检查是否可用磁盘空间太小,磁盘使用空间77%,感觉还可以,但是还是选择尝试删除一些大文件试试。
删除几个G的文件后发现一个奇怪的问题:通过df -h发现磁盘空间没有减少,反而增加了Use%由77%变为98%,文件删除后,tikv节点自动拉起。
然后查看根目录下所有文件夹大小,对每个文件夹所占空间求和,发现和df -h显示的磁盘使用空间可以对上,猜测删除文件前,虚拟机的磁盘好像“用超”了,导致tikv启动不了。至于为什么磁盘会“用超”,还在研究当中。
删除文件前:
Filesystem Type Size Used Avail Use% Mounted on
devtmpfs devtmpfs 1.4G 0 1.4G 0% /dev
tmpfs tmpfs 1.4G 0 1.4G 0% /dev/shm
tmpfs tmpfs 1.4G 9.5M 1.4G 1% /run
tmpfs tmpfs 1.4G 0 1.4G 0% /sys/fs/cgroup
/dev/mapper/centos-root xfs 17G 14G 4.0G 77% /
/dev/sda1 xfs 1014M 150M 865M 15% /boot
tmpfs tmpfs 283M 0 283M 0% /run/user/0
tmpfs tmpfs 283M 0 283M 0% /run/user/1000
删除文件后:
Filesystem Size Used Avail Use% Mounted on devtmpfs 1.4G 0 1.4G 0% /dev tmpfs 1.4G 0 1.4G 0% /dev/shm tmpfs 1.4G 9.5M 1.4G 1% /run tmpfs 1.4G 0 1.4G 0% /sys/fs/cgroup /dev/mapper/centos-root 17G 17G 473M 98% / /dev/sda1 1014M 150M 865M 15% /boot tmpfs 283M 0 283M 0% /run/user/0
集群信息:
[tidb@zk data1]$ tiup cluster display test-cluster
Found cluster newer version:
The latest version: v1.4.3
Local installed version: v1.3.2
Update current component: tiup update cluster
Update all components: tiup update --all
Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.3.2/tiup-cluster display test-cluster
Cluster type: tidb
Cluster name: test-cluster
Cluster version: v4.0.11
SSH type: builtin
Dashboard URL: http://192.168.159.135:2379/dashboard
ID Role Host Ports OS/Arch Status Data Dir Deploy Dir
-- ---- ---- ----- ------- ------ -------- ----------
192.168.159.139:9093 alertmanager 192.168.159.139 9093/9094 linux/x86_64 Up /data1/deploy/data.alertmanager /data1/deploy
192.168.159.136:8249 drainer 192.168.159.136 8249 linux/x86_64 Up data/drainer-8249 /data1/deploy
192.168.159.139:3000 grafana 192.168.159.139 3000 linux/x86_64 Up - /data1/deploy
192.168.159.134:2379 pd 192.168.159.134 2379/2380 linux/x86_64 Up /data1/deploy/data.pd /data1/deploy
192.168.159.135:2379 pd 192.168.159.135 2379/2380 linux/x86_64 Up|L|UI /data1/deploy/data.pd /data1/deploy
192.168.159.139:2379 pd 192.168.159.139 2379/2380 linux/x86_64 Up /data1/deploy/data.pd /data1/deploy
192.168.159.139:9090 prometheus 192.168.159.139 9090 linux/x86_64 Up /data1/deploy/prometheus2.0.0.data.metrics /data1/deploy
192.168.159.135:8250 pump 192.168.159.135 8250 linux/x86_64 Up /data1/deploy/pump/data.pump /data1/deploy/pump
192.168.159.136:8250 pump 192.168.159.136 8250 linux/x86_64 Up /data1/deploy/pump/data.pump /data1/deploy/pump
192.168.159.139:4000 tidb 192.168.159.139 4000/10080 linux/x86_64 Up - /data1/deploy
192.168.159.134:20160 tikv 192.168.159.134 20160/20180 linux/x86_64 Up /data1/deploy/data /data1/deploy
192.168.159.135:20160 tikv 192.168.159.135 20160/20180 linux/x86_64 Up /data1/deploy/data /data1/deploy
192.168.159.139:20160 tikv 192.168.159.139 20160/20180 linux/x86_64 Up /data1/deploy/data /data1/deploy
若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。