一tikv节点无法启动:报错:failed to init io snoope

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:

【TiDB 版本】v4.0.11

【问题描述】

补充说明:集群版本是采用tiup离线升级上去的。

今天打开虚拟机,发现一个tikv节点无法正常启动(该服务器上其他服务节点可以正常启动),tikv日志中的错误信息如下:

[2021/05/13 04:32:42.905 -04:00] [ERROR] [server.rs:854] ["failed to init io snooper"] [err_code=KV:Unknown] [err="\"IO snooper is not started due to not compiling with BCC\""]

[2021/05/13 04:32:47.925 -04:00] [FATAL] [lib.rs:465] ["called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: \"No such file or directory\" }"] [backtrace="stack backtrace:\
   0: tikv_util::set_panic_hook::{{closure}}\
             at /home/jenkins/agent/workspace/build_tikv_multi_branch_v5.0.1/tikv/components/tikv_util/src/lib.rs:464\
   1: std::panicking::rust_panic_with_hook\
             at /rustc/bc39d4d9c514e5fdb40a5782e6ca08924f979c35//library/std/src/panicking.rs:595\
   2: std::panicking::begin_panic_handler::{{closure}}\
             at /rustc/bc39d4d9c514e5fdb40a5782e6ca08924f979c35//library/std/src/panicking.rs:497\
   3: std::sys_common::backtrace::__rust_end_short_backtrace\
             at /rustc/bc39d4d9c514e5fdb40a5782e6ca08924f979c35//library/std/src/sys_common/backtrace.rs:141\
   4: rust_begin_unwind\
             at /rustc/bc39d4d9c514e5fdb40a5782e6ca08924f979c35//library/std/src/panicking.rs:493\
   5: core::panicking::panic_fmt\
             at /rustc/bc39d4d9c514e5fdb40a5782e6ca08924f979c35//library/core/src/panicking.rs:92\
   6: core::option::expect_none_failed\
             at /rustc/bc39d4d9c514e5fdb40a5782e6ca08924f979c35//library/core/src/option.rs:1266\
   7: core::result::Result<T,E>::unwrap\
             at /rustc/bc39d4d9c514e5fdb40a5782e6ca08924f979c35/library/core/src/result.rs:969\
      cmd::server::TiKVServer<ER>::init_fs\
             at /home/jenkins/agent/workspace/build_tikv_multi_branch_v5.0.1/tikv/cmd/src/server.rs:373\
      cmd::server::run_tikv\
             at /home/jenkins/agent/workspace/build_tikv_multi_branch_v5.0.1/tikv/cmd/src/server.rs:133\
   8: tikv_server::main\
             at /home/jenkins/agent/workspace/build_tikv_multi_branch_v5.0.1/tikv/cmd/src/bin/tikv-server.rs:181\
   9: core::ops::function::FnOnce::call_once\
             at /rustc/bc39d4d9c514e5fdb40a5782e6ca08924f979c35/library/core/src/ops/function.rs:227\
      std::sys_common::backtrace::__rust_begin_short_backtrace\
             at /rustc/bc39d4d9c514e5fdb40a5782e6ca08924f979c35/library/std/src/sys_common/backtrace.rs:125\
  10: main\
  11: __libc_start_main\
  12: <unknown>\
"] [location=cmd/src/server.rs:385] [thread_name=main]

日志中循环刷出tikv节点启动日志和报错信息;

在asktug上搜索“IO snooper is not started due to not compiling with BCC”,回答者大多指向:磁盘存在问题

于是:检查虚拟机的磁盘

1.检查是否可以正常读写 结果是:可以的。

2.检查是否可用磁盘空间太小,磁盘使用空间77%,感觉还可以,但是还是选择尝试删除一些大文件试试。

删除几个G的文件后发现一个奇怪的问题:通过df -h发现磁盘空间没有减少,反而增加了Use%由77%变为98%,文件删除后,tikv节点自动拉起。

然后查看根目录下所有文件夹大小,对每个文件夹所占空间求和,发现和df -h显示的磁盘使用空间可以对上,猜测删除文件前,虚拟机的磁盘好像“用超”了,导致tikv启动不了。至于为什么磁盘会“用超”,还在研究当中。

删除文件前:
    Filesystem              Type      Size  Used Avail Use% Mounted on
    devtmpfs                devtmpfs  1.4G     0  1.4G   0% /dev
    tmpfs                   tmpfs     1.4G     0  1.4G   0% /dev/shm
    tmpfs                   tmpfs     1.4G  9.5M  1.4G   1% /run
    tmpfs                   tmpfs     1.4G     0  1.4G   0% /sys/fs/cgroup
    /dev/mapper/centos-root xfs        17G   14G  4.0G  77% /
    /dev/sda1               xfs      1014M  150M  865M  15% /boot
    tmpfs                   tmpfs     283M     0  283M   0% /run/user/0
    tmpfs                   tmpfs     283M     0  283M   0% /run/user/1000

删除文件后:

Filesystem               Size  Used Avail Use% Mounted on
devtmpfs                 1.4G     0  1.4G   0% /dev
tmpfs                    1.4G     0  1.4G   0% /dev/shm
tmpfs                    1.4G  9.5M  1.4G   1% /run
tmpfs                    1.4G     0  1.4G   0% /sys/fs/cgroup
/dev/mapper/centos-root   17G   17G  473M  98% /
/dev/sda1               1014M  150M  865M  15% /boot
tmpfs                    283M     0  283M   0% /run/user/0

集群信息:
[tidb@zk data1]$ tiup cluster display test-cluster
Found cluster newer version:

        The latest version:         v1.4.3
        Local installed version:    v1.3.2
        Update current component:   tiup update cluster
        Update all components:      tiup update --all

    Starting component `cluster`: /home/tidb/.tiup/components/cluster/v1.3.2/tiup-cluster display test-cluster
    Cluster type:       tidb
    Cluster name:       test-cluster
    Cluster version:    v4.0.11
    SSH type:           builtin
    Dashboard URL:      http://192.168.159.135:2379/dashboard
    ID                     Role          Host             Ports        OS/Arch       Status   Data Dir                                    Deploy Dir
    --                     ----          ----             -----        -------       ------   --------                                    ----------
    192.168.159.139:9093   alertmanager  192.168.159.139  9093/9094    linux/x86_64  Up       /data1/deploy/data.alertmanager             /data1/deploy
    192.168.159.136:8249   drainer       192.168.159.136  8249         linux/x86_64  Up       data/drainer-8249                           /data1/deploy
    192.168.159.139:3000   grafana       192.168.159.139  3000         linux/x86_64  Up       -                                           /data1/deploy
    192.168.159.134:2379   pd            192.168.159.134  2379/2380    linux/x86_64  Up       /data1/deploy/data.pd                       /data1/deploy
    192.168.159.135:2379   pd            192.168.159.135  2379/2380    linux/x86_64  Up|L|UI  /data1/deploy/data.pd                       /data1/deploy
    192.168.159.139:2379   pd            192.168.159.139  2379/2380    linux/x86_64  Up       /data1/deploy/data.pd                       /data1/deploy
    192.168.159.139:9090   prometheus    192.168.159.139  9090         linux/x86_64  Up       /data1/deploy/prometheus2.0.0.data.metrics  /data1/deploy
    192.168.159.135:8250   pump          192.168.159.135  8250         linux/x86_64  Up       /data1/deploy/pump/data.pump                /data1/deploy/pump
    192.168.159.136:8250   pump          192.168.159.136  8250         linux/x86_64  Up       /data1/deploy/pump/data.pump                /data1/deploy/pump
    192.168.159.139:4000   tidb          192.168.159.139  4000/10080   linux/x86_64  Up       -                                           /data1/deploy
    192.168.159.134:20160  tikv          192.168.159.134  20160/20180  linux/x86_64  Up       /data1/deploy/data                          /data1/deploy
    192.168.159.135:20160  tikv          192.168.159.135  20160/20180  linux/x86_64  Up       /data1/deploy/data                          /data1/deploy
    192.168.159.139:20160  tikv          192.168.159.139  20160/20180  linux/x86_64  Up       /data1/deploy/data                          /data1/deploy

若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

1 个赞
  1. reserve-space 默认会占用 5 GB ,导致无法启动,如果只是测试,您可以把这里的参数改为 0.

    https://docs.pingcap.com/zh/tidb/stable/tikv-configuration-file#reserve-space

  2. 如果是正式环境,建议扩容磁盘,按照标准配置安装
    https://docs.pingcap.com/zh/tidb/stable/hardware-and-software-requirements

2 个赞

测试环境;感谢回复!

明白了:在删除文件之前,因为磁盘空间无法满足tikv启动需要的空间大小,所有tikv无法启动,删除几G的数据文件后,tikv生成临时文件space_placeholder_file(占用了5G的磁盘空间),自动拉起服务,导致第二次查看磁盘空间时,使用率不降反升的假象。
非常感谢指导!

1 个赞

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。