Tikv节点无法正常启动

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:

【TiDB 版本】
v4.0.11
【问题描述】

Tikv节点无法正常启动

  • 集群拓扑
global:
  user: tidb
  ssh_port: 22
  ssh_type: builtin
  deploy_dir: /tidb-deploy
  data_dir: /tidb-data
  os: linux
  arch: amd64
monitored:
  node_exporter_port: 9100
  blackbox_exporter_port: 9115
  deploy_dir: /tidb-deploy/monitor-9100
  data_dir: /tidb-data/monitor-9100
  log_dir: /tidb-deploy/monitor-9100/log
server_configs:
  tidb:
    alter-primary-key: false
    binlog.enable: true
    binlog.ignore-error: true
    enable-telemetry: false
    log.enable-slow-log: true
    log.file.max-backups: 7
    log.file.max-days: 7
    log.slow-threshold: 200
    prepared-plan-cache.enabled: true
    tikv-client.copr-cache.enable: true
  tikv: {}
  pd: {}
  tiflash: {}
  tiflash-learner: {}
  pump: {}
  drainer: {}
  cdc: {}
tidb_servers:
- host: 172.16.12.171
  ssh_port: 22
  port: 4000
  status_port: 10080
  deploy_dir: /tidb-deploy/tidb-4000
  arch: amd64
  os: linux
- host: 172.16.12.213
  ssh_port: 22
  port: 4000
  status_port: 10080
  deploy_dir: /tidb-deploy/tidb-4000
  arch: amd64
  os: linux
- host: 172.16.12.208
  ssh_port: 22
  port: 4000
  status_port: 10080
  deploy_dir: /tidb-deploy/tidb-4000
  arch: amd64
  os: linux
tikv_servers:
- host: 172.16.12.190
  ssh_port: 22
  port: 20160
  status_port: 20180
  deploy_dir: /tidb-deploy/tikv-20160
  data_dir: /tidb-data/tikv-20160
  arch: amd64
  os: linux
- host: 172.16.12.176
  ssh_port: 22
  port: 20160
  status_port: 20180
  deploy_dir: /tidb-deploy/tikv-20160
  data_dir: /tidb-data/tikv-20160
  arch: amd64
  os: linux
- host: 172.16.12.138
  ssh_port: 22
  port: 20160
  status_port: 20180
  deploy_dir: /tidb-deploy/tikv-20160
  data_dir: /tidb-data/tikv-20160
  arch: amd64
  os: linux
tiflash_servers: []
pd_servers:
- host: 172.16.12.128
  ssh_port: 22
  name: pd-172.16.12.128-2379
  client_port: 2379
  peer_port: 2380
  deploy_dir: /tidb-deploy/pd-2379
  data_dir: /tidb-data/pd-2379
  arch: amd64
  os: linux
- host: 172.16.12.217
  ssh_port: 22
  name: pd-172.16.12.217-2379
  client_port: 2379
  peer_port: 2380
  deploy_dir: /tidb-deploy/pd-2379
  data_dir: /tidb-data/pd-2379
  arch: amd64
  os: linux
- host: 172.16.12.150
  ssh_port: 22
  name: pd-172.16.12.150-2379
  client_port: 2379
  peer_port: 2380
  deploy_dir: /tidb-deploy/pd-2379
  data_dir: /tidb-data/pd-2379
  arch: amd64
  os: linux
pump_servers:
- host: 172.16.12.123
  ssh_port: 22
  port: 8250
  deploy_dir: /tidb-deploy/pump-8250
  data_dir: /tidb-data/pump-8250
  arch: amd64
  os: linux
- host: 172.16.12.142
  ssh_port: 22
  port: 8250
  deploy_dir: /tidb-deploy/pump-8250
  data_dir: /tidb-data/pump-8250
  arch: amd64
  os: linux
- host: 172.16.12.161
  ssh_port: 22
  port: 8250
  deploy_dir: /tidb-deploy/pump-8250
  data_dir: /tidb-data/pump-8250
  arch: amd64
  os: linux
drainer_servers:
- host: 172.16.12.216
  ssh_port: 22
  port: 8249
  deploy_dir: /tidb-deploy/drainer-8249
  data_dir: /tidb-data/drainer-8249
  config:
    syncer.db-type: tidb
    syncer.to.host: 172.16.12.171
    syncer.to.password: 
    syncer.to.port: 4000
    syncer.to.user: root
  arch: amd64
  os: linux
monitoring_servers:
- host: 172.16.12.216
  ssh_port: 22
  port: 9090
  deploy_dir: /tidb-deploy/prometheus-9090
  data_dir: /tidb-data/prometheus-9090
  arch: amd64
  os: linux
grafana_servers:
- host: 172.16.12.216
  ssh_port: 22
  port: 3000
  deploy_dir: /tidb-deploy/grafana-3000
  arch: amd64
  os: linux
  username: admin
  password: admin
alertmanager_servers:
- host: 172.16.12.216
  ssh_port: 22
  web_port: 9093
  cluster_port: 9094
  deploy_dir: /tidb-deploy/alertmanager-9093
  data_dir: /tidb-data/alertmanager-9093
  arch: amd64
  os: linux
  • 当前集群状态
tiup cluster display daddylab-tidb-cluster
Found cluster newer version:

    The latest version:         v1.3.5
    Local installed version:    v1.3.4
    Update current component:   tiup update cluster
    Update all components:      tiup update --all

Starting component `cluster`: /root/.tiup/components/cluster/v1.3.4/tiup-cluster display daddylab-tidb-cluster
Cluster type:       tidb
Cluster name:       daddylab-tidb-cluster
Cluster version:    v4.0.11
SSH type:           builtin
Dashboard URL:      http://172.16.12.128:2379/dashboard
ID                   Role          Host           Ports        OS/Arch       Status  Data Dir                      Deploy Dir
--                   ----          ----           -----        -------       ------  --------                      ----------
172.16.12.216:9093   alertmanager  172.16.12.216  9093/9094    linux/x86_64  Up      /tidb-data/alertmanager-9093  /tidb-deploy/alertmanager-9093
172.16.12.216:8249   drainer       172.16.12.216  8249         linux/x86_64  Up      /tidb-data/drainer-8249       /tidb-deploy/drainer-8249
172.16.12.216:3000   grafana       172.16.12.216  3000         linux/x86_64  Up      -                             /tidb-deploy/grafana-3000
172.16.12.128:2379   pd            172.16.12.128  2379/2380    linux/x86_64  Up|UI   /tidb-data/pd-2379            /tidb-deploy/pd-2379
172.16.12.150:2379   pd            172.16.12.150  2379/2380    linux/x86_64  Up|L    /tidb-data/pd-2379            /tidb-deploy/pd-2379
172.16.12.217:2379   pd            172.16.12.217  2379/2380    linux/x86_64  Up      /tidb-data/pd-2379            /tidb-deploy/pd-2379
172.16.12.216:9090   prometheus    172.16.12.216  9090         linux/x86_64  Up      /tidb-data/prometheus-9090    /tidb-deploy/prometheus-9090
172.16.12.123:8250   pump          172.16.12.123  8250         linux/x86_64  Up      /tidb-data/pump-8250          /tidb-deploy/pump-8250
172.16.12.142:8250   pump          172.16.12.142  8250         linux/x86_64  Up      /tidb-data/pump-8250          /tidb-deploy/pump-8250
172.16.12.161:8250   pump          172.16.12.161  8250         linux/x86_64  Up      /tidb-data/pump-8250          /tidb-deploy/pump-8250
172.16.12.171:4000   tidb          172.16.12.171  4000/10080   linux/x86_64  Up      -                             /tidb-deploy/tidb-4000
172.16.12.208:4000   tidb          172.16.12.208  4000/10080   linux/x86_64  Up      -                             /tidb-deploy/tidb-4000
172.16.12.213:4000   tidb          172.16.12.213  4000/10080   linux/x86_64  Up      -                             /tidb-deploy/tidb-4000
172.16.12.138:20160  tikv          172.16.12.138  20160/20180  linux/x86_64  Up      /tidb-data/tikv-20160         /tidb-deploy/tikv-20160
172.16.12.176:20160  tikv          172.16.12.176  20160/20180  linux/x86_64  Down    /tidb-data/tikv-20160         /tidb-deploy/tikv-20160
172.16.12.190:20160  tikv          172.16.12.190  20160/20180  linux/x86_64  Up      /tidb-data/tikv-20160         /tidb-deploy/tikv-20160
Total nodes: 16
  • tikv 节点172.16.12.176:20160报错信息
[2021/03/12 17:03:18.465 +08:00] [FATAL] [server.rs:303] ["panic_mark_file /tidb-data/tikv-20160/panic_mark_file exists, there must be something wrong with the db."]

目前尝试修复工作及结果

  • 删除/tidb-data/tikv-20160/panic_mark_file -------- 无法启动
  • down 节点执行

[root@db-cluster-tikv2 ~]# ./tikv-ctl --db /tidb-data/tikv-20160/db/ bad-regions
all regions are healthy

  • up节点执行

[root@db-cluster-tikv1 ~]# ./tikv-ctl --db /tidb-data/tikv-20160/db/ bad-regions
thread ‘main’ panicked at ‘called Result::unwrap() on an Err value: RocksDb(“IO error: While lock file: /tidb-data/tikv-20160/db/LOCK: Resource temporarily unavailable”)’, src/libcore/result.rs:1188:5
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace.

[root@db-cluster-tikv3 ~]# ./tikv-ctl --db /tidb-data/tikv-20160/db/ bad-regions
thread ‘main’ panicked at ‘called Result::unwrap() on an Err value: RocksDb(“IO error: While lock file: /tidb-data/tikv-20160/db/LOCK: Resource temporarily unavailable”)’, src/libcore/result.rs:1188:5
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace.

求助!!!

若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

提供一下完整的 tikv.log 日志文件看下
如果文件太大的话,可以压缩一下,上传到百度网盘。

出问题之前有做过什么操作么

有可能是我多次tidb-lightning导入数据失败,未处理失败原因导致的
非常感谢,日志如下:

tikv.log.xz (624.0 KB)

通过pd-ctl将删除了downstore
当前集群情况 原down节点已变为Tombstone

tiup cluster display  daddylab-tidb-cluster
Found cluster newer version:

    The latest version:         v1.3.5
    Local installed version:    v1.3.4
    Update current component:   tiup update cluster
    Update all components:      tiup update --all

Starting component `cluster`: /root/.tiup/components/cluster/v1.3.4/tiup-cluster display daddylab-tidb-cluster
Cluster type:       tidb
Cluster name:       daddylab-tidb-cluster
Cluster version:    v4.0.11
SSH type:           builtin
Dashboard URL:      http://172.16.12.128:2379/dashboard
ID                   Role          Host           Ports        OS/Arch       Status     Data Dir                      Deploy Dir
--                   ----          ----           -----        -------       ------     --------                      ----------
172.16.12.216:9093   alertmanager  172.16.12.216  9093/9094    linux/x86_64  inactive   /tidb-data/alertmanager-9093  /tidb-deploy/alertmanager-9093
172.16.12.216:8249   drainer       172.16.12.216  8249         linux/x86_64  Up         /tidb-data/drainer-8249       /tidb-deploy/drainer-8249
172.16.12.216:3000   grafana       172.16.12.216  3000         linux/x86_64  inactive   -                             /tidb-deploy/grafana-3000
172.16.12.128:2379   pd            172.16.12.128  2379/2380    linux/x86_64  Up|L|UI    /tidb-data/pd-2379            /tidb-deploy/pd-2379
172.16.12.150:2379   pd            172.16.12.150  2379/2380    linux/x86_64  Up         /tidb-data/pd-2379            /tidb-deploy/pd-2379
172.16.12.217:2379   pd            172.16.12.217  2379/2380    linux/x86_64  Up         /tidb-data/pd-2379            /tidb-deploy/pd-2379
172.16.12.216:9090   prometheus    172.16.12.216  9090         linux/x86_64  inactive   /tidb-data/prometheus-9090    /tidb-deploy/prometheus-9090
172.16.12.123:8250   pump          172.16.12.123  8250         linux/x86_64  Up         /tidb-data/pump-8250          /tidb-deploy/pump-8250
172.16.12.142:8250   pump          172.16.12.142  8250         linux/x86_64  Up         /tidb-data/pump-8250          /tidb-deploy/pump-8250
172.16.12.161:8250   pump          172.16.12.161  8250         linux/x86_64  Up         /tidb-data/pump-8250          /tidb-deploy/pump-8250
172.16.12.171:4000   tidb          172.16.12.171  4000/10080   linux/x86_64  Up         -                             /tidb-deploy/tidb-4000
172.16.12.208:4000   tidb          172.16.12.208  4000/10080   linux/x86_64  Up         -                             /tidb-deploy/tidb-4000
172.16.12.213:4000   tidb          172.16.12.213  4000/10080   linux/x86_64  Up         -                             /tidb-deploy/tidb-4000
172.16.12.138:20160  tikv          172.16.12.138  20160/20180  linux/x86_64  Up         /tidb-data/tikv-20160         /tidb-deploy/tikv-20160
172.16.12.176:20160  tikv          172.16.12.176  20160/20180  linux/x86_64  Tombstone  /tidb-data/tikv-20160         /tidb-deploy/tikv-20160
172.16.12.190:20160  tikv          172.16.12.190  20160/20180  linux/x86_64  Up         /tidb-data/tikv-20160         /tidb-deploy/tikv-20160

可以提供一下故障这个节点的 /var/log/messages 日志以及 dmesg -T 的输出结果看下么

现在一顿糊搞 下线了两个store节点

tiup cluster display  daddylab-tidb-cluster
Found cluster newer version:

    The latest version:         v1.3.5
    Local installed version:    v1.3.4
    Update current component:   tiup update cluster
    Update all components:      tiup update --all

Starting component `cluster`: /root/.tiup/components/cluster/v1.3.4/tiup-cluster display daddylab-tidb-cluster
Cluster type:       tidb
Cluster name:       daddylab-tidb-cluster
Cluster version:    v4.0.11
SSH type:           builtin
Dashboard URL:      http://172.16.12.128:2379/dashboard
ID                   Role          Host           Ports        OS/Arch       Status     Data Dir                      Deploy Dir
--                   ----          ----           -----        -------       ------     --------                      ----------
172.16.12.216:9093   alertmanager  172.16.12.216  9093/9094    linux/x86_64  inactive   /tidb-data/alertmanager-9093  /tidb-deploy/alertmanager-9093
172.16.12.216:8249   drainer       172.16.12.216  8249         linux/x86_64  Down       /tidb-data/drainer-8249       /tidb-deploy/drainer-8249
172.16.12.216:3000   grafana       172.16.12.216  3000         linux/x86_64  inactive   -                             /tidb-deploy/grafana-3000
172.16.12.128:2379   pd            172.16.12.128  2379/2380    linux/x86_64  Up|UI      /tidb-data/pd-2379            /tidb-deploy/pd-2379
172.16.12.150:2379   pd            172.16.12.150  2379/2380    linux/x86_64  Up         /tidb-data/pd-2379            /tidb-deploy/pd-2379
172.16.12.217:2379   pd            172.16.12.217  2379/2380    linux/x86_64  Up|L       /tidb-data/pd-2379            /tidb-deploy/pd-2379
172.16.12.216:9090   prometheus    172.16.12.216  9090         linux/x86_64  inactive   /tidb-data/prometheus-9090    /tidb-deploy/prometheus-9090
172.16.12.123:8250   pump          172.16.12.123  8250         linux/x86_64  Down       /tidb-data/pump-8250          /tidb-deploy/pump-8250
172.16.12.142:8250   pump          172.16.12.142  8250         linux/x86_64  Down       /tidb-data/pump-8250          /tidb-deploy/pump-8250
172.16.12.161:8250   pump          172.16.12.161  8250         linux/x86_64  Down       /tidb-data/pump-8250          /tidb-deploy/pump-8250
172.16.12.171:4000   tidb          172.16.12.171  4000/10080   linux/x86_64  Down       -                             /tidb-deploy/tidb-4000
172.16.12.208:4000   tidb          172.16.12.208  4000/10080   linux/x86_64  Down       -                             /tidb-deploy/tidb-4000
172.16.12.213:4000   tidb          172.16.12.213  4000/10080   linux/x86_64  Down       -                             /tidb-deploy/tidb-4000
172.16.12.138:20160  tikv          172.16.12.138  20160/20180  linux/x86_64  Up         /tidb-data/tikv-20160         /tidb-deploy/tikv-20160
172.16.12.176:20160  tikv          172.16.12.176  20160/20180  linux/x86_64  Tombstone  /tidb-data/tikv-20160         /tidb-deploy/tikv-20160
172.16.12.190:20160  tikv          172.16.12.190  20160/20180  linux/x86_64  Offline    /tidb-data/tikv-20160         /tidb-deploy/tikv-20160

dir.tar.xz (121.0 KB)

集群重建了:joy:

:joy: 只剩下一个也可以恢复的,之前有做过一个测试,下次可以参考

非常感谢…

好吧,我们先根据已经提供的日志看能不能分析下问题原因是什么,如果定位到问题的话,同步给你

thx

:handshake:

@YouCD 请问你们之前有用 tikv-ctl 工具操作过集群么
因为我们从 dmesg.log 中发现有以下日志:

[五 3月 12 17:05:26 2021] rocksdb:low2[6616]: segfault at 2b ip 000055e2fe607eaa sp 00007fde74afadf0 error 4 in tikv-ctl[55e2fd8d7000+1165000]
[五 3月 12 18:17:51 2021] rocksdb:low0[21826]: segfault at 2b ip 000055dbcbefeeaa sp 00007f07b2cfcdf0 error 4 in tikv-ctl[55dbcb1ce000+1165000]

猜测是使用 tikv-ctl 手动操作了 rocksdb 的数据,但是使用 addr2line 并不能在 tikv-ctl@v.4.0.11 中找到 55e2fd8d7000 对应的代码行数,可能是使用了非 v4.0.11 版本的 tikv-ctl

有可能,我这边现在有两套集群,原来的集群版本比较低,近期开始部署新的集群。可能存在版本不统一问题。

好的,如果下次再出现这个问题的话,麻烦保留一下现场,我们进一步排查一下。

hi, 想问一下使用 tikv-ctl 操作 rocksdb 数据的动机是啥?

尝试修复数据导入引发tikv无法启动的问题.

此类故障再次出现了。

[2021/04/22 13:32:57.946 +08:00] [FATAL] [server.rs:303] ["panic_mark_file /tidb-data/tikv-20160/panic_mark_file exists, there must be something wrong with the db."]

该节点的日志:
log.xz (1.2 MB)

/var/log/messages 信息也拿一下
需要 4-21 号 12 点之后的

FYI
messages.xz (432.8 KB)