所有节点tikv服务启动不了

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】:V3.0.1
  • 【问题描述】:所有节点tikv服务启动不了

Tidb无法连接,重启后,所有节点tikv服务启动不了,报错如下:
3个KV节点都起不来,报错不一样,
1个报[FATAL] [server.rs:153] [“failed to create raft engine: RocksDb Corruption: missing start of fragmented record(2)”],
2个报[FATAL] [server.rs:176] [“failed to create kv engine: RocksDb Corruption: SST file is ahead of WALs”]
[FATAL] [server.rs:176] [“failed to create kv engine: RocksDb Corruption: SST file is ahead of WALs”]

KV日志和dmesg见附件,dmesg.txt (511.8 KB) tikv.log (14.2 KB)
烦请帮忙看下
可丢失部分数据,强制修复也可以,谢谢

好的,稍等,我们看下日志

您好,

请帮忙确认以下内容:

  1. 是否存在手动删除 RocksDB 数据目录的 xxx.log 文件
  2. 请提供 RocksDB 数据目录情况(ls -lst {tikv_data_dir}/db),并且上传 LOG*、manifest 文件。

1.未手动删除RocksDB 数据目录的 xxx.log 文件 2.文件见附件LOG (31.5 KB) MANIFEST-295055 (457 字节)

您好,

更新下上面的回复哈:

  1. 请提供 RocksDB 数据目录情况(ls -lst {tikv_data_dir}/db),并且上传 LOG*、manifest 文件。
  2. 请提供 kv engine 情况(ls -lst {tikv_data_dir}/raft),并且上传 LOG*、manifest 文件。

1.db目录下LOG文件太多,上传一个,麻烦先看下
[tidb@tidb1 db]$ pwd
/home/tidb/deploy/data/db
[tidb@tidb1 db]$ ls |wc -l
98585
[tidb@tidb1 db]$ ls LOG* |wc -l
97991
[tidb@tidb1 db]$
2.raft:
[tidb@tidb1 db]$ cd …/raft/
[tidb@tidb1 raft]$ ls
001387.sst 001390.sst 295362.log LOCK LOG.old.1585902207402994 LOG.old.1585902255406264 LOG.old.1585902303401970 OPTIONS-295361
001388.sst 001391.sst CURRENT LOG LOG.old.1585902223445667 LOG.old.1585902271401946 LOG.old.1585902319403570 OPTIONS-295364
001389.sst 001393.sst IDENTITY LOG.old.1585902191402080 LOG.old.1585902239410418 LOG.old.1585902287403617 MANIFEST-295361
[tidb@tidb1 raft]$ LOG (31.5 KB) LOG.old.1585308877709511 (53.0 KB) MANIFEST-000006 (851.6 KB) MANIFEST-295055 (457 字节)

尝试把 tikv [rocksdb] 和 [raftdb] 配置成 wal-recovery-mode = 3 看看能起起来么

3月16日的LOG和tikv.logLOG.old.rar (226.5 KB) tikv.log.2020-03-16-00_11_14 (35.3 KB)

感谢回复

17 日的 tikv 日志麻烦上传下~研发正在排查

1.设置3台KV的配置参数为tikv [rocksdb] 和 [raftdb] 配置成 wal-recovery-mode = 3 2.重启3台服务器操作系统,10.156.1.203操作系统重启失败 3.在inventory.ini中注销了10.156.1.203的PD和KV,启动集群失败 4.再重启几次10.156.1.203,操作系统启动成功,重启集群失败,报错如下 10.156.1.202和10.156.1.204报错: [2020/04/07 16:49:37.968 +08:00] [INFO] [util.rs:397] [“connecting to PD endpoint”] [endpoints=10.156.1.202:2379] [2020/04/07 16:49:37.972 +08:00] [INFO] [util.rs:397] [“connecting to PD endpoint”] [endpoints=10.156.1.203:2379] [2020/04/07 16:49:37.973 +08:00] [INFO] [util.rs:357] [“PD failed to respond”] [err=“Grpc(RpcFailure(RpcStatus { status: Unavailable, details: Some(“Connect Failed”) }))”] [endpoints=10.156.1.203:2379] [2020/04/07 16:49:37.973 +08:00] [INFO] [util.rs:397] [“connecting to PD endpoint”] [endpoints=10.156.1.204:2379] [2020/04/07 16:49:37.975 +08:00] [INFO] [util.rs:397] [“connecting to PD endpoint”] [endpoints=http://10.156.1.203:2379] [2020/04/07 16:49:37.975 +08:00] [ERROR] [util.rs:444] [“connect failed”] [err=“Grpc(RpcFailure(RpcStatus { status: Unavailable, details: Some(“Connect Failed”) }))”] [endpoints=http://10.156.1.203:2379] [2020/04/07 16:49:37.975 +08:00] [INFO] [util.rs:397] [“connecting to PD endpoint”] [endpoints=http://10.156.1.204:2379] [2020/04/07 16:49:37.976 +08:00] [INFO] [util.rs:397] [“connecting to PD endpoint”] [endpoints=http://10.156.1.202:2379] [2020/04/07 16:49:37.978 +08:00] [INFO] [util.rs:456] [“connected to PD leader”] [endpoints=http://10.156.1.202:2379] [2020/04/07 16:49:37.978 +08:00] [INFO] [util.rs:385] [“all PD endpoints are consistent”] [endpoints="[“10.156.1.202:2379”, “10.156.1.203:2379”, “10.156.1.204:2379”]"] [2020/04/07 16:49:37.979 +08:00] [INFO] [server.rs:81] [“connect to PD cluster”] [cluster_id=6748076547420207434]

10.156.1.203报错: [2020/04/07 11:19:31.523 +08:00] [INFO] [util.rs:397] [“connecting to PD endpoint”] [endpoints=10.156.1.202:2379] [2020/04/07 11:19:31.527 +08:00] [INFO] [util.rs:397] [“connecting to PD endpoint”] [endpoints=10.156.1.203:2379] [2020/04/07 11:19:31.529 +08:00] [INFO] [util.rs:397] [“connecting to PD endpoint”] [endpoints=10.156.1.204:2379] [2020/04/07 11:19:31.531 +08:00] [INFO] [util.rs:397] [“connecting to PD endpoint”] [endpoints=http://10.156.1.202:2379] [2020/04/07 11:19:31.532 +08:00] [INFO] [util.rs:397] [“connecting to PD endpoint”] [endpoints=http://10.156.1.203:2379] [2020/04/07 11:19:31.534 +08:00] [INFO] [util.rs:456] [“connected to PD leader”] [endpoints=http://10.156.1.203:2379] [2020/04/07 11:19:31.534 +08:00] [INFO] [util.rs:385] [“all PD endpoints are consistent”] [endpoints="[“10.156.1.202:2379”, “10.156.1.203:2379”, “10.156.1.204:2379”]"] [2020/04/07 11:19:31.535 +08:00] [INFO] [server.rs:81] [“connect to PD cluster”] [cluster_id=6748076547420207434] [2020/04/07 11:19:31.542 +08:00] [FATAL] [server.rs:153] [“failed to create raft engine: RocksDb Corruption: missing start of fragmented record(2)”]

5.17日的kv日志见附件链接: https://pan.baidu.com/s/1zMcdx5P78mPqbB0QFSPTCQ 提取码: 7ay9

谢谢!

补充一点,10.156.1.202没有16日之后的kv日志,10.156.1.203和10.156.1.204有16日之后的日志

10.156.1.203的PD报错如下: [2020/04/08 10:43:17.981 +08:00] [INFO] [server.go:145] [“start embed etcd”] [2020/04/08 10:43:17.981 +08:00] [INFO] [systime_mon.go:25] [“start system time monitor”] [2020/04/08 10:43:17.981 +08:00] [INFO] [etcd.go:117] [“configuring peer listeners”] [listen-peer-urls="[http://10.156.1.203:2380]"] [2020/04/08 10:43:17.982 +08:00] [INFO] [etcd.go:127] [“configuring client listeners”] [listen-client-urls="[http://10.156.1.203:2379]"] [2020/04/08 10:43:17.982 +08:00] [INFO] [etcd.go:600] [“pprof is enabled”] [path=/debug/pprof] [2020/04/08 10:43:17.982 +08:00] [INFO] [etcd.go:297] [“starting an etcd server”] [etcd-version=3.3.0+git] [git-sha=“Not provided (use ./build instead of go build)”] [go-version=go1.12] [go-os=linux] [go-arch=amd64] [max-cpu-set=8] [max-cpu-available=8] [member-initialized=true] [name=pd_tidb2] [data-dir=/home/tidb/deploy/data.pd] [wal-dir=] [wal-dir-dedicated=] [member-dir=/home/tidb/deploy/data.pd/member] [force-new-cluster=false] [heartbeat-interval=500ms] [election-timeout=3s] [initial-election-tick-advance=true] [snapshot-count=100000] [snapshot-catchup-entries=5000] [initial-advertise-peer-urls="[http://10.156.1.203:2380]"] [listen-peer-urls="[http://10.156.1.203:2380]"] [advertise-client-urls="[http://10.156.1.203:2379]"] [listen-client-urls="[http://10.156.1.203:2379]"] [listen-metrics-urls="[]"] [cors="[]"] [host-whitelist="[]"] [initial-cluster=] [initial-cluster-state=new] [initial-cluster-token=] [quota-size-bytes=2147483648] [pre-vote=true] [initial-corrupt-check=false] [corrupt-check-time-interval=0s] [auto-compaction-mode=periodic] [auto-compaction-retention=1h0m0s] [auto-compaction-interval=1h0m0s] [discovery-url=] [discovery-proxy=] [2020/04/08 10:43:17.985 +08:00] [INFO] [backend.go:79] [“opened backend db”] [path=/home/tidb/deploy/data.pd/member/snap/db] [took=1.756127ms] [2020/04/08 10:43:17.986 +08:00] [INFO] [server.go:435] [“recovered v2 store from snapshot”] [snapshot-index=5000050] [snapshot-size=“14 kB”] [2020/04/08 10:43:17.986 +08:00] [INFO] [kvstore.go:373] [“restored last compact revision”] [meta-bucket-name=meta] [meta-bucket-name-key=finishedCompactRev] [restored-compact-revision=5069889] [2020/04/08 10:43:17.990 +08:00] [INFO] [server.go:453] [“recovered v3 backend from snapshot”] [backend-size-bytes=524288] [backend-size=“524 kB”] [backend-size-in-use-bytes=217088] [backend-size-in-use=“217 kB”] [2020/04/08 10:43:18.097 +08:00] [INFO] [raft.go:496] [“restarting local member”] [cluster-id=6596f4375111b1ab] [local-member-id=32bb8fecb44cc81f] [commit-index=4973598] [2020/04/08 10:43:18.097 +08:00] [PANIC] [raft.go:1475] [“32bb8fecb44cc81f state.commit 4973598 is out of range [5000050, 5000050]”] [2020/04/08 10:43:18.097 +08:00] [FATAL] [log.go:294] [panic] [recover="“32bb8fecb44cc81f state.commit 4973598 is out of range [5000050, 5000050]”"] [stack=“github.com/pingcap/log.Fatal\ \t/home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/github.com/pingcap/log@v0.0.0-20190715063458-479153f07ebd/global.go:59\ github.com/pingcap/pd/pkg/logutil.LogPanic\ \t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/pd/pkg/logutil/log.go:294\ runtime.gopanic\ \t/usr/local/go/src/runtime/panic.go:522\ go.uber.org/zap/zapcore.(*CheckedEntry).Write\ \t/home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/go.uber.org/zap@v1.9.1/zapcore/entry.go:229\ngo.uber.org/zap.(*SugaredLogger).log\ \t/home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/go.uber.org/zap@v1.9.1/sugar.go:234\ngo.uber.org/zap.(*SugaredLogger).Panicf\ \t/home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/go.uber.org/zap@v1.9.1/sugar.go:159\ngo.etcd.io/etcd/pkg/logutil.(*zapRaftLogger).Panicf\ \t/home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190320044326-77d4b742cdbf/pkg/logutil/zap_raft.go:96\ngo.etcd.io/etcd/raft.(*raft).loadState\ \t/home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190320044326-77d4b742cdbf/raft/raft.go:1475\ngo.etcd.io/etcd/raft.newRaft\ \t/home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190320044326-77d4b742cdbf/raft/raft.go:377\ go.etcd.io/etcd/raft.RestartNode\ \t/home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190320044326-77d4b742cdbf/raft/node.go:242\ go.etcd.io/etcd/etcdserver.restartNode\ \t/home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190320044326-77d4b742cdbf/etcdserver/raft.go:536\ go.etcd.io/etcd/etcdserver.NewServer\ \t/home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190320044326-77d4b742cdbf/etcdserver/server.go:464\ go.etcd.io/etcd/embed.StartEtcd\ \t/home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190320044326-77d4b742cdbf/embed/etcd.go:209\ github.com/pingcap/pd/server.(*Server).startEtcd\ \t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/pd/server/server.go:149\ngithub.com/pingcap/pd/server.(*Server).Run\ \t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/pd/server/server.go:302\ main.main\ \t/home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/pd/cmd/pd-server/main.go:110\ runtime.main\ \t/usr/local/go/src/runtime/proc.go:200”]

感谢提供,我们看下~

您好,

PD 的报错目前判断可能是丢数据了,但是不应该影响 tikv 的服务。PD 丢数据走 PD 的恢复流程,

关于 tikv 启动不成功的问题,请上传下调整参数后,完整的 tikv 启动日志。

稍等,我清理下,整个干清的日志发给你

调整参数后,完整的 tikv 启动日志见附件,10.156.1.202日志202-tikv.log (15.2 KB) ,10.156.1.203日志为空,10.156.1.204日志204-tikv.log (15.2 KB)

好的~

您好,

如果日志只有这些的话,那可以看看 tikv_stderr.log,以及 /var/log/message 里有没有 coredump 的记录。

202服务器上的message里全是这样的: Apr 8 15:59:38 tidb1 kernel: abrt-dump-oops[29879]: segfault at 7f246c324cd4 ip 00007f246c324cd4 sp 00007ffc4749dc08 error 7 in libsatyr.so.3.0.0[7f246c1ff000+178000] Apr 8 15:59:40 tidb1 kernel: abrt-dump-oops[29880]: segfault at 7f6e571a4cd4 ip 00007f6e571a4cd4 sp 00007ffc1b7e7548 error 7 in libsatyr.so.3.0.0[7f6e5707f000+178000] Apr 8 15:59:42 tidb1 kernel: abrt-dump-oops[29881]: segfault at 7fe8f794acd4 ip 00007fe8f794acd4 sp 00007ffedd36bbe8 error 7 in libsatyr.so.3.0.0[7fe8f7825000+178000] Apr 8 15:59:44 tidb1 kernel: abrt-dump-oops[29882]: segfault at 7f15bba95cd4 ip 00007f15bba95cd4 sp 00007ffdb054fdf8 error 7 in libsatyr.so.3.0.0[7f15bb970000+178000] Apr 8 15:59:46 tidb1 kernel: abrt-dump-oops[29883]: segfault at 7fbf2aeb8cd4 ip 00007fbf2aeb8cd4 sp 00007ffce02fe6e8 error 7 in libsatyr.so.3.0.0[7fbf2ad93000+178000] Apr 8 15:59:48 tidb1 kernel: abrt-dump-oops[29884]: segfault at 7fca68875cd4 ip 00007fca68875cd4 sp 00007ffdd3eb4148 error 7 in libsatyr.so.3.0.0[7fca68750000+178000] Apr 8 15:59:50 tidb1 kernel: abrt-dump-oops[29885]: segfault at 7fae67f32cd4 ip 00007fae67f32cd4 sp 00007ffec7a9f0d8 error 7 in libsatyr.so.3.0.0[7fae67e0d000+178000] Apr 8 15:59:52 tidb1 kernel: abrt-dump-oops[29886]: segfault at 7f797ee3fcd4 ip 00007f797ee3fcd4 sp 00007ffc47b03df8 error 7 in libsatyr.so.3.0.0[7f797ed1a000+178000] Apr 8 15:59:54 tidb1 kernel: abrt-dump-oops[29887]: segfault at 7f19a9e39cd4 ip 00007f19a9e39cd4 sp 00007fff7e0ef488 error 7 in libsatyr.so.3.0.0[7f19a9d14000+178000] Apr 8 15:59:56 tidb1 kernel: abrt-dump-oops[29888]: segfault at 7f1ac0d95cd4 ip 00007f1ac0d95cd4 sp 00007ffe660bed48 error 7 in libsatyr.so.3.0.0[7f1ac0c70000+178000] Apr 8 15:59:58 tidb1 kernel: abrt-dump-oops[29897]: segfault at 7f3c8be78cd4 ip 00007f3c8be78cd4 sp 00007ffcca0e6be8 error 7 in libsatyr.so.3.0.0[7f3c8bd53000+178000] Apr 8 16:00:00 tidb1 kernel: abrt-dump-oops[29898]: segfault at 7fdb78e66cd4 ip 00007fdb78e66cd4 sp 00007ffe7d4baaa8 error 7 in libsatyr.so.3.0.0[7fdb78d41000+178000]