系统断电,来电后重启tidb集群,启动PD节点报错,3个PD节点有两个报错。

  • 系统版本 & kernel 版本 】CentOS Linux release 7.6.1810 (Core) 3.10.0-957.el7.x86_64
  • TiDB 版本 】 v3.0.0
  • 磁盘型号 】普通磁盘
  • 集群节点分布 】2tidb 3pd 3tikv,5台机器,2台部署tidb,其余3台部署pd tikv
  • 数据量 & region 数量 & 副本数
  • 问题描述(我做了什么) 】系统断电,来电后重启系统,2个pd启动报错。
  • 关键词 】系统断电,来电后重启tidb集群,PD 报错 报错信息如下: 103节点: goroutine 158 [running]: go.etcd.io/bbolt.(*DB).freepages.func2(0xc0002216e0) /home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/go.etcd.io/bbolt@v1.3.2/db.go:997 +0xe5 created by go.etcd.io/bbolt.(*DB).freepages /home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/go.etcd.io/bbolt@v1.3.2/db.go:995 +0x1b5 panic: freepages: failed to get all reachable pages (page 122: multiple references)

goroutine 163 [running]: go.etcd.io/bbolt.(*DB).freepages.func2(0xc000446180) /home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/go.etcd.io/bbolt@v1.3.2/db.go:997 +0xe5 created by go.etcd.io/bbolt.(*DB).freepages /home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/go.etcd.io/bbolt@v1.3.2/db.go:995 +0x1b5 panic: freepages: failed to get all reachable pages (page 122: multiple references)

goroutine 121 [running]: go.etcd.io/bbolt.(*DB).freepages.func2(0xc00024a240) /home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/go.etcd.io/bbolt@v1.3.2/db.go:997 +0xe5 created by go.etcd.io/bbolt.(*DB).freepages /home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/go.etcd.io/bbolt@v1.3.2/db.go:995 +0x1b5 panic: freepages: failed to get all reachable pages (page 122: multiple references)

105节点: [2019/10/11 13:12:39.141 +08:00] [INFO] [util.go:59] [“Welcome to Placement Driver (PD)”] [2019/10/11 13:12:39.142 +08:00] [INFO] [util.go:60] [PD] [release-version=v3.0.0] [2019/10/11 13:12:39.142 +08:00] [INFO] [util.go:61] [PD] [git-hash=bfbaa06620407aaa262758a0342d919868db3916] [2019/10/11 13:12:39.142 +08:00] [INFO] [util.go:62] [PD] [git-branch=HEAD] … [2019/10/11 13:12:39.151 +08:00] [INFO] [systime_mon.go:25] [“start system time monitor”] [2019/10/11 13:12:39.163 +08:00] [INFO] [backend.go:79] [“opened backend db”] [path=/data01/deploy/data.pd/member/snap/db] [took=13.149427ms] [2019/10/11 13:12:39.166 +08:00] [INFO] [server.go:435] [“recovered v2 store from snapshot”] [snapshot-index=1444898] [snapshot-size=“15 kB”]

[2019/10/11 13:12:39.169 +08:00] [WARN] [db.go:92] [“failed to find [SNAPSHOT-INDEX].snap.db”] [snapshot-index=1444898] [snapshot-file-path=/data01/deploy/data.pd/member/snap/0000000000160c22.snap.db] [error=“snap: snapshot file doesn’t exist”] —经检查,snap存在

[2019/10/11 13:12:39.169 +08:00] [PANIC] [server.go:446] [“failed to recover v3 backend from snapshot”] [error=“failed to find database snapshot file (snap: snapshot file doesn’t exist)”] [stack=“go.etcd.io/etcd/etcdserver.NewServer /home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190320044326-77d4b742cdbf/etcdserver/server.go:446 go.etcd.io/etcd/embed.StartEtcd /home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190320044326-77d4b742cdbf/embed/etcd.go:209 github.com/pingcap/pd/server.(*Server).startEtcd /home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/pd/server/server.go:149 github.com/pingcap/pd/server.(*Server).Run /home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/pd/server/server.go:302 main.main /home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/pd/cmd/pd-server/main.go:110 runtime.main /usr/local/go/src/runtime/proc.go:200”]

[2019/10/11 13:12:39.170 +08:00] [FATAL] [log.go:294] [panic] [recover="“invalid memory address or nil pointer dereference”"] [stack=“github.com/pingcap/log.Fatal /home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/github.com/pingcap/log@v0.0.0-20190214045112-b37da76f67a7/global.go:59 github.com/pingcap/pd/pkg/logutil.LogPanic /home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/pd/pkg/logutil/log.go:294 runtime.gopanic /usr/local/go/src/runtime/panic.go:522 runtime.panicmem /usr/local/go/src/runtime/panic.go:82 runtime.sigpanic /usr/local/go/src/runtime/signal_unix.go:390 go.etcd.io/etcd/etcdserver.NewServer.func1 /home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190320044326-77d4b742cdbf/etcdserver/server.go:327 runtime.gopanic /usr/local/go/src/runtime/panic.go:522 go.uber.org/zap/zapcore.(*CheckedEntry).Write /home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/go.uber.org/zap@v1.9.1/zapcore/entry.go:229 go.uber.org/zap.(*Logger).Panic /home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/go.uber.org/zap@v1.9.1/logger.go:225 go.etcd.io/etcd/etcdserver.NewServer /home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190320044326-77d4b742cdbf/etcdserver/server.go:446 go.etcd.io/etcd/embed.StartEtcd /home/jenkins/workspace/release_tidb_3.0/go/pkg/mod/go.etcd.io/etcd@v0.0.0-20190320044326-77d4b742cdbf/embed/etcd.go:209 github.com/pingcap/pd/server.(*Server).startEtcd /home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/pd/server/server.go:149 github.com/pingcap/pd/server.(*Server).Run /home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/pd/server/server.go:302 main.main /home/jenkins/workspace/release_tidb_3.0/go/src/github.com/pingcap/pd/cmd/pd-server/main.go:110 runtime.main /usr/local/go/src/runtime/proc.go:200”]

目前集群属于不可用状态,无法访问连接。

另:手动启动103和105的PD服务也无法启动。

pd 掉电,从上面两个节点出现上面的报错,可能是 etcd 加载信息有问题,现在建议使用 pd-recover 恢复服务,pd-recover 的操作步骤如下:

https://pingcap.com/docs-cn/v3.0/reference/tools/pd-recover/#pd-recover-使用文档

版本:v5.0.1
断电后,3个pd,有2个启动不了,./pd-recover 如下报错:
./pd-recover -endpoints http://10.10.23.54:2379 -cluster-id 6964583659800111049 -alloc-id 10000

{“level”:“warn”,“ts”:“2021-05-26T02:02:52.292Z”,“caller”:“clientv3/retry_interceptor.go:61”,“msg”:“retrying of unary invoker failed”,“target”:“endpoint://client-7c5618fb-4914-4273-9682-9064babf0a28/10.10.23.54:2379”,“attempt”:0,“error”:“rpc error: code = DeadlineExceeded desc = context deadline exceeded”}
context deadline exceeded

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。