PD 无法正常启动

【TiDB 使用环境】测试/ Poc
【TiDB 版本】 v8.

192.168.2.72的PD 的status 一直在重启,无法恢复正常。

ID                  Role          Host          Ports        OS/Arch       Status   Since    Data Dir                          Deploy Dir
--                  ----          ----          -----        -------       ------   -----    --------                          ----------
192.168.2.71:9093   alertmanager  192.168.2.71  9093/9094    linux/x86_64  Up       1h4m34s  /u01/tidb-data/alertmanager-9093  /u01/tidb-deploy/alertmanager-9093
192.168.2.72:8300   cdc           192.168.2.72  8300         linux/x86_64  Up       1h4m39s  /u01/tidb-data/cdc-8300           /u01/tidb-deploy/cdc-8300
192.168.2.73:8300   cdc           192.168.2.73  8300         linux/x86_64  Up       1h4m39s  /u01/tidb-data/cdc-8300           /u01/tidb-deploy/cdc-8300
192.168.2.71:3000   grafana       192.168.2.71  3000         linux/x86_64  Up       1h4m36s  -                                 /u01/tidb-deploy/grafana-3000
192.168.2.71:2379   pd            192.168.2.71  2379/2380    linux/x86_64  Up       1h5m15s  /u01/tidb-data/pd-2379            /u01/tidb-deploy/pd-2379
192.168.2.72:2379   pd            192.168.2.72  2379/2380    linux/x86_64  Down|UI  3s       /u01/tidb-data/pd-2379            /u01/tidb-deploy/pd-2379
192.168.2.73:2379   pd            192.168.2.73  2379/2380    linux/x86_64  Up|L     1h5m15s  /u01/tidb-data/pd-2379            /u01/tidb-deploy/pd-2379
192.168.2.71:9090   prometheus    192.168.2.71  9090/12020   linux/x86_64  Up       1h4m37s  /u01/tidb-data/prometheus-9090    /u01/tidb-deploy/prometheus-9090
192.168.2.71:4000   tidb          192.168.2.71  4000/10080   linux/x86_64  Up       1h4m50s  -                                 /u01/tidb-deploy/tidb-4000
192.168.2.72:4000   tidb          192.168.2.72  4000/10080   linux/x86_64  Up       1h4m51s  -                                 /u01/tidb-deploy/tidb-4000
192.168.2.73:4000   tidb          192.168.2.73  4000/10080   linux/x86_64  Up       1h4m51s  -                                 /u01/tidb-deploy/tidb-4000
192.168.2.71:20160  tikv          192.168.2.71  20160/20180  linux/x86_64  Up       1h5m14s  /u01/tidb-data/tikv-20160         /u01/tidb-deploy/tikv-20160
192.168.2.72:20160  tikv          192.168.2.72  20160/20180  linux/x86_64  Up       1h5m14s  /u01/tidb-data/tikv-20160         /u01/tidb-deploy/tikv-20160
192.168.2.73:20160  tikv          192.168.2.73  20160/20180  linux/x86_64  Up       1h5m13s  /u01/tidb-data/tikv-20160         /u01/tidb-deploy/tikv-20160
Total nodes: 14
[tidb@tidb01 ~]$ 

对应的pd log日志

[2025/12/01 16:16:09.957 +08:00] [INFO] [stream.go:412] ["established TCP streaming connection with remote peer"] [stream-reader-type="stream MsgApp v2"] [local-member-id=36bda1842c68a23a] [remote-peer-id=5b383763007fc455]
[2025/12/01 16:16:09.957 +08:00] [INFO] [stream.go:412] ["established TCP streaming connection with remote peer"] [stream-reader-type="stream Message"] [local-member-id=36bda1842c68a23a] [remote-peer-id=42843e713797aa2a]
[2025/12/01 16:16:09.959 +08:00] [INFO] [etcd.go:599] ["serving peer traffic"] [address="[::]:2380"]
[2025/12/01 16:16:09.959 +08:00] [INFO] [etcd.go:571] [cmux::serve] [address="[::]:2380"]
[2025/12/01 16:16:09.959 +08:00] [INFO] [etcd.go:279] ["now serving peer/client/metrics"] [local-member-id=36bda1842c68a23a] [initial-advertise-peer-urls="[http://192.168.2.72:2380]"] [listen-peer-urls="[http://0.0.0.0:2380]"] [advertise-client-urls="[http://192.168.2.72:2379]"] [listen-client-urls="[http://0.0.0.0:2379]"] [listen-metrics-urls="[]"]
[2025/12/01 16:16:09.959 +08:00] [INFO] [raft] [zap_raft.go:77] ["raft.node: 36bda1842c68a23a elected leader 42843e713797aa2a at term 266"]
[2025/12/01 16:16:09.959 +08:00] [INFO] [stream.go:249] ["set message encoder"] [from=36bda1842c68a23a] [to=5b383763007fc455] [stream-type="stream MsgApp v2"]
[2025/12/01 16:16:09.960 +08:00] [INFO] [stream.go:274] ["established TCP streaming connection with remote peer"] [stream-writer-type="stream MsgApp v2"] [local-member-id=36bda1842c68a23a] [remote-peer-id=5b383763007fc455]
[2025/12/01 16:16:09.960 +08:00] [INFO] [stream.go:249] ["set message encoder"] [from=36bda1842c68a23a] [to=42843e713797aa2a] [stream-type="stream Message"]
[2025/12/01 16:16:09.960 +08:00] [INFO] [stream.go:274] ["established TCP streaming connection with remote peer"] [stream-writer-type="stream Message"] [local-member-id=36bda1842c68a23a] [remote-peer-id=42843e713797aa2a]
[2025/12/01 16:16:09.960 +08:00] [INFO] [stream.go:249] ["set message encoder"] [from=36bda1842c68a23a] [to=5b383763007fc455] [stream-type="stream Message"]
[2025/12/01 16:16:09.960 +08:00] [INFO] [stream.go:274] ["established TCP streaming connection with remote peer"] [stream-writer-type="stream Message"] [local-member-id=36bda1842c68a23a] [remote-peer-id=5b383763007fc455]
[2025/12/01 16:16:09.960 +08:00] [INFO] [stream.go:249] ["set message encoder"] [from=36bda1842c68a23a] [to=42843e713797aa2a] [stream-type="stream MsgApp v2"]
[2025/12/01 16:16:09.960 +08:00] [INFO] [stream.go:274] ["established TCP streaming connection with remote peer"] [stream-writer-type="stream MsgApp v2"] [local-member-id=36bda1842c68a23a] [remote-peer-id=42843e713797aa2a]
[2025/12/01 16:16:10.007 +08:00] [INFO] [server.go:790] ["initialized peer connections; fast-forwarding election ticks"] [local-member-id=36bda1842c68a23a] [forward-ticks=4] [forward-duration=2s] [election-ticks=6] [election-timeout=3s] [active-remote-members=2]
[2025/12/01 16:16:10.072 +08:00] [INFO] [server.go:2118] ["published local member to cluster through raft"] [local-member-id=36bda1842c68a23a] [local-member-attributes="{Name:pd-192.168.2.72-2379 ClientURLs:[http://192.168.2.72:2379]}"] [request-path=/0/members/36bda1842c68a23a/attributes] [cluster-id=9e63bd2785c31a28] [publish-timeout=11s]
[2025/12/01 16:16:10.072 +08:00] [INFO] [serve.go:103] ["ready to serve client requests"]
[2025/12/01 16:16:10.072 +08:00] [INFO] [health.go:61] ["grpc service status changed"] [service=] [status=SERVING]
[2025/12/01 16:16:10.073 +08:00] [INFO] [registry.go:69] ["gRPC service already registered"] [prefix=pd-192.168.2.72-2379] [service-name=MetaStorage]
[2025/12/01 16:16:10.073 +08:00] [INFO] [registry.go:69] ["gRPC service already registered"] [prefix=pd-192.168.2.72-2379] [service-name=ResourceManager]
[2025/12/01 16:16:10.073 +08:00] [INFO] [serve.go:187] ["serving client traffic insecurely; this is strongly discouraged!"] [traffic=grpc+http] [address="[::]:2379"]
[2025/12/01 16:16:10.076 +08:00] [INFO] [cluster_id.go:43] ["existed cluster id"] [cluster-id=7555433645641740151]
[2025/12/01 16:16:10.076 +08:00] [INFO] [member.go:350] ["member joining election"] [member-info="name:\"pd-192.168.2.72-2379\" member_id:3944486437699232314 peer_urls:\"http://192.168.2.72:2380\" client_urls:\"http://192.168.2.72:2379\" "] [root-path=/pd/7555433645641740151]
[2025/12/01 16:16:10.095 +08:00] [FATAL] [main.go:288] ["run server failed"] [error="[PD:leveldb:ErrLevelDBOpen]leveldb: manifest corrupted (field 'comparer'): missing [file=MANIFEST-000137]: leveldb: manifest corrupted (field 'comparer'): missing [file=MANIFEST-000137]"] [stack="main.start\n\t/workspace/source/pd/cmd/pd-server/main.go:288\nmain.createServerWrapper\n\t/workspace/source/pd/cmd/pd-server/main.go:194\ngithub.com/spf13/cobra.(*Command).execute\n\t/root/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:987\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/root/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1115\ngithub.com/spf13/cobra.(*Command).Execute\n\t/root/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039\nmain.main\n\t/workspace/source/pd/cmd/pd-server/main.go:71\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:272"]

我认为关键的信息是

error=“[PD:leveldb:ErrLevelDBOpen]leveldb: manifest corrupted (field ‘comparer’): missing [file=MANIFEST-000137]: leveldb: manifest corrupted (field ‘comparer’): missing [file=MANIFEST-000137]”]

如何解决下呢

应该是文件损坏了,扩容缩容吧。

manifest corrupted 清单损坏

有其他日志吗?

还有其他修复方式嘛, pd-recover有可能恢复嘛

MANIFEST损坏了吧,使用TiUP在线缩容问题节点,再扩容一个新节点。

缩扩容是可以在线的,只有一个pd挂最快的方式就是缩扩容。
pd-recover的使用场景一般是多数pd挂了恢复,且pd-recover恢复需要停服

已经做了scale-out / in 了,应该是没有恢复

[2025/12/02 10:11:02.099 +08:00] [WARN] [http.go:413] ["failed to find remote peer in cluster"] [local-member-id=36bda1842c68a23a] [remote-peer-id-stream-handler=36bda1842c68a23a] [remote-peer-id-from=42843e713797aa2a] [cluster-id=9e63bd2785c31a28]
[2025/12/02 10:11:02.099 +08:00] [WARN] [http.go:413] ["failed to find remote peer in cluster"] [local-member-id=36bda1842c68a23a] [remote-peer-id-stream-handler=36bda1842c68a23a] [remote-peer-id-from=42843e713797aa2a] [cluster-id=9e63bd2785c31a28]
[2025/12/02 10:11:02.158 +08:00] [WARN] [http.go:413] ["failed to find remote peer in cluster"] [local-member-id=36bda1842c68a23a] [remote-peer-id-stream-handler=36bda1842c68a23a] [remote-peer-id-from=5b383763007fc455] [cluster-id=9e63bd2785c31a28]
[2025/12/02 10:11:02.158 +08:00] [WARN] [http.go:413] ["failed to find remote peer in cluster"] [local-member-id=36bda1842c68a23a] [remote-peer-id-stream-handler=36bda1842c68a23a] [remote-peer-id-from=5b383763007fc455] [cluster-id=9e63bd2785c31a28]
[2025/12/02 10:11:02.190 +08:00] [WARN] [server.go:995] ["rejected Raft message to mismatch member"] [local-member-id=36bda1842c68a23a] [mismatch-member-id=77091302262bd505]
[2025/12/02 10:11:02.190 +08:00] [WARN] [http.go:145] ["failed to process Raft message"] [local-member-id=36bda1842c68a23a] [error="cannot process message to mismatch member"]
[2025/12/02 10:11:02.198 +08:00] [WARN] [http.go:413] ["failed to find remote peer in cluster"] [local-member-id=36bda1842c68a23a] [remote-peer-id-stream-handler=36bda1842c68a23a] [remote-peer-id-from=42843e713797aa2a] [cluster-id=9e63bd2785c31a28]
[2025/12/02 10:11:02.198 +08:00] [WARN] [http.go:413] ["failed to find remote peer in cluster"] [local-member-id=36bda1842c68a23a] [remote-peer-id-stream-handler=36bda1842c68a23a] [remote-peer-id-from=42843e713797aa2a] [cluster-id=9e63bd2785c31a28]
[2025/12/02 10:11:02.213 +08:00] [WARN] [server.go:995] ["rejected Raft message to mismatch member"] [local-member-id=36bda1842c68a23a] [mismatch-member-id=77091302262bd505]
[2025/12/02 10:11:02.213 +08:00] [WARN] [http.go:145] ["failed to process Raft message"] [local-member-id=36bda1842c68a23a] [error="cannot process message to mismatch member"]
[2025/12/02 10:11:02.223 +08:00] [WARN] [server.go:995] ["rejected Raft message to mismatch member"] [local-member-id=36bda1842c68a23a] [mismatch-member-id=77091302262bd505]
[2025/12/02 10:11:02.223 +08:00] [WARN] [http.go:145] ["failed to process Raft message"] [local-member-id=36bda1842c68a23a] [error="cannot process message to mismatch member"]
[2025/12/02 10:11:02.237 +08:00] [WARN] [server.go:995] ["rejected Raft message to mismatch member"] [local-member-id=36bda1842c68a23a] [mismatch-member-id=77091302262bd505]
[2025/12/02 10:11:02.237 +08:00] [WARN] [http.go:145] ["failed to process Raft message"] [local-member-id=36bda1842c68a23a] [error="cannot process message to mismatch member"]
[2025/12/02 10:11:02.257 +08:00] [WARN] [http.go:413] ["failed to find remote peer in cluster"] [local-member-id=36bda1842c68a23a] [remote-peer-id-stream-handler=36bda1842c68a23a] [remote-peer-id-from=5b383763007fc455] [cluster-id=9e63bd2785c31a28]
[2025/12/02 10:11:02.257 +08:00] [WARN] [http.go:413] ["failed to find remote peer in cluster"] [local-member-id=36bda1842c68a23a] [remote-peer-id-stream-handler=36bda1842c68a23a] [remote-peer-id-from=5b383763007fc455] [cluster-id=9e63bd2785c31a28]
[2025/12/02 10:11:02.298 +08:00] [WARN] [http.go:413] ["failed to find remote peer in cluster"] [local-member-id=36bda1842c68a23a] [remote-peer-id-stream-handler=36bda1842c68a23a] [remote-peer-id-from=42843e713797aa2a] [cluster-id=9e63bd2785c31a28]
[2025/12/02 10:11:02.298 +08:00] [WARN] [http.go:413] ["failed to find remote peer in cluster"] [local-member-id=36bda1842c68a23a] [remote-peer-id-stream-handler=36bda1842c68a23a] [remote-peer-id-from=42843e713797aa2a] [cluster-id=9e63bd2785c31a28]
[2025/12/02 10:11:02.298 +08:00] [WARN] [server.go:995] ["rejected Raft message to mismatch member"] [local-member-id=36bda1842c68a23a] [mismatch-member-id=77091302262bd505]
[2025/12/02 10:11:02.299 +08:00] [WARN] [http.go:145] ["failed to process Raft message"] [local-member-id=36bda1842c68a23a] [error="cannot process message to mismatch member"]
[2025/12/02 10:11:02.357 +08:00] [WARN] [http.go:413] ["failed to find remote peer in cluster"] [local-member-id=36bda1842c68a23a] [remote-peer-id-stream-handler=36bda1842c68a23a] [remote-peer-id-from=5b383763007fc455] [cluster-id=9e63bd2785c31a28]
[2025/12/02 10:11:02.357 +08:00] [WARN] [http.go:413] ["failed to find remote peer in cluster"] [local-member-id=36bda1842c68a23a] [remote-peer-id-stream-handler=36bda1842c68a23a] [remote-peer-id-from=5b383763007fc455] [cluster-id=9e63bd2785c31a28]
^C

pd只挂了一个,缩容再扩容也可以。这个manifest是rocksdb的,坏了就没法修了。pd新版本用了rocksdb吗?我得去了解下

是文件损坏了,核心是 PD 的数据文件MANIFEST-000137 损坏(缺少comparer 字段),导致 PD 无法正常启动或运行。这是 PD 的关键元数据损坏问题。解决办法先去backup备份一下:把当前故障节点停一下,备份,选择最新的恢复。

看着像是是 LevelDB 元数据文件损坏

反正他不是主节点,踢出去把数据目录清空了重新加回来就行了。所有需要的数据会从主节点同步过来的。

加回来有什么技巧嘛,遇到的问题是,加回来也是一样的问题。

加回来还是报一样的错误吗