Dashboard显示实例数量和tiup不一致

EricSong · 2023 年1 月 6 日 03:27

【 TiDB 使用环境】测试
【复现路径】无
【遇到的问题：问题现象及影响】
节点为多重身份节点，Prometheus和PD都安装在上面，此前由于Prometheus缓存数据量较大导致该节点宕机，重启后使用 tiup cluster display tidb-lab，显示PD节点中不包含本节点，但打开Dashboard，该节点仍然在PD节点列表中。
想问一下当前到底以哪一个为准，如何将两边的状态同步过去？
【附件：截图/日志/监控】

10.247.168.18:2378   pd            10.247.168.18   2378/2380    linux/x86_64  Up       /tidb-data/pd-2378            /tidb-deploy/pd-2378
10.247.168.77:2378   pd            10.247.168.77   2378/2380    linux/x86_64  Up|L|UI  /tidb-data/pd-2378            /tidb-deploy/pd-2378
10.247.168.75:9090   prometheus    10.247.168.75   9090         linux/x86_64  Up       /tidb-data/prometheus-9090    /tidb-deploy/prometheus-9090

Billmay表妹 · 2023 年1 月 6 日 03:36

是升级了之后才出现这个问题的吗？

Kongdom · 2023 年1 月 6 日 03:36

应该是以tiup的为准吧，重启一下集群试试

EricSong · 2023 年1 月 6 日 03:46

不是，该集群没有进行过升级操作，就是硬盘空间满后宕机重启了

EricSong · 2023 年1 月 6 日 03:49

我先找找看是否有其他办法，因为这个问题是重启节点后出现的，我担心线上可能也会出现类似问题，到时候无法重启集群解决

tidb菜鸟一只 · 2023 年1 月 6 日 06:32

SELECT * FROM INFORMATION_SCHEMA.CLUSTER_INFO;

srstack · 2023 年1 月 6 日 09:00

tiup 缺少了一个 PD节点吗？之前有通过 tiup 对这个 PD节点做过什么操作没？
tiup 的 topo 信息都是存储在 tiup 机器本地的，如果没有使用 tiup 操作，理论上 tiup cluster display 不会缺少信息。

EricSong · 2023 年1 月 10 日 03:17

这个SQL跑出来是有三个PD节点的，和Dashboard上一致
截屏2023-01-10 11.16.43

EricSong · 2023 年1 月 10 日 03:18

是的，tiup的display少了一个PD节点，大概几个月前对这个节点做过扩容操作，但近期应该是没有做过类似操作的

tidb菜鸟一只 · 2023 年1 月 10 日 03:36

那实际上那个主机上的pd进程还在吗？看看tiup cluster edit-config tidb-lab看看在线配置里面这个pd的配置还在不在？

EricSong · 2023 年1 月 10 日 03:42

edit-config里已经没有了

pd_servers:

host: 10.247.168.18
ssh_port: 22
name: pd-10.247.168.18-2378
client_port: 2378
peer_port: 2380
deploy_dir: /tidb-deploy/pd-2378
data_dir: /tidb-data/pd-2378
log_dir: /tidb-deploy/pd-2378/log
arch: amd64
os: linux

host: 10.247.168.77
ssh_port: 22
name: pd-10.247.168.77-2378
client_port: 2378
peer_port: 2380
deploy_dir: /tidb-deploy/pd-2378
data_dir: /tidb-data/pd-2378
log_dir: /tidb-deploy/pd-2378/log
arch: amd64
os: linux
cdc_servers:

tidb菜鸟一只 · 2023 年1 月 10 日 06:17

那实际上那个主机上的pd进程还在吗？我觉得可以tiup重新指定一下这个pd节点扩容一下看看

EricSong · 2023 年1 月 10 日 06:51

机器上的PD进程还在的，我试一下重新指定扩容
tidb 1430 1.9 2.0 20237948 334792 ? Ssl Jan04 174:45 bin/pd-server --name=pd-10.247.168.75-2378 --client-urls=http://0.0.0.0:2378 --advertise-client-urls=http://10.247.168.75:2378 --peer-urls=http://0.0.0.0:2380 --advertise-peer-urls=http://10.247.168.75:2380 --data-dir=/tidb-data/pd-2378 --join=http://10.247.168.18:2378,http://10.247.168.77:2378 --config=conf/pd.toml --log-file=/tidb-deploy/pd-2378/log/pd.log

EricSong · 2023 年1 月 10 日 07:37

扩容时出现报错，手动执行systemctl enable node_exporter-9100.service也出现相同的报错

2023-01-10T07:31:22.234Z ERROR CheckPoint {“host”: “10.247.168.75”, “port”: 22, “user”: “tidb”, “sudo”: true, “cmd”: “systemctl daemon-reload && systemctl enable node_exporter-9100.service”, “stdout”: “”, “stderr”: “Failed to execute operation: No such file or directory\n”, “error”: “executor.ssh.execute_failed: Failed to execute command over SSH for ‘tidb@10.247.168.75:22’ {ssh_stderr: Failed to execute operation: No such file or directory\n, ssh_stdout: , ssh_command: export LANG=C; PATH=$PATH:/bin:/sbin:/usr/bin:/usr/sbin /usr/bin/sudo -H bash -c "systemctl daemon-reload && systemctl enable node_exporter-9100.service"}, cause: Process exited with status 1”, “errorVerbose”: “executor.ssh.execute_failed: Failed to execute command over SSH for ‘tidb@10.247.168.75:22’ {ssh_stderr: Failed to execute operation: No such file or directory\n, ssh_stdout: , ssh_command: export LANG=C; PATH=$PATH:/bin:/sbin:/usr/bin:/usr/sbin /usr/bin/sudo -H bash -c "systemctl daemon-reload && systemctl enable node_exporter-9100.service"}, cause: Process exited with status 1\n at github.com/pingcap/tiup/pkg/cluster/executor.(*EasySSHExecutor).Execute()\n\tgithub.com/pingcap/tiup/pkg/cluster/executor/ssh.go:174\n at github.com/pingcap/tiup/pkg/cluster/executor.(*CheckPointExecutor).Execute()\n\tgithub.com/pingcap/tiup/pkg/cluster/executor/checkpoint.go:85\n at github.com/pingcap/tiup/pkg/cluster/module.(*SystemdModule).Execute()\n\tgithub.com/pingcap/tiup/pkg/cluster/module/systemd.go:98\n at github.com/pingcap/tiup/pkg/cluster/operation.systemctl()\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:376\n at github.com/pingcap/tiup/pkg/cluster/operation.systemctlMonitor.func1()\n\tgithub.com/pingcap/tiup/pkg/cluster/operation/action.go:286\n at golang.org/x/sync/errgroup.(*Group).Go.func1()\n\tgolang.org/x/sync@v0.0.0-20210220032951-036812b2e83c/errgroup/errgroup.go:57\n at runtime.goexit()\n\truntime/asm_amd64.s:1581”, “hash”: “ce8eb0a645cc3ead96a44d67b1ecd5034d112cf0”, “func”: “github.com/pingcap/tiup/pkg/cluster/executor.(*CheckPointExecutor).Execute”, “hit”: false}

srstack · 2023 年1 月 10 日 14:23

把异常节点上的 pd 手动下掉，然后通过tiup重新扩容出来吧，应该是之前tiup扩容的时候有问题。
tiup cluster audit 可以上传下扩容操作的 audit log 来看看

tidb菜鸟一只 · 2023 年1 月 11 日 00:38

你现在启动这个pd有问题啊，你通过pdctl登陆上去health看下状态看看，然后通过member找到对应异常的节点id，member delete id 1319539429105371180删除之后重新扩容下

EricSong · 2023 年1 月 17 日 07:29

之前发现扩容失败是原机器上node_exporter.service 未注册导致的，通过创建相关link并注册后扩容成功，现在显示一致了。
我猜测之前的现实不一致可能和node_exporter的问题有关。

system · 2023 年3 月 18 日 07:30

此话题已在最后回复的 60 天后被自动关闭。不再允许新回复。