--force强制缩容tiflash后,扩容问题请教

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:
【 TiDB 使用环境】
物理机集群
【概述】场景+问题概述

【背景】做过哪些操作
–force强制缩容了tiflash组件
【现象】业务和数据库现象
扩容时候,报错,
【业务影响】

【TiDB 版本】
v5.1.0
【附件】
再次扩容tiflash时,报错(这个在扩容其他组件时也会出现):

  • [ Serial ] - Mkdir: host=100.73.36.83, directories=‘/apps/tidbdeploy/tiflash-9000’,‘/apps/tidbdeploy/tiflash-9000/bin’,‘/apps/tidbdeploy/tiflash-9000/conf’,‘/apps/tidbdeploy/tiflash-9000/scripts’

Error: executor.ssh.execute_failed: Failed to execute command over SSH for ‘apps@100.73.36.84:22’ {ssh_stderr: , ssh_stdout: , ssh_command: export LANG=C; PATH=$PATH:/usr/bin:/usr/sbin sudo -H bash -c “test -d /apps || (mkdir -p /apps && chown apps:$(id -g -n apps) /apps)”}, cause: ssh: handshake failed: ssh: unable to authenticate, attempted methods [none publickey], no supported methods remain

Verbose debug logs has been written to /home/apps/.tiup/logs/tiup-cluster-debug-2021-07-12-10-48-14.log.
Error: run /home/apps/.tiup/components/cluster/v1.5.2/tiup-cluster (wd:/home/apps/.tiup/data/Scuk0yq) failed: exit status 1


像请教一下:
1、配置节点互信也没有用,我将IP地址换为主机名,然后再host中配置主机名和对应的IP地址,则可以扩容了,这个是为什么呢?
2、要求使用IP地址,如何修改才能使用IP地址扩容呢?

3 个赞

请问缩容的时候有没有把副本去掉的操作,
alter table . set tiflash replica 0;
另外把错误日志发下
其次看下如下语句的内容
curl http://pd-ip:pd-port/pd/api/v1/config/rules/group/tiflash

3 个赞

alter table . set tiflash replica 0;这个应该没有执行,我记得说是强制缩容的,由于昨天急着用tiflash,我就配置的主机名,然后扩容上去了,现在的结果:
[
{
“group_id”: “tiflash”,
“id”: “table-193-r”,
“override”: true,
“start_key”: “7480000000000000FFC15F720000000000FA”,
“end_key”: “7480000000000000FFC200000000000000F8”,
“role”: “learner”,
“count”: 1,
“label_constraints”: [
{
“key”: “engine”,
“op”: “in”,
“values”: [
“tiflash”
]
}
]
},
{
“group_id”: “tiflash”,
“id”: “table-195-r”,
“override”: true,
“start_key”: “7480000000000000FFC35F720000000000FA”,
“end_key”: “7480000000000000FFC400000000000000F8”,
“role”: “learner”,
“count”: 1,
“label_constraints”: [
{
“key”: “engine”,
“op”: “in”,
“values”: [
“tiflash”
]
}
]
},
{
“group_id”: “tiflash”,
“id”: “table-197-r”,
“override”: true,
“start_key”: “7480000000000000FFC55F720000000000FA”,
“end_key”: “7480000000000000FFC600000000000000F8”,
“role”: “learner”,
“count”: 1,
“label_constraints”: [
{
“key”: “engine”,
“op”: “in”,
“values”: [
“tiflash”
]
}
]
},
{
“group_id”: “tiflash”,
“id”: “table-199-r”,
“override”: true,
“start_key”: “7480000000000000FFC75F720000000000FA”,
“end_key”: “7480000000000000FFC800000000000000F8”,
“role”: “learner”,
“count”: 1,
“label_constraints”: [
{
“key”: “engine”,
“op”: “in”,
“values”: [
“tiflash”
]
}
]
},
{
“group_id”: “tiflash”,
“id”: “table-201-r”,
“override”: true,
“start_key”: “7480000000000000FFC95F720000000000FA”,
“end_key”: “7480000000000000FFCA00000000000000F8”,
“role”: “learner”,
“count”: 1,
“label_constraints”: [
{
“key”: “engine”,
“op”: “in”,
“values”: [
“tiflash”
]
}
]
},
{
“group_id”: “tiflash”,
“id”: “table-203-r”,
“override”: true,
“start_key”: “7480000000000000FFCB5F720000000000FA”,
“end_key”: “7480000000000000FFCC00000000000000F8”,
“role”: “learner”,
“count”: 1,
“label_constraints”: [
{
“key”: “engine”,
“op”: “in”,
“values”: [
“tiflash”
]
}
]
},
{
“group_id”: “tiflash”,
“id”: “table-205-r”,
“override”: true,
“start_key”: “7480000000000000FFCD5F720000000000FA”,
“end_key”: “7480000000000000FFCE00000000000000F8”,
“role”: “learner”,
“count”: 1,
“label_constraints”: [
{
“key”: “engine”,
“op”: “in”,
“values”: [
“tiflash”
]
}
]
},
{
“group_id”: “tiflash”,
“id”: “table-207-r”,
“override”: true,
“start_key”: “7480000000000000FFCF5F720000000000FA”,
“end_key”: “7480000000000000FFD000000000000000F8”,
“role”: “learner”,
“count”: 1,
“label_constraints”: [
{
“key”: “engine”,
“op”: “in”,
“values”: [
“tiflash”
]
}
]
}
]

3 个赞

例子:
curl -v -X DELETE http://<pd_ip>:<pd_port>/pd/api/v1/config/rule/tiflash/table-193-r
按照上面的命令把这些都删除掉,然后就可以扩容了,如果正常了辛苦给个反馈

3 个赞

嗯,好的,等测完后,我在强制缩容一下,然后再按照你的这个方法试一下,对了,那我上面配置主机名可以扩容,这个是为什么呢?

3 个赞

这个得具体看了,盲猜和互信有关,看看你配置的ip对应的主机名这的问题

3 个赞

都配置的业务网:joy:,当初互信也配置的业务网

2 个赞

你可以先把我上面的那个解决下,看会不会出现你这个问题,我确定下是不是这个影响的

3 个赞

好吧,我看看有没有测完

2 个赞

等会,我想到补充一个问题,当初我们也强制缩容了drainer,再扩容时,也是这个问题。

2 个赞

这个操作是在tiflash之前还是之后
你扩容时用的命令详细写下,执行命令的用户是什么用户

3 个赞

扩容指令:tiup cluster scale-out he5db scale-tiflash.yaml --user apps -p

额,为啥用这个方式,不是配置互信了么,而且你部署的时候的账户是apps么

是的,都用的apps用户

1 个赞

那你可以看看新扩容的机器这个用户能不能登陆上,有没有sudo权限之类的

可以登上的,也是有sudo权限的:joy:

试了,一下,删掉tiflash的内容还是不行:

三个问题需要确定下:
1.curl http://pd-ip:pd-port/pd/api/v1/config/rules/group/tiflash,这个查询是否为空了
2.扩容的机器上有没有apps目录,主组权限对么
3.扩容文件方便看下么

1、image
2、有apps目录,权限和其他机器都一样
image
3、扩容文件:
image

data_dir指到apps下面,或者不写