tidb lightning导入报错scatter region failed

【 TiDB 使用环境】生产环境 /测试/ Poc
【 TiDB 版本】版本v6.5.2
【复现路径】使用dumpling把某个tidb集群导出后,再使用tidb lightning导入到另外的tidb集群,日志报错
[2023/07/27 16:36:01.874 +08:00] [WARN] [localhelper.go:448] [“scatter region failed”] [regionCount=13] [failedCount=2] [error=“region 83640 not found”] [errorVerbose=“region 83640 not found\ngithub.com/pingcap/tidb/br/pkg/lightning/backend/local.(*Backend).BatchSplitRegions.func2\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/lightning/backend/local/localhelper.go:428\ngithub.com/pingcap/tidb/br/pkg/utils.WithRetry\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/utils/retry.go:56\ngithub.com/pingcap/tidb/br/pkg/lightning/backend/local.(*Backend).BatchSplitRegions\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/lightning/backend/local/localhelper.go:420\ngithub.com/pingcap/tidb/br/pkg/lightning/backend/local.(*Backend).SplitAndScatterRegionByRanges.func3\n\t/home/jenkins/agent/workspace/build-common/go/src/github.com/pingcap/br/br/pkg/lightning/backend/local/localhelper.go:293\ngolang.org/x/sync/errgroup.(*Group).Go.func1\n\t/go/pkg/mod/golang.org/x/sync@v0.2.0/errgroup/errgroup.go:75\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1598”]
[2023/07/27 16:36:01.878 +08:00] [INFO] [localhelper.go:317] [“batch split region”] [region_id=82615] [keys=1] [firstKey=“dIAAAAAAAAC2X2mAAAAAAAAAAgFoAAAAAAAAAPgBbTE1ODU4Mzn/MzQ2NgAAAAD7ATE1ODU4Mzkz/zQ2NgAAAAAA+g==”] [end=“dIAAAAAAAAC2X2mAAAAAAAAAAgFoAAAAAAAAAPgBbTE1ODU4Mzn/MzQ2NgAAAAD7ATE1ODU4Mzkz/zQ2NgAAAAAA+g==”]
[2023/07/27 16:36:01.879 +08:00] [INFO] [localhelper.go:317] [“batch split region”] [region_id=82623] [keys=1] [firstKey=“dIAAAAAAAAC2X2mAAAAAAAAAAgFqAAAAAAAAAPgBbTE1ODUxNzn/MjcxNQAAAAD7ATE1ODUxNzky/zcxNQAAAAAA+g==”] [end=“dIAAAAAAAAC2X2mAAAAAAAAAAgFqAAAAAAAAAPgBbTE1ODUxNzn/MjcxNQAAAAD7ATE1ODUxNzky/zcxNQAAAAAA+g==”]
[2023/07/27 16:36:01.879 +08:00] [INFO] [localhelper.go:317] [“batch split region”] [region_id=82635] [keys=1] [firstKey=“dIAAAAAAAAC2X2mAAAAAAAAAAgFtAAAAAAAAAPgBbTE1NjM0MDH/MjQzNwAAAAD7ATE1NjM0MDEy/zQzNwAAAAAA+g==”] [end=“dIAAAAAAAAC2X2mAAAAAAAAAAgFtAAAAAAAAAPgBbTE1NjM0MDH/MjQzNwAAAAD7ATE1NjM0MDEy/zQzNwAAAAAA+g==”]
[2023/07/27 16:36:01.879 +08:00] [INFO] [localhelper.go:317] [“batch split region”] [region_id=82631] [keys=1] [firstKey=“dIAAAAAAAAC2X2mAAAAAAAAAAgFsAAAAAAAAAPgBbTE1NzExMzn/Mzg2NQAAAAD7ATE1NzExMzkz/zg2NQAAAAAA+g==”] [end=“dIAAAAAAAAC2X2mAAAAAAAAAAgFsAAAAAAAAAPgBbTE1NzExMzn/Mzg2NQAAAAD7ATE1NzExMzkz/zg2NQAAAAAA+g==”]
[2023/07/27 16:36:01.879 +08:00] [INFO] [localhelper.go:317] [“batch split region”] [region_id=82643] [keys=2] [firstKey=“dIAAAAAAAAC2X2mAAAAAAAAAAgFvAAAAAAAAAPgBbTE1NTE2MDn/NTk2NgAAAAD7ATE1NTE2MDk1/zk2NgAAAAAA+g==”] [end=“dIAAAAAAAAC2X2mAAAAAAAAAAgFvAAAAAAAAAPgBenp6enp6enr/enp6enpkdAD+ATE1NjA0NjI3/zY4NQAAAAAA+gA=”]
[2023/07/27 16:36:01.879 +08:00] [INFO] [localhelper.go:317] [“batch split region”] [region_id=82599] [keys=1] [firstKey=“dIAAAAAAAAC2X2mAAAAAAAAAAgFkAAAAAAAAAPgBbTE3MDc2ODX/MzA5NQAAAAD7ATE3MDc2ODUz/zA5NQAAAAAA+g==”] [end=“dIAAAAAAAAC2X2mAAAAAAAAAAgFkAAAAAAAAAPgBbTE3MDc2ODX/MzA5NQAAAAD7ATE3MDc2ODUz/zA5NQAAAAAA+g==”]
[2023/07/27 16:36:01.879 +08:00] [INFO] [localhelper.go:317] [“batch split region”] [region_id=82611] [keys=1] [firstKey=“dIAAAAAAAAC2X2mAAAAAAAAAAgFnAAAAAAAAAPgBbTE1OTEwMjn/NjY1NgAAAAD7ATE1OTEwMjk2/zY1NgAAAAAA+g==”] [end=“dIAAAAAAAAC2X2mAAAAAAAAAAgFnAAAAAAAAAPgBbTE1OTEwMjn/NjY1NgAAAAD7ATE1OTEwMjk2/zY1NgAAAAAA+g==”]
[2023/07/27 16:36:01.879 +08:00] [WARN] [localhelper.go:448] [“scatter region failed”] [regionCount=1] [failedCount=1] [error=“rpc error: code = Unknown desc = region 83680 is not fully replicated”]
[2023/07/27 16:36:01.879 +08:00] [INFO] [localhelper.go:317] [“batch split region”] [region_id=82627] [keys=1] [firstKey=“dIAAAAAAAAC2X2mAAAAAAAAAAgFrAAAAAAAAAPgBbTE1NzIwMzj/NDkxOAAAAAD7ATE1NzIwMzg0/zkxOAAAAAAA+g==”] [end=“dIAAAAAAAAC2X2mAAAAAAAAAAgFrAAAAAAAAAPgBbTE1NzIwMzj/NDkxOAAAAAD7ATE1NzIwMzg0/zkxOAAAAAAA+g==”]
[2023/07/27 16:36:01.879 +08:00] [INFO] [localhelper.go:317] [“batch split region”] [region_id=82603] [keys=1] [firstKey=“dIAAAAAAAAC2X2mAAAAAAAAAAgFlAAAAAAAAAPgBbTE1OTk4ODj/NTI5OQAAAAD7ATE1OTk4ODg1/zI5OQAAAAAA+g==”] [end=“dIAAAAAAAAC2X2mAAAAAAAAAAgFlAAAAAAAAAPgBbTE1OTk4ODj/NTI5OQAAAAD7ATE1OTk4ODg1/zI5OQAAAAAA+g==”]
[2023/07/27 16:36:01.879 +08:00] [INFO] [localhelper.go:317] [“batch split region”] [region_id=82639] [keys=1] [firstKey=“dIAAAAAAAAC2X2mAAAAAAAAAAgFuAAAAAAAAAPgBbTE1NTU2NjD/MDUyMQAAAAD7ATE1NTU2NjAw/zUyMQAAAAAA+g==”] [end=“dIAAAAAAAAC2X2mAAAAAAAAAAgFuAAAAAAAAAPgBbTE1NTU2NjD/MDUyMQAAAAD7ATE1NTU2NjAw/zUyMQAAAAAA+g==”]
[2023/07/27 16:36:01.879 +08:00] [INFO] [localhelper.go:317] [“batch split region”] [region_id=82607] [keys=1] [firstKey=“dIAAAAAAAAC2X2mAAAAAAAAAAgFmAAAAAAAAAPgBbTE1OTMxNjP/NTY5MgAAAAD7ATE1OTMxNjM1/zY5MgAAAAAA+g==”] [end=“dIAAAAAAAAC2X2mAAAAAAAAAAgFmAAAAAAAAAPgBbTE1OTMxNjP/NTY5MgAAAAD7ATE1OTMxNjM1/zY5MgAAAAAA+g==”]
[2023/07/27 16:36:01.879 +08:00] [INFO] [localhelper.go:317] [“batch split region”] [region_id=82595] [keys=1] [firstKey=“dIAAAAAAAAC2X2mAAAAAAAAAAgFjAAAAAAAAAPgBbTE3NjAwMzH/NDk1MQAAAAD7ATE3NjAwMzE0/zk1MQAAAAAA+g==”] [end=“dIAAAAAAAAC2X2mAAAAAAAAAAgFjAAAAAAAAAPgBbTE3NjAwMzH/NDk1MQAAAAD7ATE3NjAwMzE0/zk1MQAAAAAA+g==”]
[2023/07/27 16:36:01.886 +08:00] [INFO] [localhelper.go:317] [“batch split region”] [region_id=82587] [keys=13] [firstKey=dIAAAAAAAAC2X2mAAAAAAAAAAQOAAAAAAAAAAQExNTAyOTIxNv85MTcAAAAAAPoBaQAAAAAAAAD4] [end=“dIAAAAAAAAC2X2mAAAAAAAAAAgFiAAAAAAAAAPgBbTE3ODU4OTD/MDkxOQAAAAD7ATE3ODU4OTAw/zkxOQAAAAAA+g==”]
[2023/07/27 16:36:01.900 +08:00] [WARN] [localhelper.go:448] [“scatter region failed”] [regionCount=1] [failedCount=1] [error=“rpc error: code = Unknown desc = region 83680 is not fully replicated”]
[2023/07/27 16:36:01.941 +08:00] [WARN] [localhelper.go:448] [“scatter region failed”] [regionCount=1] [failedCount=1] [error=“rpc error: code = Unknown desc = region 83680 is not fully replicated”]
[2023/07/27 16:36:02.022 +08:00] [WARN] [localhelper.go:448] [“scatter region failed”] [regionCount=1] [failedCount=1] [error=“rpc error: code = Unknown desc = region 83680 is not fully replicated”]
[2023/07/27 16:36:02.183 +08:00] [WARN] [localhelper.go:448] [“scatter region failed”] [regionCount=1] [failedCount=1] [error=“rpc error: code = Unknown desc = region 83680 is not fully replicated”]
[2023/07/27 16:36:02.505 +08:00] [WARN] [localhelper.go:448] [“scatter region failed”] [regionCount=1] [failedCount=1] [error=“rpc error: code = Unknown desc = region 83680 is not fully replicated”]

我的tidb lightning配置是这样的
[lightning]
status-addr = ‘:8289’
level = “info”
file = “/home/tidb/tidb-lightning/tidb-lightning.log”
check-requirements = true
region-concurrency = 32

[checkpoint]
enable = true
schema = “tidb_lightning_checkpoint”
driver = “file”
dsn = “/data1/tidb-lightning/tidb_lightning_checkpoint.pb”

[tikv-importer]
disk-quota = “10GB”
backend = “local”
on-duplicate = “error”
sorted-kv-dir = “/data1/tidb-lightning/some-dir”
duplicate-resolution = ‘remove’

[mydumper]
data-source-dir = “/home/tidb/tmp/onlinedata”

filter = [‘.’, ‘!mysql.', '!sys.’, ‘!INFORMATION_SCHEMA.', '!PERFORMANCE_SCHEMA.’, ‘!METRICS_SCHEMA.', '!INSPECTION_SCHEMA.’]
[tidb]
host = “192.168.1.1”
port = 4000
user = “root”
password = “rootroot”
status-port = 10080

pd-addr = “192.168.1.1:2379”
log-level = “error”

数据源有240G左右,现在[2023/07/27 16:48:08.951 +08:00] [WARN] [localhelper.go:448] [“scatter region failed”] [regionCount=1] [failedCount=1] [error=“rpc error: code = Unknown desc = region 83680 is not fully replicated”]这个日志一直在疯狂刷,会不会有影响,丢数据之类的

我用pd ctl查看到这个,不知道什么意思,看文档说 * extra-peer:多副本的 Region。不过集群是刚初始化的,什么都没配置,应该都是3副本才对,这个多副本又是啥
» region check extra-peer
{
“count”: 1,
“regions”: [
{
“id”: 83680,
“start_key”: “7480000000000000FFB65F698000000000FF0000020169000000FF00000000F8016D31FF333831303035FF33FF33373600000000FBFF0131333831303035FF33FF333736000000FF0000FA0000000000FA”,
“end_key”: “7480000000000000FFB65F698000000000FF0000020169000000FF00000000F8016D31FF353834373538FF35FF33383100000000FBFF0131353834373538FF35FF333831000000FF0000FA0000000000FA”,
“epoch”: {
“conf_ver”: 42,
“version”: 116
},
“peers”: [
{
“id”: 83681,
“store_id”: 5,
“role_name”: “Voter”
},
{
“id”: 83682,
“store_id”: 6,
“role_name”: “Voter”
},
{
“id”: 83683,
“store_id”: 2,
“role_name”: “Voter”
},
{
“id”: 83684,
“store_id”: 7,
“role”: 1,
“role_name”: “Learner”,
“is_learner”: true
}
],
“leader”: {
“id”: 83681,
“store_id”: 5,
“role_name”: “Voter”
},
“cpu_usage”: 0,
“written_bytes”: 1921387,
“read_bytes”: 0,
“written_keys”: 2365,
“read_keys”: 0,
“approximate_size”: 549,
“approximate_keys”: 3655815
}
]
}

正常如果是3副本,如果出现region有4个或以上副本,这个region就是多副本的异常region

需要如何解决呢?集群是5个tikv,刚初始化的

WARN日志无所谓的。不会丢数据。
导入的时候,数据变化多。region miss,找不到leader,这种报错很常见。
如果是error日志,lightning自己就停下了。

你的集群又是刚初始化的,把心放回去大胆的导入。
真有问题,重建也不麻烦。属于是出问题代价最小的时候了。

多副本又不是少副本,导入不影响的,等导入完成,果断时间应该会自动清理掉多余的副本,如果想手工清理可以通过pdctl执行下面的命令手工清理掉那个Learner的peer
operator add remove-peer 83684 7

只要lightning没自动退出,问题就应该不大。
整个过程完成以后,看看lightning的日志,最后会输出一些导入信息的,一般日志末尾是 the whole procedure completed,就表示导入成功,如果在日志末尾看到有error,可能导入真的有问题。

pd-ctl operator add remove-peer 83680 7 试试能不能把Leaner角色的peer删掉