用tidb-lightning导数据报错:error="rpc error: code = Canceled desc = context canceled"

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:
【 TiDB 使用环境】
tidb 5.2集群

【概述】 场景 + 问题概述
数据是用dumpling导出来的,在用tidb-lightning导入时,报错如下:
[2021/12/24 17:04:00.465 +08:00] [WARN] [local.go:1762] [“scan region failed”] [error=“rpc error: code = Canceled desc = context canceled”] [region_len=0] [startKey=7480000000000001FF5B5F698000000000FF0000020380000000FF06A71CF603800000FF0005A3081F000000FC] [endKey=7480000000000001FF5B5F698000000000FF0000020380000000FF06AA736E03800000FF0002CE5784000000FD] [retry=0]
[2021/12/24 17:04:02.526 +08:00] [WARN] [local.go:1762] [“scan region failed”] [error=“rpc error: code = Canceled desc = context canceled”] [region_len=0] [startKey=7480000000000001FF5B5F698000000000FF0000020380000000FF06AA736E03800000FF0002CE5785000000FC] [endKey=7480000000000001FF5B5F698000000000FF0000020380000000FF06B27A2603800000FF0004D1B414000000FD] [retry=0]
[2021/12/24 17:04:04.835 +08:00] [WARN] [local.go:1762] [“scan region failed”] [error=“rpc error: code = Canceled desc = context canceled”] [region_len=0] [startKey=7480000000000001FF5B5F698000000000FF0000020380000000FF06B27A2E03800000FF0004D1B407000000FC] [endKey=7480000000000001FF5B5F698000000000FF0000020380000000FF06B8CEEE03800000FF0002A1CEF4000000FD] [retry=0]
[2021/12/24 17:04:20.929 +08:00] [WARN] [local.go:1762] [“scan region failed”] [error=“rpc error: code = Canceled desc = context canceled”] [region_len=0] [startKey=7480000000000001FF5B5F698000000000FF0000020380000000FF06B8CEEE03800000FF0002A67B00000000FC] [endKey=7480000000000001FF5B5F698000000000FF0000020380000000FF06C6023603800000FF000515A03F000000FD] [retry=0]
[2021/12/24 17:04:22.541 +08:00] [WARN] [local.go:1762] [“scan region failed”] [error=“rpc error: code = Canceled desc = context canceled”] [region_len=0] [startKey=7480000000000001FF5B5F698000000000FF0000020380000000FF06C6023E03800000FF0004DADC11000000FC] [endKey=7480000000000001FF5B5F698000000000FF0000020380000000FF06CE9F4603800000FF0001378FDA000000FD] [retry=0]
[2021/12/24 17:04:25.206 +08:00] [WARN] [local.go:1762] [“scan region failed”] [error=“rpc error: code = Canceled desc = context canceled”] [region_len=0] [startKey=7480000000000001FF5B5F698000000000FF0000020380000000FF06E6DEDE03800000FF00017843B8000000FC] [endKey=7480000000000001FF5B5F698000000000FF0000020380000000FF06EF833E03800000FF0004EF9B24000000FD] [retry=0]
[2021/12/24 17:04:26.008 +08:00] [WARN] [local.go:1762] [“scan region failed”] [error=“rpc error: code = Canceled desc = context canceled”] [region_len=0] [startKey=7480000000000001FF5B5F698000000000FF0000020380000000FF06EF833E03800000FF0004EF9B25000000FC] [endKey=7480000000000001FF5B5F698000000000FF0000020380000000FF06F641C603800000FF00027F275E000000FD] [retry=0]
[2021/12/24 17:04:26.179 +08:00] [WARN] [local.go:1762] [“scan region failed”] [error=“rpc error: code = Canceled desc = context canceled”] [region_len=0] [startKey=7480000000000001FF5B5F698000000000FF0000020380000000FF06CE9F4603800000FF000137B8AA000000FC] [endKey=7480000000000001FF5B5F698000000000FF0000020380000000FF06D7403603800000FF00035B751D000000FD] [retry=0]
[2021/12/24 17:04:27.855 +08:00] [WARN] [local.go:1762] [“scan region failed”] [error=“rpc error: code = Canceled desc = context canceled”] [region_len=0] [startKey=7480000000000001FF5B5F698000000000FF0000020380000000FF06D7403E03800000FF00035B2660000000FC] [endKey=7480000000000001FF5B5F698000000000FF0000020380000000FF06E1044E03800000FF00035EBE51000000FD] [retry=0]
[2021/12/24 17:04:28.065 +08:00] [WARN] [local.go:1762] [“scan region failed”] [error=“rpc error: code = Canceled desc = context canceled”] [region_len=0] [startKey=7480000000000001FF5B5F698000000000FF0000020380000000FF06E1044E03800000FF00035EBE52000000FC] [endKey=7480000000000001FF5B5F698000000000FF0000020380000000FF06E6DED603800000FF0001643A21000000FD] [retry=0]

【备份和数据迁移策略逻辑】
用dumpling导出,tidb-lightning导入
【背景】 做过哪些操作

【现象】 业务和数据库现象

【问题】 当前遇到的问题
用tidb-lightning导入,报错
【业务影响】
数据导入不完全,没成功导入
【TiDB 版本】
v5.2.2
【附件】

  • 相关日志、配置文件、Grafana 监控(https://metricstool.pingcap.com/)
  • TiUP Cluster Display 信息
  • TiUP CLuster Edit config 信息
  • TiDB-Overview 监控
  • 对应模块的 Grafana 监控(如有 BR、TiDB-binlog、TiCDC 等)tidb-lightning.log (2.7 MB)
  • 对应模块日志(包含问题前后 1 小时日志)

若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

1 个赞

[2021/12/24 16:58:02.327 +08:00] [INFO] [pd.go:433] [“pause scheduler(configs)”] [name=“[balance-hot-region-scheduler,balance-leader-scheduler,balance-region-scheduler]”] [cfg=“{"enable-location-replacement":"false","leader-schedule-limit":24,"max-merge-region-keys":0,"max-merge-region-size":0,"max-pending-peer-count":2147483647,"max-snapshot-count":18,"region-schedule-limit":40}”]

看日志有pause scheduler,具体是做了什么操作吗。

1 个赞

没做操作,这个集群基本上现在没有人用,就是在导数据而已。

1 个赞

开二个任务同时导,有没有关系?现在tikv的数据量956G了,tikv6节点,tidb和pd都是三个节点。

1 个赞

你换成tidb模式导入看看

1 个赞

lightling.toml配置文件贴出来看下

1 个赞

[lightning]

转换数据的并发数,默认为逻辑 CPU 数量,不需要配置。

# # 混合部署的情况下可以配置为逻辑 CPU 的 75% 大小。

# # region-concurrency =

level = “info”
file = “/data/backup/eshop_db6m/tidb-lightning.log”
table-concurrency = 8
io-concurrency = 5

[tikv-importer]

# # backend 设置为 local 模式

backend = “local”
[mydumper]
batch-size = ‘100GiB’
[checkpoint]
enable = true
[tidb]
host = “X.X.X.X”
port = 4000
user = “root”
password = “XXXXXXXXXX”

表架构信息在从 TiDB 的“状态端口”获取。

status-port = 10080

pd-server 的地址,填一个即可

pd-addr = “X.X.X.X:2379”

1 个赞

用tidb模式导入之前清空了目标端的表吗

1 个赞

tidb模式导入之前没清空目标端的表的数据(现在只用tidb模式导入一张表,目标端库下面还有其它的表),报错如下:
[2021/12/25 22:26:27.481 +08:00] [INFO] [restore.go:2232] [“restore file completed”] [table=pagoda_customer6.store_goodsorder_goods_rel] [engineNumber=0] [fileIndex=1894] [path=pagoda_customer6.store_goodsorder_goods_rel.0000018940000.sql:0] [readDur=383.016904ms] [encodeDur=159.593555ms] [deliverDur=14.84553076s] [checksum="{cksum=0,size=23315312,kvs=119861}"] [takeTime=14.877126267s] []
[2021/12/25 22:26:27.481 +08:00] [INFO] [restore.go:2222] [“restore file start”] [table=pagoda_customer6.store_goodsorder_goods_rel] [engineNumber=0] [fileIndex=1902] [path=pagoda_customer6.store_goodsorder_goods_rel.0000019020000.sql:0]
[2021/12/25 22:26:28.242 +08:00] [INFO] [main.go:45] [“got signal to exit”] [signal=hangup]
[2021/12/25 22:26:28.242 +08:00] [WARN] [restore.go:1069] [“stopping periodic actions”] [error=“context canceled”]
[2021/12/25 22:26:28.242 +08:00] [INFO] [restore.go:1173] [“cancel periodic actions”] [do=false]
[2021/12/25 22:26:28.242 +08:00] [INFO] [restore.go:416] [“task canceled”] [step=3] [error=“restore table pagoda_customer6.store_goodsorder_goods_rel failed: context canceled”]
[2021/12/25 22:26:28.242 +08:00] [INFO] [restore.go:2222] [“restore file start”] [table=pagoda_customer6.store_goodsorder_goods_rel] [engineNumber=0] [fileIndex=1903] [path=pagoda_customer6.store_goodsorder_goods_rel.0000019030000.sql:0]
[2021/12/25 22:26:28.245 +08:00] [INFO] [restore.go:431] [“the whole procedure completed”] [takeTime=1h0m5.689551256s] []
[2021/12/25 22:26:28.255 +08:00] [INFO] [main.go:93] [“tidb lightning exit”]

现在把目标端的表store_goodsorder_goods_rel清空后采用tidb模式再导入,现在还在进行中,不知道能不能成功?

现在把目标端的表store_goodsorder_goods_rel清空后采用tidb模式再导入,重新导入,还是报错,就是只导入部份sql,就直接退出来了
[2021/12/25 23:58:35.255 +08:00] [INFO] [restore.go:2222] [“restore file start”] [table=pagoda_customer6.store_goodsorder_goods_rel] [engineNumber=0] [fileIndex=505] [path=pagoda_customer6.store_goodsorder_goods_rel.0000005050000.sql:0]
[2021/12/25 23:58:35.611 +08:00] [INFO] [main.go:45] [“got signal to exit”] [signal=hangup]
[2021/12/25 23:58:35.611 +08:00] [WARN] [restore.go:1069] [“stopping periodic actions”] [error=“context canceled”]
[2021/12/25 23:58:35.611 +08:00] [INFO] [restore.go:1173] [“cancel periodic actions”] [do=false]
[2021/12/25 23:58:35.611 +08:00] [INFO] [restore.go:416] [“task canceled”] [step=3] [error=“restore table pagoda_customer6.store_goodsorder_goods_rel failed: context canceled”]
[2021/12/25 23:58:35.611 +08:00] [INFO] [restore.go:431] [“the whole procedure completed”] [takeTime=23m2.488605913s] []
[2021/12/25 23:58:35.611 +08:00] [INFO] [restore.go:2222] [“restore file start”] [table=pagoda_customer6.store_goodsorder_goods_rel] [engineNumber=0] [fileIndex=506] [path=pagoda_customer6.store_goodsorder_goods_rel.0000005060000.sql:0]
[2021/12/25 23:58:35.611 +08:00] [INFO] [main.go:93] [“tidb lightning exit”]

最后一个sql是pagoda_customer6.store_goodsorder_goods_rel.0000019950000.sql
sql文件总数是2002个

被OOM kill了,加大点内存

从哪里能看出来oom了呢,/var/log/messages中无相关信息。

你可以执行命令,等程序报错后用echo $?看一下返回值。或者监控看下内存使用。

你是用nohup运行的lightning吗

是的,是这样放后台运行的。

nohup是直接运行的还是放在脚本文件里面。

如果直接在命令行中用 nohup 启动程序,可能会因为 SIGHUP 信号而退出,建议把 nohup 放到脚本里面,如:

#!/bin/bash
nohup ./tidb-lightning -config tidb-lightning.toml > nohup.out &

nohup是直接运行的,现在把它放到脚本文件中去试下,看还会不会退出

nohup放在脚本中,执行不报错,但是sql文件没应用完,直接退出来了
cat exec3.sh
nohup ./tidb-lightning --analyze=false --checksum=false --check-requirements=false --sorted-kv-dir=“/data/backup/tmptikv” -d “/data/backup/eshop_db6m/store_goodsorder_goods_rel” -config /data/backup/eshop_db6m/lightning.toml > nohup.out &
nohup sh exec3.sh >/data/backup/eshop_db6m/exec.log &

[2021/12/27 16:02:14.488 +08:00] [INFO] [table_restore.go:866] [“import and cleanup engine completed”] [engineTag=pagoda_customer6.store_goodsorder_goods_rel:0] [engineUUID=5496026c-6cdd-50d5-a15d-72cd867b8b31] [takeTime=60.157µs]
[2021/12/27 16:02:14.488 +08:00] [INFO] [table_restore.go:319] [“import whole table completed”] [table=pagoda_customer6.store_goodsorder_goods_rel] [takeTime=23m6.581488323s]
[2021/12/27 16:02:14.488 +08:00] [INFO] [backend.go:428] [“engine close start”] [engineTag=pagoda_customer6.store_goodsorder_goods_rel:-1] [engineUUID=76c3820a-76f4-5df9-a8b5-951ee941d514]
[2021/12/27 16:02:14.488 +08:00] [INFO] [backend.go:430] [“engine close completed”] [engineTag=pagoda_customer6.store_goodsorder_goods_rel:-1] [engineUUID=76c3820a-76f4-5df9-a8b5-951ee941d514] [takeTime=356ns]
[2021/12/27 16:02:14.488 +08:00] [INFO] [table_restore.go:857] [“import and cleanup engine start”] [engineTag=pagoda_customer6.store_goodsorder_goods_rel:-1] [engineUUID=76c3820a-76f4-5df9-a8b5-951ee941d514]
[2021/12/27 16:02:14.488 +08:00] [INFO] [backend.go:442] [“import start”] [engineTag=pagoda_customer6.store_goodsorder_goods_rel:-1] [engineUUID=76c3820a-76f4-5df9-a8b5-951ee941d514] [retryCnt=0]
[2021/12/27 16:02:14.488 +08:00] [INFO] [backend.go:445] [“import completed”] [engineTag=pagoda_customer6.store_goodsorder_goods_rel:-1] [engineUUID=76c3820a-76f4-5df9-a8b5-951ee941d514] [retryCnt=0] [takeTime=589ns]
[2021/12/27 16:02:14.488 +08:00] [INFO] [backend.go:457] [“cleanup start”] [engineTag=pagoda_customer6.store_goodsorder_goods_rel:-1] [engineUUID=76c3820a-76f4-5df9-a8b5-951ee941d514]
[2021/12/27 16:02:14.488 +08:00] [INFO] [backend.go:459] [“cleanup completed”] [engineTag=pagoda_customer6.store_goodsorder_goods_rel:-1] [engineUUID=76c3820a-76f4-5df9-a8b5-951ee941d514] [takeTime=313ns]
[2021/12/27 16:02:14.488 +08:00] [INFO] [table_restore.go:866] [“import and cleanup engine completed”] [engineTag=pagoda_customer6.store_goodsorder_goods_rel:-1] [engineUUID=76c3820a-76f4-5df9-a8b5-951ee941d514] [takeTime=71.533µs]
[2021/12/27 16:02:14.488 +08:00] [INFO] [tidb.go:363] [“alter table auto_increment start”] [table=pagoda_customer6.store_goodsorder_goods_rel] [auto_increment=2]
[2021/12/27 16:02:15.015 +08:00] [INFO] [tidb.go:365] [“alter table auto_increment completed”] [table=pagoda_customer6.store_goodsorder_goods_rel] [auto_increment=2] [takeTime=527.23718ms]
[2021/12/27 16:02:15.016 +08:00] [INFO] [table_restore.go:695] [“skip checksum”] [table=pagoda_customer6.store_goodsorder_goods_rel]
[2021/12/27 16:02:15.016 +08:00] [INFO] [table_restore.go:760] [“skip analyze”] [table=pagoda_customer6.store_goodsorder_goods_rel]
[2021/12/27 16:02:15.016 +08:00] [INFO] [restore.go:1295] [“restore table completed”] [table=pagoda_customer6.store_goodsorder_goods_rel] [takeTime=23m7.109855158s]
[2021/12/27 16:02:15.016 +08:00] [INFO] [restore.go:1371] [“restore all tables data completed”] [takeTime=23m7.111373226s]
[2021/12/27 16:02:15.016 +08:00] [INFO] [restore.go:1173] [“cancel periodic actions”] [do=false]
[2021/12/27 16:02:15.016 +08:00] [INFO] [restore.go:1488] [“skip full compaction”]
[2021/12/27 16:02:15.016 +08:00] [INFO] [restore.go:1072] [“everything imported, stopping periodic actions”]
[2021/12/27 16:02:15.029 +08:00] [INFO] [restore.go:1660] [“clean checkpoints start”] [keepAfterSuccess=false] [taskID=1640586793297545211]
[2021/12/27 16:02:15.033 +08:00] [INFO] [restore.go:1667] [“clean checkpoints completed”] [keepAfterSuccess=false] [taskID=1640586793297545211] [takeTime=4.278185ms]
[2021/12/27 16:02:15.033 +08:00] [INFO] [restore.go:431] [“the whole procedure completed”] [takeTime=23m7.250078203s]
[2021/12/27 16:02:15.034 +08:00] [INFO] [main.go:93] [“tidb lightning exit”]

数据只导进去了1/3,还有2/3没导进去。

从日志来看程序是正常退出的,没有导入进去的数据有没有办法两头验证一下?

我对了源端和目标端的数据,确实只导进去了1/3,继续再导的话,数据又能导进去一部份,但是没导完,又退出来了
[2021/12/27 17:31:59.164 +08:00] [INFO] [restore.go:2222] [“restore file start”] [table=pagoda_customer6.store_goodsorder_goods_rel] [engineNumber=0] [fileIndex=1399] [path=pagoda_customer6.store_goodsorder_goods_rel.0000013990000.sql:0]
[2021/12/27 17:32:02.187 +08:00] [INFO] [restore.go:2232] [“restore file completed”] [table=pagoda_customer6.store_goodsorder_goods_rel] [engineNumber=0] [fileIndex=1392] [path=pagoda_customer6.store_goodsorder_goods_rel.0000013920000.sql:0] [readDur=407.748294ms] [encodeDur=160.821254ms] [deliverDur=18.232594518s] [checksum="{cksum=0,size=23158346,kvs=119861}"] [takeTime=18.293565006s] []
[2021/12/27 17:32:02.187 +08:00] [INFO] [restore.go:2222] [“restore file start”] [table=pagoda_customer6.store_goodsorder_goods_rel] [engineNumber=0] [fileIndex=1400] [path=pagoda_customer6.store_goodsorder_goods_rel.0000014000000.sql:0]
[2021/12/27 17:32:02.878 +08:00] [INFO] [restore.go:2232] [“restore file completed”] [table=pagoda_customer6.store_goodsorder_goods_rel] [engineNumber=0] [fileIndex=1393] [path=pagoda_customer6.store_goodsorder_goods_rel.0000013930000.sql:0] [readDur=379.868789ms] [encodeDur=176.533193ms] [deliverDur=18.545114912s] [checksum="{cksum=0,size=23172106,kvs=119861}"] [takeTime=18.579020322s] []
[2021/12/27 17:32:02.878 +08:00] [INFO] [restore.go:2222] [“restore file start”] [table=pagoda_customer6.store_goodsorder_goods_rel] [engineNumber=0] [fileIndex=1401] [path=pagoda_customer6.store_goodsorder_goods_rel.0000014010000.sql:0]
[2021/12/27 17:32:03.634 +08:00] [INFO] [restore.go:1151] [progress] [total=55.9%] [tables=“0/1 (0.0%)”] [chunks=“1394/1996 (69.8%)”] [engines=“0/2 (0.0%)”] [speed(MiB/s)=8.440296836877138] [state=writing] [remaining=47m23s]
[2021/12/27 17:32:06.478 +08:00] [INFO] [restore.go:2232] [“restore file completed”] [table=pagoda_customer6.store_goodsorder_goods_rel] [engineNumber=0] [fileIndex=1394] [path=pagoda_customer6.store_goodsorder_goods_rel.0000013940000.sql:0] [readDur=389.448079ms] [encodeDur=163.955421ms] [deliverDur=18.007150948s] [checksum="{cksum=0,size=23204719,kvs=119861}"] [takeTime=18.043067126s] []
[2021/12/27 17:32:06.478 +08:00] [INFO] [restore.go:2222] [“restore file start”] [table=pagoda_customer6.store_goodsorder_goods_rel] [engineNumber=0] [fileIndex=1402] [path=pagoda_customer6.store_goodsorder_goods_rel.0000014020000.sql:0]
[2021/12/27 17:32:08.551 +08:00] [INFO] [main.go:45] [“got signal to exit”] [signal=hangup]
[2021/12/27 17:32:08.551 +08:00] [WARN] [restore.go:1069] [“stopping periodic actions”] [error=“context canceled”]
[2021/12/27 17:32:08.551 +08:00] [INFO] [restore.go:1173] [“cancel periodic actions”] [do=false]
[2021/12/27 17:32:08.551 +08:00] [INFO] [restore.go:2222] [“restore file start”] [table=pagoda_customer6.store_goodsorder_goods_rel] [engineNumber=0] [fileIndex=1403] [path=pagoda_customer6.store_goodsorder_goods_rel.0000014030000.sql:0]
[2021/12/27 17:32:08.551 +08:00] [INFO] [restore.go:416] [“task canceled”] [step=3] [error=“restore table pagoda_customer6.store_goodsorder_goods_rel failed: context canceled”]
[2021/12/27 17:32:08.552 +08:00] [INFO] [restore.go:431] [“the whole procedure completed”] [takeTime=1h0m5.020661215s] []
[2021/12/27 17:32:08.552 +08:00] [INFO] [main.go:93] [“tidb lightning exit”]