tidb-lightning导入数据结束时报错

为提高效率,请提供以下信息,问题描述清晰能够更快得到解决:

【TiDB 版本】
5.0.0
【问题描述】
通过tidb-lightning导入数据后,结束时报错,错误信息如下。
数据在库里已经可查,但不确认是否会有数据丢失和影响,因为是获取数据checksum时报的错。

[2021/04/24 16:14:15.676 +08:00] [INFO] [local.go:1405] [“import engine success”] [uuid=12ba0845-7d86-51f1-8fc6-9a331b646b1d] [size=57808745971] [kvs=27292317]
[2021/04/24 16:14:15.676 +08:00] [INFO] [backend.go:401] [“import completed”] [engineTag=testperformance2.testtable:0] [engineUUID=12ba0845-7d86-51f1-8fc6-9a331b646b1d] [retryCnt=0] [takeTime=17m43.973046465s] []
[2021/04/24 16:14:15.676 +08:00] [INFO] [backend.go:413] [“cleanup start”] [engineTag=testperformance2.testtable:0] [engineUUID=12ba0845-7d86-51f1-8fc6-9a331b646b1d]
[2021/04/24 16:14:17.523 +08:00] [INFO] [backend.go:415] [“cleanup completed”] [engineTag=testperformance2.testtable:0] [engineUUID=12ba0845-7d86-51f1-8fc6-9a331b646b1d] [takeTime=1.846906023s] []
[2021/04/24 16:14:17.523 +08:00] [INFO] [restore.go:2222] [“import and cleanup engine completed”] [engineTag=testperformance2.testtable:0] [engineUUID=12ba0845-7d86-51f1-8fc6-9a331b646b1d] [takeTime=17m45.820069435s] []
[2021/04/24 16:14:17.523 +08:00] [INFO] [restore.go:1386] [“import whole table completed”] [table=testperformance2.testtable] [takeTime=52m27.562425781s] []
[2021/04/24 16:14:17.523 +08:00] [INFO] [backend.go:384] [“engine close start”] [engineTag=testperformance2.testtable:-1] [engineUUID=082c18e4-06dd-509c-890f-5af1f90de5e5]
[2021/04/24 16:14:17.544 +08:00] [INFO] [backend.go:386] [“engine close completed”] [engineTag=testperformance2.testtable:-1] [engineUUID=082c18e4-06dd-509c-890f-5af1f90de5e5] [takeTime=20.890699ms] []
[2021/04/24 16:14:17.544 +08:00] [INFO] [restore.go:2214] [“import and cleanup engine start”] [engineTag=testperformance2.testtable:-1] [engineUUID=082c18e4-06dd-509c-890f-5af1f90de5e5]
[2021/04/24 16:14:17.544 +08:00] [INFO] [backend.go:398] [“import start”] [engineTag=testperformance2.testtable:-1] [engineUUID=082c18e4-06dd-509c-890f-5af1f90de5e5] [retryCnt=0]
[2021/04/24 16:14:17.544 +08:00] [INFO] [local.go:1356] [“engine contains no kv, skip import”] [engine=082c18e4-06dd-509c-890f-5af1f90de5e5]
[2021/04/24 16:14:17.544 +08:00] [INFO] [backend.go:401] [“import completed”] [engineTag=testperformance2.testtable:-1] [engineUUID=082c18e4-06dd-509c-890f-5af1f90de5e5] [retryCnt=0] [takeTime=34.228µs] []
[2021/04/24 16:14:17.544 +08:00] [INFO] [backend.go:413] [“cleanup start”] [engineTag=testperformance2.testtable:-1] [engineUUID=082c18e4-06dd-509c-890f-5af1f90de5e5]
[2021/04/24 16:14:17.561 +08:00] [INFO] [backend.go:415] [“cleanup completed”] [engineTag=testperformance2.testtable:-1] [engineUUID=082c18e4-06dd-509c-890f-5af1f90de5e5] [takeTime=17.495262ms] []
[2021/04/24 16:14:17.562 +08:00] [INFO] [restore.go:2222] [“import and cleanup engine completed”] [engineTag=testperformance2.testtable:-1] [engineUUID=082c18e4-06dd-509c-890f-5af1f90de5e5] [takeTime=17.611058ms] []
[2021/04/24 16:14:17.562 +08:00] [INFO] [tidb.go:355] [“alter table auto_increment start”] [table=testperformance2.testtable] [auto_increment=27292318]
[2021/04/24 16:14:19.727 +08:00] [INFO] [tidb.go:357] [“alter table auto_increment completed”] [table=testperformance2.testtable] [auto_increment=27292318] [takeTime=2.165242974s] []
[2021/04/24 16:14:19.727 +08:00] [INFO] [restore.go:1070] [“restore table completed”] [table=testperformance2.testtable] [takeTime=52m29.991658214s] []
[2021/04/24 16:14:19.727 +08:00] [INFO] [restore.go:1694] [“local checksum”] [table=testperformance2.testtable] [checksum="{cksum=11487513479586547264,size=57808745971,kvs=27292317}"]
[2021/04/24 16:14:19.727 +08:00] [INFO] [checksum.go:159] [“remote checksum start”] [table=testtable]
[2021/04/24 16:14:19.984 +08:00] [ERROR] [checksum.go:162] [“remote checksum failed”] [table=testtable] [takeTime=256.81432ms] [error=“fetch tso from pd failed: rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster”]
[2021/04/24 16:14:19.984 +08:00] [ERROR] [restore.go:1215] [“restore all tables data failed”] [takeTime=52m30.317565972s] [error=“fetch tso from pd failed: rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster”]
[2021/04/24 16:14:19.984 +08:00] [INFO] [pd.go:407] [“resume scheduler”] [schedulers="[balance-hot-region-scheduler,balance-leader-scheduler,balance-region-scheduler]"]
[2021/04/24 16:14:19.984 +08:00] [INFO] [pd.go:393] [“exit pause scheduler and configs successful”]
[2021/04/24 16:14:19.984 +08:00] [INFO] [restore.go:904] [“everything imported, stopping periodic actions”]
[2021/04/24 16:14:19.985 +08:00] [INFO] [pd.go:427] [“resume scheduler successful”] [scheduler=balance-hot-region-scheduler]
[2021/04/24 16:14:19.986 +08:00] [INFO] [pd.go:427] [“resume scheduler successful”] [scheduler=balance-leader-scheduler]
[2021/04/24 16:14:19.987 +08:00] [INFO] [pd.go:427] [“resume scheduler successful”] [scheduler=balance-region-scheduler]
[2021/04/24 16:14:19.987 +08:00] [INFO] [pd.go:518] [“restoring config”] [config="{“enable-cross-table-merge”:“true”,“enable-debug-metrics”:“false”,“enable-joint-consensus”:“true”,“enable-location-replacement”:“true”,“enable-make-up-replica”:“true”,“enable-one-way-merge”:“false”,“enable-remove-down-replica”:“true”,“enable-remove-extra-replica”:“true”,“enable-replace-offline-replica”:“true”,“high-space-ratio”:0.7,“hot-region-cache-hits-threshold”:3,“hot-region-schedule-limit”:4,“leader-schedule-limit”:4,“leader-schedule-policy”:“count”,“low-space-ratio”:0.8,“max-merge-region-keys”:200000,“max-merge-region-size”:20,“max-pending-peer-count”:16,“max-snapshot-count”:3,“max-store-down-time”:“30m0s”,“merge-schedule-limit”:8,“patrol-region-interval”:“100ms”,“region-schedule-limit”:2048,“region-score-formula-version”:“v2”,“replica-schedule-limit”:64,“scheduler-max-waiting-operator”:5,“schedulers-payload”:null,“schedulers-v2”:[{“args”:null,“args-payload”:"",“disable”:false,“type”:“balance-region”},{“args”:null,“args-payload”:"",“disable”:false,“type”:“balance-leader”},{“args”:null,“args-payload”:"",“disable”:false,“type”:“hot-region”},{“args”:null,“args-payload”:"",“disable”:false,“type”:“label”}],“split-merge-interval”:“1h0m0s”,“store-limit”:{“1”:{“add-peer”:15,“remove-peer”:15},“2”:{“add-peer”:15,“remove-peer”:15},“7”:{“add-peer”:15,“remove-peer”:15}},“store-limit-mode”:“manual”,“tolerant-size-ratio”:0}"]
[2021/04/24 16:14:20.048 +08:00] [INFO] [restore.go:1030] [“add back PD leader&region schedulers”]
[2021/04/24 16:14:20.048 +08:00] [ERROR] [restore.go:331] [“run failed”] [step=3] [error=“fetch tso from pd failed: rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster”]
[2021/04/24 16:14:20.048 +08:00] [ERROR] [restore.go:342] [“the whole procedure failed”] [takeTime=52m30.735349512s] [error=“fetch tso from pd failed: rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster”]
[2021/04/24 16:14:20.048 +08:00] [ERROR] [restore.go:122] [“tables failed to be imported”] [count=1]
[2021/04/24 16:14:20.048 +08:00] [ERROR] [restore.go:124] [-] [table=testperformance2.testtable] [status=checksum] [error=“fetch tso from pd failed: rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster”]
[2021/04/24 16:14:20.050 +08:00] [INFO] [checksum.go:456] [“service safe point keeper exited”]
[2021/04/24 16:14:20.050 +08:00] [ERROR] [main.go:91] [“tidb lightning encountered error stack info”] [error=“fetch tso from pd failed: rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster”] [errorVerbose=“rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster\ngithub.com/tikv/pd/client.(*client).processTSORequests\ \tgithub.com/tikv/pd@v1.1.0-beta.0.20210323123936-c8fa72502f16/client/client.go:717\ github.com/tikv/pd/client.(*client).handleDispatcher\ \tgithub.com/tikv/pd@v1.1.0-beta.0.20210323123936-c8fa72502f16/client/client.go:587\ runtime.goexit\ \truntime/asm_amd64.s:1357\ github.com/tikv/pd/client.(*tsoRequest).Wait\ \tgithub.com/tikv/pd@v1.1.0-beta.0.20210323123936-c8fa72502f16/client/client.go:913\ github.com/tikv/pd/client.(*client).GetTS\ \tgithub.com/tikv/pd@v1.1.0-beta.0.20210323123936-c8fa72502f16/client/client.go:933\ github.com/pingcap/br/pkg/lightning/restore.(*tikvChecksumManager).checksumDB\ \tgithub.com/pingcap/br@/pkg/lightning/restore/checksum.go:270\ github.com/pingcap/br/pkg/lightning/restore.(*tikvChecksumManager).Checksum\ \tgithub.com/pingcap/br@/pkg/lightning/restore/checksum.go:322\ github.com/pingcap/br/pkg/lightning/restore.DoChecksum\ \tgithub.com/pingcap/br@/pkg/lightning/restore/checksum.go:161\ github.com/pingcap/br/pkg/lightning/restore.(*TableRestore).compareChecksum\ \tgithub.com/pingcap/br@/pkg/lightning/restore/restore.go:2237\ github.com/pingcap/br/pkg/lightning/restore.(*TableRestore).postProcess\ \tgithub.com/pingcap/br@/pkg/lightning/restore/restore.go:1695\ github.com/pingcap/br/pkg/lightning/restore.(*RestoreController).restoreTables.func3\ \tgithub.com/pingcap/br@/pkg/lightning/restore/restore.go:1206\ runtime.goexit\ \truntime/asm_amd64.s:1357\ fetch tso from pd failed”]
[2021/04/24 16:14:20.050 +08:00] [ERROR] [main.go:92] [“tidb lightning encountered error”] [error=“fetch tso from pd failed: rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster”]


若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

  1. 请问集群拓扑是什么? tiup cluster display 看下
  2. 是否也有 ticdc ? TiDB节点报错

6台虚拟机,3台TiKV,3台TiDB+TiPD节点
目前推测可能是因为直接在1台TiDB上使用tidb-lightening,后来改到一台空闲机器,就没这个报错了。
明天再试试

[2021/04/24 16:14:20.050 +08:00] [ERROR] [main.go:92] [“tidb lightning encountered error”] [error=“fetch tso from pd failed: rpc error: code = Unknown desc = [PD:tso:ErrGenerateTimestamp]generate timestamp failed, requested pd is not leader of cluster”]

结合日志和你的描述来看,可能是因为 你的 TiDB 和 PD 混部。且同时负载了 lightning 的 压力导致 PD TSO 不能有效提供给 Lightning

tso 是保证数据一致性写的重要元数据,如果不能拿到 tso 那么数据写入一定会出问题。

目前将 lightning 移出 host 恢复,应该是缓解了资源瓶颈的问题。

需要关注 TIDB+PD 混部服务器 的资源满足 官方推荐配置,已避免因为 TIDB 压力过大导致 PD 的资源被争抢(主要是CPU和带宽)
混合部署推进配置 参考官方文档

另外 4.0 新版的 lightning 推荐使用 local 模式进行数据导入,相较 tidb 模式更快。但代价也是会尽可能使用更多资源。具体原理参考官方文档

非常感谢

我用的就是local模式,导入非常快,哈哈

:+1: