lightning迁移无响应 import reach max retry 3 and still failed

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】:v3.1.0-beta.1
  • 【问题描述】: 考虑到loader速度慢,尝试使用lightning导入。但出现了问题: 1)导入过程显示failed,使用check-point将对应出错表删除重新上传; 2)上传过程中服务出现无响应,数据33G。对应日志信息如下: lightning.log:

[2020/03/27 19:28:16.330 +08:00] [ERROR] [main.go:59] [“tidb lightning encountered error”] [error="[41c24811-f498-5f61-83ed-ceacb685ed01] import reach max retry 3 and still failed: rpc error: code = Unknown desc = EngineInUse(Uuid { bytes: [65, 194, 72, 17, 244, 152, 95, 97, 131, 237, 206, 172, 182, 133, 237, 1] })"] errorVerbose="rpc error: code = Unknown desc = EngineInUse(Uuid { bytes: [65, 194, 72, 17, 244, 152, 95, 97, 131, 237, 206, 172, 182, 133, 237, 1] })[ngithub.com/pingcap/errors.AddStack\n\t/home/jenkins/agent/workspace/release_tidb_3.1/go/pkg/mod/github.com/pingcap/errors@v0.11.4/errors.go:174[ngithub.com/pingcap/errors.Trace](http://ngithub.com/pingcap/errors.Trace)\n\t/home/jenkins/agent/workspace/release_tidb_3.1/go/pkg/mod/github.com/pingcap/errors@v0.11.4/juju_adaptor.go:15[ngithub.com/pingcap/tidb-lightning/lightning/backend.(*importer).ImportEngine](http://ngithub.com/pingcap/tidb-lightning/lightning/backend.(*importer).ImportEngine)\n\t/home/jenkins/agent/workspace/release_tidb_3.1/go/src/github.com/pingcap/tidb-lightning/lightning/backend/importer.go:135[ngithub.com/pingcap/tidb-lightning/lightning/backend.(*ClosedEngine).Import](http://ngithub.com/pingcap/tidb-lightning/lightning/backend.(*ClosedEngine).Import)\n\t/home/jenkins/agent/workspace/release_tidb_3.1/go/src/github.com/pingcap/tidb-lightning/lightning/backend/backend.go:281[ngithub.com/pingcap/tidb-lightning/lightning/restore.(*TableRestore).importKV](http://ngithub.com/pingcap/tidb-lightning/lightning/restore.(*TableRestore).importKV)\n\t/home/jenkins/agent/workspace/release_tidb_3.1/go/src/github.com/pingcap/tidb-lightning/lightning/restore/restore.go:1399[ngithub.com/pingcap/tidb-lightning/lightning/restore.(*TableRestore).importEngine](http://ngithub.com/pingcap/tidb-lightning/lightning/restore.(*TableRestore).importEngine)\n\t/home/jenkins/agent/workspace/release_tidb_3.1/go/src/github.com/pingcap/tidb-lightning/lightning/restore/restore.go:958[ngithub.com/pingcap/tidb-lightning/lightning/restore.(*TableRestore).restoreEngines.func1](http://ngithub.com/pingcap/tidb-lightning/lightning/restore.(*TableRestore).restoreEngines.func1)\n\t/home/jenkins/agent/workspace/release_tidb_3.1/go/src/github.com/pingcap/tidb-lightning/lightning/restore/restore.go:786\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1337\n[41c24811-f498-5f61-83ed-ceacb685ed01] import reach max retry 3 and still failed"]

importer.log:

[2020/03/27 11:54:58.737 +00:00] [INFO] [engine.rs:204] [“cleanup SST”] [file_path=""/home/daslab/importer/data.import/41c24811-f498-5f61-83ed-ceacb685ed01/.e113b245-c959-456b-8f61-650dd60856de:write""]

[2020/03/27 11:54:58.738 +00:00] [WARN] [import.rs:82] [“still has ranges need to retry”] [current_round=4] [retry_count=31] [tag="[ImportJob 41c24811-f498-5f61-83ed-ceacb685ed01]"]

[2020/03/27 11:54:58.738 +00:00] [ERROR] [import.rs:103] [“import engine failed”] [err=“retry 5 times still 31 ranges failed”] [tag="[ImportJob 41c24811-f498-5f61-83ed-ceacb685ed01]"]

[2020/03/27 11:54:58.794 +00:00] [ERROR] [kv_importer.rs:142] [“import failed”] [err=“retry 5 times still 31 ranges failed”] [uuid=41c24811-f498-5f61-83ed-ceacb685ed01]

[2020/03/27 11:54:59.018 +00:00] [WARN] [kv_service.rs:223] [“send rpc response”] [err=RemoteStopped]

  1. 另外,使用尝试使用下述命令获得服务界面,但界面显示全部状态参数为0,如0 pending .
./tidb-lightning --server-mode --status-addr :8289

麻烦确认一下集群版本是 3.0.1 还是 3.1.0 的? 集群是否有部署 TiFlash?

v3.1.0-beta.1. 集群仅部署3PD,2TIDB,5TIKV(与PD,TIDB混合部署),lightning(其中lightning和importer分别位于两台机器)。

code = Unknown desc = EngineInUse 这个错误表示 有另一個Lightning 打开了这个表,例如旧Lightning 沒有干净的退出,或者两台Lightning 在同一台importer导相同的表

麻烦将 lightning 完全退出干净后,尝试重新导入

您好,您说的这个解决办法有道理,的确能够帮助我顺利进行导入。但是,在导入过程中又遇到了新的问题(之前也遇到过):

1) 33G数据导入到96%时持续无响应。log信息如下:

[2020/03/28 20:15:58.981 +08:00] [INFO] [restore.go:473] [progress] [files=“570/592 (96.3%)”] [tables=“9/10 (90.0%)”] [speed(MiB/s)=4.563793917038412] [state=writing] [remaining=4m38s]

[2020/03/28 20:22:59.898 +08:00] [WARN] [backend.go:286] [“import spuriously failed, going to retry again”] [engineTag=Block_Base.block_info:0] [engineUUID=41c24811-f498-5f61-83ed-ceacb685ed01] [retryCnt=1] [error=“rpc error: code = Unknown desc = ImportJobFailed(“retry 5 times still 15 ranges failed”)”]

[2020/03/28 20:23:02.898 +08:00] [INFO] [backend.go:280] [“import start”] [engineTag=Block_Base.block_info:0] [engineUUID=41c24811-f498-5f61-83ed-ceacb685ed01] [retryCnt=2]

[2020/03/28 20:25:58.974 +08:00] [INFO] [restore.go:473] [progress] [files=“570/592 (96.3%)”] [tables=“9/10 (90.0%)”] [speed(MiB/s)=4.212737123035598] [state=writing] [remaining=5m1s]

[2020/03/28 21:29:03.428 +08:00] [ERROR] [restore.go:1404] [“import and cleanup engine failed”] [engineTag=Block_Base.block_info:0] [engineUUID=41c24811-f498-5f61-83ed-ceacb685ed01] [takeTime=3h4m28.953039226s] [error="[41c24811-f498-5f61-83ed-ceacb685ed01] import reach max retry 3 and still failed: rpc error: code = Unknown desc = ImportJobFailed(“retry 5 times still 5 ranges failed”)"]

[2020/03/28 21:29:03.428 +08:00] [ERROR] [restore.go:795] [“import whole table failed”] [table=Block_Base.block_info] [takeTime=3h13m4.430763272s] [error="[41c24811-f498-5f61-83ed-ceacb685ed01] import reach max retry 3 and still failed: rpc error: code = Unknown desc = ImportJobFailed(“retry 5 times still 5 ranges failed”)"]

[2020/03/28 21:29:03.429 +08:00] [ERROR] [restore.go:569] [“restore table failed”] [table=Block_Base.block_info] [takeTime=3h13m4.465582643s] [error="[41c24811-f498-5f61-83ed-ceacb685ed01] import reach max retry 3 and still failed: rpc error: code = Unknown desc = ImportJobFailed(“retry 5 times still 5 ranges failed”)"]

[2020/03/28 21:29:03.429 +08:00] [ERROR] [restore.go:664] [“restore all tables data failed”] [takeTime=3h13m4.466333445s] [error="[41c24811-f498-5f61-83ed-ceacb685ed01] import reach max retry 3 and still failed: rpc error: code = Unknown desc = ImportJobFailed(“retry 5 times still 5 ranges failed”)"]

[2020/03/28 21:29:03.429 +08:00] [ERROR] [restore.go:260] [“run failed”] [step=2] [error="[41c24811-f498-5f61-83ed-ceacb685ed01] import reach max retry 3 and still failed: rpc error: code = Unknown desc = ImportJobFailed(“retry 5 times still 5 ranges failed”)"]

[2020/03/28 21:29:03.429 +08:00] [INFO] [restore.go:442] [“everything imported, stopping periodic actions”]

[2020/03/28 21:29:03.429 +08:00] [ERROR] [restore.go:266] [“the whole procedure failed”] [takeTime=3h13m4.550782774s] [error="[41c24811-f498-5f61-83ed-ceacb685ed01] import reach max retry 3 and still failed: rpc error: code = Unknown desc = ImportJobFailed(“retry 5 times still 5 ranges failed”)"]

[2020/03/28 21:29:03.429 +08:00] [ERROR] [restore.go:115] [“tables failed to be imported”] [count=1]

[2020/03/28 21:29:03.429 +08:00] [ERROR] [restore.go:117] [-] [table=Block_Base.block_info] [status=imported] [error="[41c24811-f498-5f61-83ed-ceacb685ed01] import reach max retry 3 and still failed: rpc error: code = Unknown desc = ImportJobFailed(“retry 5 times still 5 ranges failed”)"]

[2020/03/28 21:29:03.430 +08:00] [ERROR] [main.go:59] [“tidb lightning encountered error”] [error="[41c24811-f498-5f61-83ed-ceacb685ed01] import reach max retry 3 and still failed: rpc error: code = Unknown desc = ImportJobFailed(“retry 5 times still 5 ranges failed”)"] [errorVerbose=“rpc error: code = Unknown desc = ImportJobFailed(“retry 5 times still 5 ranges failed”)\ngithub.com/pingcap/errors.AddStack\n\t/home/jenkins/agent/workspace/release_tidb_3.1/go/pkg/mod/github.com/pingcap/errors@v0.11.4/errors.go:174\ngithub.com/pingcap/errors.Trace\n\t/home/jenkins/agent/workspace/release_tidb_3.1/go/pkg/mod/github.com/pingcap/errors@v0.11.4/juju_adaptor.go:15\ngithub.com/pingcap/tidb-lightning/lightning/backend.(*importer).ImportEngine\n\t/home/jenkins/agent/workspace/release_tidb_3.1/go/src/github.com/pingcap/tidb-lightning/lightning/backend/importer.go:135\ngithub.com/pingcap/tidb-lightning/lightning/backend.(*ClosedEngine).Import\n\t/home/jenkins/agent/workspace/release_tidb_3.1/go/src/github.com/pingcap/tidb-lightning/lightning/backend/backend.go:281\ngithub.com/pingcap/tidb-lightning/lightning/restore.(*TableRestore).importKV\n\t/home/jenkins/agent/workspace/release_tidb_3.1/go/src/github.com/pingcap/tidb-lightning/lightning/restore/restore.go:1399\ngithub.com/pingcap/tidb-lightning/lightning/restore.(*TableRestore).importEngine\n\t/home/jenkins/agent/workspace/release_tidb_3.1/go/src/github.com/pingcap/tidb-lightning/lightning/restore/restore.go:958\ngithub.com/pingcap/tidb-lightning/lightning/restore.(*TableRestore).restoreEngines.func1\n\t/home/jenkins/agent/workspace/release_tidb_3.1/go/src/github.com/pingcap/tidb-lightning/lightning/restore/restore.go:786\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1337\n[41c24811-f498-5f61-83ed-ceacb685ed01] import reach max retry 3 and still failed”]

2)再此重启lightning服务后,log日志如下:

[2020/03/28 22:05:11.138 +08:00] [ERROR] [restore.go:604] [“TiDB Lightning has failed last time. To prevent data loss, this run will stop now. Please resolve errors first”] [count=1]

[2020/03/28 22:05:11.138 +08:00] [INFO] [restore.go:623] [-] [table=Block_Base.block_info] [status=12] [failedStep=imported] [recommendedAction="./tidb-lightning-ctl --checkpoint-errors-destroy=’Block_Base.block_info’ --config=…"]

[2020/03/28 22:05:11.138 +08:00] [INFO] [restore.go:631] [“You may also run ./tidb-lightning-ctl --checkpoint-errors-destroy=all --config=... to start from scratch”]

[2020/03/28 22:05:11.138 +08:00] [INFO] [restore.go:632] [“For details of this failure, read the log file from the PREVIOUS run”]

[2020/03/28 22:05:11.138 +08:00] [ERROR] [restore.go:260] [“run failed”] [step=2] [error=“TiDB Lightning has failed last time; please resolve these errors first”]

[2020/03/28 22:05:11.138 +08:00] [ERROR] [restore.go:266] [“the whole procedure failed”] [takeTime=3.04949469s] [error=“TiDB Lightning has failed last time; please resolve these errors first”]

[2020/03/28 22:05:11.138 +08:00] [ERROR] [main.go:59] [“tidb lightning encountered error”] [error=“TiDB Lightning has failed last time; please resolve these errors first”] [errorVerbose=“TiDB Lightning has failed last time; please resolve these errors first\ngithub.com/pingcap/tidb-lightning/lightning/restore.(*RestoreController).restoreTables\n\t/home/jenkins/agent/workspace/release_tidb_3.1/go/src/github.com/pingcap/tidb-lightning/lightning/restore/restore.go:634\ngithub.com/pingcap/tidb-lightning/lightning/restore.(*RestoreController).Run\n\t/home/jenkins/agent/workspace/release_tidb_3.1/go/src/github.com/pingcap/tidb-lightning/lightning/restore/restore.go:250\ngithub.com/pingcap/tidb-lightning/lightning.(*Lightning).run\n\t/home/jenkins/agent/workspace/release_tidb_3.1/go/src/github.com/pingcap/tidb-lightning/lightning/lightning.go:213\ngithub.com/pingcap/tidb-lightning/lightning.(*Lightning).RunOnce\n\t/home/jenkins/agent/workspace/release_tidb_3.1/go/src/github.com/pingcap/tidb-lightning/lightning/lightning.go:138\nmain.main\n\t/home/jenkins/agent/workspace/release_tidb_3.1/go/src/github.com/pingcap/tidb-lightning/cmd/tidb-lightning/main.go:56\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:200\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1337”]

根据log日志提示,我执行了下述命令,打算重新导入:

./tidb-lightning-ctl --checkpoint-errors-destroy='`Block_Base`.`block_info`' --config=..."

3)导入过程继续遇到问题:

[2020/03/28 22:13:50.246 +08:00] [ERROR] [restore.go:1602] [“write to data engine failed”] [table=Block_Base.block_info] [engineNumber=0] [fileIndex=15] [path=/home/daslab/mydumper/block_base/Block_Base.block_info.000000017.sql:0] [task=deliver] [error=EOF]

[2020/03/28 22:13:50.247 +08:00] [INFO] [restore.go:256] [“user terminated”] [step=2] [error=“rpc error: code = Canceled desc = Cancelled”]

[2020/03/28 22:13:50.247 +08:00] [INFO] [restore.go:442] [“everything imported, stopping periodic actions”]

[2020/03/28 22:13:50.247 +08:00] [INFO] [restore.go:266] [“the whole procedure completed”] [takeTime=792.948656ms] []

[2020/03/28 22:13:50.247 +08:00] [INFO] [main.go:61] [“tidb lightning exit”]

4)看到lightning退出后,我又重新打开该服务,这次的问题是并未继续导入数据:

[2020/03/28 22:25:03.591 +08:00] [INFO] [lightning.go:143] [“Lightning server is running, post to /tasks to start an import task”] [address="[::]:8289"]

同时,我查询了一下数据库中的数据,发现待上传的表是empty的。

综上,我对该工具的使用充满了疑惑。我的操作步骤: 1)在指定dir放好dumper的数据文件 2)启动importer 3)启动lightning 4)观察lightning.log,判断导入情况。

我不清楚是否是我使用不当,还是对官方提供的文档的理解存在偏差。请您指点。或者是否可以换个社交工具方便问题及时沟通?

看起来应该是操作步骤有问题,为了确认具体原因,麻烦提供一下对应每次导入的 importer log,我们这边好排查一下。

PS:最后一次看到lightning退出后,我又重新打开该服务,这次的问题是并未继续导入数据的原因是开启了 server-mode

您好,我关闭了server-mode后,重新启动了importer和lightning,此时log输出分别为:

importer.log:

[2020/03/29 05:53:16.187 +00:00] [INFO] [setup.rs:75] [“start prometheus client”]

[2020/03/29 05:53:16.188 +00:00] [INFO] [tikv-importer.rs:41] [“Welcome to TiKV Importer.”]

[2020/03/29 05:53:16.188 +00:00] [INFO] [tikv-importer.rs:43] []

[2020/03/29 05:53:16.188 +00:00] [INFO] [tikv-importer.rs:43] [“Release Version: 3.1.0-beta.1”]

[2020/03/29 05:53:16.188 +00:00] [INFO] [tikv-importer.rs:43] [“Git Commit Hash: 5e7a7568362063184003634916145992160338f1”]

[2020/03/29 05:53:16.188 +00:00] [INFO] [tikv-importer.rs:43] [“Git Commit Branch: release-3.1”]

[2020/03/29 05:53:16.188 +00:00] [INFO] [tikv-importer.rs:43] [“UTC Build Time: 2020-01-17 01:01:08”]

[2020/03/29 05:53:16.188 +00:00] [INFO] [tikv-importer.rs:43] [“Rust Version: rustc 1.42.0-nightly (0de96d37f 2019-12-19)”]

[2020/03/29 05:53:16.188 +00:00] [INFO] [tikv-importer.rs:45] []

[2020/03/29 05:53:16.204 +00:00] [INFO] [tikv-importer.rs:160] [“import server started”]

[2020/03/29 05:53:20.805 +00:00] [INFO] [kv_importer.rs:63] [“open engine”] [engine=“EngineFile { uuid: Uuid { bytes: [204, 128, 138, 159, 220, 253, 82, 137, 159, 226, 252, 80, 28, 57, 65, 169] }, path: EnginePath { save: “/home/daslab/importer/data.import/cc808a9f-dcfd-5289-9fe2-fc501c3941a9”, temp: “/home/daslab/importer/data.import/.temp/cc808a9f-dcfd-5289-9fe2-fc501c3941a9” } }”]

[2020/03/29 05:53:20.806 +00:00] [ERROR] [kv_importer.rs:68] [“open engine failed”] [err=“File “/home/daslab/importer/data.import/41c24811-f498-5f61-83ed-ceacb685ed01” exists”] [uuid=41c24811-f498-5f61-83ed-ceacb685ed01]

[2020/03/29 05:53:20.825 +00:00] [WARN] [kv_service.rs:170] [“send rpc response”] [err=RemoteStopped]

[2020/03/29 05:53:20.825 +00:00] [WARN] [kv_service.rs:170] [“send rpc response”] [err=RemoteStopped]

[2020/03/29 05:53:20.825 +00:00] [WARN] [kv_service.rs:170] [“send rpc response”] [err=RemoteStopped]

lightning.log:

[2020/03/29 13:53:20.730 +08:00] [INFO] [version.go:48] [“Welcome to lightning”] [“Release Version”=v3.1.0-beta.1] [“Git Commit Hash”=605760d1b2025d1e1a8b7d0c668c74863d7d1271] [“Git Branch”=HEAD] [“UTC Build Time”=“2020-01-10 12:16:24”] [“Go Version”=“go version go1.12 linux/amd64”]

[2020/03/29 13:53:20.730 +08:00] [INFO] [lightning.go:165] [cfg] [cfg="{“id”:1585461200730201032,“lightning”:{“table-concurrency”:6,“index-concurrency”:2,“region-concurrency”:32,“io-concurrency”:5,“check-requirements”:true},“tidb”:{“host”:“10.12.5.233”,“port”:4000,“user”:“root”,“status-port”:10080,“pd-addr”:“10.12.5.234:2379”,“sql-mode”:“ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION”,“max-allowed-packet”:67108864,“distsql-scan-concurrency”:100,“build-stats-concurrency”:20,“index-serial-scan-concurrency”:20,“checksum-table-concurrency”:16},“checkpoint”:{“enable”:true,“schema”:“tidb_lightning_checkpoint”,“driver”:“file”,“keep-after-success”:false},“mydumper”:{“read-block-size”:65536,“batch-size”:107374182400,“batch-import-ratio”:0,“data-source-dir”:"/home/daslab/mydumper",“no-schema”:false,“character-set”:“auto”,“csv”:{“separator”:",",“delimiter”:"\"",“header”:true,“trim-last-separator”:false,“not-null”:false,“null”:"\\N",“backslash-escape”:true},“case-sensitive”:false},“black-white-list”:{“do-tables”:null,“do-dbs”:null,“ignore-tables”:null,“ignore-dbs”:[“mysql”,“information_schema”,“performance_schema”,“sys”]},“tikv-importer”:{“addr”:“10.12.5.112:8287”,“backend”:“importer”,“on-duplicate”:“replace”},“post-restore”:{“level-1-compact”:false,“compact”:false,“checksum”:true,“analyze”:true},“cron”:{“switch-mode”:“5m0s”,“log-progress”:“5m0s”},“routes”:null}"]

[2020/03/29 13:53:20.730 +08:00] [INFO] [lightning.go:194] [“load data source start”]

[2020/03/29 13:53:20.739 +08:00] [INFO] [lightning.go:197] [“load data source completed”] [takeTime=8.854188ms] []

[2020/03/29 13:53:20.745 +08:00] [INFO] [restore.go:245] [“the whole procedure start”]

[2020/03/29 13:53:20.751 +08:00] [INFO] [restore.go:283] [“restore table schema start”] [db=Block_Base]

[2020/03/29 13:53:20.752 +08:00] [INFO] [tidb.go:99] [“create tables start”] [db=Block_Base]

[2020/03/29 13:53:20.753 +08:00] [INFO] [tidb.go:117] [“create tables completed”] [db=Block_Base] [takeTime=1.134321ms] []

[2020/03/29 13:53:20.753 +08:00] [INFO] [restore.go:291] [“restore table schema completed”] [db=Block_Base] [takeTime=2.211599ms] []

[2020/03/29 13:53:20.756 +08:00] [INFO] [restore.go:545] [“restore all tables data start”]

[2020/03/29 13:53:20.756 +08:00] [INFO] [restore.go:566] [“restore table start”] [table=Block_Base.block_info]

[2020/03/29 13:53:20.756 +08:00] [INFO] [restore.go:683] [“reusing engines and files info from checkpoint”] [table=Block_Base.block_info] [enginesCnt=2] [filesCnt=570]

[2020/03/29 13:53:20.782 +08:00] [INFO] [backend.go:187] [“open engine”] [engineTag=Block_Base.block_info:-1] [engineUUID=cc808a9f-dcfd-5289-9fe2-fc501c3941a9]

[2020/03/29 13:53:20.782 +08:00] [INFO] [restore.go:747] [“import whole table start”] [table=Block_Base.block_info]

[2020/03/29 13:53:20.782 +08:00] [INFO] [restore.go:776] [“restore engine start”] [table=Block_Base.block_info] [engineNumber=0]

[2020/03/29 13:53:20.782 +08:00] [INFO] [restore.go:853] [“encode kv data and write start”] [table=Block_Base.block_info] [engineNumber=0]

[2020/03/29 13:53:20.783 +08:00] [INFO] [backend.go:187] [“open engine”] [engineTag=Block_Base.block_info:0] [engineUUID=41c24811-f498-5f61-83ed-ceacb685ed01]

[2020/03/29 13:53:20.785 +08:00] [INFO] [restore.go:1776] [“restore file start”] [table=Block_Base.block_info] [engineNumber=0] [fileIndex=13] [path=/home/daslab/mydumper/block_base/Block_Base.block_info.000000015.sql:0]

[2020/03/29 13:53:20.795 +08:00] [INFO] [restore.go:1776] [“restore file start”] [table=Block_Base.block_info] [engineNumber=0] [fileIndex=27] [path=/home/daslab/mydumper/block_base/Block_Base.block_info.000000029.sql:0]

[2020/03/29 13:53:20.803 +08:00] [ERROR] [restore.go:1602] [“write to data engine failed”] [table=Block_Base.block_info] [engineNumber=0] [fileIndex=7] [path=/home/daslab/mydumper/block_base/Block_Base.block_info.000000009.sql:0] [task=deliver] [error=“rpc error: code = Canceled desc = Cancelled”]

[2020/03/29 13:53:20.804 +08:00] [ERROR] [restore.go:1602] [“write to data engine failed”] [table=Block_Base.block_info] [engineNumber=0] [fileIndex=13] [path=/home/daslab/mydumper/block_base/Block_Base.block_info.000000015.sql:0] [task=deliver] [error=“rpc error: code = Canceled desc = Cancelled”]

[2020/03/29 13:53:20.804 +08:00] [ERROR] [restore.go:1602] [“write to data engine failed”] [table=Block_Base.block_info] [engineNumber=0] [fileIndex=10] [path=/home/daslab/mydumper/block_base/Block_Base.block_info.000000012.sql:0] [task=deliver] [error=“rpc error: code = Canceled desc = Cancelled”]

[2020/03/29 13:53:20.846 +08:00] [ERROR] [restore.go:1602] [“write to data engine failed”] [table=Block_Base.block_info] [engineNumber=0] [fileIndex=30] [path=/home/daslab/mydumper/block_base/Block_Base.block_info.000000032.sql:0] [task=deliver] [error=“rpc error: code = Canceled desc = Cancelled”]

[2020/03/29 13:53:20.909 +08:00] [ERROR] [restore.go:1602] [“write to data engine failed”] [table=Block_Base.block_info] [engineNumber=0] [fileIndex=21] [path=/home/daslab/mydumper/block_base/Block_Base.block_info.000000023.sql:0] [task=deliver] [error=“rpc error: code = Canceled desc = Cancelled”]

[2020/03/29 13:53:20.953 +08:00] [INFO] [restore.go:256] [“user terminated”] [step=2] [error=“rpc error: code = Canceled desc = Cancelled”]

[2020/03/29 13:53:20.953 +08:00] [INFO] [restore.go:266] [“the whole procedure completed”] [takeTime=207.864072ms] []

[2020/03/29 13:53:20.953 +08:00] [INFO] [restore.go:442] [“everything imported, stopping periodic actions”]

[2020/03/29 13:53:20.953 +08:00] [INFO] [main.go:61] [“tidb lightning exit”]

另外,附件为lightning和importer的全部log文件信息。 上传中:tidb_lightning.log…

从日志上看,启动 lightning 后,进程被 kill 了,这块是有什么操作吗?

在开启lightning服务后,我意识到当前还在使用loader进行导入,所以把loader服务给停了。loader服务并不在这两台主机上运行。是否是这个带来的影响?除此之外,并没有进行其他操作。

loader 不会有这个影响,我看到最新的 lightning 日志显示的是, lightning 导入数据的时候被主动取消了,是每次启动都会这样吗?最新的 importer 和 lightning 日志上来看,并无其它明显的报错l。

我为了确保不出现端口占用情况,每次使用importer和lightning组件时,都先运行stop脚本确认先前服务清空。然后再次运行start脚本,开启服务。这个过程应该没有什么问题?

我清空了数据库的数据,重新进行了导入,还是遇到了上述的问题: lightning.log:

[2020/03/29 19:38:00.144 +08:00] [ERROR] [restore.go:260] [“run failed”] [step=2] [error="[41c24811-f498-5f61-83ed-ceacb685ed01] import reach max retry 3 and still failed: rpc error: code = Unknown desc = ImportJobFailed(“retry 5 times still 28 ranges failed”)"]

[2020/03/29 19:38:00.144 +08:00] [ERROR] [restore.go:266] [“the whole procedure failed”] [takeTime=2h9m7.671443124s] [error="[41c24811-f498-5f61-83ed-ceacb685ed01] import reach max retry 3 and still failed: rpc error: code = Unknown desc = ImportJobFailed(“retry 5 times still 28 ranges failed”)"]

[2020/03/29 19:38:00.144 +08:00] [ERROR] [restore.go:115] [“tables failed to be imported”] [count=1]

[2020/03/29 19:38:00.144 +08:00] [ERROR] [restore.go:117] [-] [table=Block_Base.block_info] [status=imported] [error="[41c24811-f498-5f61-83ed-ceacb685ed01] import reach max retry 3 and still failed: rpc error: code = Unknown desc = ImportJobFailed(“retry 5 times still 28 ranges failed”)"]

[2020/03/29 19:38:00.144 +08:00] [INFO] [restore.go:442] [“everything imported, stopping periodic actions”]

[2020/03/29 19:38:00.145 +08:00] [ERROR] [main.go:59] [“tidb lightning encountered error”] [error="[41c24811-f498-5f61-83ed-ceacb685ed01] import reach max retry 3 and still failed: rpc error: code = Unknown desc = ImportJobFailed(“retry 5 times still 28 ranges failed”)"] [errorVerbose=“rpc error: code = Unknown desc = ImportJobFailed(“retry 5 times still 28 ranges failed”)\ngithub.com/pingcap/errors.AddStack\n\t/home/jenkins/agent/workspace/release_tidb_3.1/go/pkg/mod/github.com/pingcap/errors@v0.11.4/errors.go:174\ngithub.com/pingcap/errors.Trace\n\t/home/jenkins/agent/workspace/release_tidb_3.1/go/pkg/mod/github.com/pingcap/errors@v0.11.4/juju_adaptor.go:15\ngithub.com/pingcap/tidb-lightning/lightning/backend.(*importer).ImportEngine\n\t/home/jenkins/agent/workspace/release_tidb_3.1/go/src/github.com/pingcap/tidb-lightning/lightning/backend/importer.go:135\ngithub.com/pingcap/tidb-lightning/lightning/backend.(*ClosedEngine).Import\n\t/home/jenkins/agent/workspace/release_tidb_3.1/go/src/github.com/pingcap/tidb-lightning/lightning/backend/backend.go:281\ngithub.com/pingcap/tidb-lightning/lightning/restore.(*TableRestore).importKV\n\t/home/jenkins/agent/workspace/release_tidb_3.1/go/src/github.com/pingcap/tidb-lightning/lightning/restore/restore.go:1399\ngithub.com/pingcap/tidb-lightning/lightning/restore.(*TableRestore).importEngine\n\t/home/jenkins/agent/workspace/release_tidb_3.1/go/src/github.com/pingcap/tidb-lightning/lightning/restore/restore.go:958\ngithub.com/pingcap/tidb-lightning/lightning/restore.(*TableRestore).restoreEngines.func1\n\t/home/jenkins/agent/workspace/release_tidb_3.1/go/src/github.com/pingcap/tidb-lightning/lightning/restore/restore.go:786\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1337\n[41c24811-f498-5f61-83ed-ceacb685ed01] import reach max retry 3 and still failed”]

importer.log:

[2020/03/29 11:37:47.861 +00:00] [INFO] [import.rs:428] [upload] [store=10] [tag="[ImportSSTJob 41c24811-f498-5f61-83ed-ceacb685ed01:11:846]"]

[2020/03/29 11:37:56.977 +00:00] [INFO] [import.rs:463] [ingest] [store=4] [tag="[ImportSSTJob 41c24811-f498-5f61-83ed-ceacb685ed01:11:846]"]

[2020/03/29 11:37:56.977 +00:00] [INFO] [import.rs:333] [“import sst”] [takes=14.236807784s] [tag="[ImportSSTJob 41c24811-f498-5f61-83ed-ceacb685ed01:11:846]"]

[2020/03/29 11:37:56.977 +00:00] [INFO] [engine.rs:204] [“cleanup SST”] [file_path=""/home/daslab/importer/data.import/41c24811-f498-5f61-83ed-ceacb685ed01/.1f57b11d-ad9b-4a9e-9b9d-1279c000a295:write""]

[2020/03/29 11:37:56.978 +00:00] [WARN] [import.rs:82] [“still has ranges need to retry”] [current_round=4] [retry_count=28] [tag="[ImportJob 41c24811-f498-5f61-83ed-ceacb685ed01]"]

[2020/03/29 11:37:56.978 +00:00] [ERROR] [import.rs:103] [“import engine failed”] [err=“retry 5 times still 28 ranges failed”] [tag="[ImportJob 41c24811-f498-5f61-83ed-ceacb685ed01]"]

[2020/03/29 11:37:56.998 +00:00] [ERROR] [kv_importer.rs:142] [“import failed”] [err=“retry 5 times still 28 ranges failed”] [uuid=41c24811-f498-5f61-83ed-ceacb685ed01]

我在这里上传详细的log文件,或许您可以从中为我解答疑惑

tidb_lightning.log (361.8 KB)

tikv_importer_1.log (4.6 MB)

tikv_importer_2.log (4.7 MB)

tikv_importer_3.log (595.3 KB)

tikv_importer_3.log 里面看有些 tikv 节点的磁盘空间不足了,需要清理下,这个会导致 importer upload sst failed

目前我使用了5个TIKV,分别部署在不同节点。通过grafana我能够获得232无论在已用空间、regoin数目和leader数目都远小于其他4台TIKV节点。但也能看出这五台TIDV节点的可用空间是足够的,请问是否“磁盘空间不足”是有什么进一步的原因?

TIKV信息

PD信息

  1. 从监控看,232的store size比较小。 请上传pd,tikv的监控信息 (1)、chrome 安装这个插件https://chrome.google.com/webstore/detail/full-page-screen-capture/fdpohaocaechififmbbbbbknoalclacl

(2)、鼠标焦点置于 Dashboard 上,按 ?可显示所有快捷键,先按 d 再按 E 可将所有 Rows 的 Panels 打开,需等待一段时间待页面加载完成。

(3)、使用这个 full-page-screen-capture 插件进行截屏保存

  1. 请上传inventory.ini文件
  2. pd-ctl 执行 config show all, store , member 反馈下结果,多谢
  3. 从监控看有3台disconnect的store,请确认下网络是否正常.

config show all

tidb@three:~/tidb-ansible/resources/bin$ ./pd-ctl -u http://10.12.5.231:2379 config show all { “client-urls”: “http://10.12.5.232:2379”, “peer-urls”: “http://10.12.5.232:2380”, “advertise-client-urls”: “http://10.12.5.232:2379”, “advertise-peer-urls”: “http://10.12.5.232:2380”, “name”: “pd_three”, “data-dir”: “/home/tidb/deploy/data.pd”, “force-new-cluster”: false, “enable-grpc-gateway”: true, “initial-cluster”: “pd_one=http://10.12.5.234:2380,pd_three=http://10.12.5.232:2380,pd_four=http://10.12.5.231:2380”, “initial-cluster-state”: “new”, “join”: “”, “lease”: 3, “log”: { “level”: “info”, “format”: “text”, “disable-timestamp”: false, “file”: { “filename”: “/home/tidb/deploy/log/pd.log”, “log-rotate”: true, “max-size”: 300, “max-days”: 0, “max-backups”: 0 }, “development”: false, “disable-caller”: false, “disable-stacktrace”: false, “disable-error-verbose”: true, “sampling”: null }, “log-file”: “”, “log-level”: “”, “tso-save-interval”: “3s”, “metric”: { “job”: “pd_three”, “address”: “”, “interval”: “15s” }, “schedule”: { “max-snapshot-count”: 3, “max-pending-peer-count”: 16, “max-merge-region-size”: 20, “max-merge-region-keys”: 200000, “split-merge-interval”: “1h0m0s”, “enable-one-way-merge”: “false”, “patrol-region-interval”: “100ms”, “max-store-down-time”: “30m0s”, “leader-schedule-limit”: 4, “region-schedule-limit”: 4, “replica-schedule-limit”: 8, “merge-schedule-limit”: 8, “hot-region-schedule-limit”: 4, “hot-region-cache-hits-threshold”: 3, “store-balance-rate”: 15, “tolerant-size-ratio”: 5, “low-space-ratio”: 0.8, “high-space-ratio”: 0.6, “scheduler-max-waiting-operator”: 3, “disable-raft-learner”: “false”, “disable-remove-down-replica”: “false”, “disable-replace-offline-replica”: “false”, “disable-make-up-replica”: “false”, “disable-remove-extra-replica”: “false”, “disable-location-replacement”: “false”, “disable-namespace-relocation”: “false”, “schedulers-v2”: [ { “type”: “balance-region”, “args”: null, “disable”: false }, { “type”: “balance-leader”, “args”: null, “disable”: false }, { “type”: “hot-region”, “args”: null, “disable”: false }, { “type”: “label”, “args”: null, “disable”: false } ] }, “replication”: { “max-replicas”: 3, “location-labels”: “”, “strictly-match-label”: “false” }, “namespace”: {}, “pd-server”: { “use-region-storage”: “true”, “max-reset-ts-gap”: 86400000000000 }, “cluster-version”: “3.1.0-beta.1”, “quota-backend-bytes”: “0B”, “auto-compaction-mode”: “periodic”, “auto-compaction-retention-v2”: “1h”, “TickInterval”: “500ms”, “ElectionInterval”: “3s”, “PreVote”: true, “security”: { “cacert-path”: “”, “cert-path”: “”, “key-path”: “” }, “label-property”: {}, “WarningMsgs”: null, “namespace-classifier”: “table”, “LeaderPriorityCheckInterval”: “1m0s” }

store

tidb@three:~/tidb-ansible/resources/bin$ ./pd-ctl -u http://10.12.5.231:2379 store { “count”: 5, “stores”: [ { “store”: { “id”: 7, “address”: “10.12.5.232:20160”, “version”: “3.1.0-beta.1”, “state_name”: “Up” }, “status”: { “capacity”: “1.968TiB”, “available”: “806.1GiB”, “leader_count”: 971, “leader_weight”: 1, “leader_score”: 83795, “leader_size”: 83795, “region_count”: 1050, “region_weight”: 1, “region_score”: 90620, “region_size”: 90620, “start_ts”: “2020-03-30T11:00:06Z”, “last_heartbeat_ts”: “2020-03-30T11:07:07.230688229Z”, “uptime”: “7m1.230688229s” } }, { “store”: { “id”: 10, “address”: “10.12.5.231:20160”, “version”: “3.1.0-beta.1”, “state_name”: “Up” }, “status”: { “capacity”: “1007GiB”, “available”: “704.4GiB”, “leader_count”: 1215, “leader_weight”: 1, “leader_score”: 107822, “leader_size”: 107822, “region_count”: 4092, “region_weight”: 1, “region_score”: 364827, “region_size”: 364827, “start_ts”: “2020-03-30T10:59:56Z”, “last_heartbeat_ts”: “2020-03-30T11:07:06.829900063Z”, “uptime”: “7m10.829900063s” } }, { “store”: { “id”: 340, “address”: “10.12.5.226:20160”, “version”: “3.1.0-beta.1”, “state_name”: “Up” }, “status”: { “capacity”: “1007GiB”, “available”: “721.7GiB”, “leader_count”: 1186, “leader_weight”: 1, “leader_score”: 107184, “leader_size”: 107184, “region_count”: 4073, “region_weight”: 1, “region_score”: 365093, “region_size”: 365093, “start_ts”: “2020-03-30T10:59:57Z”, “last_heartbeat_ts”: “2020-03-30T11:07:07.78239106Z”, “uptime”: “7m10.78239106s” } }, { “store”: { “id”: 1, “address”: “10.12.5.234:20160”, “version”: “3.1.0-beta.1”, “state_name”: “Up” }, “status”: { “capacity”: “1007GiB”, “available”: “700.6GiB”, “leader_count”: 1204, “leader_weight”: 1, “leader_score”: 108787, “leader_size”: 108787, “region_count”: 4080, “region_weight”: 1, “region_score”: 364662, “region_size”: 364662, “start_ts”: “2020-03-30T10:59:58Z”, “last_heartbeat_ts”: “2020-03-30T11:06:48.966995808Z”, “uptime”: “6m50.966995808s” } }, { “store”: { “id”: 4, “address”: “10.12.5.233:20160”, “version”: “3.1.0-beta.1”, “state_name”: “Up” }, “status”: { “capacity”: “1007GiB”, “available”: “709.4GiB”, “leader_count”: 1213, “leader_weight”: 1, “leader_score”: 108704, “leader_size”: 108704, “region_count”: 4072, “region_weight”: 1, “region_score”: 363674, “region_size”: 363674, “start_ts”: “2020-03-30T10:59:59Z”, “last_heartbeat_ts”: “2020-03-30T11:07:00.247442233Z”, “uptime”: “7m1.247442233s” } } ] }

** member**

{ “header”: { “cluster_id”: 6807312755917103041 }, “members”: [ { “name”: “pd_three”, “member_id”: 16490134911803683457, “peer_urls”: [ “http://10.12.5.232:2380” ], “client_urls”: [ “http://10.12.5.232:2379” ] }, { “name”: “pd_one”, “member_id”: 17149276931683771649, “peer_urls”: [ “http://10.12.5.234:2380” ], “client_urls”: [ “http://10.12.5.234:2379” ] }, { “name”: “pd_four”, “member_id”: 17567682691042087964, “peer_urls”: [ “http://10.12.5.231:2380” ], “client_urls”: [ “http://10.12.5.231:2379” ] } ], “leader”: { “name”: “pd_three”, “member_id”: 16490134911803683457, “peer_urls”: [ “http://10.12.5.232:2380” ], “client_urls”: [ “http://10.12.5.232:2379” ] }, “etcd_leader”: { “name”: “pd_three”, “member_id”: 16490134911803683457, “peer_urls”: [ “http://10.12.5.232:2380” ], “client_urls”: [ “http://10.12.5.232:2379” ] } }

inventory.ini文件

TiDB Cluster Part

[tidb_servers]

10.12.5.233

10.12.5.232

[tikv_servers]

10.12.5.234

10.12.5.233

10.12.5.232

10.12.5.231

10.12.5.226

[pd_servers]

10.12.5.234

10.12.5.232

10.12.5.231

[spark_master]

[spark_slaves]

[lightning_server]

10.12.5.111

[importer_server]

10.12.5.112

Monitoring Part

prometheus and pushgateway servers

[monitoring_servers]

10.12.5.232

[grafana_servers]

10.12.5.232

node_exporter and blackbox_exporter servers

[monitored_servers]

10.12.5.234

10.12.5.233

10.12.5.232

10.12.5.231

10.12.5.226

[alertmanager_servers]

10.12.5.232

[kafka_exporter_servers]

Binlog Part

[pump_servers]

[drainer_servers]

Group variables

[pd_servers:vars]

location_labels = [“zone”,“rack”,“host”]

Global variables

[all:vars]

deploy_dir = /home/tidb/deploy

Connection

ssh via normal user

ansible_user = tidb

cluster_name = test-cluster

tidb_version = v3.1.0-beta.1

process supervision, [systemd, supervise]

process_supervision = systemd

timezone = Asia/Shanghai

enable_firewalld = False

check NTP service

enable_ntpd = True

set_hostname = False

binlog trigger

enable_binlog = False

kafka cluster address for monitoring, example:

kafka_addrs = “192.168.0.11:9092,192.168.0.12:9092,192.168.0.13:9092”

kafka_addrs = “”

zookeeper address of kafka cluster for monitoring, example:

zookeeper_addrs = “192.168.0.11:2181,192.168.0.12:2181,192.168.0.13:2181”

zookeeper_addrs = “”

enable TLS authentication in the TiDB cluster

enable_tls = False

KV mode

deploy_without_tidb = False

wait for region replication complete before start tidb-server.

wait_replication = True

Optional: Set if you already have a alertmanager server.

Format: alertmanager_host:alertmanager_port

alertmanager_target = “”

grafana_admin_user = “admin”

grafana_admin_password = “123456”

Collect diagnosis

collect_log_recent_hours = 2

enable_bandwidth_limit = True

default: 10Mb/s, unit: Kbit/s

collect_bandwidth_limit = 10000

TIKV信息:

PD信息1

PD信息2