lightning迁移无响应 import reach max retry 3 and still failed

从 tikv_importer_1.log 找到报错,那个时间点上面对应的 store 1,4,10,340 看起来都不正常

能否上传一份那个时间段的 tikv.log ,比如从 store 1 对应的机器上找下那个时间段的 log,我们看下那段时间 TiKV 发生了什么。

您好:

     结合上面的回答,store 1,4,10,340 的 “capacity”都是“1007GiB” ,  而 store 7 的“capacity”是“1.968TiB”, 所以如果继续导入store 7, 可能其他几个store没有可用空间,请确保每个tikv实例的容量相同,多谢.

重跑了一遍,发现store1对应的tikv.log在lightning服务停滞时主要会出现下述几种error日志: 但从grafana中显示disconnect数量是0:

类型一:

[2020/03/30 15:01:37.769 +00:00] [INFO] [peer_storage.rs:1327] [“finish clear peer meta”] [takes=40.157069ms] [raft_logs=47] [raft_key=1] [apply_key=1] [meta_key=1] [region_id=41405]

[2020/03/30 15:01:37.769 +00:00] [INFO] [peer.rs:588] [“peer destroy itself”] [takes=40.282729ms] [peer_id=44624] [region_id=41405]

[2020/03/30 15:01:37.769 +00:00] [INFO] [router.rs:446] ["[region 41405] shutdown mailbox"]

[2020/03/30 15:01:37.769 +00:00] [INFO] [region.rs:452] [“register deleting data in range”] [end_key=7A7480000000000000FF4B5F698000000000FF0000030400000000FF0A1FB40303800000FF0005ED7795000000FC] [start_key=7A7480000000000000FF4B5F698000000000FF0000030400000000FF0A110E0303800000FF0001B955E8000000FC] [region_id=41405]

[2020/03/30 15:01:38.696 +00:00] [INFO] [peer.rs:724] [“failed to schedule peer tick”] [err=“sending on a disconnected channel”] [tick=RAFT] [peer_id=44624] [region_id=41405]

[2020/03/30 15:01:38.993 +00:00] [INFO] [peer.rs:724] [“failed to schedule peer tick”] [err=“sending on a disconnected channel”] [tick=PD_HEARTBEAT] [peer_id=35027] [region_id=35026]

[2020/03/30 15:01:41.918 +00:00] [INFO] [peer.rs:724] [“failed to schedule peer tick”] [err=“sending on a disconnected channel”] [tick=SPLIT_REGION_CHECK] [peer_id=44624] [region_id=41405]

[2020/03/30 15:01:41.918 +00:00] [INFO] [peer.rs:724] [“failed to schedule peer tick”] [err=“sending on a disconnected channel”] [tick=RAFT_LOG_GC] [peer_id=44624] [region_id=41405]

[2020/03/30 15:01:46.066 +00:00] [INFO] [peer.rs:724] [“failed to schedule peer tick”] [err=“sending on a disconnected channel”] [tick=SPLIT_REGION_CHECK] [peer_id=35027] [region_id=35026]

[2020/03/30 15:01:46.066 +00:00] [INFO] [peer.rs:724] [“failed to schedule peer tick”] [err=“sending on a disconnected channel”] [tick=RAFT_LOG_GC] [peer_id=35027] [region_id=35026]

类型二:

[2020/03/30 15:24:10.422 +00:00] [INFO] [raft.rs:891] ["[region 90958] 90961 [logterm: 7, index: 12, vote: 90961] ignored MsgRequestPreVote vote from 90959 [logterm: 7, index: 12] at term 7: lease is not expired (remaining ticks: 8)"]

[2020/03/30 15:24:10.422 +00:00] [ERROR] [client.rs:342] [“failed to send heartbeat”] [err=“Grpc(RpcFinished(Some(RpcStatus { status: Unknown, details: Some(“rpc error: code = Unavailable desc = not leader”) })))”]

[2020/03/30 15:24:10.422 +00:00] [ERROR] [util.rs:287] [“request failed, retry”] [err=“Grpc(RpcFinished(Some(RpcStatus { status: Unknown, details: Some(“rpc error: code = Unavailable desc = not leader”) })))”]

[2020/03/30 15:24:10.422 +00:00] [ERROR] [util.rs:287] [“request failed, retry”] [err=“Other(SendError(”…"))"]

[2020/03/30 15:24:10.422 +00:00] [ERROR] [util.rs:287] [“request failed, retry”] [err=“Other(SendError(”…"))"]

[2020/03/30 15:24:10.422 +00:00] [INFO] [util.rs:233] [“updating PD client, block the tokio core”]

[2020/03/30 15:24:10.422 +00:00] [INFO] [util.rs:397] [“connecting to PD endpoint”] [endpoints=http://10.12.5.232:2379]

[2020/03/30 15:24:10.422 +00:00] [INFO] [raft.rs:891] ["[region 2204] 2205 [logterm: 193, index: 1680, vote: 2205] ignored MsgRequestPreVote vote from 7022 [logterm: 193, index: 1680] at term 193: lease is not expired (remaining ticks: 10)"]

[2020/03/30 15:24:10.422 +00:00] [INFO] [raft.rs:891] ["[region 58211] 64662 [logterm: 56, index: 3916, vote: 64662] ignored MsgRequestPreVote vote from 58214 [logterm: 56, index: 3916] at term 56: lease is not expired (remaining ticks: 10)"]

[2020/03/30 15:24:10.423 +00:00] [INFO] [subchannel.cc:841] [“New connected subchannel at 0x7fa6fbc3c670 for subchannel 0x7fa6fb822000”]

[2020/03/30 15:24:10.424 +00:00] [INFO] [util.rs:397] [“connecting to PD endpoint”] [endpoints=http://10.12.5.234:2379]

[2020/03/30 15:24:10.424 +00:00] [INFO] [raft.rs:891] ["[region 21066] 21178 [logterm: 165, index: 12395, vote: 21069] ignored MsgRequestPreVote vote from 21069 [logterm: 165, index: 12395] at term 165: lease is not expired (remaining ticks: 4)"]

[2020/03/30 15:24:10.429 +00:00] [INFO] [util.rs:456] [“connected to PD leader”] [endpoints=http://10.12.5.234:2379]

[2020/03/30 15:24:10.429 +00:00] [WARN] [util.rs:176] [“heartbeat sender and receiver are stale, refreshing …”]

[2020/03/30 15:24:10.429 +00:00] [INFO] [raft.rs:891] ["[region 13103] 15202 [logterm: 171, index: 27821, vote: 13105] ignored MsgRequestPreVote vote from 13105 [logterm: 171, index: 27821] at term 171: lease is not expired (remaining ticks: 8)"]

[2020/03/30 15:24:10.429 +00:00] [WARN] [util.rs:195] [“updating PD client done”] [spend=6.678606ms]

[2020/03/30 15:24:10.429 +00:00] [INFO] [client.rs:323] [“heartbeat sender is refreshed”]

[2020/03/30 15:24:10.429 +00:00] [INFO] [util.rs:73] [“heartbeat receiver is refreshed”]

[2020/03/30 15:24:10.429 +00:00] [ERROR] [client.rs:342] [“failed to send heartbeat”] [err=Grpc(RpcFinished(None))]

[2020/03/30 15:24:10.429 +00:00] [ERROR] [util.rs:287] [“request failed, retry”] [err=Grpc(RpcFinished(None))]

[2020/03/30 15:24:10.429 +00:00] [ERROR] [util.rs:287] [“request failed, retry”] [err=“Other(SendError(”…"))"]

[2020/03/30 15:24:10.429 +00:00] [ERROR] [util.rs:287] [“request failed, retry”] [err=“Other(SendError(”…"))"]

[2020/03/30 15:24:10.429 +00:00] [INFO] [util.rs:233] [“updating PD client, block the tokio core”]

[2020/03/30 15:24:10.429 +00:00] [ERROR] [util.rs:287] [“request failed, retry”] [err=“Other(SendError(”…"))"]

[2020/03/30 15:24:10.429 +00:00] [ERROR] [util.rs:287] [“request failed, retry”] [err=“Other(SendError(”…"))"]

[2020/03/30 15:24:10.429 +00:00] [ERROR] [util.rs:287] [“request failed, retry”] [err=“Other(SendError(”…"))"]

[2020/03/30 15:24:10.429 +00:00] [INFO] [util.rs:233] [“updating PD client, block the tokio core”]

[2020/03/30 15:24:10.429 +00:00] [ERROR] [util.rs:287] [“request failed, retry”] [err=“Other(SendError(”…"))"]

[2020/03/30 15:24:10.430 +00:00] [ERROR] [util.rs:287] [“request failed, retry”] [err=“Other(SendError(”…"))"]

[2020/03/30 15:24:10.430 +00:00] [INFO] [util.rs:233] [“updating PD client, block the tokio core”]

[2020/03/30 15:24:10.430 +00:00] [ERROR] [util.rs:287] [“request failed, retry”] [err=“Other(SendError(”…"))"]

[2020/03/30 15:24:10.430 +00:00] [INFO] [peer.rs:1834] [“transfer leader”] [peer=“id: 55431 store_id: 10”] [peer_id=37287] [region_id=3190]

[2020/03/30 15:24:10.430 +00:00] [INFO] [raft.rs:1271] ["[region 3190] 37287 [term 175] transfer leadership to 55431 is in progress, ignores request to same node 55431"]

[2020/03/30 15:24:10.430 +00:00] [ERROR] [util.rs:287] [“request failed, retry”] [err=“Other(SendError(”…"))"]

[2020/03/30 15:24:10.430 +00:00] [ERROR] [util.rs:287] [“request failed, retry”] [err=“Other(SendError(”…"))"]

[2020/03/30 15:24:10.430 +00:00] [INFO] [util.rs:233] [“updating PD client, block the tokio core”]

您好,由于我们这边不方便进行缩容/扩容。能否通过修改capacity来实现各TIKV存储容量一致?

  1. 查看pd监控信息发生了pd leader异常
  2. 查看系统日志,有异常退出
  3. 由于是虚拟机,并且使用的机械盘,建议先把tikv实例容量调整一致,之后测试使用lightning再试试看.