kettle获取字段,tikv挂掉

  • 【TiDB 版本】:v.4.0.1
  • 【问题描述】:kettle获取表字段会导致多个tikv挂掉,必须重启集群才行. 自己把字段写上去是没事的,只有用kettle自己获取字段的时候才会有问题
  1. 使用kettle的Table input
    image
  2. 点击选择的表
  3. 会提示是否需要获取所有字段,这步kettle就会卡住,tikv也会挂掉

1.请问,tidb是什么版本?
2. 这步获取的时候,具体执行的什么SQL?
3. 麻烦看下 dmesg 是否有OOM信息。
4. 看下 tidb.log 和 tikv.log 是否有报错信息

  1. tidb版本:Release Version: v4.0.1
  2. 就是普通的select * from table
  3. 没有OOM信息
  4. tikv一直刷新一下日志
    希望你们测试下kettle 谢谢
[2020/06/28 16:57:49.230 +08:00] [INFO] [raft.rs:923] ["[logterm: 102, index: 8258] sent request to 2851"] [msg=MsgRequestPreVote] [term=102] [id=2851] [log_index=8258] [log_term=102] [raft_id=2850] [region_id=2849]
[2020/06/28 16:57:49.230 +08:00] [INFO] [raft.rs:1177] ["starting a new election"] [term=106] [raft_id=1698] [region_id=1697]
[2020/06/28 16:57:49.230 +08:00] [INFO] [raft.rs:833] ["became pre-candidate at term 106"] [term=106] [raft_id=1698] [region_id=1697]
[2020/06/28 16:57:49.230 +08:00] [INFO] [raft.rs:902] ["1698 received message from 1698"] [term=106] [msg=MsgRequestPreVote] [from=1698] [id=1698] [raft_id=1698] [region_id=1697]
[2020/06/28 16:57:49.230 +08:00] [INFO] [raft.rs:923] ["[logterm: 106, index: 113] sent request to 1655261"] [msg=MsgRequestPreVote] [term=106] [id=1655261] [log_index=113] [log_term=106] [raft_id=1698] [region_id=1697]
[2020/06/28 16:57:49.230 +08:00] [INFO] [raft.rs:923] ["[logterm: 106, index: 113] sent request to 1699"] [msg=MsgRequestPreVote] [term=106] [id=1699] [log_index=113] [log_term=106] [raft_id=1698] [region_id=1697]
[2020/06/28 16:57:49.230 +08:00] [INFO] [raft.rs:1177] ["starting a new election"] [term=99] [raft_id=4370] [region_id=4369]
[2020/06/28 16:57:49.230 +08:00] [INFO] [raft.rs:833] ["became pre-candidate at term 99"] [term=99] [raft_id=4370] [region_id=4369]
[2020/06/28 16:57:49.230 +08:00] [INFO] [raft.rs:902] ["4370 received message from 4370"] [term=99] [msg=MsgRequestPreVote] [from=4370] [id=4370] [raft_id=4370] [region_id=4369]
[2020/06/28 16:57:49.230 +08:00] [INFO] [raft.rs:923] ["[logterm: 99, index: 108] sent request to 1655361"] [msg=MsgRequestPreVote] [term=99] [id=1655361] [log_index=108] [log_term=99] [raft_id=4370] [region_id=4369]
[2020/06/28 16:57:49.230 +08:00] [INFO] [raft.rs:923] ["[logterm: 99, index: 108] sent request to 4371"] [msg=MsgRequestPreVote] [term=99] [id=4371] [log_index=108] [log_term=99] [raft_id=4370] [region_id=4369]
[2020/06/28 16:57:49.230 +08:00] [INFO] [raft.rs:1177] ["starting a new election"] [term=95] [raft_id=2086] [region_id=2085]
[2020/06/28 16:57:49.230 +08:00] [INFO] [raft.rs:833] ["became pre-candidate at term 95"] [term=95] [raft_id=2086] [region_id=2085]
[2020/06/28 16:57:49.230 +08:00] [INFO] [raft.rs:902] ["2086 received message from 2086"] [term=95] [msg=MsgRequestPreVote] [from=2086] [id=2086] [raft_id=2086] [region_id=2085]
[2020/06/28 16:57:49.230 +08:00] [INFO] [raft.rs:923] ["[logterm: 95, index: 106] sent request to 1186539"] [msg=MsgRequestPreVote] [term=95] [id=1186539] [log_index=106] [log_term=95] [raft_id=2086] [region_id=2085]
[2020/06/28 16:57:49.230 +08:00] [INFO] [raft.rs:923] ["[logterm: 95, index: 106] sent request to 2087"] [msg=MsgRequestPreVote] [term=95] [id=2087] [log_index=106] [log_term=95] [raft_id=2086] [region_id=2085]
[2020/06/28 16:57:49.230 +08:00] [INFO] [<unknown>] ["Connect failed: {\"created\":\"@1593334669.230401197\",\"description\":\"Failed to connect to remote host: Connection refused\",\"errno\":111,\"file\":\"/rust/registry/src/github.com-1ecc6299db9ec823/grpcio-sys-0.5.3/grpc/src/core/lib/iomgr/tcp_client_posix.cc\",\"file_line\":200,\"os_error\":\"Connection refused\",\"syscall\":\"connect\",\"target_address\":\"ipv4:192.168.1.134:20160\"}"]
[2020/06/28 16:57:49.230 +08:00] [INFO] [<unknown>] ["Subchannel 0x7f12266a8cc0: Retry in 1000 milliseconds"]
[2020/06/28 16:57:49.230 +08:00] [WARN] [raft_client.rs:296] ["RPC batch_raft fail"] [err="Some(RpcFailure(RpcStatus { status: 14-UNAVAILABLE, details: Some(\"failed to connect to all addresses\") }))"] [sink_err="Some(RpcFinished(Some(RpcStatus { status: 14-UNAVAILABLE, details: Some(\"failed to connect to all addresses\") })))"] [to_addr=192.168.1.134:20160]

  1. 麻烦在命令行执行 explain analyze select * from table 反馈下执行计划,多谢
  2. 麻烦上传执行过程中前后的, tidb.log 和 tikv.log 辛苦了,多谢。
  1. 执行计划
TableReader_5	25956238.00	25956238	root		time:4m58.078042568s, loops:25370, rpc num: 48, rpc max:2m54.755333038s, min:1.954343353s, avg:33.618897478s, p80:1m1.634654984s, p95:2m50.024557712s, proc keys max:773937, p95:742868	data:TableFullScan_4	480.0145454406738 MB	N/A
└─TableFullScan_4	25956238.00	25956238	cop[tikv]	table:t_fund_netvalue	proc max:9.77s, min:308ms, p80:3.054s, p95:4.263s, iters:25562, tasks:48	keep order:false	N/A	N/A

2.日志,tikv我清空了以前的日志 保留的只是执行产生的日志
tikv.log (4.8 MB) tidb.log (3.0 MB)

  1. tidb 日志看应该是有重启

  1. tikv 日志给的时间不对,和 tidb 对应不上

  1. 这个sql查看占用了480M,请问当前这个table表有多少数据量?

1.我又重新操作了一遍,我执行了两遍语句所以有三次Welcome to TiKV,第一次是我重启了集群产生的
tikv.log.7z (1.0 MB)
tidb.log.7z (37.0 KB)
2. 执行计划

TableReader_5	25956238.00	25956238	root		time:4m57.156118107s, loops:25370, rpc num: 48, rpc max:2m25.416974677s, min:3.542409359s, avg:17.556123349s, p80:16.884544167s, p95:58.096784273s, proc keys max:773937, p95:742868	data:TableFullScan_4	475.29619693756104 MB	N/A
└─TableFullScan_4	25956238.00	25956238	cop[tikv]	table:t_fund_netvalue	proc max:5.505s, min:656ms, p80:3.573s, p95:4.945s, iters:25562, tasks:48	keep order:false	N/A	N/A

请检查下 tidb 和 tikv 服务器是系统时间不一致吗? 看tidb 和 tikv 日志时间相差很大

我看了一下其中有一台tikv的机器时区和其他的不一样但是时间是对的啊

  1. 麻烦帮忙测试下,执行 select查询 加条件 limit 1 是否能出结果,麻烦展示下,多谢。
  2. 随便查个业务表 limit 1 是否能出结果,麻烦展示下,多谢

加条件limit 1能出来结果,查询都没有问题.我发现大表执行explain analyze ,tikv都会重启,而且还好慢.

好的,麻烦了,我们分析下,会尽快回复,多谢。

好的,麻烦了

  1. 麻烦取一下问题发生时的监控信息 over-view,tidb ,tikv-detail ,麻烦了

(1)、chrome 安装这个插件https://chrome.google.com/webstore/detail/full-page-screen-capture/fdpohaocaechififmbbbbbknoalclacl

(2)、鼠标焦点置于 Dashboard 上,按 ?可显示所有快捷键,先按 d 再按 E 可将所有 Rows 的 Panels 打开,需等待一段时间待页面加载完成。

(3)、使用这个 full-page-screen-capture 插件进行截屏保存

  1. 由于文件有点大我上传到网盘了
    https://wws.lanzous.com/iGRtGe6lqif
    2.我一共运行了两次explain analyze SELECT * FROM语句,因为运行第一的时候运行到一半链接直接断开了

根据监控看,当执行查询的时候,内存会占用的较多,达到了5,6个G。咨询研发同事,这里之后会有改善的计划。当前是否有可能不使用 select * from table 绕过,加上 limit 条件,或者走索引字段试试吗?
image

加limit和走索引字段是没有问题的

如果这样业务上能够满足的话,辛苦调整下sql,先这样使用,多谢。

kettle源码我加上limit也不行..还是会造成tikv重启

这里麻烦您再检测下。 如果命令行可以执行,为什么通过kettle就不行呢?