dumpling estimate count

h5n1 · 2021 年11 月 3 日 07:18

【版本】v5.2.1
dump导出提示[“skip concurrent dump due to estimate count < rows”] [“estimate count”=10000] [conf.rows=200000] ,这里estimate count是根据什么计算的，没有统计信息的默认是10000吗？

[2021/11/03 14:56:21.934 +08:00] [INFO] [dump.go:490] [“get estimated rows count”] [database=d_xxx] [table=TF_F_xxxx] [estimateCount=10000]
[2021/11/03 14:56:21.934 +08:00] [WARN] [dump.go:496] [“skip concurrent dump due to estimate count < rows”] [“estimate count”=10000] [conf.rows=200000] [database=d_xxxx] [table=TF_F_xxxx]
[2021/11/03 14:56:21.934 +08:00] [INFO] [dump.go:448] [“didn’t build tidb concat sqls, will select all from table now”] [database=d_xxxx] [table=TF_Fxxxxxx]
[2021/11/03 14:56:21.935 +08:00] [WARN] [writer.go:230] [“no data written in table chunk”] [database=d_xxxx] [table=TF_F_xxxx] [chunkIdx=0]

这道题我不会 · 2021 年11 月 3 日 13:52

从代码里看 estimate count 是通过 explain select xxx 生成的执行计划里的 estRows 得出来的，统计信息不准确的话这块值误差就会比较大了：

h5n1 · 2021 年11 月 4 日 00:05

这里为什么要用explain 来获取行数，是否还有其他用途，直接查询统计信息不是更快、资源消耗更少吗

HHHHHHULK · 2021 年11 月 4 日 01:17

explain 返回的 estRows 就是基于统计信息的吧

h5n1 · 2021 年11 月 4 日 01:30

是基于统计信息，但是dumpling时为什么要用explain获得直接查统计信息不是更好吗

这道题我不会 · 2021 年11 月 5 日 02:07

这块不太清楚研发的设计考虑，但我理解 dumpling 整体导出过程，获取表统计信息的代价相比较扫描 region 代价还是小到可以忽略的吧

h5n1 · 2022 年10 月 31 日 19:16

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。