tiflash cache 导致查询异常缓慢

为提高效率,提问时请提供以下信息,问题描述清晰可优先响应。

  • 【TiDB 版本】:
    v4.0.5
  • 【问题描述】:
    通过 explain 发现走tiflash,查询巨慢,40s+
    大量日志
[2020/09/10 11:06:13.553 +08:00] [INFO] [coprocessor.go:928] ["[TIME_COP_PROCESS] resp_time:1.622336401s txnStartTS:419353633513537578 region_id:17168 store_addr:172.19.20.138:3930 backoff_ms:14288 backoff_types:[tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC]"] [conn=150]
[2020/09/10 11:06:13.675 +08:00] [WARN] [region_request.go:407] ["receive a grpc cancel signal from remote"] [error="rpc error: code = Canceled desc = context canceled"]
[2020/09/10 11:06:13.676 +08:00] [INFO] [region_cache.go:590] ["mark store's regions need be refill"] [store=172.19.20.138:3930]
[2020/09/10 11:06:13.676 +08:00] [INFO] [region_cache.go:609] ["switch region peer to next due to send request fail"] [conn=150] [current="region ID: 13011, meta: id:13011 start_key:\"t\\200\\000\\000\\000\\000\\000\\020}_r\\200\\000\\000\\000\\000\\007\\031\\237\" end_key:\"t\\200\\000\\000\\000\\000\\000\\020}_r\\200\\000\\000\\000\\000\\010\\255j\" region_epoch:<conf_ver:4 version:2062 > peers:<id:13012 store_id:1 > peers:<id:13013 store_id:2 > peers:<id:18067 store_id:68 is_learner:true > , peer: id:18067 store_id:68 is_learner:true , addr: 172.19.20.138:3930, idx: 0, reqStoreType: TiFlashOnly, runStoreType: tiflash"] [needReload=false] [error="rpc error: code = Canceled desc = context canceled"]
[2020/09/10 11:06:13.686 +08:00] [INFO] [region_cache.go:467] ["invalidate current region, because others failed on same store"] [region=12945] [store=172.19.20.139:3930]
[2020/09/10 11:06:13.686 +08:00] [INFO] [coprocessor.go:928] ["[TIME_COP_PROCESS] resp_time:1.521358194s txnStartTS:419353633513537578 region_id:12945 store_addr:172.19.20.139:3930 backoff_ms:14422 backoff_types:[tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC]"] [conn=150]
[2020/09/10 11:06:13.733 +08:00] [INFO] [region_cache.go:467] ["invalidate current region, because others failed on same store"] [region=16424] [store=172.19.20.138:3930]
[2020/09/10 11:06:13.733 +08:00] [INFO] [coprocessor.go:928] ["[TIME_COP_PROCESS] resp_time:1.426339308s txnStartTS:419353633513537578 region_id:16424 store_addr:172.19.20.138:3930 backoff_ms:14468 backoff_types:[tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC]"] [conn=150]
[2020/09/10 11:06:13.871 +08:00] [WARN] [region_request.go:407] ["receive a grpc cancel signal from remote"] [error="rpc error: code = Canceled desc = context canceled"]
[2020/09/10 11:06:13.871 +08:00] [INFO] [region_cache.go:590] ["mark store's regions need be refill"] [store=172.19.20.138:3930]
[2020/09/10 11:06:13.871 +08:00] [INFO] [region_cache.go:609] ["switch region peer to next due to send request fail"] [conn=150] [current="region ID: 13080, meta: id:13080 start_key:\"t\\200\\000\\000\\000\\000\\000\\020}_r\\200\\000\\000\\000\\000\\010\\255j\" end_key:\"t\\200\\000\\000\\000\\000\\000\\020}_r\\200\\000\\000\\000\\000\\nB\\213\" region_epoch:<conf_ver:4 version:2063 > peers:<id:13081 store_id:1 > peers:<id:13082 store_id:2 > peers:<id:18068 store_id:68 is_learner:true > , peer: id:18068 store_id:68 is_learner:true , addr: 172.19.20.138:3930, idx: 0, reqStoreType: TiFlashOnly, runStoreType: tiflash"] [needReload=false] [error="rpc error: code = Canceled desc = context canceled"]
[2020/09/10 11:06:13.892 +08:00] [INFO] [region_cache.go:467] ["invalidate current region, because others failed on same store"] [region=13257] [store=172.19.20.139:3930]
[2020/09/10 11:06:13.892 +08:00] [INFO] [coprocessor.go:928] ["[TIME_COP_PROCESS] resp_time:1.359374604s txnStartTS:419353633513537578 region_id:13257 store_addr:172.19.20.139:3930 backoff_ms:14625 backoff_types:[tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC]"] [conn=150]
[2020/09/10 11:06:13.943 +08:00] [WARN] [region_request.go:407] ["receive a grpc cancel signal from remote"] [error="rpc error: code = Canceled desc = context canceled"]

通过 SELECT * FROM information_schema.tiflash_replica; 查看,数据正常

执行语句为 select * from xxx limit 500 offset 1000;

目前已经关闭改db所对应的tiflash,恢复正常,响应时间 20ms 左右

重新开启后,依然报错

[2020/09/10 11:26:50.649 +08:00] [INFO] [coprocessor.go:928] ["[TIME_COP_PROCESS] resp_time:1.883329513s txnStartTS:419353950786420802 region_id:17420 store_addr:172.19.20.139:3930 backoff_ms:41003 backoff_types:[regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,regionMiss,tikvRPC,regionMiss,regionMiss,tikvRPC,regionMiss,regionMiss,tikvRPC,regionMiss,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,regionMiss,tikvRPC,regionMiss,tikvRPC]"] [conn=225]
[2020/09/10 11:26:50.649 +08:00] [WARN] [backoff.go:316] ["regionMiss backoffer.maxSleep 40000ms is exceeded, errors:\nregion_id:17420, region_ver:2073, store_type:tiflash, peer_addr:172.19.20.139:3930, error:epoch_not_match:<>  at 2020-09-10T11:26:48.265896402+08:00\nsend tikv request error: rpc error: code = Canceled desc = context canceled, ctx: region ID: 17420, meta: id:17420 start_key:\"t\\200\\000\\000\\000\\000\\000\\020}_r\\200\\000\\000\\000\\000\\032\\020\\226\" end_key:\"t\\200\\000\\000\\000\\000\\000\\020}_r\\200\\000\\000\\000\\000\\034\\013\\273\" region_epoch:<conf_ver:8 version:2073 > peers:<id:17421 store_id:1 > peers:<id:17422 store_id:2 > peers:<id:23019 store_id:18157 is_learner:true > , peer: id:23019 store_id:18157 is_learner:true , addr: 172.19.20.139:3930, idx: 0, reqStoreType: TiFlashOnly, runStoreType: tiflash, try next peer later at 2020-09-10T11:26:48.766190406+08:00\nregion_id:17420, region_ver:2073, store_type:tiflash, peer_addr:172.19.20.139:3930, error:epoch_not_match:<>  at 2020-09-10T11:26:50.649392364+08:00"]
[2020/09/10 11:26:50.744 +08:00] [INFO] [region_cache.go:467] ["invalidate current region, because others failed on same store"] [region=12945] [store=172.19.20.138:3930]
[2020/09/10 11:26:50.744 +08:00] [INFO] [coprocessor.go:928] ["[TIME_COP_PROCESS] resp_time:1.203336061s txnStartTS:419353950786420802 region_id:12945 store_addr:172.19.20.138:3930 backoff_ms:41098 backoff_types:[regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC]"] [conn=225]
[2020/09/10 11:26:50.744 +08:00] [WARN] [backoff.go:316] ["regionMiss backoffer.maxSleep 40000ms is exceeded, errors:\nregion_id:12945, region_ver:2061, store_type:tiflash, peer_addr:172.19.20.138:3930, error:epoch_not_match:<>  at 2020-09-10T11:26:49.040579365+08:00\nsend tikv request error: rpc error: code = Canceled desc = context canceled, ctx: region ID: 12945, meta: id:12945 start_key:\"t\\200\\000\\000\\000\\000\\000\\020}_r\\200\\000\\000\\000\\000\\005\\204\\344\" end_key:\"t\\200\\000\\000\\000\\000\\000\\020}_r\\200\\000\\000\\000\\000\\007\\031\\237\" region_epoch:<conf_ver:8 version:2061 > peers:<id:12946 store_id:1 > peers:<id:12947 store_id:2 > peers:<id:23009 store_id:68 is_learner:true > , peer: id:23009 store_id:68 is_learner:true , addr: 172.19.20.138:3930, idx: 0, reqStoreType: TiFlashOnly, runStoreType: tiflash, try next peer later at 2020-09-10T11:26:49.541135504+08:00\nregion_id:12945, region_ver:2061, store_type:tiflash, peer_addr:172.19.20.138:3930, error:epoch_not_match:<>  at 2020-09-10T11:26:50.744346859+08:00"]
[2020/09/10 11:26:50.981 +08:00] [INFO] [region_cache.go:467] ["invalidate current region, because others failed on same store"] [region=12759] [store=172.19.20.138:3930]
[2020/09/10 11:26:50.981 +08:00] [INFO] [coprocessor.go:928] ["[TIME_COP_PROCESS] resp_time:1.688309744s txnStartTS:419353950786420802 region_id:12759 store_addr:172.19.20.138:3930 backoff_ms:41337 backoff_types:[regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,regionMiss,regionMiss,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,regionMiss,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC]"] [conn=225]
[2020/09/10 11:26:50.981 +08:00] [WARN] [backoff.go:316] ["regionMiss backoffer.maxSleep 40000ms is exceeded, errors:\nregion_id:12759, region_ver:2061, store_type:tiflash, peer_addr:172.19.20.138:3930, error:epoch_not_match:<>  at 2020-09-10T11:26:48.7932128+08:00\nsend tikv request error: rpc error: code = Canceled desc = context canceled, ctx: region ID: 12759, meta: id:12759 start_key:\"t\\200\\000\\000\\000\\000\\000\\020}_r\" end_key:\"t\\200\\000\\000\\000\\000\\000\\020}_r\\200\\000\\000\\000\\000\\002[x\" region_epoch:<conf_ver:6 version:2061 > peers:<id:12760 store_id:1 > peers:<id:12761 store_id:2 > peers:<id:23007 store_id:68 is_learner:true > , peer: id:23007 store_id:68 is_learner:true , addr: 172.19.20.138:3930, idx: 0, reqStoreType: TiFlashOnly, runStoreType: tiflash, try next peer later at 2020-09-10T11:26:49.293741434+08:00\nregion_id:12759, region_ver:2061, store_type:tiflash, peer_addr:172.19.20.138:3930, error:epoch_not_match:<>  at 2020-09-10T11:26:50.981960744+08:00"]
[2020/09/10 11:26:51.083 +08:00] [INFO] [region_cache.go:467] ["invalidate current region, because others failed on same store"] [region=13011] [store=172.19.20.138:3930]
[2020/09/10 11:26:51.083 +08:00] [INFO] [coprocessor.go:928] ["[TIME_COP_PROCESS] resp_time:1.676328245s txnStartTS:419353950786420802 region_id:13011 store_addr:172.19.20.138:3930 backoff_ms:41436 backoff_types:[regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC]"] [conn=225]
[2020/09/10 11:26:51.083 +08:00] [WARN] [backoff.go:316] ["regionMiss backoffer.maxSleep 40000ms is exceeded, errors:\nregion_id:13011, region_ver:2062, store_type:tiflash, peer_addr:172.19.20.138:3930, error:epoch_not_match:<>  at 2020-09-10T11:26:48.906761904+08:00\nsend tikv request error: rpc error: code = Canceled desc = context canceled, ctx: region ID: 13011, meta: id:13011 start_key:\"t\\200\\000\\000\\000\\000\\000\\020}_r\\200\\000\\000\\000\\000\\007\\031\\237\" end_key:\"t\\200\\000\\000\\000\\000\\000\\020}_r\\200\\000\\000\\000\\000\\010\\255j\" region_epoch:<conf_ver:6 version:2062 > peers:<id:13012 store_id:1 > peers:<id:13013 store_id:2 > peers:<id:23010 store_id:68 is_learner:true > , peer: id:23010 store_id:68 is_learner:true , addr: 172.19.20.138:3930, idx: 0, reqStoreType: TiFlashOnly, runStoreType: tiflash, try next peer later at 2020-09-10T11:26:49.407319591+08:00\nregion_id:13011, region_ver:2062, store_type:tiflash, peer_addr:172.19.20.138:3930, error:epoch_not_match:<>  at 2020-09-10T11:26:51.083512728+08:00"]
[2020/09/10 11:26:51.266 +08:00] [INFO] [region_cache.go:467] ["invalidate current region, because others failed on same store"] [region=13128] [store=172.19.20.138:3930]

疑似与该问题有关 tidb 配置修改,pd dashboard 时间显示错误

enable-streaming: true

若提问为性能优化、故障排查类问题,请下载脚本运行。终端输出的打印结果,请务必全选并复制粘贴上传。

1)提供下在开启和关闭 tiflash 情况下的 select * from xxx limit 500 offset 1000; 分别对应的 explain analyze 信息。

2)提到的关闭该 DB 对应的 tiflash 恢复正常-> 这个操作是怎么做的?

3)给到的日志截图应该是 tidb.log 里面的信息,warning 信息是 TiDB 的重试机制,一般是因为 cache 信息过旧,如果影响很大,辛苦提供下 TiDB 监控面板的截图,具体操作:
(1)使用 chrome 浏览器,安装“Full Page Screen Capture”插件:https://chrome.google.com/webstore/detail/full-page-screen-capture/fdpohaocaechififmbbbbbknoalclacl

(2)展开grafana 监控的 “cluster-name-overview” 的所有 dashboard (先按 d 再按 E 可将所有 Rows 的 Panels 打开,需等待一段时间待页面加载完成)

(3)使用插件导出 pdf

1)
目前已无法提供,因影响业务,目前 tiflash 已移除。
关闭 tiflash 下,是走 tikv,tidb.log 中无异常日志。
开启 tiflash 下,是走 tiflash, tidb.log 中的异常日志参考如上。

2)ALTER TABLE xxx.yyy SET TIFLASH REPLICA 0;

3)导出时间3小时
screencapture-172-19-20-139-3000-d-eDbRZpnWk-fsk-overview-2020-09-10-12_31_13.pdf (9.3 MB)

cache 过旧,这个首次查询导致变慢可以接受,但实际上,在有 tiflash 的情况下,重复执行语句,发现每次都是相应的错误。

1)没有 explain analyze 信息不太好确认是哪里的问题,有个类似的帖子可以参考下:

2)关闭的姿势不正确,可以通过设置 session 和 instance 的查询 engine 来强制走 TiDB 查询,不用删掉 tiflash 上面的副本信息

3)监控里面看到有 2 个 TiKV 和 PD,建议 TiKV 至少三个节点,同时 PD 保证奇数个。

4)监控信息部分是 TiDB 面板部分,不是 overview,包含走 TiFlash 查询时候的信息,另外也可以提供 TiFlash 正常时的监控信息,监控信息获取方法如上介绍。

删除磁盘中原有的tiflash相关目录,并使用 tiup scale-out tiflash 节点

explain 结果参考如下,执行时 tidb 节点大量信息如截图2所示
世界执行时间如 DataGrap 中的读秒所示,需要 40s+,疑似本身 tidb 部分存在问题,explain 的结果看 tiflash 返回时间正常

改用 tikv 正常

请问使用 tiflash 查询耗时 41s?走 tikv 耗时多久?

猜想:

tiflash 从 explain 的结果看是 ms 级别,但是本身 tidb 里一堆 cache 输出,导致整体的执行时间是 40s +

tikv 整体执行时间 < 1s

这部分看起来像是 tiflash 返回结果后,tidb 又做了一堆查询。两边跟 tikv 之间 cache 相关的有些不对等。

多次查询的结果耗时都稳定复现吗?

是的,每次看 tidb 里都是那一堆日志。

查询时指定具体字段也是同样的现象,比如 select id

但是 select count(*) from xyz; 通过 explain 发现也是走 tiflash, 但是执行时间正常

查看对应的 tikv, tiflash 无明显大量异常日志。

你好,可以尝试一下执行下面这个语句么?(请忽视这个语句本身的语义,它只是作为实验帮助我们排除一些可能)

select count(*) from xyz limit 500

然后看下是否走到tiflash,以及执行时间。

谢谢。

count 正常

执行总时间也是正常么?没有出现40多秒的现象是么?

是的,count(*) 执行总时间正常。

只有 select * 或者 select column_name 这类语句,走 tiflash store,才会出现执行时间超长的现象。而且其执行时间跟 limit 无关,比如 limit 1,时间也是 40s 左右

与本帖子关联的帖子中,特性 tidb.enable-streaming 已经关闭且 reload 过。

tiflash scale-out 部署时,原有数据文件已经做过清理。

好的,谢谢提供的信息。我们内部再查一下,会尽快给您回复。

v4.0.6 问题依旧存在

可供参考的日志如下

==> tidb.log <==
[2020/09/16 09:11:59.181 +08:00] [WARN] [backoff.go:317] ["regionMiss backoffer.maxSleep 40000ms is exceeded, errors:\nregion_id:13395, region_ver:2068, store_type:tiflash, peer_addr:, error:epoch_not_match:<>  at 2020-09-16T09:11:57.248975101+08:00\nsend tikv request error: rpc error: code = Canceled desc = context canceled, ctx: region ID: 13395, meta: id:13395 start_key:\"t\\200\\000\\000\\000\\000\\000\\020}_r\\200\\000\\000\\000\\000\\020\\225\\036\" end_key:\"t\\200\\000\\000\\000\\000\\000\\020}_r\\200\\000\\000\\000\\000\\022*\\364\" region_epoch:<conf_ver:10 version:2068 > peers:<id:13396 store_id:1 > peers:<id:13397 store_id:2 > peers:<id:25148 store_id:25133 is_learner:true > , peer: id:25148 store_id:25133 is_learner:true , addr: 172.19.20.138:3930, idx: 0, reqStoreType: TiFlashOnly, runStoreType: tiflash, try next peer later at 2020-09-16T09:11:57.74960308+08:00\nregion_id:13395, region_ver:2068, store_type:tiflash, peer_addr:172.19.20.138:3930, error:epoch_not_match:<>  at 2020-09-16T09:11:59.181037929+08:00"]
[2020/09/16 09:11:59.181 +08:00] [INFO] [region_cache.go:467] ["invalidate current region, because others failed on same store"] [region=17168] [store=172.19.20.138:3930]
[2020/09/16 09:11:59.181 +08:00] [INFO] [coprocessor.go:926] ["[TIME_COP_PROCESS] resp_time:1.031305707s txnStartTS:419487725154468076 region_id:17168 store_addr:172.19.20.138:3930 backoff_ms:40717 backoff_types:[tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC]"] [conn=40876]
[2020/09/16 09:11:59.181 +08:00] [WARN] [backoff.go:317] ["regionMiss backoffer.maxSleep 40000ms is exceeded, errors:\nregion_id:17168, region_ver:2072, store_type:tiflash, peer_addr:172.19.20.138:3930, error:epoch_not_match:<>  at 2020-09-16T09:11:57.650499532+08:00\nsend tikv request error: rpc error: code = Canceled desc = context canceled, ctx: region ID: 17168, meta: id:17168 start_key:\"t\\200\\000\\000\\000\\000\\000\\020}_r\\200\\000\\000\\000\\000\\030\\177\\205\" end_key:\"t\\200\\000\\000\\000\\000\\000\\020}_r\\200\\000\\000\\000\\000\\032\\020\\226\" region_epoch:<conf_ver:8 version:2072 > peers:<id:17169 store_id:1 > peers:<id:17170 store_id:2 > peers:<id:25152 store_id:25133 is_learner:true > , peer: id:25152 store_id:25133 is_learner:true , addr: 172.19.20.138:3930, idx: 0, reqStoreType: TiFlashOnly, runStoreType: tiflash, try next peer later at 2020-09-16T09:11:58.150766942+08:00\nregion_id:17168, region_ver:2072, store_type:tiflash, peer_addr:172.19.20.138:3930, error:epoch_not_match:<>  at 2020-09-16T09:11:59.181948838+08:00"]
[2020/09/16 09:11:59.455 +08:00] [INFO] [region_cache.go:467] ["invalidate current region, because others failed on same store"] [region=17420] [store=172.19.20.138:3930]
[2020/09/16 09:11:59.455 +08:00] [INFO] [coprocessor.go:926] ["[TIME_COP_PROCESS] resp_time:1.038291276s txnStartTS:419487725154468076 region_id:17420 store_addr:172.19.20.138:3930 backoff_ms:40991 backoff_types:[tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,regionMiss,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC]"] [conn=40876]
[2020/09/16 09:11:59.455 +08:00] [WARN] [backoff.go:317] ["regionMiss backoffer.maxSleep 40000ms is exceeded, errors:\nregion_id:17420, region_ver:2073, store_type:tiflash, peer_addr:172.19.20.138:3930, error:epoch_not_match:<>  at 2020-09-16T09:11:57.916988278+08:00\nsend tikv request error: rpc error: code = Canceled desc = context canceled, ctx: region ID: 17420, meta: id:17420 start_key:\"t\\200\\000\\000\\000\\000\\000\\020}_r\\200\\000\\000\\000\\000\\032\\020\\226\" end_key:\"t\\200\\000\\000\\000\\000\\000\\020}_r\\200\\000\\000\\000\\000\\034\\013\\273\" region_epoch:<conf_ver:10 version:2073 > peers:<id:17421 store_id:1 > peers:<id:17422 store_id:2 > peers:<id:25138 store_id:25133 is_learner:true > , peer: id:25138 store_id:25133 is_learner:true , addr: 172.19.20.138:3930, idx: 0, reqStoreType: TiFlashOnly, runStoreType: tiflash, try next peer later at 2020-09-16T09:11:58.417620369+08:00\nregion_id:17420, region_ver:2073, store_type:tiflash, peer_addr:172.19.20.138:3930, error:epoch_not_match:<>  at 2020-09-16T09:11:59.455806017+08:00"]
[2020/09/16 09:11:59.555 +08:00] [INFO] [region_cache.go:467] ["invalidate current region, because others failed on same store"] [region=13323] [store=172.19.20.139:3930]
[2020/09/16 09:11:59.556 +08:00] [INFO] [coprocessor.go:926] ["[TIME_COP_PROCESS] resp_time:1.657290717s txnStartTS:419487725154468076 region_id:13323 store_addr:172.19.20.139:3930 backoff_ms:41081 backoff_types:[tikvRPC,regionMiss,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC,regionMiss,tikvRPC]"] [conn=40876]
[2020/09/16 09:11:59.556 +08:00] [WARN] [backoff.go:317] ["regionMiss backoffer.maxSleep 40000ms is exceeded, errors:\nregion_id:13323, region_ver:2067, store_type:tiflash, peer_addr:172.19.20.139:3930, error:epoch_not_match:<>  at 2020-09-16T09:11:57.398214542+08:00\nsend tikv request error: rpc error: code = Canceled desc = context canceled, ctx: region ID: 13323, meta: id:13323 start_key:\"t\\200\\000\\000\\000\\000\\000\\020}_r\\200\\000\\000\\000\\000\\016\\377:\" end_key:\"t\\200\\000\\000\\000\\000\\000\\020}_r\\200\\000\\000\\000\\000\\020\\225\\036\" region_epoch:<conf_ver:10 version:2067 > peers:<id:13324 store_id:1 > peers:<id:13325 store_id:2 > peers:<id:25147 store_id:25132 is_learner:true > , peer: id:25147 store_id:25132 is_learner:true , addr: 172.19.20.139:3930, idx: 0, reqStoreType: TiFlashOnly, runStoreType: tiflash, try next peer later at 2020-09-16T09:11:57.898871275+08:00\nregion_id:13323, region_ver:2067, store_type:tiflash, peer_addr:172.19.20.139:3930, error:epoch_not_match:<>  at 2020-09-16T09:11:59.556059181+08:00"]

==> tidb_slow_query.log <==
# Plan_from_cache: false
# Has_more_results: false
# KV_total: 4.531818309
# PD_total: 0.138581784
# Backoff_total: 606.082
# Write_sql_response_total: 0.000002055
# Succ: true
# Plan: tidb_decode_plan('DjQwCTBfMAkwCTAJTi9BCg==')
use prod_fsk_workorder;
/* ApplicationName=DataGrip 2020.2.2 */ explain analyze SELECT   /*+ read_from_storage(tiflash[w]) */   * FROM   prod_fsk_workorder.wo_workorder_out w LIMIT   1;

==> tidb_stderr.log <==
{"level":"warn","ts":"2020-09-02T22:01:40.746+0800","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-6cc46c6c-9522-4593-ba94-9a1b7a5156d0/172.19.20.138:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
{"level":"warn","ts":"2020-09-02T22:01:46.751+0800","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-6cc46c6c-9522-4593-ba94-9a1b7a5156d0/172.19.20.138:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
{"level":"warn","ts":"2020-09-02T22:01:52.751+0800","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-6cc46c6c-9522-4593-ba94-9a1b7a5156d0/172.19.20.138:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
{"level":"warn","ts":"2020-09-08T02:16:50.444+0800","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-4b0c5c1d-1f4c-43a5-a0d6-abea1a6996e2/172.19.20.138:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
{"level":"warn","ts":"2020-09-08T02:16:33.840+0800","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-4b0c5c1d-1f4c-43a5-a0d6-abea1a6996e2/172.19.20.138:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
{"level":"warn","ts":"2020-09-08T02:18:30.839+0800","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-4b0c5c1d-1f4c-43a5-a0d6-abea1a6996e2/172.19.20.138:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
{"level":"warn","ts":"2020-09-08T02:18:40.035+0800","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-4b0c5c1d-1f4c-43a5-a0d6-abea1a6996e2/172.19.20.138:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
{"level":"warn","ts":"2020-09-08T02:18:50.736+0800","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-4b0c5c1d-1f4c-43a5-a0d6-abea1a6996e2/172.19.20.138:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
{"level":"warn","ts":"2020-09-08T02:18:57.339+0800","caller":"clientv3/retry_interceptor.go:61","msg":"retrying of unary invoker failed","target":"endpoint://client-4b0c5c1d-1f4c-43a5-a0d6-abea1a6996e2/172.19.20.138:2379","attempt":0,"error":"rpc error: code = Unknown desc = context deadline exceeded"}

您好,这个问题我们昨天找到原因并且在 master 分枝上提交了修复,见这个 PR

这个 fix 会在之后 backport 到 4.0 版本并且在下一个小版本(4.0.7,预计 9 月 29 日发布)中带入。

这个问题会影响 limit 语句,其他类型的语句不受影响。如果您这边对这个问题有比较迫切的需求,我们可以给您提供一个 hotfix 包;如果不紧急的话,可以等待我们下一个小版本发布。

谢谢。

那么逻辑上,该系统查询的 SQL 也是受此影响的吧?跟 tikv/tiflash 本身没有关系

是的,图中的 query 执行很慢也是同一个问题导致的。

另外如您所说,这个问题跟 tikv/tiflash 本身也没有关系,只不过我们观察下来 tiflash 相对更容易暴露这个问题,我们本地也重现出了 tikv 上同样的问题。