TiDB Bug List=

Billmay表妹 · 2024 年7 月 25 日 02:08

Issue

github.com/pingcap/tidb

planner, txn: `select ... for update` using Plan Cache can not lock data correctly in some cases

opened 03:41AM - 16 Jul 24 UTC

closed 03:18AM - 17 Jul 24 UTC

qw4990

type/bug sig/planner sig/transaction severity/critical affects-6.1 affects-6.5 affects-7.1 affects-7.5 affects-8.1

## Bug Report Please answer these questions before submitting your issue. Tha…nks! ### 1. Minimal reproduce step (Required) ``` mysql> select @@autocommit; -- enable autocommit +--------------+ | @@autocommit | +--------------+ | 1 | +--------------+ create table t (pk int, a int, primary key(pk)); -- create a table with PK prepare st from 'select * from t where pk=? for update'; -- prepare a PointPlan statement set @pk=1; execute st using @pk; -- execute this statement to generate a PointPlan cached in Plan Cache -- plan of this exec-statement, Lock operations for "for update" are optimized by auto-commit +-------------+---------+---------+------+---------------+------------------------------------------------------------+---------------+--------+------+ | id | estRows | actRows | task | access object | execution info | operator info | memory | disk | +-------------+---------+---------+------+---------------+------------------------------------------------------------+---------------+--------+------+ | Point_Get_1 | 1.00 | 0 | root | table:t | time:94.1µs, loops:1, Get:{num_rpc:1, total_time:42.5µs} | handle:2 | N/A | N/A | +-------------+---------+---------+------+---------------+------------------------------------------------------------+---------------+--------+------+ begin; set @pk=1; execute st using @pk; -- the optimizer decided to reuse the prior PointPlan, which is incorrect. mysql> select @@last_plan_from_cache; +------------------------+ | @@last_plan_from_cache | +------------------------+ | 1 | +------------------------+ ``` Reusing this PointPlan without Lock in the second exec-statement can cause wrong results. The correct plan for the second exec-statement should have Lock opearations: ``` +-------------+---------+---------+------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+--------+------+ | id | estRows | actRows | task | access object | execution info | operator info | memory | disk | +-------------+---------+---------+------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+--------+------+ | Point_Get_1 | 1.00 | 0 | root | table:t | time:1.74ms, loops:1, lock_keys: {time:1.69ms, region:1, keys:1, slowest_rpc: {total: 0.000s, region_id: 93, store: store1, }, lock_rpc:165µs, rpc_count:1} | handle:1, lock | N/A | N/A | +-------------+---------+---------+------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+--------+------+ ``` ### 2. What did you expect to see? (Required) Shouldn't reuse the first PointPlan for the second exec-statement and the second exec-statement's plan should have Lock operations: ``` +-------------+---------+---------+------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+--------+------+ | id | estRows | actRows | task | access object | execution info | operator info | memory | disk | +-------------+---------+---------+------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+--------+------+ | Point_Get_1 | 1.00 | 0 | root | table:t | time:1.74ms, loops:1, lock_keys: {time:1.69ms, region:1, keys:1, slowest_rpc: {total: 0.000s, region_id: 93, store: store1, }, lock_rpc:165µs, rpc_count:1} | handle:1, lock | N/A | N/A | +-------------+---------+---------+------+---------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+--------+------+ ``` ### 3. What did you see instead (Required) The second exec-statement's plan has no Lock: ``` +-------------+---------+---------+------+---------------+-------------------------------------------------------------+---------------+--------+------+ | id | estRows | actRows | task | access object | execution info | operator info | memory | disk | +-------------+---------+---------+------+---------------+-------------------------------------------------------------+---------------+--------+------+ | Point_Get_1 | 1.00 | 0 | root | table:t | time:123.7µs, loops:1, Get:{num_rpc:1, total_time:63.3µs} | handle:1 | N/A | N/A | +-------------+---------+---------+------+---------------+-------------------------------------------------------------+---------------+--------+------+ ``` ### 4. What is your TiDB version? (Required) Master

问题描述

由于 LOCK 语义未被正确执行，上述问题可能造成丢失更新等并发异常，以及并发事务写入结果不正确写入数据丢失。

根因分析

如下例子所述，当以 autocommit 方式在事务外执行 “select for update” 语句时，”for update” 语义不会生效（参考 https://docs.pingcap.com/zh/tidb/stable/pessimistic-transaction#和-mysql-innodb-的差异第五点），最终生成的执行计划不包含 lock 语义（见下方例子中 Point_Get_1 算子）。

后续 session 开启显示事务后，执行相同 “select for update” 语句，前面的执行计划被复用，导致无 LOCK 语义的计划被使用，造成 LOCK 语义丢失，使得事务内 “select for update” 语句按照 “select” 语义执行。

mysql> select @@autocommit; – enable autocommit
±-------------+
| @@autocommit |
±-------------+
| 1 |
±-------------+
create table t (pk int, a int, primary key(pk)); – create a table with PK
prepare st from ‘select * from t where pk=? for update’; – prepare a PointPlan statement
set @pk=1;
execute st using @pk; – execute this statement to generate a PointPlan cached in Plan Cache
– plan of this exec-statement, Lock operations for “for update” are optimized by auto-commit
±------------±--------±--------±-----±--------------±-----------------------------------------------------------±--------------±-------±-----+
| id | estRows | actRows | task | access object | execution info | operator info | memory | disk |
±------------±--------±--------±-----±--------------±-----------------------------------------------------------±--------------±-------±-----+
| Point_Get_1 | 1.00 | 0 | root | table:t | time:94.1µs, loops:1, Get:{num_rpc:1, total_time:42.5µs} | handle:2 | N/A | N/A |
±------------±--------±--------±-----±--------------±-----------------------------------------------------------±--------------±-------±-----+
begin;
set @pk=1;
execute st using @pk; – the optimizer decided to reuse the prior PointPlan, which is incorrect.
mysql> select @@last_plan_from_cache;
±-----------------------+
| @@last_plan_from_cache |
±-----------------------+
| 1 |
±-----------------------+

诊断方法

使用上述例子中类似步骤，查看后执行的 “select for update” 语句对应执行计划是否从 plan cache 中获取，以及 operator info 中是否包含 lock 标记。

影响版本

v6.1.0 - v6.1.7

v6.5.0 - v6.5.10

v7.1.0 - v7.1.5

v7.5.0 - v7.5.2

v8.1.0

问题修复版本

修复 PR:

github.com/pingcap/tidb

planner: fix the issue of reusing wrong point-plan for "select ... for update"

pingcap:master ← qw4990:fix-54652

opened 08:06AM - 16 Jul 24 UTC

qw4990

+54 -9

### What problem does this PR solve?  Issue Number: close #54652 Problem Summary: planner: fix the issue of reusing wrong point-plan for "select ... for update" ### What changed and how does it work? Encode more txn state into the plan cache key, and check whether the key has changed before reusing point-get plans. ### Check List Tests - [x] Unit test - [ ] Integration test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test > - [ ] I checked and no code files have been changed. > Side effects - [ ] Performance regression: Consumes more CPU - [ ] Performance regression: Consumes more Memory - [ ] Breaking backward compatibility Documentation - [ ] Affects user behaviors - [ ] Contains syntax changes - [ ] Contains variable changes - [ ] Contains experimental features - [ ] Changes MySQL compatibility ### Release note Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note. ```release-note None ```

修复版本：v6.1.8, v6.5.11, v7.1.6, v7.5.3, v8.1.1

Workaround 方法

两种方式

DB 侧：关闭 prepared plan cache, 设置 “tidb_enable_prepared_plan_cache” 为 OFF （https://docs.pingcap.com/tidb/stable/system-variables#tidb_enable_prepared_plan_cache-new-in-v610 ），设置该变量只对新 session 生效，所以需要重启 tidb-server 或者重新建立 tidb connection
应用侧：避免 autocommit 方式执行 “select for update” 语句

Hacker_xUwtuKxa · 2024 年8 月 6 日 15:22

请问BR log backup这个critical bug，solution中6.5.10是数据库版本，还是BR的版本？
若用6.5.10的BR备份6.5.7的tidb数据库，是否会有这个问题？

Billmay表妹 · 2024 年8 月 15 日 08:53

BR 版本需要和 Tidb 保持一直

Billmay表妹 · 2024 年8 月 15 日 08:53

[Critical bug] TiDB 并行aggregation spill 可能会导致结果出错
问题

github.com/pingcap/tidb

hash agg parallel spill get wrong result

已打开 05:32AM - 08 Aug 24 UTC

已关闭 06:46AM - 12 Aug 24 UTC

windtalker

type/bug sig/execution severity/critical affects-8.0 affects-8.1 impact/wrong-result affects-8.2

## Bug Report Please answer these questions before submitting your issue. Tha…nks! ### 1. Minimal reproduce step (Required) for a simple query like ``` select count(*), id from t group by id ``` if parallel spill is triggered, it will return wrong result. ### 2. What did you expect to see? (Required) ### 3. What did you see instead (Required) ### 4. What is your TiDB version? (Required)

对于并行模式的 hash aggregation 算子（hash aggregation 算子默认即为并行模式），如果触发了 spil，对于特定的aggregation 函数（count 和 avg ），会返回错误的结果

根因
在 TiDB aggregtion 算子的并行 spill 实现中，存在一个 bug 导致在以下两种情况下，最终的结果会是错误的

如果 hash aggregation 算子中带有 count 函数

如果 hash aggregation 算子没有被下推到 TiKV/TiFlash，且 hash aggregation 算子中带有 avg 函数

诊断步骤
判断 hash aggregation 算子是否在 parallel 模式下触发了 spill.

这个判断主要依据是 explain analyze 的结果。在 explain analyze 结果中，对于 hash aggregation 算子，如果 partial_worker 或者 final_worker 的并发度大于 1，说明这个 aggregation 运行在 parallel 模式。如果 hash aggregation 算子的 disk usage 不是 N/A, 说明该 hash aggregation 触发了 spill。

以下面的 explain analyze 结果为例，该 query 中的 hash aggregation 触发了 parallel spill，因为：

partial_worker 和 final_worker 的并发度是 5

hash aggregation 算子的 disk usage 是 1.85 GB

mysql> explain analyze select count(), value from spill_test group by value having count() > 0;
±-----------------------------±------------±---------±----------±-----------------±----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------±------------------------------------------------------------------------------------------------------------------------------±--------±--------+
| id | estRows | actRows | task | access object | execution info | operator info | memory | disk |
±-----------------------------±------------±---------±----------±-----------------±----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------±------------------------------------------------------------------------------------------------------------------------------±--------±--------+
| Selection_7 | 6848921.60 | 0 | root | | time:2m33.6s, loops:1, RU:44821.700304 | gt(Column#4, 0) | 61.7 KB | N/A |
| └─HashAgg_12 | 8561152.00 | 17099136 | root | | time:2m33.3s, loops:16702, partial_worker:{wall_time:15.445530168s, concurrency:5, task_num:625, tot_wait:32.672876439s, tot_exec:44.553148301s, tot_time:1m17.22698924s, max:15.445404668s, p95:15.445404668s}, final_worker:{wall_time:2m33.608198045s, concurrency:5, task_num:5, tot_wait:2.8µs, tot_exec:0s, tot_time:12m45.060517323s, max:2m33.608064545s, p95:2m33.608064545s} | group by:test.spill_test.value, funcs:count(Column#7)->Column#4, funcs:firstrow(test.spill_test.value)->test.spill_test.value | 3.24 GB | 1.85 GB |
| └─TableReader_13 | 8561152.00 | 17099216 | root | | time:3.07s, loops:626, cop_task: {num: 625, max: 632.4ms, min: 1.04ms, avg: 135.4ms, p95: 444.5ms, max_proc_keys: 51200, p95_proc_keys: 51200, tot_proc: 1m12.3s, tot_wait: 252ms, copr_cache_hit_ratio: 0.00, build_task_duration: 48.8µs, max_distsql_concurrency: 15}, rpc_info:{Cop:{num_rpc:625, total_time:1m24.6s}} | data:HashAgg_8 | 40.3 MB | N/A |
| └─HashAgg_8 | 8561152.00 | 17099216 | cop[tikv] | | tikv_task:{proc max:620ms, min:0s, avg: 123.8ms, p80:240ms, p95:400ms, iters:16740, tasks:625}, scan_detail: {total_process_keys: 17122304, total_process_keys_size: 1338451200, total_keys: 17122929, get_snapshot_time: 68.6ms, rocksdb: {key_skipped_count: 17122304, block: {cache_hit_count: 4366, read_count: 47582, read_byte: 126.9 MB, read_time: 28.3s}}}, time_detail: {total_process_time: 1m12.3s, total_suspend_time: 7.42s, total_wait_time: 252ms, total_kv_read_wall_time: 52.1s, tikv_wall_time: 1m20.8s} | group by:test.spill_test.value, funcs:count(1)->Column#7 | N/A | N/A |
| └─TableFullScan_11 | 17122304.00 | 17122304 | cop[tikv] | table:spill_test | tikv_task:{proc max:450ms, min:0s, avg: 83.4ms, p80:160ms, p95:300ms, iters:16740, tasks:625} | keep order:false | N/A | N/A |
±-----------------------------±------------±---------±----------±-----------------±----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------±------------------------------------------------------------------------------------------------------------------------------±--------±--------+
5 rows in set (2 min 33.87 sec)
判断 hash aggregation 是否下推到 TiKV/TiFlash

这个判断主要基于 query 的 plan。如果一个 hash aggregation 被下推到了 TiKV/TiFlash，它会被转化成一个两阶段的 hash aggregation。其中一个 hash aggregation 算子的 task type 是 cop[tikv]/cop[tiflash]/batchcop[tiflash]/mpp[tiflash]，另一个 hash aggregation 算子的 task type 是 root

以下面两个 query 的 plan 为例，plan 1 中 hash aggregation 被下推到了 TiKV，plan 2 中 hash aggregation 没有被下推到TiKV

plan 1

mysql> explain select count(), value from spill_test group by value having count() > 0;
±---------------------------±------------±----------±-----------------±-----------------------------------------------------------------------------------------------------------------------+
| id | estRows | task | access object | operator info |
±---------------------------±------------±----------±-----------------±-----------------------------------------------------------------------------------------------------------------------+
| Selection_7 | 13697843.20 | root | | gt(Column#4, 0) |
| └─HashAgg_10 | 17122304.00 | root | | group by:test.spill_test.value, funcs:count(1)->Column#4, funcs:firstrow(test.spill_test.value)->test.spill_test.value |
| └─TableReader_15 | 17122304.00 | root | | data:TableFullScan_14 |
| └─TableFullScan_14 | 17122304.00 | cop[tikv] | table:spill_test | keep order:false |
±---------------------------±------------±----------±-----------------±-----------------------------------------------------------------------------------------------------------------------+
检查 hash aggregation 算子中是否包含可能会输出错误结果的 aggregation 函数

当 hash aggregation 触发了 parallel spill 时

如果 hash aggregation 被下推到 TiKV/TiFlash： count 函数会返回错误结果

如果 hash aggregation 没有被下推到 TiKV/TiFlash： count 和 avg 函数都会返回错误的结果

解决方法
在 v8.1.0 中，除了关闭 hash aggregation 的 parallel spill 外没有其他解决方法

这个 bug 会在 v8.1.1 中修复

绕过方法
关闭 hash aggregation 的 parallel spill 功能：

set tidb_enable_parallel_hashagg_spill=0;
影响版本
v8.0.0

v8.1.0

v8.2.0

修复版本
v8.1.1

v8.3.0

补充说明
Parallel hash agg spill 在 v8.1.0 中并未 GA，所以不推荐在 v8.1.x 集群中在生产环境中默认启用 parallel hash agg spill 的特性

在 v8.1.1 中我们会将 tidb_enable_parallel_hashagg_spill 的默认值改成 off。所以对于 v8.1.x(x != 0) 的新装集群，parallel hash agg spill 功能都会默认关闭。但是对于从 v8.0.0/v8.1.0 升级到 v8.1.x 的集群，tidb_enable_parallel_hashagg_spill 会保持集群的原有值(默认是 on)，可以通过以下命令来显式关闭 parallel hash agg 的 spill

set @@global.tidb_enable_parallel_hashagg_spill=0;

Billmay表妹 · 2024 年11 月 22 日 02:49

Issue

https://github.com/pingcap/tiflow/issues/11744

Root Cause

Redo log 将数据和元数据保存到外部存储。但是，redo 的数据处理模块中存在一些与错误处理相关的 Bug。当 TiCDC 集群和 redo log 外部存储之间出现网络分区故障时，可能出现：

如果写meta失败，changefeed将重新启动并尝试恢复同步过程，这是预期行为。
如果写meta成功，写数据失败，即使redo的数据处理模块已经停止，changefeed也会继续正常进行。这是非预期行为，实际上禁用了 redo log 功能，并在灾难场景中导致可能的数据不一致。

注意：网络故障的持续时间决定了这些问题是否会发生。由于 redo log 模块的指数回避机制中随机添加了抖动，所以只有当失败持续时间近似等于重试模块的超时时间（5分钟）时，才会出现写文件的概率失败。

Diagnostic Steps

部署 TiDB(上游） + TiCDC + TiDB/mysql（下游）
创建 Changefeed ，打开设置 Changefeed 配置

[consistent]

level = “eventual”

在 TiCDC 集群与 Redo 外部存储间注入约 5 分钟的网络分区故障。
上游 TiDB 有写负载的过程中，观察 TiCDC-Redo 面板的监控状态：

正常情况：Redo Writer rows/s and Redo Writer bytes/s 均显示存在写负载。
异常情况：Redo Writer bytes/s 显示为空，而另一个监控显示存在写负载。

I2gFbgLTUoqsIqxrRfkc66kJnaf2900×634 129 KB

Affect version

v6.5.10, v6.5.11
v7.5.2, v7.5.3, v7.5.4
v8.1.0, v8.1.1

Resolution

升级 TiCDC 到 v6.1.12 / v7.5.5 / v8.1.2 及之后版本

Workaround

从监控中发现这个问题后，重启 changefeed

Billmay表妹 · 2024 年11 月 26 日 09:19

Issue

https://github.com/pingcap/tidb/issues/56809

Impact

如果使用了 stale read，则有一定风险影响到 Async Commit / 1PC 事务的线性一致性。在这种情况下，Async Commit / 1PC 事务可能会以过大的 commit_ts 提交，导致改事务所写入的数据，对另一个开始时间严格晚于当前事务成功提交时间的读操作不可见。用户的应用读到这些不一致的结果并使用之后，有可能因此继续写入更多有错误的数据。

Root Cause

当使用 stale read 时（无论使用哪种语法），TiDB 并未保证读用的 ts 不会超过 PD 曾分配过的最大 ts （read_ts <= pd_allocated_max_ts）。所以在一些情况下，stale read 可能会使用一个未来的 ts （超过 PD 分配过的任何 ts）来进行。可能的情况包括但不限于：

用户手动指定了读取的数据的时间
物理时钟存在漂移
PD 的写入延迟比较高，造成分配 ts 和物理时间之间存在 lag。

当发送到 TiKV 的 stale read 请求未能成功处理、TiDB 因此以普通 leader read 模式重试该读请求时，新的请求不会带有 stale_read flag，以告知 TiKV 该请求应当以非 stale read 的通常模式来进行处理。

对于发送到 TiKV 的非 stale read 请求，需要推动 max_ts 到当前请求的 read_ts。这里的 max_ts 将被用于计算 Async Commit / 1PC 事务的 commit_ts，其应当被保证永远小于等于 PD 曾经分配过的最大 ts。然而，此时就有可能因为一个 stale read 操作而被推进到一个未来的 ts，从而引发问题。

Diagnostic Steps

确认集群中使用了 stale read。如果可以确认用户从未使用 stale read，那么就和该问题无关。
可以确认（或疑似）Async Commit / 1PC 事务有时存在线性一致性被破坏的情况：Async Commit / 1PC 事务写入的数据在成功提交后，有时不会对紧接着的读语句立即可见。如果有办法找到这种异常的 Async Commit / 1PC 事务的具体数据的话，可以发现它的 commit_ts 的 logical 部分总是等于 1。
遇到这一现象的概率可能会非常低、所以可能非常难以复现。即使集群中发生了这种现象，也未必总能有办法从中检查到。

如果观察到下列现象，则可以增加符合该问题的判断的置信度：

在 TiDB 的日志中，可能会看到这一错误：

Retrying aggressive locking with ForUpdateTS (…) less than previous LockedWithConflictTS (…)

该报错信息中的 LockedWithConflictTS 所指的 ts 的 logical 部分总是等于 1。

注意该问题发生时未必一定会引起这种报错。因此找不到这种报错并不意味着不是这个问题。

可以确认集群中存在时钟漂移、或者 PD 存在写入性能问题。

如需检查 ts 的 logical 部分的数值，可使用 pd-ctl tso 命令。TiDB 日志中可以看到的错误 example:

// TiDB related error log

Error 1105 (HY000): Txn 453423637808807957 Retrying aggressive locking with ForUpdateTS (453423637808807963) less than previous LockedWithConflictTS (453423637831352321)

tiup ctl:v7.5.1 pd tso 453423637831352321

// get logic of LockedWithConflictTS

Starting component ctl: /Users/wuxuelian/.tiup/components/ctl/v7.5.1/ctl pd tso 453423637831352321

system: 2024-10-23 01:58:31.405 -0700 PDT

logic: 1

Resolution

升级到该问题已经被修复的新版本。

Workaround

如果有可能，可以考虑停止使用 stale read。
如果不能停用 stale read，那么我们并没有其它手段来严格地保证不触发这一问题。不过，如果可以保证如下几点，那么触发该问题的概率微乎其微：
- 避免使用 stale read 读太新的数据（即读的时间太过接近当前时间）。
- 确保部署集群的机器上的系统时间精确且稳定。
- 确保 PD 性能良好。

Affected Versions

v6.5.0~v6.5.11
v7.1.0~v7.1.5
v7.5.0~v7.5.4
v8.1.0~v8.1.1