求教如何解决：ERROR 9001 (HY000): PD server timeout

agoo · 2025 年4 月 29 日 07:30

【TiDB 使用环境】生产环境
【TiDB 版本】8.5.1
【操作系统】AWS
【部署方式】AWS云上部署
【集群数据量】4T
【集群节点数】9
【问题复现路径】从v7.1.1 升级到 v8.5.1
【遇到的问题：问题现象及影响】
背景：
升级系统后，出现了 insert into 语句的prewrite和commit 耗时在20s以上，有些甚至要几分钟才能完成的现象。持续了大约1天时间后有出现2台tikv先后重启，之后tikv的硬盘写入升高prewrite和commit开始恢复到ms级别。

问题：
生产环境有张表 w_item_rakuten_sku 大约1.5亿记录。
涉及region数量如下：

执行：select count(*) from api.w_item_rakuten_sku where id>1152921504630682995;
报错：ERROR 9001 (HY000): PD server timeout:
Explain结果：

处理：执行analyze table api.w_item_rakuten_sku; 上述sql可以正常执行。

但是其他session，再次执行查询仍然报错：

执行：select * from w_item_rakuten_sku force index (PRIMARY) limit 1
报错：ERROR 9001 (HY000): PD server timeout
Explain结果如下：

处理：再次执行 analyze table之后，查询正常返回。

每次都analyze也不是个办法，想请问下有什么调查方向没？是否什么参数配置导致的？

【资源配置】进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面

从tidb服务器ping pd服务器：大约0.25ms
pd服务器cpu99%idle，硬盘空闲；

【复制黏贴 ERROR 报错的日志】
pd.log 和 tikv.log: 此时间点没有内容；
tidb.log
[2025/04/29 15:10:55.488 +09:00] [WARN] [backoff.go:179] [“pdRPC backoffer.maxSleep 10000ms is exceeded, errors:\nPD returned regions have gaps, rang
e num: 2, limit: 128 at 2025-04-29T15:10:49.551409367+09:00\nPD returned regions have gaps, range num: 2, limit: 128 at 2025-04-29T15:10:51.285390702
+09:00\nPD returned regions have gaps, range num: 2, limit: 128 at 2025-04-29T15:10:53.492388735+09:00\ntotal-backoff-times: 7, backoff-detail: pdRPC
:7, maxBackoffTimeExceeded: true, maxExcludedTimeExceeded: false\nlongest sleep type: pdRPC, time: 10899ms”] [conn=641932222] [session_alias=]
[2025/04/29 15:10:55.489 +09:00] [INFO] [conn.go:1184] [“command dispatched failed”] [conn=641932222] [session_alias=] [connInfo=“id:641932222, addr:127.0.0.1:61334 status:10, collation:utf8_general_ci, user:root”] [command=Query] [status=“inTxn:0, autocommit:1”] [sql=“select * from w_item_rakuten_sku force index (PRIMARY) limit 1”] [txn_mode=PESSIMISTIC] [timestamp=457679056292806657] [err=“[tikv:9001]PD server timeout: \ngithub.com/pingcap/errors.AddStack\n\t/root/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20240318064555-6bd07397691f/errors.go:178\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\t/root/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20240318064555-6bd07397691f/normalize.go:175\ngithub.com/pingcap/tidb/pkg/store/driver/error.ToTiDBErr\n\t/workspace/source/tidb/pkg/store/driver/error/error.go:119\ngithub.com/pingcap/tidb/pkg/store/copr.(*RegionCache).SplitKeyRangesByLocations\n\t/workspace/source/tidb/pkg/store/copr/region_cache.go:190\ngithub.com/pingcap/tidb/pkg/store/copr.(*RegionCache).SplitKeyRangesByBuckets\n\t/workspace/source/tidb/pkg/store/copr/region_cache.go:232\ngithub.com/pingcap/tidb/pkg/store/copr.buildCopTasks\n\t/workspace/source/tidb/pkg/store/copr/coprocessor.go:353\ngithub.com/pingcap/tidb/pkg/store/copr.(*CopClient).BuildCopIterator.func3\n\t/workspace/source/tidb/pkg/store/copr/coprocessor.go:159\ngithub.com/pingcap/tidb/pkg/kv.(*KeyRanges).ForEachPartitionWithErr\n\t/workspace/source/tidb/pkg/kv/kv.go:476\ngithub.com/pingcap/tidb/pkg/store/copr.(*CopClient).BuildCopIterator\n\t/workspace/source/tidb/pkg/store/copr/coprocessor.go:173\ngithub.com/pingcap/tidb/pkg/store/copr.(*CopClient).Send\n\t/workspace/source/tidb/pkg/store/copr/coprocessor.go:100\ngithub.com/pingcap/tidb/pkg/distsql.Select\n\t/workspace/source/tidb/pkg/distsql/distsql.go:91\ngithub.com/pingcap/tidb/pkg/distsql.SelectWithRuntimeStats\n\t/workspace/source/tidb/pkg/distsql/distsql.go:146\ngithub.com/pingcap/tidb/pkg/executor.selectResultHook.SelectResult\n\t/workspace/source/tidb/pkg/executor/table_reader.go:70\ngithub.com/pingcap/tidb/pkg/executor.(*TableReaderExecutor).buildResp\n\t/workspace/source/tidb/pkg/executor/table_reader.go:421\ngithub.com/pingcap/tidb/pkg/executor.(*TableReaderExecutor).Open\n\t/workspace/source/tidb/pkg/executor/table_reader.go:298\ngithub.com/pingcap/tidb/pkg/executor/internal/exec.Open\n\t/workspace/source/tidb/pkg/executor/internal/exec/executor.go:433\ngithub.com/pingcap/tidb/pkg/executor/internal/exec.(*BaseExecutorV2).Open\n\t/workspace/source/tidb/pkg/executor/internal/exec/executor.go:303\ngithub.com/pingcap/tidb/pkg/executor.(*LimitExec).Open\n\t/workspace/source/tidb/pkg/executor/select.go:496\ngithub.com/pingcap/tidb/pkg/executor/internal/exec.Open\n\t/workspace/source/tidb/pkg/executor/internal/exec/executor.go:433\ngithub.com/pingcap/tidb/pkg/executor.(*ExecStmt).openExecutor\n\t/workspace/source/tidb/pkg/executor/adapter.go:1259\ngithub.com/pingcap/tidb/pkg/executor.(*ExecStmt).Exec\n\t/workspace/source/tidb/pkg/executor/adapter.go:592\ngithub.com/pingcap/tidb/pkg/session.runStmt\n\t/workspace/source/tidb/pkg/session/session.go:2288\ngithub.com/pingcap/tidb/pkg/session.(*session).ExecuteStmt\n\t/workspace/source/tidb/pkg/session/session.go:2150\ngithub.com/pingcap/tidb/pkg/server.(*TiDBContext).ExecuteStmt\n\t/workspace/source/tidb/pkg/server/driver_tidb.go:291\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).handleStmt\n\t/workspace/source/tidb/pkg/server/conn.go:2026\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).handleQuery\n\t/workspace/source/tidb/pkg/server/conn.go:1779\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).dispatch\n\t/workspace/source/tidb/pkg/server/conn.go:1378\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).Run\n\t/workspace/source/tidb/pkg/server/conn.go:1147\ngithub.com/pingcap/tidb/pkg/server.(*Server).onConn\n\t/workspace/source/tidb/pkg/server/server.go:741\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1700”]

【其他附件：截图/日志/监控】

zhanggame1 · 2025 年4 月 29 日 07:45

每个tidb上执行一次analyze或者重启下tidb节点试试

agoo · 2025 年4 月 29 日 08:01

说起analyze忘记交代了。
升级前tidb_analyze_version 和 tidb_cost_model_version是1，此次都改回了默认值2。

不知道是否有关系？是否需要手动对所有表执行一次analyze？
（这个工作量和影响貌似有点大的）

有猫万事足 · 2025 年4 月 29 日 08:01

github.com/pingcap/tidb

inIdexRangeScan for large table hit ERROR 9001 (HY000): PD server timeout due to "PD returned regions have gaps"

已打开 09:00AM - 17 Apr 25 UTC

已关闭 03:13AM - 22 Apr 25 UTC

mayjiang0203

type/bug severity/major component/pd may-affects-6.1 may-affects-6.5 may-affects-7.1 may-affects-7.5 may-affects-8.1 report/customer affects-8.5 impact/upgrade

## Bug Report Please answer these questions before submitting your issue. Thank…s! ### 1. Minimal reproduce step (Required) One 1 billion rows table with one index. Run the following sql ``` mysql> show create table t; +-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | Table | Create Table | +-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ | t | CREATE TABLE `t` ( `id` varchar(32) NOT NULL, `k` bigint NOT NULL, `c` varchar(32) DEFAULT NULL, `b` decimal(19,3) DEFAULT NULL, `r` datetime DEFAULT NULL, `comments` varchar(255) DEFAULT NULL, PRIMARY KEY (`id`) /*T![clustered_index] NONCLUSTERED */, KEY `idx_k` (`k`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin /*T! SHARD_ROW_ID_BITS=4 PRE_SPLIT_REGIONS=4 */ | +-----------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ 1 row in set (0.05 sec) ``` run the following sql ``` mysql> select count(customer_number) from t_account where customer_number <> 'name10000001650000001'; ERROR 9001 (HY000): PD server timeout: ``` ### 2. What did you expect to see? (Required) ``` mysql> select count(k) from t where k <> 'name10000001650000001'; +------------------------+ | count(k) | +------------------------+ | 999999999 | +------------------------+ 1 row in set (2 min 40.31 sec) ``` ### 3. What did you see instead (Required) ``` mysql> select count(customer_number) from t_account where customer_number <> 'name10000001650000001'; ERROR 9001 (HY000): PD server timeout: ``` error logs ``` 2025-04-17 12:01:49 (UTC+08:00)TiDB tc-tidb-1.tc-tidb-peer.e2e-oltp-sysbench-wr-stab-tps-7754415-1-906.svc:4000[backoff.go:179] ["pdRPC backoffer.maxSleep 10000ms is exceeded, errors:\nPD returned regions have gaps, range num: 2, limit: 128 at 2025-04-17T12:01:44.521850053+08:00\nPD returned regions have gaps, range num: 2, limit: 128 at 2025-04-17T12:01:46.075557903+08:00\nPD returned regions have gaps, range num: 2, limit: 128 at 2025-04-17T12:01:47.738450713+08:00\ntotal-backoff-times: 8, backoff-detail: pdRPC:8, maxBackoffTimeExceeded: true, maxExcludedTimeExceeded: false\nlongest sleep type: pdRPC, time: 11932ms"] [conn=1646264616] [session_alias=] ``` ### 4. What is your TiDB version? (Required) v8.5.1

看上去是这个bug。描述非常接近。

好消息是这个bug修掉了，坏消息是这个bug2周前才修的，现在还没有最新可用的子版本包含这个修复。

agoo · 2025 年4 月 29 日 08:04

感谢。看描述还真是相当一致。。。

有猫万事足 · 2025 年4 月 29 日 08:05

几个要素都有，大表，超过128个region，count。

agoo · 2025 年4 月 29 日 08:21

我们这次是执行select * from w_item_rakuten_sku limit 1;也会出现timeout。
执行analyze后又能恢复正常～～
感觉也还有点区别，先安排重启tidb节点了。
感谢。

有猫万事足 · 2025 年4 月 29 日 13:58

根因是

For a large table with multiple range queries, if the region count exceeds 128, the last range may be not limited.

问题在于扫描的region超过128就会报错。analyze后能正常，可能是有了比较准确的统计信息后，扫描的region个数小于128了。

当然重启能解决更好，不过我感觉恐怕于事无补。

另外这个issue，其实代码改动非常小。

github.com/tikv/pd

core: fix the issue that BatchScanRegions is not limited (#9215)

release-8.5 ← ti-chi-bot:cherry-pick-9215-to-release-8.5

已打开 06:43AM - 18 Apr 25 UTC

ti-chi-bot

+30 -5

This is an automated cherry-pick of #9215 ### What problem does this PR sol…ve? Issue Number: Close https://github.com/tikv/pd/issues/9216 ### What is changed and how does it work? ```commit-message ``` ### Check List Tests - Unit test ``` ``` ### Release note ```release-note None. ```

改动一下pd代码里面对应文件的这个判断的位置就完成修复了。

如果着急，不如自己搞个8.5.1分支的代码，手动改好，编译一下pd就完成修复了。

Kongdom · 2025 年5 月 8 日 10:57

源码大佬

system · 2025 年5 月 19 日 09:18

此话题已在最后回复的 7 天后被自动关闭。不再允许新回复。