求教如何解决:ERROR 9001 (HY000): PD server timeout

【TiDB 使用环境】生产环境
【TiDB 版本】8.5.1
【操作系统】AWS
【部署方式】AWS云上部署
【集群数据量】4T
【集群节点数】9
【问题复现路径】从v7.1.1 升级到 v8.5.1
【遇到的问题:问题现象及影响】
背景:
升级系统后,出现了 insert into 语句的prewrite和commit 耗时在20s以上,有些甚至要几分钟才能完成的现象。持续了大约1天时间后有出现2台tikv先后重启,之后tikv的硬盘写入升高prewrite和commit开始恢复到ms级别。

问题:
生产环境有张表 w_item_rakuten_sku 大约1.5亿记录。
涉及region数量如下:

执行:select count(*) from api.w_item_rakuten_sku where id>1152921504630682995;
报错:ERROR 9001 (HY000): PD server timeout:
Explain结果:

处理:执行analyze table api.w_item_rakuten_sku; 上述sql可以正常执行。

但是其他session,再次执行查询仍然报错:

执行:select * from w_item_rakuten_sku force index (PRIMARY) limit 1
报错:ERROR 9001 (HY000): PD server timeout
Explain结果如下:

处理:再次执行 analyze table之后,查询正常返回。

每次都analyze也不是个办法,想请问下有什么调查方向没?是否什么参数配置导致的?

【资源配置】进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面

从tidb服务器ping pd服务器:大约0.25ms
pd服务器cpu99%idle,硬盘空闲;

【复制黏贴 ERROR 报错的日志】
pd.log 和 tikv.log: 此时间点没有内容;
tidb.log
[2025/04/29 15:10:55.488 +09:00] [WARN] [backoff.go:179] [“pdRPC backoffer.maxSleep 10000ms is exceeded, errors:\nPD returned regions have gaps, rang
e num: 2, limit: 128 at 2025-04-29T15:10:49.551409367+09:00\nPD returned regions have gaps, range num: 2, limit: 128 at 2025-04-29T15:10:51.285390702
+09:00\nPD returned regions have gaps, range num: 2, limit: 128 at 2025-04-29T15:10:53.492388735+09:00\ntotal-backoff-times: 7, backoff-detail: pdRPC
:7, maxBackoffTimeExceeded: true, maxExcludedTimeExceeded: false\nlongest sleep type: pdRPC, time: 10899ms”] [conn=641932222] [session_alias=]
[2025/04/29 15:10:55.489 +09:00] [INFO] [conn.go:1184] [“command dispatched failed”] [conn=641932222] [session_alias=] [connInfo=“id:641932222, addr:127.0.0.1:61334 status:10, collation:utf8_general_ci, user:root”] [command=Query] [status=“inTxn:0, autocommit:1”] [sql=“select * from w_item_rakuten_sku force index (PRIMARY) limit 1”] [txn_mode=PESSIMISTIC] [timestamp=457679056292806657] [err=“[tikv:9001]PD server timeout: \ngithub.com/pingcap/errors.AddStack\n\t/root/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20240318064555-6bd07397691f/errors.go:178\ngithub.com/pingcap/errors.(*Error).GenWithStackByArgs\n\t/root/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20240318064555-6bd07397691f/normalize.go:175\ngithub.com/pingcap/tidb/pkg/store/driver/error.ToTiDBErr\n\t/workspace/source/tidb/pkg/store/driver/error/error.go:119\ngithub.com/pingcap/tidb/pkg/store/copr.(*RegionCache).SplitKeyRangesByLocations\n\t/workspace/source/tidb/pkg/store/copr/region_cache.go:190\ngithub.com/pingcap/tidb/pkg/store/copr.(*RegionCache).SplitKeyRangesByBuckets\n\t/workspace/source/tidb/pkg/store/copr/region_cache.go:232\ngithub.com/pingcap/tidb/pkg/store/copr.buildCopTasks\n\t/workspace/source/tidb/pkg/store/copr/coprocessor.go:353\ngithub.com/pingcap/tidb/pkg/store/copr.(*CopClient).BuildCopIterator.func3\n\t/workspace/source/tidb/pkg/store/copr/coprocessor.go:159\ngithub.com/pingcap/tidb/pkg/kv.(*KeyRanges).ForEachPartitionWithErr\n\t/workspace/source/tidb/pkg/kv/kv.go:476\ngithub.com/pingcap/tidb/pkg/store/copr.(*CopClient).BuildCopIterator\n\t/workspace/source/tidb/pkg/store/copr/coprocessor.go:173\ngithub.com/pingcap/tidb/pkg/store/copr.(*CopClient).Send\n\t/workspace/source/tidb/pkg/store/copr/coprocessor.go:100\ngithub.com/pingcap/tidb/pkg/distsql.Select\n\t/workspace/source/tidb/pkg/distsql/distsql.go:91\ngithub.com/pingcap/tidb/pkg/distsql.SelectWithRuntimeStats\n\t/workspace/source/tidb/pkg/distsql/distsql.go:146\ngithub.com/pingcap/tidb/pkg/executor.selectResultHook.SelectResult\n\t/workspace/source/tidb/pkg/executor/table_reader.go:70\ngithub.com/pingcap/tidb/pkg/executor.(*TableReaderExecutor).buildResp\n\t/workspace/source/tidb/pkg/executor/table_reader.go:421\ngithub.com/pingcap/tidb/pkg/executor.(*TableReaderExecutor).Open\n\t/workspace/source/tidb/pkg/executor/table_reader.go:298\ngithub.com/pingcap/tidb/pkg/executor/internal/exec.Open\n\t/workspace/source/tidb/pkg/executor/internal/exec/executor.go:433\ngithub.com/pingcap/tidb/pkg/executor/internal/exec.(*BaseExecutorV2).Open\n\t/workspace/source/tidb/pkg/executor/internal/exec/executor.go:303\ngithub.com/pingcap/tidb/pkg/executor.(*LimitExec).Open\n\t/workspace/source/tidb/pkg/executor/select.go:496\ngithub.com/pingcap/tidb/pkg/executor/internal/exec.Open\n\t/workspace/source/tidb/pkg/executor/internal/exec/executor.go:433\ngithub.com/pingcap/tidb/pkg/executor.(*ExecStmt).openExecutor\n\t/workspace/source/tidb/pkg/executor/adapter.go:1259\ngithub.com/pingcap/tidb/pkg/executor.(*ExecStmt).Exec\n\t/workspace/source/tidb/pkg/executor/adapter.go:592\ngithub.com/pingcap/tidb/pkg/session.runStmt\n\t/workspace/source/tidb/pkg/session/session.go:2288\ngithub.com/pingcap/tidb/pkg/session.(*session).ExecuteStmt\n\t/workspace/source/tidb/pkg/session/session.go:2150\ngithub.com/pingcap/tidb/pkg/server.(*TiDBContext).ExecuteStmt\n\t/workspace/source/tidb/pkg/server/driver_tidb.go:291\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).handleStmt\n\t/workspace/source/tidb/pkg/server/conn.go:2026\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).handleQuery\n\t/workspace/source/tidb/pkg/server/conn.go:1779\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).dispatch\n\t/workspace/source/tidb/pkg/server/conn.go:1378\ngithub.com/pingcap/tidb/pkg/server.(*clientConn).Run\n\t/workspace/source/tidb/pkg/server/conn.go:1147\ngithub.com/pingcap/tidb/pkg/server.(*Server).onConn\n\t/workspace/source/tidb/pkg/server/server.go:741\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1700”]

【其他附件:截图/日志/监控】

每个tidb上执行一次analyze或者重启下tidb节点试试

说起analyze忘记交代了。
升级前tidb_analyze_version 和 tidb_cost_model_version是1,此次都改回了默认值2。

不知道是否有关系?是否需要手动对所有表执行一次analyze?
(这个工作量和影响貌似有点大的)

看上去是这个bug。描述非常接近。

好消息是这个bug修掉了,坏消息是这个bug2周前才修的,现在还没有最新可用的子版本包含这个修复。

1 个赞

感谢。 看描述还真是相当一致。。。

几个要素都有,大表,超过128个region,count。

我们这次是执行select * from w_item_rakuten_sku limit 1;也会出现timeout。
执行analyze后又能恢复正常~~
感觉也还有点区别,先安排重启tidb节点了。
感谢。

根因是

For a large table with multiple range queries, if the region count exceeds 128, the last range may be not limited.

问题在于扫描的region超过128就会报错。analyze后能正常,可能是有了比较准确的统计信息后,扫描的region个数小于128了。

当然重启能解决更好,不过我感觉恐怕于事无补。

另外这个issue,其实代码改动非常小。

改动一下pd代码里面对应文件的这个判断的位置就完成修复了。

如果着急,不如自己搞个8.5.1分支的代码,手动改好,编译一下pd就完成修复了。