tidb访问pd出现pd timeout(9001)或者拿到过时信息(?)出现region unavailable(9005)导致数据无法写入

【 TiDB 使用环境`】生产环境
【 TiDB 版本】4.0.14
【遇到的问题】
集群较大,160T左右的数据、750W/3左右的分片数,目前出现tidb无法获取到正确的region路由信息的问题导致读写大部分失败,从监控上看错误分成9001和9005两种,以9005为主,初步怀疑是pd调度压力太大导致了,所以调整了以下参数,但是没有解决这个问题:

tikv:
    raftstore.hibernate-regions: true
    raftstore.pd-heartbeat-tick-interval: 1m30s
    raftstore.pd-store-heartbeat-tick-interval: 20s
pd里和调度相关的参数:
   leader-schedule-limit
   merge-schedule-limit
   region-schedule-limit
   replica-schedule-limit

因为外部写入是顺序消费消息队列中的数据写入tidb的关系,因此即使只看到很少报错,整个写入流程也被完全堵住,目前的缓解措施是每隔一段时间重启pd,这样能防止队列中的数据堆积太久导致过时,但是感觉这只是个不是办法的办法,所以想问问有没有相关的解决经验或者思路可以参考,附上出错的日志。

9001:

[2022/06/06 17:19:05.900 +08:00] [WARN] [backoff.go:329] ["pdRPC backoffer.maxSleep 40000ms is exceeded, errors:\
region not found for key \"t\\x80\\x00\\x00\\x00\\x00\\x00\\x00\\x       cf_r\\x81\\xa7T\\xbd\\xb3\\x8aR\\x17\" at 2022-06-06T17:19:02.483532549+08:00\
region not found for key \"t\\x80\\x00\\x00\\x00\\x00\\x00\\x00\\xcf_r\\x81\\xa7T\\xbd\\xb3\\x8aR\\x1       7\" at 2022-06-06T17:19:04.123025368+08:00\
region not found for key \"t\\x80\\x00\\x00\\x00\\x00\\x00\\x00\\xcf_r\\x81\\xa7T\\xbd\\xb3\\x8aR\\x17\" at 2022-06-06T17:19:05.90054506       3+08:00"]
[2022/06/06 17:19:05.900 +08:00] [WARN] [session.go:1384] ["run statement failed"] [conn=1413] [schemaVersion=412] [error="[tikv:9001]PD server timeout"] [session="{\
  \"currDBNam       e\": \"gifshow\",\
  \"id\": 1413,\
  \"status\": 1,\
  \"strictMode\": true,\
  \"txn\": \"433719110697222149\",\
  \"user\": {\
    \"Username\": \"pay_gateway_rw\",\
    \"Hostn       ame\": \"**.**.**.**\",\
    \"CurrentUser\": false,\
    \"AuthUsername\": \"pay_gateway_rw\",\
    \"AuthHostname\": \"%\"\
  }\
}"]
[2022/06/06 17:19:05.901 +08:00] [INFO] [conn.go:864] ["command dispatched failed"] [conn=1413] [connInfo="id:1413, addr:**.**.**.**:45704 status:1, collation:utf8_general_ci, use       r:pay_gateway_rw"] [command=Query] [status="inTxn:1, autocommit:0"] [sql="/* \
ktrace:CAISGhCZgICAoKaduwoY3xAggN61vZMwKKX6/7UOGhoQx4CAgLCFk7YKGPACILq3isKTMCig2ozPCiASKhdrc3BheS1jb3       JlLWRhdGFidXMuUFJPRDIFa3NwYXk=\
trace_ctx:EgAyAA==\
 */

9005:

[2022/06/06 17:18:53.445 +08:00] [WARN] [backoff.go:329] ["regionMiss backoffer.maxSleep 40000ms is exceeded, errors:\
message:\"region 241388794 is missing\" region_not_found:<reg       ion_id:241388794 >  at 2022-06-06T17:18:52.442196869+08:00\
message:\"region 241388794 is missing\" region_not_found:<region_id:241388794 >  at 2022-06-06T17:18:52.94406682+08:00\
       message:\"region 241388794 is missing\" region_not_found:<region_id:241388794 >  at 2022-06-06T17:18:53.445932518+08:00"]
[2022/06/06 17:18:53.446 +08:00] [WARN] [session.go:1384] ["run statement failed"] [conn=567] [schemaVersion=412] [error="[tikv:9005]Region is unavailable"] [session="{\
  \"currDB       Name\": \"gifshow\",\
  \"id\": 567,\
  \"status\": 1,\
  \"strictMode\": true,\
  \"txn\": \"433719107446637014\",\
  \"user\": {\
    \"Username\": \"pay_gateway_rw\",\
    \"Hos       tname\": \"**.**.**.**\",\
    \"CurrentUser\": false,\
    \"AuthUsername\": \"pay_gateway_rw\",\
    \"AuthHostname\": \"%\"\
  }\
}"]
[2022/06/06 17:18:53.446 +08:00] [INFO] [conn.go:864] ["command dispatched failed"]

可以确定的是,通过pd-ctl可以看到对应的region状态是正常的。

全部的监控发一下

第一份日志剩余部分:

[err="[tikv:9001]PD server timeout\
github.com/pingcap/errors.AddStack\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/pkg/mod/github.co       m/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\
github.com/pingcap/errors.Trace\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/pkg/mod/       github.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/juju_adaptor.go:15\
github.com/pingcap/tidb/store/tikv.(*RegionCache).loadRegion\
\t/home/jenkins/agent/workspace/op       timization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/store/tikv/region_cache.go:996\
github.com/pingcap/tidb/store/tikv.(*RegionCache).findRegionByKey\
\t/home/jenkins/ag       ent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/store/tikv/region_cache.go:575\
github.com/pingcap/tidb/store/tikv.(*RegionCache).LocateKey\
\t/home/       jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/store/tikv/region_cache.go:535\
github.com/pingcap/tidb/store/tikv.(*RegionCache).GroupKeys       ByRegion\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/store/tikv/region_cache.go:711\
github.com/pingcap/tidb/store/tikv.(*tikv       Snapshot).batchGetKeysByRegions\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:222\
github.com/pingcap/tid       b/store/tikv.(*tikvSnapshot).BatchGet\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:148\
github.com/pingc       ap/tidb/kv.(*BufferBatchGetter).BatchGet\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/kv/memdb_buffer.go:227\
github.com/pingca       p/tidb/store/tikv.(*tikvTxn).BatchGet\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/store/tikv/txn.go:192\
github.com/pingcap/ti       db/session.(*TxnState).BatchGet\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/session/txn.go:345\
github.com/pingcap/tidb/execut       or.prefetchUniqueIndices\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/executor/insert.go:136\
github.com/pingcap/tidb/executor.       (*InsertValues).batchCheckAndInsert\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/executor/insert_common.go:1041\
github.com/pin       gcap/tidb/executor.(*InsertExec).exec\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/executor/insert.go:81\
github.com/pingcap/ti       db/executor.insertRows\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/executor/insert_common.go:272\
github.com/pingcap/tidb/exec       utor.(*InsertExec).Next\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/executor/insert.go:288\
github.com/pingcap/tidb/executor.N       ext\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/executor/executor.go:262\
github.com/pingcap/tidb/executor.(*ExecStmt).handleN       oDelayExecutor\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/executor/adapter.go:531\
github.com/pingcap/tidb/executor.(*ExecStm       t).handlePessimisticDML\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/executor/adapter.go:550\
github.com/pingcap/tidb/executor.       (*ExecStmt).handleNoDelay\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/executor/adapter.go:411\
github.com/pingcap/tidb/executo       r.(*ExecStmt).Exec\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/executor/adapter.go:366\
github.com/pingcap/tidb/session.runStm       t\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/session/tidb.go:322\
github.com/pingcap/tidb/session.(*session).ExecuteStmt\
\t/       home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/session/session.go:1381\
github.com/pingcap/tidb/server.(*TiDBContext).ExecuteStmt\
\t       /home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/server/driver_tidb.go:270\
github.com/pingcap/tidb/server.(*clientConn).handleStmt\
\       t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/server/conn.go:1513\
github.com/pingcap/tidb/server.(*clientConn).handleQuery\
\t/ho       me/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/server/conn.go:1502\
github.com/pingcap/tidb/server.(*clientConn).dispatch\
\t/home/jenk       ins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/server/conn.go:1080\
github.com/pingcap/tidb/server.(*clientConn).Run\
\t/home/jenkins/agent/wo       rkspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/server/conn.go:849\
github.com/pingcap/tidb/server.(*Server).onConn\
\t/home/jenkins/agent/workspace/optimi       zation-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/server/server.go:453\
runtime.goexit\
\t/usr/local/go/src/runtime/asm_amd64.s:1357"]

第二份日志剩余部分:

[err="[tikv:9005]Region is unavailable\
github.com/pingcap/errors.A       ddStack\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/errors.go:174\
github.com/ping       cap/errors.Trace\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/pkg/mod/github.com/pingcap/errors@v0.11.5-0.20201126102027-b0a155152ca3/juju_adaptor.go:15\
g       ithub.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetSingleRegion\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/store/tikv/       snapshot.go:301\
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetKeysByRegions\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingc       ap/tidb/store/tikv/snapshot.go:238\
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetSingleRegion\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/sr       c/github.com/pingcap/tidb/store/tikv/snapshot.go:303\
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetKeysByRegions\
\t/home/jenkins/agent/workspace/optimization-build-t       idb-linux-amd/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:238\
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetSingleRegion\
\t/home/jenkins/agent/workspace/op       timization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:303\
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetKeysByRegions\
\t/home/jenkins       /agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:238\
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetSingleRegio       n\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:303\
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).b       atchGetKeysByRegions\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:238\
github.com/pingcap/tidb/store/tik       v.(*tikvSnapshot).batchGetSingleRegion\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:303\
github.com/ping       cap/tidb/store/tikv.(*tikvSnapshot).batchGetKeysByRegions\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:2       38\
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetSingleRegion\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/store/       tikv/snapshot.go:303\
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetKeysByRegions\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/       pingcap/tidb/store/tikv/snapshot.go:238\
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetSingleRegion\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/       go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:303\
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetKeysByRegions\
\t/home/jenkins/agent/workspace/optimization-bu       ild-tidb-linux-amd/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:238\
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetSingleRegion\
\t/home/jenkins/agent/workspa       ce/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:303\
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetKeysByRegions\
\t/home/je       nkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:238\
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetSingle       Region\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:303\
github.com/pingcap/tidb/store/tikv.(*tikvSnapsh       ot).batchGetKeysByRegions\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:238\
github.com/pingcap/tidb/stor       e/tikv.(*tikvSnapshot).batchGetSingleRegion\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:303\
github.com       /pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetKeysByRegions\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/store/tikv/snapshot       .go:238\
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetSingleRegion\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/s       tore/tikv/snapshot.go:303\
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetKeysByRegions\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github       .com/pingcap/tidb/store/tikv/snapshot.go:238\
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetSingleRegion\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux       -amd/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:303\
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetKeysByRegions\
\t/home/jenkins/agent/workspace/optimizati       on-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:238\
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetSingleRegion\
\t/home/jenkins/agent/wo       rkspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:303\
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetKeysByRegions\
\t/ho       me/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:238\
github.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetS       ingleRegion\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:303\
github.com/pingcap/tidb/store/tikv.(*tikvS       napshot).batchGetKeysByRegions\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:238\
github.com/pingcap/tidb       /store/tikv.(*tikvSnapshot).batchGetSingleRegion\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/store/tikv/snapshot.go:303\
githu       b.com/pingcap/tidb/store/tikv.(*tikvSnapshot).batchGetKeysByRegions\
\t/home/jenkins/agent/workspace/optimization-build-tidb-linux-amd/go/src/github.com/pingcap/tidb/store/tikv/sna       pshot.go:238"]

你可以参考一下这个

这个参数我们调整过,作用并不明显

麻烦使用 clinic 采集下相关日志及监控信息,目前信息根因尚无法定位。 → https://docs.pingcap.com/zh/tidb-in-kubernetes/stable/clinic-user-guide#使用-pingcap-clinic

此话题已在最后回复的 60 天后被自动关闭。不再允许新回复。