PD leader oom

【 TiDB 使用环境】生产环境
【 TiDB 版本】7.1.0
【复现路径】无
【遇到的问题:问题现象及影响】
PD节点leader oom,tidb访问pd出现故障。日志:

[2024/01/17 15:30:11.960 +08:00] [WARN] [util.go:163] ["apply request took too long"] [took=182.652093ms] [expected-duration=100ms] [prefix=] [request="header:<ID:15889696433221208426 > txn:<compare:<target:MOD key:\"/tidb/cdc/default/default/changefeed/status/broker-b-broker-bills-enterprise-collection\" mod_revision:7048499856  target:MOD key:\"/tidb/cdc/default/default/changefeed/status/prod-tidb-rd-stat-deal-suffix\" mod_revision:7048499856  target:MOD key:\"/tidb/cdc/default/default/changefeed/status/broker-b-broker-bills-corporate\" mod_revision:7048499856  target:MOD key:\"/tidb/cdc/default/default/changefeed/status/broker-b-broker-bills-realtime\" mod_revision:7048499856  target:MOD key:\"/tidb/cdc/default/default/changefeed/status/broker-b-broker-bills-wxpay-datalake\" mod_revision:7048499856  target:MOD key:\"/tidb/cdc/default/default/changefeed/status/ticdc-v504-rd-payment-orders-many-prd\" mod_revision:7048499857  target:MOD key:\"/tidb/cdc/default/default/changefeed/status/broker-b-broker-bills-alipay\" mod_revision:7048499856  target:MOD key:\"/tidb/cdc/default/default/changefeed/status/ticdc-v504-stat-ads-adi-prd\" mod_revision:7048499856  target:MOD key:\"/tidb/cdc/default/default/changefeed/status/broker-mutex-user-remit-period\" mod_revision:7048499856  target:MOD key:\"/tidb/cdc/default/default/changefeed/status/ticdc-v504-broker-stat-v3-income-record-prd\" mod_revision:7048499856  target:MOD key:\"/tidb/cdc/default/default/changefeed/status/broker-b-broker-bills-enterprise\" mod_revision:7048499856  target:MOD key:\"/tidb/cdc/default/default/changefeed/status/ticdc-v504-rd-broker-bills-many-prod\" mod_revision:7048499857  target:MOD key:\"/tidb/cdc/default/default/changefeed/status/prod-tidb-broker-vol3\" mod_revision:7048499856  target:MOD key:\"/tidb/cdc/default/default/changefeed/status/ticdc-v504-stat-online-bp-many-prod-c\" mod_revision:7048499856  target:MOD key:\"/tidb/cdc/default/default/changefeed/status/broker-b-broker-bills-external\" mod_revision:7048499856  target:VALUE key:\"/tidb/cdc/default/__cdc_meta__/meta/ticdc-delete-etcd-key-count\" value_size:2 > success:<request_put:<key:\"/tidb/cdc/default/default/changefeed/status/broker-b-broker-bills-enterprise-collection\" value_size:130 > request_put:<key:\"/tidb/cdc/default/default/changefeed/status/prod-tidb-rd-stat-deal-suffix\" value_size:130 > request_put:<key:\"/tidb/cdc/default/default/changefeed/status/broker-b-broker-bills-corporate\" value_size:130 > request_put:<key:\"/tidb/cdc/default/default/changefeed/status/broker-b-broker-bills-realtime\" value_size:130 > request_put:<key:\"/tidb/cdc/default/default/changefeed/status/broker-b-broker-bills-wxpay-datalake\" value_size:130 > request_put:<key:\"/tidb/cdc/default/default/changefeed/status/ticdc-v504-rd-payment-orders-many-prd\" value_size:130 > request_put:<key:\"/tidb/cdc/default/default/changefeed/status/broker-b-broker-bills-alipay\" value_size:130 > request_put:<key:\"/tidb/cdc/default/default/changefeed/status/ticdc-v504-stat-ads-adi-prd\" value_size:130 > request_put:<key:\"/tidb/cdc/default/default/changefeed/status/broker-mutex-user-remit-period\" value_size:130 > request_put:<key:\"/tidb/cdc/default/default/changefeed/status/ticdc-v504-broker-stat-v3-income-record-prd\" value_size:130 > request_put:<key:\"/tidb/cdc/default/default/changefeed/status/broker-b-broker-bills-enterprise\" value_size:130 > request_put:<key:\"/tidb/cdc/default/default/changefeed/status/ticdc-v504-rd-broker-bills-many-prod\" value_size:130 > request_put:<key:\"/tidb/cdc/default/default/changefeed/status/prod-tidb-broker-vol3\" value_size:130 > request_put:<key:\"/tidb/cdc/default/default/changefeed/status/ticdc-v504-stat-online-bp-many-prod-c\" value_size:130 > request_put:<key:\"/tidb/cdc/default/default/changefeed/status/broker-b-broker-bills-external\" value_size:130 >> failure:<>>"] [response=size:190] []

看下机器上内存是全是pd占用的吗?有其他进程吗?是混合部署不是?

内存分配太小了吗

你这是cdc出问题了?

/tidb/cdc/default/default/changefeed/status/broker-b-broker-bills-external" value_size:130 >> failure:<>>

怀疑 ticdc和pd部署在一起,可能是ticdc同步中断 在启动时 内存溢出,导致pd也会相应报错

cdc正常,机器单独部署。

独立部署的pd-server

错误日志上下可以多提供2行吧。。。。还有集群配置

PD出现OOM的可能性很小的,检查REGION情况,检查并发是否过多等。

很少遇到 PD 节点机器出现 OOM 的情况,从原理机制上也比较罕见。

确认机器是否有其他进程混合部署,导致内存使用增加。

确认下集群的数据规模、Region数量等信息。

同时重点分析该pd 节点的日志 pd.log,获取其重启前的日志情况,即查看出现 Welcome 重启关键字前后的内容,确认有无异常。

region数量比较多,230w+
日志都是info或者warning的

当时查了region数量,通过sql。不知道有没有关系。

SELECT s.store_id,s.address,count(distinct r.REGION_ID) from INFORMATION_SCHEMA.TIKV_REGION_STATUS as r,INFORMATION_SCHEMA.TIKV_REGION_PEERS as p,INFORMATION_SCHEMA.TIKV_STORE_STATUS as s
where r.REGION_ID=p.REGION_ID and p.STORE_ID = s.STORE_ID