TIDB6.5.0版本出现pd leader内存泄露达到28G

【 TiDB 使用环境】测试/ Poc, 易捷行云的云环境部署
【 TiDB 版本】6.5.0
【复现路径】正常K8S部署该版本
【遇到的问题:问题现象及影响】
系统部署后,pd-server主节点莫名内存持续升高,达到28G后开始pd故障,自动被干掉,然后重新选主,新主节点内存仍然持续升高。目前系统的测试业务,压力较小。

通过排查故障pd的日志,没有太多可疑的异常。目前整个系统唯一可疑的点是IO较高,系统wait值达到5,但之前压力测试也没有出现这种情况,同样系统IO较高。

【资源配置】64G 32线程CPU
【附件:截图/日志/监控】

通过granafa 看下 IO 的监控指标咯,特别是 PD 的相关指标

IO 不够的话,跑 tidb 有点够呛

磁盘测过fio么?

大规模pd写入,肯定不正常,这个和io高不高,关系不大。 pd到底在写些什么数据,什么情况会引发这些写入。

可以采集下pd的heap看看是什么消耗了
curl http://xxx.xxx.xxx.xxx:2379/debug/pprof/heap?seconds=60 >pd_heap

你这是 1 台机器混部所有的组件吗?

看到2个修复:
修复调用 ReportMinResolvedTS 过于频繁导致 PD OOM 的问题 #5965
修复使用 PrepareExecute 查询某些虚拟表时无法将表 ID 下推,导致在大量 Region 的情况下 PD OOM 的问题 #39605

1 个赞

虚拟表指的是view类型,还是分区表?

k8s上部署的tidb,磁盘使用的本地磁盘?我看pd数据目录指向的是/var/lib/pd?你这个集群规模有多大啊?正常pd的io压力没那么大应该

我就部署的三个节点,每个节点都是复用。 三个节点都在一个超融合云环境中。一个测试环境,写入压力也不大,之前测试的时候,写入压力很大,pd也没这个问题。 pd数据目录指向/var/lib/pd 是因为采用k8s容器部署模式。

怎么有这么多pd-server,这是部署了好几套吗

有没有和tikv或者tidb混合部署呢?

PD的节点怎么这么多,都是一个集群的吗

这可是pd下面的各个线程,大伙没看出来? 多个pd线程都在大量写磁盘。看看这几个线程的堆栈:
Thread 8 (LWP 14156):
#0 runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:560
#1 0x0000000000decb96 in runtime.futexsleep (addr=0xfffffffffffffe00, val=0, ns=14845859) at /usr/local/go/src/runtime/os_linux.go:69
#2 0x0000000000dc2de7 in runtime.notesleep (n=0xc000316148) at /usr/local/go/src/runtime/lock_futex.go:160
#3 0x0000000000df78ac in runtime.mPark () at /usr/local/go/src/runtime/proc.go:2247
#4 runtime.stopm () at /usr/local/go/src/runtime/proc.go:2247
#5 0x0000000000df8f48 in runtime.findRunnable (gp=, inheritTime=, tryWakeP=) at /usr/local/go/src/runtime/proc.go:2874
#6 0x0000000000df9d7e in runtime.schedule () at /usr/local/go/src/runtime/proc.go:3214
#7 0x0000000000dfa2ad in runtime.park_m (gp=0xc00096da00) at /usr/local/go/src/runtime/proc.go:3363
#8 0x0000000000e24663 in runtime.mcall () at /usr/local/go/src/runtime/asm_amd64.s:448
#9 0x0000000000000000 in ?? ()
Thread 7 (LWP 14155):
#0 runtime.futex () at /usr/local/go/src/runtime/sys_linux_amd64.s:560
#1 0x0000000000decb96 in runtime.futexsleep (addr=0xfffffffffffffe00, val=0, ns=14845859) at /usr/local/go/src/runtime/os_linux.go:69
#2 0x0000000000dc2de7 in runtime.notesleep (n=0xc000100548) at /usr/local/go/src/runtime/lock_futex.go:160
#3 0x0000000000df78ac in runtime.mPark () at /usr/local/go/src/runtime/proc.go:2247
#4 runtime.stopm () at /usr/local/go/src/runtime/proc.go:2247
#5 0x0000000000df8f48 in runtime.findRunnable (gp=, inheritTime=, tryWakeP=) at /usr/local/go/src/runtime/proc.go:2874
#6 0x0000000000df9d7e in runtime.schedule () at /usr/local/go/src/runtime/proc.go:3214
#7 0x0000000000dfa2ad in runtime.park_m (gp=0xc00096da00) at /usr/local/go/src/runtime/proc.go:3363
#8 0x0000000000e24663 in runtime.mcall () at /usr/local/go/src/runtime/asm_amd64.s:448
#9 0x0000000000000000 in ?? ()

看看当前tidb集群在执行什么任务

上面这个大佬提到的采集了么,
同时确认一下集群的监控以及其他tidb-server/tikv-server节点有无异常日志信息,
集群有没有不一样的业务访问

现在提供的信息不是很好确认问题所在,继续排查下

这个没有看,现在正在看用gdb dump的数据,太大啦! 给一些实例:
peer.mycluster
pu.svc.clu
vel":“info”,“log-file”:“”,“log-format”:“text”,“log-rotation-timespan”:“0s”,“log-rotation-size”:“300MiB”,“slow-log-file”:“”,“slow-log-threshold”:“1s”,“abort-on-panic”:false,“memory-usage-limit”:“5000MiB”,“memory-usage-high-water”:0.9,“log”:{“level”:“info”,“format”:“text”,“enable-timestamp”:true,“file”:{“filename”:“”,“max-size”:300,“max-days”:0,“max-backups”:0}},“quota”:{“foreground-cpu-time”:0,“foreground-write-bandwidth”:“0KiB”,“foreground-read-bandwidth”:“0KiB”,“max-delay-duration”:“500ms”,“background-cpu-time”:0,“background-write-bandwidth”:“0KiB”,“background-read-bandwidth”:“0KiB”,“enable-auto-tune”:false},“readpool”:{“unified”:{“min-thread-count”:1,“max-thread-count”:10,“stack-size”:“10MiB”,“max-tasks-per-worker”:2000,“auto-adjust-pool-size”:false},“storage”:{“use-unified-pool”:true,“high-concurrency”:8,“normal-concurrency”:8,“low-concurrency”:8,“max-tasks-per-worker-high”:2000,“max-tasks-per-worker-normal”:2000,“max-tasks-per-worker-low”:2000,“stack-size”:“10MiB”},“coprocessor”:{“use-unified-pool”:true,“high-concurrency”:12,“normal-concurrency”:12,“low-concurrency”:12,“max-tasks-per-worker-high”:2000,“max-tasks-per-worker-normal”:2000,“max-tasks-per-worker-low”:2000,“stack-size”:“10MiB”}},“server”:{“addr”:“0.0.0.0:20160”,“advertise-addr”:“basic-tikv-2.basic-tikv-peer.mycluster.svc:20160”,“status-addr”:“0.0.0.0:20180”,“advertise-status-addr”:“”,“status-thread-pool-size”:1,“max-grpc-send-msg-len”:10485760,“raft-client-grpc-send-msg-buffer”:524288,“raft-client-queue-size”:8192,“raft-msg-max-batch-size”:128,“grpc-compression-type”:“none”,“grpc-gzip-compression-level”:2,“grpc-min-message-size-to-compress”:4096,“grpc-concurrency”:5,“grpc-concurrent-stream”:1024,“grpc-raft-conn-num”:1,“grpc-memory-pool-quota”:“9223372036854775807B”,“grpc-stream-initial-window-size”:“2MiB”,“grpc-keepalive-time”:“10s”,“grpc-keepalive-timeout”:“3s”,“concurrent-send-snap-limit”:32,“concurrent-recv-snap-limit”:32,“end-point-recursion-limit”:1000,“end-point-stream-channel-size”:8,“end-point-batch-row-limit”:64,“end-point-stream-batch-row-limit”:128,“end-point-enable-batch-if-possible”:true,“end-point-request-max-handle-duration”:“1m”,“end-point-max-concurrency”:16,“end-point-perf-level”:0,“snap-max-write-bytes-per-sec”:“100MiB”,“snap-max-total-size”:“0KiB”,“stats-concurrency”:1,“heavy-load-threshold”:75,“heavy-load-wait-duration”:null,“enable-request-batch”:true,“background-thread-count”:2,“end-point-slow-log-threshold”:“1s”,“forward-max-connections-per-address”:4,“reject-messages-on-memory-ratio”:0.2,“simplify-metrics”:false,“labels”:{}},“storage”:{“data-dir”:“/var/lib/tikv”,“gc-ratio-threshold”:1.1,“max-key-size”:8192,“scheduler-concurrency”:524288,“scheduler-worker-pool-size”:8,“scheduler-pending-write-threshold”:“100MiB”,“reserve-space”:“0KiB”,“reserve-raft-space”:“1GiB”,“enable-async-apply-prewrite”:false,“api-version”:1,“enable-ttl”:false,“background-error-recovery-window”:“1h”,“ttl-check-poll-interval”:“12h”,“flow-control”:{“enable”:true,“soft-pending-compaction-bytes-limit”:“192GiB”,“hard-pending-compaction-bytes-limit”:“1TiB”,“memtables-threshold”:5,“l0-files-threshold”:20},“block-cache”:{“shared”:true,“capacity”:“3000MiB”,“num-shard-bits”:6,“strict-capacity-limit”:true,“high-pri-pool-ratio”:0.8,“memory-allocator”:“nodump”},“io-rate-limit”:{“max-bytes-per-sec”:“0KiB”,“mode”:“write-only”,“strict”:false,“foreground-read-priority”:“high”,“foreground-write-priority”:“high”,“flush-priority”:“high”,“level-zero-compaction-priority”:“medium”,“compaction-priority”:“low”,“replication-priority”:“high”,“load-balance-priority”:“high”,“gc-priority”:“high”,“import-priority”:“medium”,“export-priority”:“medium”,“other-priority”:“high”}},“pd”:{“endpoints”:[“http://basic-pd:2379”],“retry-interval”:“3
very”:10,"update

pdmonitor-0
stats-monitor
deadlock-0
refreash-config
purge-worker-0
backup-stream-0
sst-importer6
advance-ts
grpc-server-1
check_leader-0
grpc-server-2
sst-importer7
deadlock-detect
grpc-server-0
raft-stream-0
snap-handler-0
sst-importer
default-executo
raftstore-5-0
sst-importer4
sst-importer2
raftlog-fetch-w
rocksdb:low
region-collecto
inc-scanslogger
cleanup-worker-
backup-stream
apply-low-0
log-backup-scan
background-1
resource-meteri
gc-manager
snap-sender
grpc_global_tim
sst-importer5
rocksdb:high
re-metricstso
timerpd-worker-0sst-importer0
tikv-servercdc-0time updater
flow-checker

3/pd/7311282686139874110/raft/s/00000000000000000004
2basic-tikv-0.basic-tikv-peer.mycluster.svc:20160*
6.5.022basic-tikv-0.basic-tikv-peer.mycluster.svc:20160:
0.0.0.0:20180B(47b81680f75adc4b7200480cea5dbe46ae07c4b5H
3/pd/7311282686139874110/raft/s/00000000000000000001
2basic-tikv-1.basic-tikv-peer.mycluster.svc:20160*
6.5.022basic-tikv-1.basic-tikv-peer.mycluster.svc:20160:
0.0.0.0:20180B(47b81680f75adc4b7200480cea5dbe46ae07c4b5H
3/pd/7311282686139874110/raft/s/00000000000000000001
2basic-tikv-1.basic-tikv-peer.mycluster.svc:20160*
6.5.022basic-tikv-1.basic-tikv-peer.mycluster.svc:20160:
0.0.0.0:20180B(47b81680f75adc4b7200480cea5dbe46ae07c4b5H
3/pd/7311282686139874110/raft/s/00000000000000000001
2basic-tikv-1.basic-tikv-peer.mycluster.svc:20160*
6.5.022basic-tikv-1.basic-tikv-peer.mycluster.svc:20160:
0.0.0.0:20180B(47b81680f75adc4b7200480cea5dbe46ae07c4b5H

从PD的功能处检查(TSO、全局ID、REGION信息等等)

集群的问题怎么样了。
使用 K8S 部署的话,建议也让相关同事排查一下 K8S 集群是否有问题。之前遇到类似的情况发现有可能是底层的K8S导致,也同步排查确认吧。

1 个赞

看下pd节点能不能找到容量较大的目录或者文件