tidb集群中有4个tidb-server做负载均衡,今天突然轮换出现OOM自动重启的现象

【 TiDB 使用环境】生产环境
【 TiDB 版本】v6.5.0
【复现路径】做过哪些操作出现的问题
【遇到的问题:问题现象及影响】tidb集群中有4个tidb-server做负载均衡,今天突然轮换出现OOM自动重启的现象
1、如何快速定位是哪些慢sql引起的,目前,从dashboard里,无法获取到发生问题时的相关慢sql
2、查看出现OOM的tidb-server的oom文件夹,发现里面的sql占用的内存量也不大
3、查看tidb-server的日志,发现有报错:


【资源配置】进入到 TiDB Dashboard -集群信息 (Cluster Info) -主机(Hosts) 截图此页面
【附件:截图/日志/监控】

资源配置如下:

参考这个文档,定位一下慢sql

1 个赞

好的,我参考一下,谢谢

使用grafana 更详细些。

可以在 tidb.log 中 grep "expensive_query" ,该 log 会记录运行超时、或使用内存超过阈值的 SQL。

1 个赞

2024-03-28 10:54:18 (UTC+08:00)

TiDB 10.3.8.193:4000

[printer.go:48] [“loaded config”] [config=“{"host":"0.0.0.0","advertise-address":"10.3.8.193","port":4000,"cors":"","store":"tikv","path":"10.3.8.244:2379,10.3.8.245:2379,10.3.8.246:2379","socket":"/tmp/tidb-4000.sock","lease":"45s","split-table":true,"token-limit":1000,"temp-dir":"/tmp/tidb","tmp-storage-path":"/acdata/tidb-memory-cache/1000_tidb/MC4wLjAuMDo0MDAwLzAuMC4wLjA6MTAwODA=/tmp-storage","tmp-storage-quota":-1,"server-version":"","version-comment":"","tidb-edition":"","tidb-release-version":"","log":{"level":"error","format":"text","disable-timestamp":null,"enable-timestamp":null,"disable-error-stack":null,"enable-error-stack":null,"file":{"filename":"/acdata/tidb-cluster/tidb-deploy/tidb-4000/log/tidb.log","max-size":300,"max-days":0,"max-backups":0},"slow-query-file":"/acdata/tidb-cluster/tidb-deploy/tidb-4000/log/tidb_slow_query.log","expensive-threshold":10000,"query-log-max-len":4096,"enable-slow-log":true,"slow-threshold":300,"record-plan-in-slow-log":1},"instance":{"tidb_general_log":false,"tidb_pprof_sql_cpu":false,"ddl_slow_threshold":300,"tidb_expensive_query_time_threshold":60,"tidb_enable_slow_log":true,"tidb_slow_log_threshold":300,"tidb_record_plan_in_slow_log":1,"tidb_check_mb4_value_in_utf8":true,"tidb_force_priority":"NO_PRIORITY","tidb_memory_usage_alarm_ratio":0.8,"tidb_enable_collect_execution_info":true,"plugin_dir":"/data/deploy/plugin","plugin_load":"","max_connections":0,"tidb_enable_ddl":true,"tidb_rc_read_check_ts":false},"security":{"skip-grant-table":false,"ssl-ca":"","ssl-cert":"","ssl-key":"","cluster-ssl-ca":"","cluster-ssl-cert":"","cluster-ssl-key":"","cluster-verify-cn":null,"session-token-signing-cert":"","session-token-signing-key":"","spilled-file-encryption-method":"plaintext","enable-sem":false,"auto-tls":false,"tls-version":"","rsa-key-size":4096,"secure-bootstrap":false,"auth-token-jwks":"","auth-token-refresh-interval":"1h0m0s","disconnect-on-expired-password":true},"status":{"status-host":"0.0.0.0","metrics-addr":"","status-port":10080,"metrics-interval":15,"report-status":true,"record-db-qps":false,"grpc-keepalive-time":10,"grpc-keepalive-timeout":3,"grpc-concurrent-streams":1024,"grpc-initial-window-size":2097152,"grpc-max-send-msg-size":2147483647},"performance":{"max-procs":30,"max-memory":0,"server-memory-quota":0,"stats-lease":"3s","stmt-count-limit":5000,"pseudo-estimate-ratio":0.8,"bind-info-lease":"3s","txn-entry-size-limit":6291456,"txn-total-size-limit":4221225472,"tcp-keep-alive":true,"tcp-no-delay":true,"cross-join":true,"distinct-agg-push-down":false,"projection-push-down":false,"max-txn-ttl":3600000,"index-usage-sync-lease":"0s","plan-replayer-gc-lease":"10m","gogc":100,"enforce-mpp":false,"stats-load-concurrency":5,"stats-load-queue-size":1000,"analyze-partition-concurrency-quota":16,"enable-stats-cache-mem-quota":false,"committer-concurrency":128,"run-auto-analyze":true,"force-priority":"NO_PRIORITY","memory-usage-alarm-ratio":0.8,"enable-load-fmsketch":false},"prepared-plan-cache":{"enabled":true,"capacity":100,"memory-guard-ratio":0.1},"opentracing":{"enable":false,"rpc-metrics":false,"sampler":{"type":"const","param":1,"sampling-server-url":"","max-operations":0,"sampling-refresh-interval":0},"reporter":{"queue-size":0,"buffer-flush-interval":0,"log-spans":false,"local-agent-host-port":""}},"proxy-protocol":{"networks":"","header-timeout":5},"pd-client":{"pd-server-timeout":3},"tikv-client":{"grpc-connection-count":4,"grpc-keepalive-time":10,"grpc-keepalive-timeout":3,"grpc-compression-type":"none","commit-timeout":"41s","async-commit":{"keys-limit":256,"total-key-size-limit":4096,"safe-window":2000000000,"allowed-clock-drift":500000000},"max-batch-size":128,"overload-threshold":200,"max-batch-wait-time":0,"batch-wait-size":8,"enable-chunk-rpc":true,"region-cache-ttl":600,"store-limit":0,"store-liveness-timeout":"1s","copr-cache":{"capacity-mb":1000},"ttl-refreshed-txn-size":33554432,"resolve-lock-lite-threshold":16},"binlog":{"enable":false,"ignore-error":false,"write-timeout":"15s","binlog-socket":"","strategy":"range"},"compatible-kill-query":false,"pessimistic-txn":{"max-retry-count":256,"deadlock-history-capacity":10,"deadlock-history-collect-retryable":false,"pessimistic-auto-commit":false,"constraint-check-in-place-pessimistic":true},"max-index-length":3072,"index-limit":64,"table-column-count-limit":1017,"graceful-wait-before-shutdown":0,"alter-primary-key":false,"treat-old-version-utf8-as-utf8mb4":true,"enable-table-lock":false,"delay-clean-table-lock":0,"split-region-max-num":1000,"top-sql":{"receiver-address":""},"repair-mode":false,"repair-table-list":,"isolation-read":{"engines":["tikv","tiflash","tidb"]},"new_collations_enabled_on_first_bootstrap":true,"experimental":{"allow-expression-index":false},"skip-register-to-dashboard":false,"enable-telemetry":true,"labels":{},"enable-global-index":false,"deprecate-integer-display-length":false,"enable-enum-length-limit":true,"stores-refresh-interval":60,"enable-tcp4-only":false,"enable-forwarding":false,"max-ballast-object-size":0,"ballast-object-size":0,"transaction-summary":{"transaction-summary-capacity":500,"transaction-id-digest-min-duration":2147483647},"enable-global-kill":true,"enable-batch-dml":false,"mem-quota-query":1073741824,"oom-action":"cancel","oom-use-tmp-storage":true,"check-mb4-value-in-utf8":true,"enable-collect-execution-info":true,"plugin":{"dir":"/data/deploy/plugin","load":""},"max-server-connections":0,"run-ddl":true,"tidb-max-reuse-chunk":64,"tidb-max-reuse-column":256}”]

:thinking:轮换出现OOM,那应该有个大查询吧。节点a执行失败重启了,然后节点b执行,依次下去。需要注意的,同一个查询,是不会负载到多个tidb节点的,只会在一个tidb节点上执行。

这个是启动加载配置,搜不到其他的了么?

https://docs.pingcap.com/zh/tidb/stable/release-6.5.4#错误修复

  • 修复下推 STREAM_AGG() 算子时,可能报错 index out of range 的问题 #40857 @Dousir9

子版本是多少,不会正巧用了6.5.4以下版本?

发生问题的前几分钟,内存迅速飙升:
image
搜索所有Tidb-server节点的日志信息,搜索“expensive_query”后,只有初始化的配置信息,没有找到具体的相关sql

子版本是v6.5.0,不过,我们的这套集群是从v5.4.0版本升级到v6.5.0版本的。
image

已经参考,将默认的300ms,修改为1s,这样,好过虑一些慢sql,谢谢

没有搜索到其他相关信息

连接数量涨的是挺猛的。这应该是正好有什么东西连进来了。

https://docs.pingcap.com/zh/tidb/stable/top-sql

如果其他地方找不到有问题的sql,可以去topsql里面看看些什么。特别是挂掉的tidb之前正好在执行什么。

感觉需要个数据库审计功能

把ap tp分开 你就能够看到oom都在ap 再去找top

通过gf看看,资源那些

试试升级到 6.5 最新版本,看到日志有 panic,感觉是遇到 bug 了。

:thinking:你发的截图是连接数飙升吧,是不是有大量并发进来?