【 TiDB 使用环境】生产环境
【 TiDB 版本】v5.2.1
【遇到的问题】TiDB单节点内存持续升高,最终oom,已定位到问题
【复现路径】无
【问题现象及影响】
TiDB单节点内存持续升高,周期性oom
通过内存分析,定位到一些问题,
RowSampleCollector
和AnalyzeColumnsExec
的内存消耗占比占80%以上,也就是说主要内存消耗在analyze统计信息的获取上面
(pprof) top
Showing nodes accounting for 14582.34MB, 89.52% of 16289.01MB total
Dropped 1250 nodes (cum <= 81.45MB)
Showing top 10 nodes out of 53
flat flat% sum% cum cum%
8852.69MB 54.35% 54.35% 8852.69MB 54.35% github.com/pingcap/tidb/statistics.(*RowSampleCollector).FromProto
4144.93MB 25.45% 79.79% 13286.68MB 81.57% github.com/pingcap/tidb/executor.(*AnalyzeColumnsExec).subMergeWorker
953.54MB 5.85% 85.65% 955.04MB 5.86% github.com/pingcap/tidb/store/copr.(*copIteratorWorker).handleCopResponse
289.07MB 1.77% 87.42% 289.07MB 1.77% github.com/pingcap/tidb/statistics.(*FMSketch).insertHashValue
185.85MB 1.14% 88.56% 185.85MB 1.14% github.com/pingcap/tidb/executor.getOneLine
84.57MB 0.52% 89.08% 84.57MB 0.52% bytes.makeSlice
60.04MB 0.37% 89.45% 153.48MB 0.94% github.com/pingcap/tidb/table/tables.(*TableCommon).UpdateRecord
4.50MB 0.028% 89.48% 383.46MB 2.35% github.com/pingcap/tidb/executor.(*Compiler).Compile
3.66MB 0.022% 89.50% 104.05MB 0.64% github.com/pingcap/tidb/executor.(*tableWorker).executeTask
3.50MB 0.021% 89.52% 100.01MB 0.61% github.com/pingcap/tidb/util/ranger.(*rangeDetacher).detachCNFCondAndBuildRangeForIndex
analyze相关参数SHOW VARIABLES LIKE '%analyze%'
tidb_analyze_version 2
tidb_auto_analyze_end_time 21:00 +0000
tidb_auto_analyze_ratio 0.5
tidb_auto_analyze_start_time 18:00 +0000
tidb_enable_fast_analyze OFF
TiDB日志有相关analyze报错,显示analyze操作时间超过GC life time,导致analyze操作失败,非
auto_analyze时间重试,导致内存泄漏
[2022/04/20 14:54:31.216 +08:00] [ERROR] [analyze.go:146] ["analyze failed"] [error="[tikv:9006]GC life time is shorter than transaction duration, transaction starts at 2022-04-20 14:17:31.474 +0800 CST, GC safe point is 2022-04-20 14:20:29.924 +0800 CST"]
[2022/04/20 14:54:31.216 +08:00] [INFO] [analyze.go:128] ["analyze table `db1`.`t1` has failed"] [partition=] ["job info"="auto analyze table"] ["start time"=2022/04/20 14:17:31.517 +08:00] ["end time"=2022/04/20 14:54:31.216 +08:00] [cost=36m59.698492441s]
[2022/04/20 14:54:31.216 +08:00] [INFO] [tidb.go:242] ["rollbackTxn called due to ddl/autocommit failure"]
[2022/04/20 14:54:31.216 +08:00] [WARN] [session.go:1683] ["run statement failed"] [schemaVersion=557] [error="[tikv:9006]GC life time is shorter than transaction duration, transaction starts at 2022-04-20 14:17:31.474 +0800 CST, GC safe point is 2022-04-20 14:20:29.924 +0800 CST"] [session="{\
\"currDBName\": \"\",\
\"id\": 0,\
\"status\": 2,\
\"strictMode\": true,\
\"user\": null\
}"]
[2022/04/20 14:54:31.216 +08:00] [ERROR] [update.go:1085] ["[stats] auto analyze failed"] [sql="analyze table %n.%nindex %n"] [cost_time=36m59.704161336s] [error="[tikv:9006]GC life time is shorter than transaction duration, transaction starts at 2022-04-20 14:17:31.474 +0800 CST, GC safe point is 2022-04-20 14:20:29.924 +0800 CST"]
GC相关参数SHOW VARIABLES LIKE '%gc%',tidb_gc_life_time已经调整为30min
tidb_gc_concurrency -1
tidb_gc_enable ON
tidb_gc_life_time 30m0s
tidb_gc_run_interval 10m0s
tidb_gc_scan_lock_mode LEGACY
查看analyze失败的表数据,SELECT * FROM INFORMATION_SCHEMA.
TABLES WHERE table_name='t1'
,数据行数不准确估计有4~5亿
TABLE_CATALOG TABLE_SCHEMA TABLE_NAME TABLE_TYPE ENGINE VERSION ROW_FORMAT TABLE_ROWS AVG_ROW_LENGTH DATA_LENGTH MAX_DATA_LENGTH INDEX_LENGTH DATA_FREE AUTO_INCREMENT CREATE_TIME UPDATE_TIME CHECK_TIME TABLE_COLLATION CHECKSUM CREATE_OPTIONS TABLE_COMMENT TIDB_TABLE_ID TIDB_ROW_ID_SHARDING_INFO TIDB_PK_TYPE
def db1 t1 BASE TABLE InnoDB 10 Compact 447487020 678 303842721774 0 193638835755 0 116096911662 1/4/2022 11:13:03 utf8mb4_bin 130 NOT_SHARDED(PK_IS_HANDLE) CLUSTERED
【附件】
pprof.tidb-server.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz (1.6 MB)