TiKV snap 目录容量增长

【 TiDB 集群信息】

  • 集群版本:v5.1.1
  • tikv 节点个数:5

【遇到的问题】

1、发现有个 TIKV 节点(192.168.1.17)数据目录下的 snap 目录占用空间 150G

2、然后我按照官方的文档调整了 snap 的参数,发现完全没用!正常来说超过 4 小时的文件会自动删除,可是我发现 snap 目录下的文件最早是十天前的!

3、我查看其他 TiKV 节点的 snap 目录正常是空的,为什么唯独这个节点的 snap 不会自动删除?

snap-gc-timeout

调整这个参数可以控制snapshot 文件的最长保存时间

1 个赞

调了,你看我前面的截图,是四个小时!

定义:snap 存的是 “snapshot 镜像数据” from --> https://docs.pingcap.com/zh/tidb/v6.0/manage-cluster-faq#tidb-tikv-有哪些数据目录

排查步骤:

  1. 可以看下集群中 snap 数据量不断增加的 tikv 实例日志是否有什么异常
  2. 参考 看面板测是否存在什么异常,及异常调度,或大量查询打到这个问题节点之类的,导致产生大量 snapshot,但持续膨胀应该属于非正常现象。bug 或 不规范操作引起的可能性更大。

猜测方向:

  1. https://github.com/tikv/tikv/pull/10831 可以看下是否为该 PR 描述问题(snap 存在有问题的文件,导致 snap 目录的 GC 总是被中断。目录内文件持续膨胀),可以从 tikv.log 搜索关键信息 ”region is disconnected remove snaps“ 或者 “handle gc snap failed”。
  2. https://github.com/tikv/tikv/issues/11673 或是这个 PR ,可以看看 tikv.log 中是否有 compaction 报错。这个问题的 workaround 是 raftstore.use-delete-range = true

综上,可以自己根据提示的信息排查(还需要更多信息分析),或把 tikv 日志 或 clinic 搜集的监控信息共享到帖子中,大家一起排查。
clinic 使用方法 --> 【SOP 系列 22】TiDB 集群诊断信息收集 Clinic 使用指南&资料大全

1、我根据你的排查思路,确实发现了一些异常日志

[2022/06/01 15:02:56.158 +08:00] [WARN] [store.rs:571] ["[store 1] handle timeout SnapGc"] [takes=2267]
[2022/06/06 06:56:31.338 +08:00] [INFO] [store.rs:2088] ["region is disconnected, remove snaps"] [snaps="[(SnapKey { region_id: 1338130143, term: 13, idx: 27 }, false)]"] [region_id=1338130143]
[2022/06/06 07:02:31.608 +08:00] [INFO] [store.rs:2088] ["region is disconnected, remove snaps"] [snaps="[(SnapKey { region_id: 1334158070, term: 17, idx: 8137 }, false)]"] [region_id=1334158070]
[2022/06/06 07:31:32.970 +08:00] [INFO] [store.rs:2088] ["region is disconnected, remove snaps"] [snaps="[(SnapKey { region_id: 1303170296, term: 20, idx: 2276 }, false)]"] [region_id=1303170296]
[2022/06/06 08:42:36.234 +08:00] [INFO] [store.rs:2088] ["region is disconnected, remove snaps"] [snaps="[(SnapKey { region_id: 1326577685, term: 13, idx: 995 }, false)]"] [region_id=1326577685]
[2022/06/06 08:52:36.741 +08:00] [INFO] [store.rs:2088] ["region is disconnected, remove snaps"] [snaps="[(SnapKey { region_id: 1340317839, term: 7, idx: 163 }, false)]"] [region_id=1340317839]
[2022/06/06 09:26:38.326 +08:00] [INFO] [store.rs:2088] ["region is disconnected, remove snaps"] [snaps="[(SnapKey { region_id: 1297309784, term: 18, idx: 709128 }, false)]"] [region_id=1297309784]
[2022/06/06 09:37:38.816 +08:00] [INFO] [store.rs:2088] ["region is disconnected, remove snaps"] [snaps="[(SnapKey { region_id: 1326162439, term: 144, idx: 36753954 }, false)]"] [region_id=1326162439]
[2022/06/06 10:25:40.982 +08:00] [INFO] [store.rs:2088] ["region is disconnected, remove snaps"] [snaps="[(SnapKey { region_id: 1338105051, term: 12, idx: 1941 }, false)]"] [region_id=1338105051]
[2022/06/06 10:35:41.455 +08:00] [INFO] [store.rs:2088] ["region is disconnected, remove snaps"] [snaps="[(SnapKey { region_id: 1338179137, term: 11, idx: 665511 }, false)]"] [region_id=1338179137]
[2022/06/06 11:06:42.871 +08:00] [INFO] [store.rs:2088] ["region is disconnected, remove snaps"] [snaps="[(SnapKey { region_id: 1310646377, term: 13, idx: 373813 }, false)]"] [region_id=1310646377]
[2022/06/06 11:16:43.443 +08:00] [INFO] [store.rs:2088] ["region is disconnected, remove snaps"] [snaps="[(SnapKey { region_id: 1335233054, term: 9, idx: 1458 }, false)]"] [region_id=1335233054]
[2022/06/06 11:46:45.517 +08:00] [INFO] [store.rs:2088] ["region is disconnected, remove snaps"] [snaps="[(SnapKey { region_id: 1338192845, term: 10, idx: 14826 }, false)]"] [region_id=1338192845]
[2022/06/06 12:06:47.263 +08:00] [INFO] [store.rs:2088] ["region is disconnected, remove snaps"] [snaps="[(SnapKey { region_id: 1340736006, term: 48, idx: 778499 }, false)]"] [region_id=1340736006]
[2022/06/06 12:17:47.751 +08:00] [INFO] [store.rs:2088] ["region is disconnected, remove snaps"] [snaps="[(SnapKey { region_id: 1340019677, term: 11, idx: 292 }, false)]"] [region_id=1340019677]
[2022/06/06 12:26:48.144 +08:00] [INFO] [store.rs:2088] ["region is disconnected, remove snaps"] [snaps="[(SnapKey { region_id: 1335858837, term: 9, idx: 69 }, false)]"] [region_id=1335858837]
[2022/06/06 12:34:48.500 +08:00] [INFO] [store.rs:2088] ["region is disconnected, remove snaps"] [snaps="[(SnapKey { region_id: 1341432610, term: 39, idx: 817762 }, false)]"] [region_id=1341432610]
[2022/06/06 13:06:49.949 +08:00] [INFO] [store.rs:2088] ["region is disconnected, remove snaps"] [snaps="[(SnapKey { region_id: 1320985794, term: 11, idx: 2730 }, false)]"] [region_id=1320985794]
[2022/06/06 13:14:50.307 +08:00] [INFO] [store.rs:2088] ["region is disconnected, remove snaps"] [snaps="[(SnapKey { region_id: 1337638098, term: 15, idx: 3668 }, false)]"] [region_id=1337638098]
[2022/06/06 13:38:51.334 +08:00] [INFO] [store.rs:2088] ["region is disconnected, remove snaps"] [snaps="[(SnapKey { region_id: 1315722051, term: 17, idx: 676750 }, false)]"] [region_id=1315722051]
[2022/06/06 15:03:55.251 +08:00] [INFO] [store.rs:2088] ["region is disconnected, remove snaps"] [snaps="[(SnapKey { region_id: 1340279880, term: 7, idx: 13 }, false)]"] [region_id=1340279880]
[2022/06/06 15:23:56.142 +08:00] [INFO] [store.rs:2088] ["region is disconnected, remove snaps"] [snaps="[(SnapKey { region_id: 1330724438, term: 7, idx: 21485 }, false)]"] [region_id=1330724438]
[2022/06/06 15:33:56.621 +08:00] [INFO] [store.rs:2088] ["region is disconnected, remove snaps"] [snaps="[(SnapKey { region_id: 1341718152, term: 6, idx: 1617 }, false)]"] [region_id=1341718152]

2、监控图显示该节点确实有大量查询

【疑问】
1、请问 region is disconnected 报错是什么意思,有什么办法解决吗?
2、这个节点有大量查询,是因为刚好查询的热点 region leader 都在上面吗?

是不是某个tikv server出问题了,跟其他节点的数据相比相差很多,导致保留了好多快照

但是快照过了十几天了,应该没用了,gc 掉才对吧

  1. region is disconnected 表示给本地的 Region peer 发送一条 GC 消息,但是发送失败了,from → tikv/components/raftstore/src/store/fsm/store.rs at 80ce9341cd9346bf5a956effd15ccf77deb96a0a · tikv/tikv · GitHub
  2. 从 store read rate bytes 看是存在热点情况,是因为刚好查询的热点 region leader 都在上面。但目前看不出 热点与 GC 无法释放 snap 之间的关系。

麻烦把该 tikv.log 贴出来一下吧,想看看有没有其他信息点。

好的,麻烦帮忙看下
tikv.log (7.1 MB)

看下没 GC 的 snap 目录下,都是对应的 Region ID 号都是哪些,用于在日志中找对应信息;

如:前缀 44274 就表示 Region ID 为 44274,着重看下 10 天前的的那些 files;

-rw-r--r-- 1 root root 17463947 Aug  1 10:55 rev_44274_1976_10018545_default.sst
-rw-r--r-- 1 root root     3824 Aug  1 10:55 rev_44274_1976_10018545_lock.sst
-rw-r--r-- 1 root root       59 Aug  1 10:55 rev_44274_1976_10018545.meta

另外就目前看没找到什么异常信息,另外麻烦分析下,当前容量的主要占比;即:4 小时内的文件占多大空间,4小时外的文件占多大空间。

还有取下,tikv-details --> Thread CPU 面板分区下相关面板 store-1 是否承载比其他 store 更多压力。用于判断,当前大量snap file 是否因 request 产生。

1、我随机找了几个 4 小时前生成的 snap 文件,根据 region id 去日志查找只能发现文件生成时候的日志,但是 gc 的日志没找到,下面是其中一个 region id 的日志:

[2022/06/07 09:30:10.512 +08:00] [INFO] [peer.rs:1518] ["target peer id is larger, destroying self"] [target_peer="id: 1342411149 store_id: 1 role: Learner"] [peer_id=1342201973] [region_id=1285918525]
[2022/06/07 09:30:10.513 +08:00] [INFO] [apply.rs:3251] ["remove delegate from apply delegates"] [peer_id=1342201973] [region_id=1285918525]
[2022/06/07 09:30:10.513 +08:00] [INFO] [router.rs:300] ["[region 1285918525] shutdown mailbox"]
[2022/06/07 09:30:10.513 +08:00] [INFO] [peer.rs:1934] ["starts destroy"] [merged_by_target=false] [peer_id=1342201973] [region_id=1285918525]
[2022/06/07 09:30:10.513 +08:00] [INFO] [peer.rs:804] ["begin to destroy"] [peer_id=1342201973] [region_id=1285918525]
[2022/06/07 09:30:10.513 +08:00] [INFO] [peer_storage.rs:1677] ["finish clear peer meta"] [takes=467.427µs] [raft_key=1] [apply_key=1] [meta_key=1] [region_id=1285918525]
[2022/06/07 09:30:10.523 +08:00] [INFO] [peer.rs:844] ["peer destroy itself"] [takes=10.068683ms] [peer_id=1342201973] [region_id=1285918525]
[2022/06/07 09:30:10.523 +08:00] [INFO] [router.rs:300] ["[region 1285918525] shutdown mailbox"]
[2022/06/07 09:30:12.514 +08:00] [INFO] [peer.rs:250] ["replicate peer"] [peer_id=1342411149] [region_id=1285918525]
[2022/06/07 09:30:12.514 +08:00] [INFO] [raft.rs:2605] ["switched to configuration"] [config="Configuration { voters: Configuration { incoming: Configuration { voters: {} }, outgoing: Configuration { voters: {} } }, learners: {}, learners_next: {}, auto_leave: false }"] [raft_id=1342411149] [region_id=1285918525]
[2022/06/07 09:30:12.514 +08:00] [INFO] [raft.rs:1088] ["became follower at term 0"] [term=0] [raft_id=1342411149] [region_id=1285918525]
[2022/06/07 09:30:12.514 +08:00] [INFO] [raft.rs:389] [newRaft] [peers="Configuration { incoming: Configuration { voters: {} }, outgoing: Configuration { voters: {} } }"] ["last term"=0] ["last index"=0] [applied=0] [commit=0] [term=0] [raft_id=1342411149] [region_id=1285918525]
[2022/06/07 09:30:12.514 +08:00] [INFO] [raw_node.rs:315] ["RawNode created with id 1342411149."] [id=1342411149] [raft_id=1342411149] [region_id=1285918525]
[2022/06/07 09:30:12.514 +08:00] [INFO] [raft.rs:1332] ["received a message with higher term from 1317648823"] ["msg type"=MsgHeartbeat] [message_term=14] [term=0] [from=1317648823] [raft_id=1342411149] [region_id=1285918525]
[2022/06/07 09:30:12.514 +08:00] [INFO] [raft.rs:1088] ["became follower at term 14"] [term=14] [raft_id=1342411149] [region_id=1285918525]
[2022/06/07 09:30:14.673 +08:00] [INFO] [snap.rs:247] ["saving snapshot file"] [file=/home/tidb/deploy/data/snap/rev_1285918525_14_905_(default|lock|write).sst] [snap_key=1285918525_14_905]
[2022/06/07 09:30:14.750 +08:00] [INFO] [raft_log.rs:614] ["log [committed=0, persisted=0, applied=0, unstable.offset=1, unstable.entries.len()=0] starts to restore snapshot [index: 905, term: 14]"] [snapshot_term=14] [snapshot_index=905] [log="committed=0, persisted=0, applied=0, unstable.offset=1, unstable.entries.len()=0"] [raft_id=1342411149] [region_id=1285918525]
[2022/06/07 09:30:14.750 +08:00] [INFO] [raft.rs:2605] ["switched to configuration"] [config="Configuration { voters: Configuration { incoming: Configuration { voters: {1339756704, 1340719366, 1317648823} }, outgoing: Configuration { voters: {} } }, learners: {1342411149}, learners_next: {}, auto_leave: false }"] [raft_id=1342411149] [region_id=1285918525]
[2022/06/07 09:30:14.750 +08:00] [INFO] [raft.rs:2589] ["restored snapshot"] [snapshot_term=14] [snapshot_index=905] [last_term=14] [last_index=905] [commit=905] [raft_id=1342411149] [region_id=1285918525]
[2022/06/07 09:30:14.750 +08:00] [INFO] [raft.rs:2470] ["[commit: 905, term: 14] restored snapshot [index: 905, term: 14]"] [snapshot_term=14] [snapshot_index=905] [commit=905] [term=14] [raft_id=1342411149] [region_id=1285918525]
[2022/06/07 09:30:14.750 +08:00] [INFO] [peer_storage.rs:1262] ["begin to apply snapshot"] [peer_id=1342411149] [region_id=1285918525]
[2022/06/07 09:30:14.750 +08:00] [INFO] [peer_storage.rs:1305] ["apply snapshot with state ok"] [state="applied_index: 905 truncated_state { index: 905 term: 14 }"] [region="id: 1285918525 start_key: 7480000000000001FFDA5F698000000000FF000003016C697374FF5265636FFF72642DFF6465746169FF6C2CFF533839363437FF35FF33000000000000F9FF0380000000000000FF030419A924000000FF0000000000000000F9 end_key: 7480000000000001FFDA5F698000000000FF000003016C697374FF5265636FFF72642DFF6465746169FF6C2CFF533839373036FF38FF33000000000000F9FF0380000000000000FF020419A936000000FF0000000000000000F9 region_epoch { conf_ver: 2794 version: 1300 } peers { id: 1317648823 store_id: 1005346961 } peers { id: 1339756704 store_id: 1031573974 } peers { id: 1340719366 store_id: 1006624325 } peers { id: 1342411149 store_id: 1 role: Learner }"] [peer_id=1342411149] [region_id=1285918525]
[2022/06/07 09:30:14.752 +08:00] [INFO] [peer.rs:3024] ["snapshot is applied"] [region="id: 1285918525 start_key: 7480000000000001FFDA5F698000000000FF000003016C697374FF5265636FFF72642DFF6465746169FF6C2CFF533839363437FF35FF33000000000000F9FF0380000000000000FF030419A924000000FF0000000000000000F9 end_key: 7480000000000001FFDA5F698000000000FF000003016C697374FF5265636FFF72642DFF6465746169FF6C2CFF533839373036FF38FF33000000000000F9FF0380000000000000FF020419A936000000FF0000000000000000F9 region_epoch { conf_ver: 2794 version: 1300 } peers { id: 1317648823 store_id: 1005346961 } peers { id: 1339756704 store_id: 1031573974 } peers { id: 1340719366 store_id: 1006624325 } peers { id: 1342411149 store_id: 1 role: Learner }"] [peer_id=1342411149] [region_id=1285918525]
[2022/06/07 09:30:45.945 +08:00] [INFO] [peer.rs:799] ["deleting applied snap file"] [snap_file=1285918525_14_902] [peer_id=1342411149] [region_id=1285918525]

2、四个小时内的 snap 文件大概占用200M,四小时外的文件可以追溯到5月19号,整个 snap 文件夹大概占用200多G

3、监控图如下,绿色的就是 store-1

可以用 clinic 搜集下集群信息吗?暂时还没有太好的思路

能手动把过期的文件删掉吗,比如超过 7 天以前的,应该快照也没用了吧

有的 workaround 可以,最好判断清楚当前的问题再处理,毕竟属于数据库内部文件,很多行为无法预知;

你看下snap目录空间大的那台机的tikv实例,是不是频繁OM,我遇到过一个tikv的OM,频繁OM,导致snap 目录占用空间大,snap 目录空间大,通常是PD调度了很多region 来这个TIKV上,然后这些region没有迁移完成?

1、OM 指的是 OOM 被系统 kill 掉吗,我看内存监控是好的

2、这个节点的 region 是比其他节点都要多,如何看 region 是否在迁移呢?

如果无法提供 clinic 的话,仅从解决问题角度看:

  1. 可以试一下重启能否绕过,先将 store-1 上的 tikv 上的 Region leader 驱逐掉,再重启该节点观察 snap 下数据是否可以清掉,再删掉驱逐器;
  2. 或者更粗暴一点,可以先缩掉再扩进来,这块要评估业务影响。

此话题已在最后回复的 1 分钟后被自动关闭。不再允许新回复。