Paper Reading预告|Efficient Query Processing with Optimistically Compressed Hash Tables & Strings in the USSR

Paper Reading 是 TiDB 社区的小伙伴分享研读数据库、分布式等相关领域的论文心得的活动。
2021 年 9 月 7 日 19 时,TiDB 计算引擎研发工程师徐怀宇将于 Zoom 分享 Paper KVSSD: Close integration of LSM trees and flash translation layer for write-efficient KV store,欢迎大家报名参加。

微信扫码 -> 点击报名 -> 报名成功,加入交流群

论文中介绍的 3 种优化思路总结出来其实很简单,希望听众能够从中得到启发,在自己的工程系统中可以进行类似的实践。
该 paper 获得了 ICDE 2020 best paper award。
哈希表在查询引擎中,是一个极为常见的数据结构,该 paper 的主要内容是提出了 3 种正交的哈希表压缩技术,以提升哈希表的访问性能:Domain-guided prefix suppression,Optimistic splitting, Unique Strings self-aligned Region (USSR)。
Modern query engines rely heavily on hash tables for query processing. Overall query performance and memory footprint is often determined by how hash tables and the tuples within them are represented. In this work, we propose three complementary techniques to improve this representation: Domain-Guided Prefix Suppression bit-packs keys and values tightly to reduce hash table record width. Optimistic Splitting decomposes values (and operations on them) into (operations on) frequently-accessed and infrequently-accessed value slices. By removing the infrequently-accessed value slices from the hash table record, it improves cache locality. The Unique Strings Selfaligned Region (USSR) accelerates handling frequently-occurring strings, which are very common in real-world data sets, by creating an on-the-fly dictionary of the most frequent strings. This allows executing many string operations with integer logic and reduces memory pressure. We integrated these techniques into Vectorwise. On the TPC-H benchmark, our approach reduces peak memory consumption by 2–4× and improves performance by up to 1.5×. On a real-world BI workload, we measured a 2× improvement in performance and in micro-benchmarks we observed speedups of up to 25×

Paper 地址: