论文标题

雏菊布鲁姆过滤器

Daisy Bloom Filters

论文作者

Bercea, Ioana O., Houen, Jakob Bæk Tejs, Pagh, Rasmus

论文摘要

过滤器是一种广泛使用的数据结构,用于存储某些宇宙$ u $(可数套件)的给定元素元素的近似值。它代表一个superset $ s'\ supseteq s $,它接近$ s $'''''在$ x $中n n of x $ n n of s $ x $ x $ x $ x \ y n yne $ x bys,n of x \ s $ x \ in $ x \ y n n n n n n n n n n n n n n n notebory中。使用Bloom过滤器的优点(当一些误报都可以接受时)是,该空间使用量小于精确存储$ s $所需的空间。 尽管从最糟糕的角度来看,过滤器是对过滤器的理解,但很明显,最先进的构造可能对数据和查询的特定分布而言并不接近最佳。例如,假设某些元素在$ s $中,概率接近1。然后,将它们始终包含在$ s'U中,从而通过不必将这些元素表示在过滤器中来节省空间是很有意义的。这样的问题是在加权布鲁姆过滤器(Bruck,Gao和Jiang,ISIT 2006)和Bloom过滤器实现的背景下提出的,这些问题可利用访问所学的组件(Vaidya,Knorr,Mitzenmacher和Krask,ICLR,ICLR 2021)。 在本文中,我们为这种过滤器所需的预期空间提供了一个下限。我们还表明,通过在最差的持续时间内执行查询和插入的过滤器结构,下界的渐近性很紧,并且最多具有误报率,最多具有$ \ varepsilon $,其概率高于从产品分布中得出的输入集。我们还提出了一个Bloom Filter替代方案,我们称之为$ \ textit {daisy bloom filter} $,该替代方案更快地执行操作,并且比标准绽放过滤器所使用的空间明显少得多。

A filter is a widely used data structure for storing an approximation of a given set $S$ of elements from some universe $U$ (a countable set).It represents a superset $S'\supseteq S$ that is ''close to $S$'' in the sense that for $x\not\in S$, the probability that $x\in S'$ is bounded by some $\varepsilon > 0$. The advantage of using a Bloom filter, when some false positives are acceptable, is that the space usage becomes smaller than what is required to store $S$ exactly. Though filters are well-understood from a worst-case perspective, it is clear that state-of-the-art constructions may not be close to optimal for particular distributions of data and queries. Suppose, for instance, that some elements are in $S$ with probability close to 1. Then it would make sense to always include them in $S'$, saving space by not having to represent these elements in the filter. Questions like this have been raised in the context of Weighted Bloom filters (Bruck, Gao and Jiang, ISIT 2006) and Bloom filter implementations that make use of access to learned components (Vaidya, Knorr, Mitzenmacher, and Krask, ICLR 2021). In this paper, we present a lower bound for the expected space that such a filter requires. We also show that the lower bound is asymptotically tight by exhibiting a filter construction that executes queries and insertions in worst-case constant time, and has a false positive rate at most $\varepsilon $ with high probability over input sets drawn from a product distribution. We also present a Bloom filter alternative, which we call the $\textit{Daisy Bloom filter}$, that executes operations faster and uses significantly less space than the standard Bloom filter.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源