Cuckoo Filter

前言

逛GitHub的时候看到了这个repo，里面提到了Cuckoo Filter，没听说过，学习一下，顺便做个笔记持久化。

简介

实际上源于2014年CMU的一篇论文：Cuckoo Filter: Practically Better Than Bloom，听名字就知道，和布隆过滤器类似。

什么是Cuckoo Filter

论文摘要：

In many networking systems, Bloom filters are used for highspeed set membership tests. They permit a small fraction of false positive answers with very good space efficiency. However, they do not permit deletion of items from the set, and previous attempts to extend “standard” Bloom filters to support deletion all degrade either space or performance.

We propose a new data structure called the cuckoo filter that can replace Bloom filters for approximate set membership tests. Cuckoo filters support adding and removing items dynamically while achieving even higher performance than Bloom filters. For applications that store many items and target moderately low false positive rates, cuckoo filters have lower space overhead than space-optimized Bloom filters. Our experimental results also show that cuckoo filters outperform previous data structures that extend Bloom filters to support deletions substantially in both time and space.

简单来说就是解决了Bloom Filter不能删除，或者删除影响空间或时间上的性能的问题。Cuckoo Filter在解决这一问题的同时，还能够表现出更好的性能。（吹是这么吹的）

在合适的参数配置下，空间利用率可以达到95%。

为什么叫Cuckoo Filter

Cuckoo 的意思是 布谷鸟 ，取这个名字其实跟他的算法实现有关。

看一张论文里的图：

Cuckoo Hashing

可以看到每个item会有两个 candidate buckets ，如果有任意一个是空的，直接插入即可；如果都已经被占了，则会任意挑选出一个bucket，把其中的元素移出来，再将这个 victim 插入到其他可供选择的地方。

这个过程会一直重复到能找到一个空的bucket，或者替换次数达到一个阈值（原文为500次）。

尽管Cuckoo Hashing可能会引起一系列的替换，但是这个操作的均摊时间复杂度是O(1)。

回到名字上来，这个过程其实和布谷鸟下蛋很像。

布谷鸟会把蛋下到别的鸟的巢里，而布谷鸟的幼鸟比别的鸟出生的早，于是会将别的还没出生的鸟蛋挤出巢外，来保证自己今后能独享食物。

懂了之后就会感觉这个命名确实很生动，也让人不禁想到Kafka中一系列有趣的命名。计算机科学家还是很有意思的。

另外，(c) 图中展示的是优化过后的hash table，可以看到一个bucket中有4个entry，很大程度上减少了冲突几率。

优化

如果只根据标准的Cuckoo hashing方法计算冲突时可选的空间，（原文中说）一个很容易想到的方法是：

store each inserted item in its entirety (perhaps externally to the table); given the original item (“key”), calculating its alternate location is easy

我的理解就是，对于每一个插入的item，都维护一个变量（可以看作是这个item的entirety），这个变量记录了这个item所有可能的index值，然后再对key进行计算。重点在于维护这个变量需要的在hash table之外的额外空间，会产生比较大的开销。

为了达到更好的空间上的效率，减少hash table以外的大小，提出了下面的优化。

partial-key cuckoo hashing

每一个item都根据一个常数大小的 fingerprint 进行hash。

为什么这个fingerprint也需要额外空间，刚才的entirety也需要额外空间，就算成优化了呢？其实很简单，因为fingerprint相较于entirety需要的空间开销更小，是常数级别的。并且在后面可以看到，对于大部分数据，4~5bit大小的fingerprint就能够达到非常好的空间占用率。而如果采用entirety，在下面的算法实现中可以看到，对于每一个bucket计算出备选bucket的index时，还需要回溯(retrieve)出该bucket的值，而采用fingerprint就可以避免这一操作。

算法实现

主要是三部分：插入、查询和删除。

插入

先看下原文的伪代码实现：

insert-伪代码

需要注意的是，计算i2的时候用到了i1和hash(f)的异或，这是很方便也很有必要的，因为这样一来，就可以保证i1也可以用同样的方式通过i2和hash(f)计算得出。所以，当我们需要计算出当前bucket的备选bucket时，只需要根据当前bucket的下标i以及存在bucket中的fingerprint得到备选目标j，即：

$$j = i\oplus hash(fingerprint)$$

这样一来，每次插入实际上只用到了表内的信息，不需要追溯到原来的item。

并且可以看到，在每次异或运算前对fingerprint进行hash而不是之后，目的也是为了减少冲突，否则如果先异或后hash，很容易落到临近的bucket中。

此外，如果两个或多个item具有相同的fingerprint，也是可以的，但有一个上限 2b ，b为bucket的大小，此时这些重复的item的两个bucket将会变得超载。

循环中 MaxNumKicks 其实就是最大重试次数，比如500次，如果超过这个次数还没有成功插入，就可以认为这个table满了。

并且也可以看到，通过 partial-key 的优化，现在每次冲突时计算备选空间地址，无论冲突了多少次，只需要根据fingerprint，还是很方便的。

查询

伪代码：

lookup-伪代码

删除

伪代码：

delete-伪代码

删除操作思路很简单，判断一下两个桶有没有相符的fingerprint，有就删除。这里的删除也不是真的把数据删除，而只是简单的标记一下。

效果

简单的看下图：

bucket size为4和8情况下 load factor和fingerprint size关系图

总结

这里就援引论文中的Conclusion：

Cuckoo filters are a new data structure for approximate set membership queries that can be used for many networking problems formerly solved using Bloom filters. Cuckoo filters improve upon Bloom filters in three ways: (1) support for deleting items dynamically; (2) better lookup performance; and (3) better space efficiency for applications requiring low false positive rates (< 3%). A cuckoo filter stores the fingerprints of a set of items based on cuckoo hashing, thus achieving high space occupancy. As a further key contribution, we have applied partial-key cuckoo hashing, which makes cuckoo filters significantly more efficient by allowing relocation based on only the stored fingerprint. Our configuration exploration suggests that the cuckoo filter, which uses buckets of size 4, will perform well for a wide range of applications, although appealingly cuckoo filter parameters can be easily varied for application-dependent tuning.

While we expect that further extensions and optimizations to cuckoo filters are possible and will further provide impetus for their use, the data structure as described is a fast and efficient building block already well-suited to the practical demands of networking and distributed systems.

几个重点也都加粗了，主要是相比于Bloom Filter的三大优势：

支持动态删除
查询性能更好
空间效率更高

并且经过 partial-key 的优化，对于每一个item，只需要额外存储 fingerprint 的值作为hash的依据，节省空间开销。