Abstract
Approximate set membership data structures (ASMDSs) are ubiquitous in computing. They trade a tunable, often small, error rate (ϵ) for large space savings. The canonical ASMDS is the Bloom filter, which supports lookups and insertions but not deletions in its simplest form. Cuckoo filters (CFs), a recently proposed class of ASMDSs, add deletion support and often use fewer bits per item for equal ϵ.
This work introduces the Morton filter (MF), a novel AS-MDS that introduces several key improvements to CFs. Like CFs, MFs support lookups, insertions, and deletions, but improve their respective throughputs by 1.3x to 2.5x, 0.9x to 15.5x, and 1.3x to 1.6x. MFs achieve these improvements by (1) introducing a compressed format that permits a logically sparse filter to be stored compactly in memory, (2) leveraging succinct embedded metadata to prune unnecessary memory accesses, and (3) heavily biasing insertions to use a single hash function. With these optimizations, lookups, insertions, and deletions often only require accessing a single hardware cache line from the filter. These improvements are not at a loss in space efficiency, as MFs typically use comparable to slightly less space than CFs for the same epsis;.
- P. S. Almeida, C. Baquero, N. M. Preguiça, and D. Hutchison. Scalable Bloom filters. Information Processing Letters, 101(6):255--261, 2007. Google ScholarDigital Library
- G. Antoshenkov. Byte-aligned bitmap compression. In Data Compression Conference, 1995. DCC '95. Proceedings, pages 476--, March 1995. Google ScholarDigital Library
- A. Appleby. MurmurHash. https://sites.google.com/site/murmurhash, 2008. Accessed: 2018-02-05.Google Scholar
- Y. Azar, A. Z. Broder, A. R. Karlin, and E. Upfal. Balanced allocations. SIAM Journal of Computing, 29(1):180--200, 1999. Google ScholarDigital Library
- L. A. Belady. A study of replacement algorithms for a virtual-storage computer. IBM Systems Journal, 5(2):78--101, 1966. Google ScholarDigital Library
- M. A. Bender, M. Farach-Colton, R. Johnson, R. Kraner, B. C. Kuszmaul, D. Medjedovic, P. Montes, P. Shetty, R. P. Spillane, and E. Zadok. Don't thrash: how to cache your hash on flash. PVLDB, 5(11):1627--1637, 2012. Google ScholarDigital Library
- B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422--426, 1970. Google ScholarDigital Library
- P. A. Boncz, S. Manegold, and M. L. Kersten. Database architecture optimized for the new bottleneck: memory access. In Proceedings of the 25th International Conference on Very Large Data Bases, VLDB '99, pages 54--65, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
- P. A. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: hyper-pipelining query execution. In CIDR 2005, Second Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 4--7, 2005, Online Proceedings, pages 225--237, 2005.Google Scholar
- F. Bonomi, M. Mitzenmacher, R. Panigrahy, S. Singh, and G. Varghese. An improved construction for counting Bloom filters. In ESA, volume 6, pages 684--695. Springer, 2006. Google ScholarDigital Library
- F. Bonomi, M. Mitzenmacher, R. Panigraphy, S. Singh, and G. Varghese. Bloom filters via d-left hashing and dynamic bit reassignment extended abstract. In Forty-Fourth Annual Allerton Conference, Illinois, USA, pages 877--883, 2006.Google Scholar
- K. Bratbergsengen. Hashing methods and relational algebra operations. In Proceedings of the 10th International Conference on Very Large Data Bases, VLDB '84, pages 323--333, San Francisco, CA, USA, 1984. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
- A. D. Breslow, D. P. Zhang, J. L. Greathouse, N. Jayasena, and D. M. Tullsen. Horton tables: fast hash tables for in-memory data-intensive computing. In A. Gulati and H. Weatherspoon, editors, 2016 USENIX Annual Technical Conference, USENIX ATC 2016, Denver, CO, USA, June 22--24, 2016., pages 281--294. USENIX Association, 2016. Google ScholarDigital Library
- A. Z. Broder and M. Mitzenmacher. Network applications of Bloom filters: a survey. Internet Mathematics, 1(4):485--509, 2003.Google ScholarCross Ref
- L. Carter, R. Floyd, J. Gill, G. Markowsky, and M. Wegman. Exact and approximate membership testers. In Proceedings of the Tenth Annual ACM Symposium on Theory of Computing, STOC '78, pages 59--65, New York, NY, USA, 1978. ACM. Google ScholarDigital Library
- S. Chambi, D. Lemire, O. Kaser, and R. Godin. Better bitmap performance with Roaring bitmaps. Software: Practice and Experience, 46(5):709--719, 2016. Google ScholarDigital Library
- F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. BigTable: a distributed storage system for structured data. ACM Transactions on Computer Systems (TOCS), 26(2):4, 2008. Google ScholarDigital Library
- S. Cohen and Y. Matias. Spectral Bloom filters. In A. Y. Halevy, Z. G. Ives, and A. Doan, editors, Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, San Diego, California, USA, June 9--12, 2003, pages 241--252. ACM, 2003. Google ScholarDigital Library
- A. Colantonio and R. D. Pietro. Concise: compressed 'n' composable integer set. Information Processing Letters, 110(16):644--650, 2010. Google ScholarDigital Library
- J. Dean and S. Ghemawat. LevelDB: a fast persistent key-value store. https://opensource.googleblog.com/2011/07/leveldb-fast-persistent-key-value-store.html, July 27, 2011. Accessed: 2017-01-25.Google Scholar
- F. Deng and D. Rafiei. Approximately detecting duplicates for streaming data using Stable Bloom filters. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Chicago, Illinois, USA, June 27--29, 2006, pages 25--36, 2006. Google ScholarDigital Library
- S. Dong, M. Callaghan, L. Galanis, D. Borthakur, T. Savor, and M. Strum. Optimizing space amplification in RocksDB. In CIDR 2017, 8th Biennial Conference on Innovative Data Systems Research, Chaminade, CA, USA, January 8--11, 2017, Online Proceedings, 2017.Google Scholar
- Dr. Seuss. Horton Hatches the Egg. Random House, 1940.Google Scholar
- G. Einziger and R. Friedman. TinySet - an access efficient self adjusting Bloom filter construction. IEEE/ACM Transactions on Networking, 25(4):2295--2307, 2017. Google ScholarDigital Library
- D. Eppstein, M. T. Goodrich, M. Mitzenmacher, and M. R. Torres. 2--3 cuckoo filters for faster triangle listing and set intersection. In Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS 2017, Chicago, IL, USA, May 14--19, 2017, pages 247--260, 2017. Google ScholarDigital Library
- U. Erlingsson, M. Manasse, and F. McSherry. A cool and practical alternative to traditional hash tables. In Proc. 7th Workshop on Distributed Data and Structures (WDAS06), 2006.Google Scholar
- B. Fan, D. G. Andersen, and M. Kaminsky. MemC3: compact and concurrent memcache with dumber caching and smarter hashing. In N. Feamster and J. C. Mogul, editors, Proceedings of the 10th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2013, Lombard, IL, USA, April 2--5, 2013, pages 371--384. USENIX Association, 2013. Google ScholarDigital Library
- B. Fan, D. G. Andersen, and M. Kaminsky. Cuckoo filter. https://github.com/efficient/cuckoofilter, 2017. Accessed: 2017-11-19.Google Scholar
- B. Fan, D. G. Andersen, M. Kaminsky, and M. Mitzenmacher. Cuckoo filter: practically better than Bloom. In Proceedings of the 10th ACM International on Conference on Emerging Networking Experiments and Technologies, CoNEXT 2014, Sydney, Australia, December 2--5, 2014, pages 75--88, 2014. Google ScholarDigital Library
- L. Fan, P. Cao, J. M. Almeida, and A. Z. Broder. Summary Cache: a scalable wide -area web cache sharing protocol. IEEE/ACM Transactions on Networking, 8(3):281--293, 2000. Google ScholarDigital Library
- M. J. Flynn. Some computer organizations and their effectiveness. IEEE Transactions on Computers, C-21(9):948--960, Sept 1972. Google ScholarDigital Library
- M. L. Fredman, J. Komlós, and E. Szemerédi. Storing a sparse table with 0(1) worst case access time. Journal of the ACM, 31(3):538--544, June 1984. Google ScholarDigital Library
- L. George. HBase: the definitive guide: random access to your planet-size data. O'Reilly Media, Inc., 2011.Google Scholar
- R. González, S. Grabowski, V. Mäkinen, and G. Navarro. Practical implementation of rank and select queries. In Poster Proc. Volume of 4th Workshop on Efficient and Experimental Algorithms (WEA), pages 27--38, 2005.Google Scholar
- J. R. Goodman. Using cache memory to reduce processor-memory traffic. In Proceedings of the 10th Annual International Symposium on Computer Architecture, ISCA '83, pages 124--131, New York, NY, USA, 1983. ACM. Google ScholarDigital Library
- J. L. Greathouse and M. Daga. Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format. In T. Damkroger and J. J. Dongarra, editors, International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2014, New Orleans, LA, USA, November 16--21, 2014, pages 769--780. IEEE Computer Society, 2014. Google ScholarDigital Library
- G. Guzun, G. Canahuate, D. Chiu, and J. Sawin. A tunable compression framework for bitmap indices. In IEEE 30th International Conference on Data Engineering, Chicago, ICDE 2014, IL, USA, March 31 - April 4, 2014, pages 484--495, 2014.Google ScholarCross Ref
- G. Jacobson. Space-efficient static trees and graphs. In 30th Annual Symposium on Foundations of Computer Science, Research Triangle Park, North Carolina, USA, 30 October - 1 November 1989, pages 549--554, 1989. Google ScholarDigital Library
- M. Kandemir, H. Zhao, X. Tang, and M. Karakoy. Memory row reuse distance and its role in optimizing application performance. In Proceedings of the 2015 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS '15, pages 137--149, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
- P. M. Kogge and H. S. Stone. A parallel algorithm for the efficient solution of a general class of recurrence equations. IEEE Transactions on Computers, 100(8):786--793, 1973. Google ScholarDigital Library
- M. Kornacker, A. Behm, V. Bittorf, T. Bobrovytsky, C. Ching, A. Choi, J. Erickson, M. Grund, D. Hecht, M. Jacobs, I. Joshi, L. Kuff, D. Kumar, A. Leblang, N. Li, I. Pandis, H. Robinson, D. Rorke, S. Rus, J. Russell, D. Tsirogiannis, S. Wanderman-Milne, and M. Yoder. Impala: a modern, open-source SQL engine for Hadoop. In CIDR 2015, Seventh Biennial Conference on Innovative Data Systems Research, Asilomar, CA, USA, January 4--7, 2015, Online Proceedings, 2015.Google Scholar
- J. Kubiatowicz, D. Bindel, Y. Chen, S. E. Czerwinski, P. R. Eaton, D. Geels, R. Gummadi, S. C. Rhea, H. Weatherspoon, W. Weimer, C. Wells, and B. Y. Zhao. OceanStore: an architecture for global-scale persistent storage. In L. Rudolph and A. Gupta, editors, ASPLOS-IX Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, Cambridge, MA, USA, November 12--15, 2000., pages 190--201. ACM Press, 2000. Google ScholarDigital Library
- A. Lakshman and P. Malik. Cassandra: a decentralized structured storage system. Operating Systems Review, 44(2):35--40, 2010. Google ScholarDigital Library
- D. Lemire. A fast alternative to the modulo reduction. https://lemire.me/blog/2016/06/27/a-fast-alternative-to-the-modulo-reduction/, June 27, 2016. Accessed: 2017-07-01.Google Scholar
- X. Li, D. G. Andersen, M. Kaminsky, and M. J. Freedman. Algorithmic improvements for fast concurrent cuckoo hashing. In Ninth EuroSys Conference 2014, EuroSys 2014, Amsterdam, The Netherlands, April 13--16, 2014, pages 27:1--27:14, 2014. Google ScholarDigital Library
- C. Lomont. Introduction to Intel advanced vector extensions. Intel White Paper, pages 1--21, 2011.Google Scholar
- D. B. Loveman. Program improvement by source-to-source transformation. Journal of the ACM, 24(1):121--145, Jan. 1977. Google ScholarDigital Library
- L. F. Mackert and G. M. Lohman. R* optimizer validation and performance evaluation for distributed queries. In Proceedings of the 12th International Conference on Very Large Data Bases, VLDB '86, pages 149--159, San Francisco, CA, USA, 1986. Morgan Kaufmann Publishers Inc. Google ScholarDigital Library
- P. Melsted and J. K. Pritchard. Efficient counting of k-mers in DNA sequences using a Bloom filter. BMC Bioinformatics, 12:333, 2011.Google ScholarCross Ref
- M. Mitzenmacher. Compressed Bloom filters. In Proceedings of the Twentieth Annual ACM Symposium on Principles of Distributed Computing, PODC 2001, Newport, Rhode Island, USA, August 26--29, 2001, pages 144--150, 2001. Google ScholarDigital Library
- M. Mitzenmacher. The power of two choices in randomized load balancing. IEEE Transactions on Parallel and Distribributed Systems, 12(10):1094--1104, 2001. Google ScholarDigital Library
- M. Mitzenmacher and E. Upfal. Probability and Computing: randomization and Probabilistic Techniques in Algorithms and Data Analysis. Cambridge University Press, 2017. Google ScholarDigital Library
- W. Mula, N. Kurz, and D. Lemire. Faster population counts using AVX2 instructions. The Computer Journal, 61(1):111--120, 2018.Google ScholarCross Ref
- G. Navarro. Compact data structures: a practical approach. Cambridge University Press, 2016. Google ScholarDigital Library
- D. Okanohara and K. Sadakane. Practical entropy-compressed rank/select dictionary. In Proceedings of the Meeting on Algorithm Engineering & Expermiments, pages 60--70. Society for Industrial and Applied Mathematics, 2007. Google ScholarDigital Library
- P. E. O'Neil, E. Cheng, D. Gawlick, and E. J. O'Neil. The Log-Structured Merge-tree (LSM-tree). Acta Informatica, 33(4):351--385, 1996. Google ScholarDigital Library
- D. A. Padua and M. J. Wolfe. Advanced compiler optimizations for supercomputers. Communications of the ACM, 29(12):1184--1201, Dec. 1986. Google ScholarDigital Library
- R. Pagh and F. F. Rodler. Cuckoo hashing. Journal of Algorithms, 51(2):122--144, 2004. Google ScholarDigital Library
- P. Pandey, M. A. Bender, R. Johnson, and R. Patro. A general-purpose counting filter: making every bit count. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14--19, 2017, pages 775--787, 2017. Google ScholarDigital Library
- P. Pandey and R. Johnson. A general-purpose counting filter: counting quotient filter. https://github.com/splatlab/cqf, 2017. Accessed: 2017-11-09.Google Scholar
- O. Polychroniou, A. Raghavan, and K. A. Ross. Rethinking simd vectorization for in-memory databases. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD '15, pages 1493--1508, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
- F. Putze, P. Sanders, and J. Singler. Cache-, hash-, and space-efficient Bloom filters. ACM Journal of Experimental Algorithmics, 14, 2009. Google ScholarDigital Library
- R. Raman, V. Raman, and S. S. Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proceedings of the Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '02, pages 233--242, Philadelphia, PA, USA, 2002. Society for Industrial and Applied Mathematics. Google ScholarDigital Library
- S. K. Raman, V. Pentkovski, and J. Keshava. Implementing streaming SIMD extensions on the Pentium III Processor. IEEE Micro, 20(4):47--57, July 2000. Google ScholarDigital Library
- K. A. Ross. Efficient hash probes on modern processors. In R. Chirkova, A. Dogac, M. T. Özsu, and T. K. Sellis, editors, Proceedings of the 23rd International Conference on Data Engineering, ICDE 2007, The Marmara Hotel, Istanbul, Turkey, April 15--20, 2007, pages 1297--1301. IEEE Computer Society, 2007.Google Scholar
- O. Rottenstreich, Y. Kanizo, and I. Keslassy. The variable-increment counting Bloom filter. IEEE/ACM Transactions on Networing, 22(4):1092--1105, 2014. Google ScholarDigital Library
- R. Sears and R. Ramakrishnan. bLSM: a general purpose Log Structured Merge tree. In K. S. Candan, Y. Chen, R. T. Snodgrass, L. Gravano, and A. Fuxman, editors, Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20--24, 2012, pages 217--228. ACM, 2012. Google ScholarDigital Library
- T. Sigaev, A. Korotkov, and O. Bartunov. PostgreSQL 10 documentation: F.5. bloom. https://www.postgresql.org/docs/10/static/bloom.html, 2017. Accessed: 2018-01-25.Google Scholar
- J. E. Smith. A study of branch prediction strategies. In Proceedings of the 8th Annual Symposium on Computer Architecture, ISCA '81, pages 135--148, Los Alamitos, CA, USA, 1981. IEEE Computer Society Press. Google ScholarDigital Library
- M. Stonebraker, L. A. Rowe, and M. Hirohama. The implementation of POSTGRES. IEEE Transactions on Knowledge and Data Engineering, 2(1):125--142, 1990. Google ScholarDigital Library
- Y. Sun, Y. Hua, S. Jiang, Q. Li, S. Cao, and P. Zuo. SmartCuckoo: a fast and cost-efficient hashing index scheme for cloud storage systems. In 2017 USENIX Annual Technical Conference, USENIX ATC 2017, Santa Clara, CA, USA, July 12--14, 2017., pages 553--565. USENIX Association, 2017. Google ScholarDigital Library
- R. E. Tarjan and A. C. Yao. Storing a sparse table. Communications of the ACM, 22(11):606--611, 1979. Google ScholarDigital Library
- W. F. Tinney and J. W. Walker. Direct solutions of sparse network equations by optimally ordered triangular factorization. Proceedings of the IEEE, 55(11):1801--1809, 1967.Google ScholarCross Ref
- D. M. Tullsen, S. J. Eggers, J. S. Emer, H. M. Levy, J. L. Lo, and R. L. Stamm. Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, Philadelphia, PA, USA, May 22--24, 1996, pages 191--202, 1996. Google ScholarDigital Library
- D. M. Tullsen, S. J. Eggers, and H. M. Levy. Simultaneous multithreading: maximizing on-chip parallelism. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, ISCA '95, Santa Margherita Ligure, Italy, June 22--24, 1995, pages 392--403, 1995. Google ScholarDigital Library
- B. Vöcking. How asymmetry helps load balancing. In 40th Annual Symposium on Foundations of Computer Science, FOCS '99, 17--18 October, 1999, New York, NY, USA, pages 131--141, 1999. Google ScholarDigital Library
- J. Wang, C. Lin, Y. Papakonstantinou, and S. Swanson. An experimental study of bitmap compression vs. inverted list compression. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD '17, pages 993--1008, New York, NY, USA, 2017. ACM. Google ScholarDigital Library
- M. Wolfe. More iteration space tiling. In Proceedings of the 1989 ACM/IEEE Conference on Supercomputing, Supercomputing '89, pages 655--664, New York, NY, USA, 1989. ACM. Google ScholarDigital Library
- K. Wu, E. J. Otoo, and A. Shoshani. Optimizing bitmap indices with efficient compression. ACM Transactions on Database Systems, 31(1):1--38, 2006. Google ScholarDigital Library
- M. Yoon. Aging Bloom filter with two active buffers for dynamic sets. IEEE Transactions on Knowledge and Data Engineering, 22(1):134--138, 2010. Google ScholarDigital Library
- H. Zhang, H. Lim, V. Leis, D. G. Andersen, M. Kaminsky, K. Keeton, and A. Pavlo. SuRF: practical range query filtering with fast succinct tries. In Proceedings of the 2018 ACM SIGMOD International Conference on Management of Data, 2018. Google ScholarDigital Library
- K. Zhang, K. Wang, Y. Yuan, L. Guo, R. Lee, and X. Zhang. Mega-KV: a case for GPUs to maximize the throughput of in-memory key-value stores. PVLDB, 8(11):1226--1237, 2015. Google ScholarDigital Library
- Z. Zhang, Z. Zhu, and X. Zhang. A permutation-based page interleaving scheme to reduce row-buffer conflicts and exploit data locality. In Proceedings 33rd Annual IEEE/ACM International Symposium on Microarchitecture. MICRO-33 2000, pages 32--41, 2000. Google ScholarDigital Library
Recommendations
Adaptive median filters: new algorithms and results
Based on two types of image models corrupted by impulse noise, we propose two new algorithms for adaptive median filters. They have variable window size for removal of impulses while preserving sharpness. The first one, called the ranked-order based ...
Fuzzy anti-grouped filters and fuzzy normal filters in pseudo-BCI algebras
The notions of fuzzy anti-grouped filter and fuzzy normal filter in pseudo-BCI algebra are introduced, some properties and equivalent conditions are presented. A counterexample is constructed to show that a fuzzy anti-grouped filter may be not a fuzzy p-...
Hybrid structures for low-complexity variable fractional-delay FIR filters
This paper proposes a pair of new structures for implementing low-complexity odd-order and even-order variable fractional-delay (VFD) FIR filters using hybrid structures with both even-order and odd-order subfilters. For odd-order VFD filter design, the ...
Comments