skip to main content
research-article

Don't thrash: how to cache your hash on flash

Published:01 July 2012Publication History
Skip Abstract Section

Abstract

This paper presents new alternatives to the well-known Bloom filter data structure. The Bloom filter, a compact data structure supporting set insertion and membership queries, has found wide application in databases, storage systems, and networks. Because the Bloom filter performs frequent random reads and writes, it is used almost exclusively in RAM, limiting the size of the sets it can represent.

This paper first describes the quotient filter, which supports the basic operations of the Bloom filter, achieving roughly comparable performance in terms of space and time, but with better data locality. Operations on the quotient filter require only a small number of contiguous accesses. The quotient filter has other advantages over the Bloom filter: it supports deletions, it can be dynamically resized, and two quotient filters can be efficiently merged.

The paper then gives two data structures, the buffered quotient filter and the cascade filter, which exploit the quotient filter advantages and thus serve as SSD-optimized alternatives to the Bloom filter. The cascade filter has better asymptotic I/O performance than the buffered quotient filter, but the buffered quotient filter outperforms the cascade filter on small to medium data sets. Both data structures significantly outperform recently-proposed SSD-optimized Bloom filter variants, such as the elevator Bloom filter, buffered Bloom filter, and forest-structured Bloom filter. In experiments, the cascade filter and buffered quotient filter performed insertions 8.6--11 times faster than the fastest Bloom filter variant and performed lookups 0.94--2.56 times faster.

References

  1. M. A. Bender, M. Farach-Colton, J. T. Fineman, Y. R. Fogel, B. C. Kuszmaul, and J. Nelson. Cache-oblivious streaming B-trees. In SPAA, pages 81--92, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. B. H. Bloom. Space/time trade-offs in hash coding with allowable errors. Communincations of the ACM, 13(7):422--426, 1970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. F. Bonomi, M. Mitzenmacher, R. Panigrahy, S. Singh, and G. Varghese. An improved construction for counting Bloom filters. In ECA, pages 684--695, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Broder and M. Mitzenmacher. Network applications of Bloom filters: A survey. Internet Mathematics, 1(4):485--509, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  5. M. Canim, G. A. Mihaila, B. Bhattacharhee, C. A. Lang, and K. A. Ross. Buffered Bloom filters on solid state storage. In VLDB ADMS Workshop, 2010.Google ScholarGoogle Scholar
  6. F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. In OSDI, pages 205--218, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Y. Chen, B. Schmidt, and D. L. Maskell. A reconfigurable Bloom filter architecture for BLASTN. In ARCS, pages 40--49, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. H. Cheung. TG Video: Fusion io - the power of 1000 hard drives in the palm of your hand. http://www.tgdaily.com/hardware-features/34065-tg-video-fusion-io-the-power-of-1000-hard-drives-in-the-palm-of-your-hand, 2007.Google ScholarGoogle Scholar
  9. J. G. Cleary. Compact hash tables using bidirectional linear probing. IEEE Transactions on Computing, 33(9):828--834, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. B. Debnath, S. Sengupta, J. Li, D. Lilja, and D. Du. Bloomflash: Bloom filter on flash-based storage. In ICDCS, pages 635--644, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. L. Fan, P. Cao, J. Almeida, and A. Z. Broder. Summary cache: a scalable wide-area web cache sharing protocol. IEEE/ACM Transactions on Networking, 8(3):281--293, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. L. Freeman. How NetApp deduplication works - a primer, Apr. 2010. http://blogs.netapp.com/drdedupe/2010/04/how-netapp-deduplication-works.html.Google ScholarGoogle Scholar
  13. G. J. Holzmann. Design and validation of computer protocols. Prentice-Hall, Inc., 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. K. Keeton, C. B. Morrey, III, C. A. Soules, and A. Veitch. Lazybase: freshness vs. performance in information management. SIGOPS Operating Systems Review, 44(1):15--19, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. D. E. Knuth. The Art of Computer Programming: Sorting and Searching, volume 3. Addison Wesley, 1973.Google ScholarGoogle Scholar
  16. Z. Li, R. Grosu, P. Sehgal, S. A. Smolka, S. D. Stoller, and E. Zadok. On the energy consumption and performance of systems software. In SYSTOR, pages 8:1--8:12, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. Lu, B. Debnath, and D. H. C. Du. A forest-structured bloom filter with flash memory. In MSST, pages 1--6, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. K. Malde and B. O'Sullivan. Using Bloom filters for large scale gene sequence analysis in Haskell. In PADL, pages 183--194, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. Mullin. Optimal semijoins for distributed database systems. IEEE Transactions on Software Engineering, 16(5):558--560, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. Singh, M. Srivatsa, L. Liu, and T. Miller. Apoidea: A decentralized peer-to-peer architecture for crawling the world wide web. In SIGIR Workshop on Distributed Multimedia Information Retrieval, pages 126--142, 2003.Google ScholarGoogle Scholar
  21. Z. Yuan, J. Miao, Y. Jia, and L. Wang. Counting data stream based on improved counting Bloom filter. In WAIM, pages 512--519, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. B. Zhu, K. Li, and H. Patterson. Avoiding the disk bottleneck in the data domain deduplication file system. In FAST, pages 18:1--18:14, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 5, Issue 11
    July 2012
    608 pages

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 July 2012
    Published in pvldb Volume 5, Issue 11

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader