skip to main content
10.1145/2452376.2452456acmotherconferencesArticle/Chapter ViewAbstractPublication PagesedbtConference Proceedingsconference-collections
research-article

HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm

Published:18 March 2013Publication History

ABSTRACT

Cardinality estimation has a wide range of applications and is of particular importance in database systems. Various algorithms have been proposed in the past, and the HyperLogLog algorithm is one of them. In this paper, we present a series of improvements to this algorithm that reduce its memory requirements and significantly increase its accuracy for an important range of cardinalities. We have implemented our proposed algorithm for a system at Google and evaluated it empirically, comparing it to the original HyperLogLog algorithm. Like HyperLogLog, our improved algorithm parallelizes perfectly and computes the cardinality estimate in a single pass.

References

  1. K. Aouiche and D. Lemire. A comparison of five probabilistic view-size estimation techniques in OLAP. In Workshop on Data Warehousing and OLAP (DOLAP), pages 17--24, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In Workshop on Randomization and Approximation Techniques (RANDOM), pages 1--10, London, UK, UK, 2002. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Clifford and I. A. Cosma. A statistical analysis of probabilistic counting algorithms. Scandinavian Journal of Statistics, pages 1--14, 2011.Google ScholarGoogle Scholar
  4. M. Durand and P. Flajolet. Loglog counting of large cardinalities. In G. D. Battista and U. Zwick, editors, European Symposium on Algorithms (ESA), volume 2832, pages 605--617, 2003.Google ScholarGoogle Scholar
  5. C. Estan, G. Varghese, and M. Fisk. Bitmap algorithms for counting active flows on high-speed links. IEEE/ACM Transactions on Networking, pages 925--937, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 31(2):182--209, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Flajolet, Éric Fusy, O. Gandouet, and F. Meunier. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In Analysis of Algorithms (AOFA), pages 127--146, 2007.Google ScholarGoogle Scholar
  8. F. Giroire. Order statistics and estimating cardinalities of massive data sets. Discrete Applied Mathematics, 157(2):406--427, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Hall, O. Bachmann, R. Büssow, S. Gănceanu, and M. Nunkesser. Processing a trillion cells per mouse click. In Very Large Databases (VLDB), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. Indyk. Tight lower bounds for the distinct elements problem. In Foundations of Computer Science (FOCS), pages 283--288, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. D. M. Kane, J. Nelson, and D. P. Woodruff. An optimal algorithm for the distinct elements problem. In Principles of database systems (PODS), pages 41--52. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Lumbroso. An optimal cardinality estimation algorithm based on order statistics and its full analysis. In Analysis of Algorithms (AOFA), pages 489--504, 2010.Google ScholarGoogle Scholar
  13. S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, T. Vassilakis, and G. Inc. Dremel: Interactive analysis of web-scale datasets. In Very Large Databases (VLDB), pages 330--339, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Metwally, D. Agrawal, and A. E. Abbadi. Why go logarithmic if we can go linear? Towards effective distinct counting of search traffic. In Extending database technology (EDBT), pages 618--629, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data, parallel analysis with Sawzall. Journal on Scientific Programming, pages 277--298, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. K.-Y. Whang, B. T. Vander-Zanden, and H. M. Taylor. A linear-time probabilistic counting algorithm for database applications. ACM Transactions on Database Systems, 15:208--229, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Other conferences
        EDBT '13: Proceedings of the 16th International Conference on Extending Database Technology
        March 2013
        793 pages
        ISBN:9781450315975
        DOI:10.1145/2452376

        Copyright © 2013 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 18 March 2013

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate7of10submissions,70%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader