ABSTRACT
Cardinality estimation has a wide range of applications and is of particular importance in database systems. Various algorithms have been proposed in the past, and the HyperLogLog algorithm is one of them. In this paper, we present a series of improvements to this algorithm that reduce its memory requirements and significantly increase its accuracy for an important range of cardinalities. We have implemented our proposed algorithm for a system at Google and evaluated it empirically, comparing it to the original HyperLogLog algorithm. Like HyperLogLog, our improved algorithm parallelizes perfectly and computes the cardinality estimate in a single pass.
- K. Aouiche and D. Lemire. A comparison of five probabilistic view-size estimation techniques in OLAP. In Workshop on Data Warehousing and OLAP (DOLAP), pages 17--24, 2007. Google ScholarDigital Library
- Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In Workshop on Randomization and Approximation Techniques (RANDOM), pages 1--10, London, UK, UK, 2002. Springer-Verlag. Google ScholarDigital Library
- P. Clifford and I. A. Cosma. A statistical analysis of probabilistic counting algorithms. Scandinavian Journal of Statistics, pages 1--14, 2011.Google Scholar
- M. Durand and P. Flajolet. Loglog counting of large cardinalities. In G. D. Battista and U. Zwick, editors, European Symposium on Algorithms (ESA), volume 2832, pages 605--617, 2003.Google Scholar
- C. Estan, G. Varghese, and M. Fisk. Bitmap algorithms for counting active flows on high-speed links. IEEE/ACM Transactions on Networking, pages 925--937, 2006. Google ScholarDigital Library
- P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 31(2):182--209, 1985. Google ScholarDigital Library
- P. Flajolet, Éric Fusy, O. Gandouet, and F. Meunier. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In Analysis of Algorithms (AOFA), pages 127--146, 2007.Google Scholar
- F. Giroire. Order statistics and estimating cardinalities of massive data sets. Discrete Applied Mathematics, 157(2):406--427, 2009. Google ScholarDigital Library
- A. Hall, O. Bachmann, R. Büssow, S. Gănceanu, and M. Nunkesser. Processing a trillion cells per mouse click. In Very Large Databases (VLDB), 2012. Google ScholarDigital Library
- P. Indyk. Tight lower bounds for the distinct elements problem. In Foundations of Computer Science (FOCS), pages 283--288, 2003. Google ScholarDigital Library
- D. M. Kane, J. Nelson, and D. P. Woodruff. An optimal algorithm for the distinct elements problem. In Principles of database systems (PODS), pages 41--52. ACM, 2010. Google ScholarDigital Library
- J. Lumbroso. An optimal cardinality estimation algorithm based on order statistics and its full analysis. In Analysis of Algorithms (AOFA), pages 489--504, 2010.Google Scholar
- S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, T. Vassilakis, and G. Inc. Dremel: Interactive analysis of web-scale datasets. In Very Large Databases (VLDB), pages 330--339, 2010. Google ScholarDigital Library
- A. Metwally, D. Agrawal, and A. E. Abbadi. Why go logarithmic if we can go linear? Towards effective distinct counting of search traffic. In Extending database technology (EDBT), pages 618--629, 2008. Google ScholarDigital Library
- R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data, parallel analysis with Sawzall. Journal on Scientific Programming, pages 277--298, 2005. Google ScholarDigital Library
- K.-Y. Whang, B. T. Vander-Zanden, and H. M. Taylor. A linear-time probabilistic counting algorithm for database applications. ACM Transactions on Database Systems, 15:208--229, 1990. Google ScholarDigital Library
Index Terms
- HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm
Recommendations
Sliding HyperLogLog: Estimating Cardinality in a Data Stream over a Sliding Window
ICDMW '10: Proceedings of the 2010 IEEE International Conference on Data Mining WorkshopsIn this paper, a new algorithm estimating the number of active flows in a data stream is proposed. This algorithm adapts the HyperLogLog algorithm of Flajolet et al. to data stream processing by adding a sliding window mechanism. It has the advantage to ...
Better Cardinality Estimators for HyperLogLog, PCSA, and Beyond
PODS '23: Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database SystemsCardinality Estimation (aka Distinct Elements) is a classic problem in sketching with many applications in databases, networking, and security. Although sketching algorithms are fairly simple, analyzing the cardinality estimators is notoriously difficult,...
Branch-and-reduce exponential/FPT algorithms in practice
We investigate the gap between theory and practice for exact branching algorithms. In theory, branch-and-reduce algorithms currently have the best time complexity for numerous important problems. On the other hand, in practice, state-of-the-art methods ...
Comments