research-article

HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm

Authors:
Stefan Heule

ETH Zurich and Google, Inc.

ETH Zurich and Google, Inc.
View Profile

,
Marc Nunkesser

Google, Inc.

Google, Inc.
View Profile

,
Alexander Hall

Google, Inc.

Google, Inc.
View Profile

EDBT '13: Proceedings of the 16th International Conference on Extending Database TechnologyMarch 2013Pages 683–692https://doi.org/10.1145/2452376.2452456

Published:18 March 2013Publication History

EDBT '13: Proceedings of the 16th International Conference on Extending Database Technology

Pages 683–692

ABSTRACT

Cardinality estimation has a wide range of applications and is of particular importance in database systems. Various algorithms have been proposed in the past, and the HyperLogLog algorithm is one of them. In this paper, we present a series of improvements to this algorithm that reduce its memory requirements and significantly increase its accuracy for an important range of cardinalities. We have implemented our proposed algorithm for a system at Google and evaluated it empirically, comparing it to the original HyperLogLog algorithm. Like HyperLogLog, our improved algorithm parallelizes perfectly and computes the cardinality estimate in a single pass.

References

K. Aouiche and D. Lemire. A comparison of five probabilistic view-size estimation techniques in OLAP. In Workshop on Data Warehousing and OLAP (DOLAP), pages 17--24, 2007. Google ScholarDigital Library
Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In Workshop on Randomization and Approximation Techniques (RANDOM), pages 1--10, London, UK, UK, 2002. Springer-Verlag. Google ScholarDigital Library
P. Clifford and I. A. Cosma. A statistical analysis of probabilistic counting algorithms. Scandinavian Journal of Statistics, pages 1--14, 2011.Google Scholar
M. Durand and P. Flajolet. Loglog counting of large cardinalities. In G. D. Battista and U. Zwick, editors, European Symposium on Algorithms (ESA), volume 2832, pages 605--617, 2003.Google Scholar
C. Estan, G. Varghese, and M. Fisk. Bitmap algorithms for counting active flows on high-speed links. IEEE/ACM Transactions on Networking, pages 925--937, 2006. Google ScholarDigital Library
P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 31(2):182--209, 1985. Google ScholarDigital Library
P. Flajolet, Éric Fusy, O. Gandouet, and F. Meunier. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. In Analysis of Algorithms (AOFA), pages 127--146, 2007.Google Scholar
F. Giroire. Order statistics and estimating cardinalities of massive data sets. Discrete Applied Mathematics, 157(2):406--427, 2009. Google ScholarDigital Library
A. Hall, O. Bachmann, R. Büssow, S. Gănceanu, and M. Nunkesser. Processing a trillion cells per mouse click. In Very Large Databases (VLDB), 2012. Google ScholarDigital Library
P. Indyk. Tight lower bounds for the distinct elements problem. In Foundations of Computer Science (FOCS), pages 283--288, 2003. Google ScholarDigital Library
D. M. Kane, J. Nelson, and D. P. Woodruff. An optimal algorithm for the distinct elements problem. In Principles of database systems (PODS), pages 41--52. ACM, 2010. Google ScholarDigital Library
J. Lumbroso. An optimal cardinality estimation algorithm based on order statistics and its full analysis. In Analysis of Algorithms (AOFA), pages 489--504, 2010.Google Scholar
S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton, T. Vassilakis, and G. Inc. Dremel: Interactive analysis of web-scale datasets. In Very Large Databases (VLDB), pages 330--339, 2010. Google ScholarDigital Library
A. Metwally, D. Agrawal, and A. E. Abbadi. Why go logarithmic if we can go linear? Towards effective distinct counting of search traffic. In Extending database technology (EDBT), pages 618--629, 2008. Google ScholarDigital Library
R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data, parallel analysis with Sawzall. Journal on Scientific Programming, pages 277--298, 2005. Google ScholarDigital Library
K.-Y. Whang, B. T. Vander-Zanden, and H. M. Taylor. A linear-time probabilistic counting algorithm for database applications. ACM Transactions on Database Systems, 15:208--229, 1990. Google ScholarDigital Library

Index Terms

HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm
1. Information systems
  1. Data management systems
    1. Database management system engines
  2. Information systems applications

Recommendations

Sliding HyperLogLog: Estimating Cardinality in a Data Stream over a Sliding Window
ICDMW '10: Proceedings of the 2010 IEEE International Conference on Data Mining Workshops

In this paper, a new algorithm estimating the number of active flows in a data stream is proposed. This algorithm adapts the HyperLogLog algorithm of Flajolet et al. to data stream processing by adding a sliding window mechanism. It has the advantage to ...
Read More
Better Cardinality Estimators for HyperLogLog, PCSA, and Beyond
PODS '23: Proceedings of the 42nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Cardinality Estimation (aka Distinct Elements) is a classic problem in sketching with many applications in databases, networking, and security. Although sketching algorithms are fairly simple, analyzing the cardinality estimators is notoriously difficult,...
Read More
Branch-and-reduce exponential/FPT algorithms in practice

We investigate the gap between theory and practice for exact branching algorithms. In theory, branch-and-reduce algorithms currently have the best time complexity for numerous important problems. On the other hand, in practice, state-of-the-art methods ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EDBT '13: Proceedings of the 16th International Conference on Extending Database Technology
March 2013
793 pages
ISBN:9781450315975
DOI:10.1145/2452376
General Chair:
Giovanna Guerrini
Università di Genova, Italy
,
Program Chair:
Norman W. Paton
University of Manchester, UK
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 March 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate7of10submissions,70%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 204
  Total Citations
  View Citations
- 1,283
  Total Downloads
- Downloads (Last 12 months)191
- Downloads (Last 6 weeks)42
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm

EDBT '13: Proceedings of the 16th International Conference on Extending Database Technology

ABSTRACT

References

Cited By

Index Terms

Recommendations

Sliding HyperLogLog: Estimating Cardinality in a Data Stream over a Sliding Window

Better Cardinality Estimators for HyperLogLog, PCSA, and Beyond

Branch-and-reduce exponential/FPT algorithms in practice