research-article

An Optimal Algorithm for l1-Heavy Hitters in Insertion Streams and Related Problems

Authors:
Arnab Bhattacharyya

Indian Institute of Science, Bangalore, India

Indian Institute of Science, Bangalore, India
View Profile

,
Palash Dey

Indian Institute of Science, Bangalore, India

Indian Institute of Science, Bangalore, India
View Profile

,
David P. Woodruff

IBM Research Almaden, San Jose, CA, USA

IBM Research Almaden, San Jose, CA, USA
View Profile

PODS '16: Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database SystemsJune 2016Pages 385–400https://doi.org/10.1145/2902251.2902284

Published:15 June 2016Publication History

PODS '16: Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Pages 385–400

ABSTRACT

We give the first optimal bounds for returning the l₁-heavy hitters in a data stream of insertions, together with their approximate frequencies, closing a long line of work on this problem. For a stream of m items in {1, 2, ..., n} and parameters 0 < ε < φ ≤ 1, let f_i denote the frequency of item i, i.e., the number of times item i occurs in the stream. With arbitrarily large constant probability, our algorithm returns all items i for which f_i ≥ φ m, returns no items j for which f_j ≤ (φ -ε)m, and returns approximations ~f_i with |~f_i - f_i| ≤ ε m for each item i that it returns. Our algorithm uses O(ε^-1 logφ^-1 + φ^-1 log n + log log m) bits of space, processes each stream update in O(1) worst-case time, and can report its output in time linear in the output size. We also prove a lower bound, which implies that our algorithm is optimal up to a constant factor in its space complexity. A modification of our algorithm can be used to estimate the maximum frequency up to an additive ε m error in the above amount of space, resolving Question 3 in the IITK 2006 Workshop on Algorithms for Data Streams for the case of l₁-heavy hitters. We also introduce several variants of the heavy hitters and maximum frequency problems, inspired by rank aggregation and voting schemes, and show how our techniques can be applied in such settings. Unlike the traditional heavy hitters problem, some of these variants look at comparisons between items rather than numerical values to determine the frequency of an item.

References

G. Adomavicius and A. Tuzhilin. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Trans. Knowl. Data Eng., 17(6):734--749, 2005. Google ScholarDigital Library
R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proc. 20th International Conference on Very Large Data Bases, pages 487--499, 1994. Google ScholarDigital Library
A. Arasu, S. Babu, and J. Widom. CQL: A language for continuous queries over streams and relations. In Proc. 9th International Workshop on Database Programming Languages DBPL, pages 1--19, 2003.Google Scholar
J. A. Aslam and M. Montague. Models for metasearch. In Proc. 24th Annual international ACM SIGIR conference on Research and Development in Information Retrieval, pages 276--284. ACM, 2001. Google ScholarDigital Library
R. Berinde, P. Indyk, G. Cormode, and M. J. Strauss. Space-optimal heavy hitters with strong error bounds. ACM Trans. Database Syst, 35(4):26, 2010. Google ScholarDigital Library
K. S. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. In Proc. ACM SIGMOD International Conference on Management of Data, pages 359--370, 1999. Google ScholarDigital Library
D. K. Blandford and G. E. Blelloch. Compact dictionaries for variable-length keys and data with applications. ACM Trans. Algor., 4(2):17, 2008. Google ScholarDigital Library
P. Bonnet, J. Gehrke, and P. Seshadri. Towards sensor database systems. In Proc. 2nd International Conference on Mobile Data Management, MDM, pages 3--14, 2001. Google ScholarDigital Library
F. Brandt, V. Conitzer, U. Endriss, J. Lang, and A. Procaccia. Handbook of computational social choice, 2015.Google Scholar
V. Braverman, S. R. Chestnut, N. Ivkin, J. Nelson, Z. Wang, and D. P. Woodruff. BPTree: an l2 heavy hitters algorithm using constant memory. 2016. arXiv:1603.00759.Google Scholar
V. Braverman, S. R. Chestnut, N. Ivkin, and D. P. Woodruff. Beating countsketch for heavy hitters in insertion streams. Technical report, 2016. arXiv:1511.00661. To appear in STOC '16. Google ScholarDigital Library
E. J. Candes, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Commun. Pur. Appl. Math., 59:1207--1223, 2006.Google ScholarCross Ref
I. Caragiannis and A. D. Procaccia. Voting almost maximizes social welfare despite limited communication. Artif. Intell., 175(9--10):1655 -- 1671, 2011. Google ScholarDigital Library
B. Carterette and D. Petkova. Learning a ranking from pairwise preferences. In Proc. 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 629--630. ACM, 2006. Google ScholarDigital Library
Y. Chabchoub, C. Fricker, and H. Mohamed. Analysis of a bloom filter algorithm via the supermarket model. In Proc. 21st International Teletraffic Congress, ITC, pages 1--8, 2009.Google Scholar
M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. Theor. Comput. Sci., 312(1):3--15, 2004. Google ScholarDigital Library
V. Conitzer and T. Sandholm. Communication complexity of common voting rules. In Proc. 6th ACM Conference on Electronic Commerce, pages 78--87. ACM, 2005. Google ScholarDigital Library
G. Cormode. Sketch techniques for massive data. In G. Cormode, M. Garofalakis, P. J. Haas, and C. Jermaine, editors, Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches, volume 4 of Foundations and Trends in Databases, pages 1--294. Now Publishers Inc., Hanover, MA, USA, Jan. 2012. Google ScholarDigital Library
G. Cormode and M. Hadjieleftheriou. Finding frequent items in data streams. Proceedings of the VLDB Endowment, 1(2):1530--1541, 2008. Google ScholarDigital Library
G. Cormode, F. Korn, S. Muthukrishnan, and D. Srivastava. Finding hierarchical heavy hitters in streaming data. ACM Trans. Knowl. Discov. Data, 1(4), 2008. Google ScholarDigital Library
G. Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. J. Algorithms, 55(1):58--75, 2005. Google ScholarDigital Library
T. M. Cover and J. A. Thomas. Elements of information theory. John Wiley & Sons, 2012.Google ScholarDigital Library
E. D. Demaine, A. López-Ortiz, and J. I. Munro. Frequency estimation of internet packet streams with limited space. In Proc. 10th Annual European Symposium on Algorithms, pages 348--360. Springer, 2002. Google ScholarDigital Library
P. Dey and A. Bhattacharyya. Sample complexity for winner prediction in elections. In Proc. 14th International Conference on Autonomous Systems and Multiagent Systems (AAMAS-15), 2015. Google ScholarDigital Library
M. Dietzfelbinger, T. Hagerup, J. Katajainen, and M. Penttonen. A reliable randomized algorithm for the closest-pair problem. J. Algorithms, 25(1):19--51, 1997. Google ScholarDigital Library
A. Dvoretzky, J. Kiefer, and J. Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. Ann. Math. Stat., 27(3):642 -- 669, 1956.Google ScholarCross Ref
C. Estan and G. Varghese. New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice. Theor. Comput. Syst., 21(3):270--313, 2003. Google ScholarDigital Library
M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing iceberg queries efficiently. In Proc. 24rd International Conference on Very Large Data Bases, pages 299--310, 1998. Google ScholarDigital Library
P. Flajolet. Approximate counting: a detailed analysis. BIT Numerical Mathematics, 25(1):113--134, 1985. Google ScholarDigital Library
A. C. Gilbert, M. J. Strauss, J. A. Tropp, and R. Vershynin. One sketch for all: fast algorithms for compressed sensing. In Proc. 39th Annual ACM Symposium on Theory of Computing, San Diego, California, USA, June 11--13, 2007, pages 237--246, 2007. Google ScholarDigital Library
J. Han, J. Pei, G. Dong, and K. Wang. Efficient computation of iceberg cubes with complex measures. In Proc. 2001 ACM SIGMOD International Conference on Management of Data, pages 1--12, 2001. Google ScholarDigital Library
J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proc. 2000 ACM SIGMOD International Conference on Management of Data, pages 1--12, 2000. Google ScholarDigital Library
J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl. Evaluating collaborative filtering recommender systems. ACM Trans. Inf. Syst., 22(1):5--53, 2004. Google ScholarDigital Library
J. Hershberger, N. Shrivastava, S. Suri, and C. D. Tóth. Space complexity of hierarchical heavy hitters in multi-dimensional data streams. In Proc. Twenty-fourth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 338--347, 2005. Google ScholarDigital Library
C. Hidber. Online association rule mining. In SIGMOD 1999, Proc. ACM SIGMOD International Conference on Management of Data, pages 145--156, 1999. Google ScholarDigital Library
T. K. Ho, J. J. Hull, and S. N. Srihari. Decision combination in multiple classifier systems. IEEE Trans. Pattern Anal. Mach. Intell., 16(1):66--75, 1994. Google ScholarDigital Library
A. Jiang, L. S. Marcolino, A. D. Procaccia, T. Sandholm, N. Shah, and M. Tambe. Diverse randomized agents vote to win. In Proc. Annual Conference on Neural Information Processing Systems, pages 2573--2581, 2014.Google Scholar
R. M. Karp, S. Shenker, and C. H. Papadimitriou. A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst., 28(1):51--55, 2003. Google ScholarDigital Library
P. Kellner, J. Twyman, and A. Wells. Polling voting intentions. In Political Communication in Britain, pages 94--108. Springer, 2011.Google ScholarCross Ref
I. Kremer, N. Nisan, and D. Ron. On randomized one-round communication complexity. Comput. Complex., 8(1):21--49, 1999. Google ScholarDigital Library
A. Kumar and J. J. Xu. Sketch guided sampling - using on-line estimates of flow size for adaptive data collection. In Proc. 25th IEEE International Conference on Computer Communications, Joint Conference of the IEEE Computer and Communications Societies, 2006.Google ScholarCross Ref
E. Kushilevitz and N. Nisan. Communication Complexity. Cambridge University Press, New York, NY, USA, 1997. Google ScholarDigital Library
K. G. Larsen, J. Nelson, H. L. Nguyen, and M. Thorup. Heavy hitters via cluster-preserving clustering. arXiv:1604.01357, Apr. 2016.Google Scholar
C. E. Leiserson, R. L. Rivest, C. Stein, and T. H. Cormen. Introduction to algorithms. The MIT press, 5:2--2, 2001.Google Scholar
H. Li. Learning to rank for information retrieval and natural language processing. In Synthesis Lectures on Human Language Technologies, volume 7, pages 1--121. Morgan & Claypool Publishers, 2014.Google Scholar
A. Lumini and L. Nanni. Detector of image orientation based on borda count. Pattern Recogn. Lett., 27(3):180--186, 2006. Google ScholarDigital Library
N. Mamoulis, M. L. Yiu, K. H. Cheng, and D. W. Cheung. Efficient top-phk aggregation of ranked inputs. ACM Trans. Database Syst, 32(3):19, 2007. Google ScholarDigital Library
G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In Proc. 28th International Conference on Very Large Data Bases, pages 346--357. VLDB Endowment, 2002. Google ScholarDigital Library
A. Mao, A. D. Procaccia, and Y. Chen. Social Choice for Human Computation. In Proc. Fourth Workshop on Human Computation (HCOMP-12), 2012.Google Scholar
A. Mao, A. D. Procaccia, and Y. Chen. Better Human Computation Through Principled Voting. In Proc. 27th Conference on Artificial Intelligence (AAAI'13), 2013. Google ScholarDigital Library
A. Marian, N. Bruno, and L. Gravano. Evaluating top-phk queries over web-accessible databases. ACM Trans. Database Syst., 29(2):319--362, 2004. Google ScholarDigital Library
A. Metwally, D. Agrawal, and A. El Abbadi. Efficient computation of frequent and top-k elements in data streams. In Proc. 10th International Conference on Database Theory, ICDT'05, pages 398--412, Berlin, Heidelberg, 2005. Springer-Verlag. Google ScholarDigital Library
P. B. Miltersen, N. Nisan, S. Safra, and A. Wigderson. On data structures and asymmetric communication complexity. J. Comput. Syst. Sci., 57(1):37--49, 1998. Google ScholarDigital Library
J. Misra and D. Gries. Finding repeated elements. Sci. Comput. Program., 2(2):143--152, 1982.Google ScholarCross Ref
M. M. Monwar and M. L. Gavrilova. Multimodal biometric system using rank-level fusion approach. IEEE Trans. Syst. Man. Cybern. B, Cybern., 39(4):867--878, 2009. Google ScholarDigital Library
J. S. Moore. J. Algorithm, June 1981. p. 208--209.Google Scholar
R. Morris. Counting large numbers of events in small registers. Commun. ACM, 21(10):840--842, 1978. Google ScholarDigital Library
D. Mullen and B. Roth. Decision making: Its logic and practice. Savage, MD: Rowman and Littlefield Publishers, Inc., 1991.Google Scholar
S. Muthukrishnan. Data streams: Algorithms and applications. Now Publishers Inc, 2005.Google Scholar
J. Nelson. Sketching and streaming algorithms for processing massive data. XRDS: Crossroads, The ACM Magazine for Students, 19(1):14--19, 2012. Google ScholarDigital Library
R. Nuray and F. Can. Automatic ranking of information retrieval systems using data fusion. Inf. Process Manag., 42(3):595--614, 2006. Google ScholarDigital Library
A. Prasad, H. Pareek, and P. Ravikumar. Distributional rank aggregation, and an axiomatic analysis. In Proc. 32nd International Conference on Machine Learning (ICML-15), pages 2104--2112, 2015.Google Scholar
P. Resnick and H. R. Varian. Recommender systems. Commun. ACM, 40(3):56--58, 1997. Google ScholarDigital Library
A. Savasere, E. Omiecinski, and S. B. Navathe. An efficient algorithm for mining association rules in large databases. In Proc. 21th International Conference on Very Large Data Bases, pages 432--444, 1995. Google ScholarDigital Library
N. Shrivastava, C. Buragohain, D. Agrawal, and S. Suri. Medians and beyond: new aggregation techniques for sensor networks. In Proc. 2nd International Conference on Embedded Networked Sensor Systems, pages 239--249, 2004. Google ScholarDigital Library
D. Smirnov. Shannon's information methods for lower bounds for probabilistic communication complexity. Master's thesis, Moscow University, 1988.Google Scholar
X. Sun and D. P. Woodruff. Tight bounds for graph problems in insertion streams. In Proc. 18th. International Workshop on Approximation Algorithms for Combinatorial Optimization Problems (APPROX 2015), pages 435--448, 2015.Google Scholar
H. Toivonen. Sampling large databases for association rules. In Proc. 22th International Conference on Very Large Data Bases, pages 134--145, 1996. Google ScholarDigital Library
D. Van Gucht, R. Williams, D. P. Woodruff, and Q. Zhang. The communication complexity of distributed set-joins with applications to matrix multiplication. In Proc. 34th ACM Symposium on Principles of Database Systems, pages 199--212. ACM, 2015. Google ScholarDigital Library
M. N. Volkovs and R. S. Zemel. New learning methods for supervised and unsupervised preference aggregation. J. Mach. Learn. Res., 15(1):1135--1176, 2014. Google ScholarDigital Library
G. Wang and F. H. Lochovsky. Feature selection with conditional mutual information maximin in text categorization. In Proc. 13th ACM International Conference on Information and Knowledge Management, pages 342--349. ACM, 2004. Google ScholarDigital Library
D. P. Woodruff. New Algorithms for Heavy Hitters in Data Streams. 2016. arXiv 1603.01733. Appeared in ICDT '16.Google Scholar
L. Xia. Computing the margin of victory for various voting rules. In Proc. 13th ACM Conference on Electronic Commerce, pages 982--999. ACM, 2012. Google ScholarDigital Library
A. C.-C. Yao. Some complexity questions related to distributive computing (preliminary report). In Proc. 11th annual ACM Symposium on Theory of computing, pages 209--213. ACM, 1979. Google ScholarDigital Library

Index Terms

An Optimal Algorithm for l1-Heavy Hitters in Insertion Streams and Related Problems
1. Theory of computation
  1. Design and analysis of algorithms
    1. Streaming, sublinear and near linear time algorithms

Recommendations

BPTree: An ℓ₂ Heavy Hitters Algorithm Using Constant Memory
PODS '17: Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

The task of finding heavy hitters is one of the best known and well studied problems in the area of data streams. One is given a list i₁,i₂,...,i_m∈[n] and the goal is to identify the items among [n] that appear frequently in the list. In sub-polynomial ...
Read More
Beating CountSketch for heavy hitters in insertion streams
STOC '16: Proceedings of the forty-eighth annual ACM symposium on Theory of Computing

Given a stream p₁, …, p_m of items from a universe U, which, without loss of generality we identify with the set of integers {1, 2, …, n}, we consider the problem of returning all ℓ₂-heavy hitters, i.e., those items j for which f_j ≥ є √F₂, where f_j is ...
Read More
Data Streams with Bounded Deletions
PODS '18: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

Two prevalent models in the data stream literature are the insertion-only and turnstile models. Unfortunately, many important streaming problems require a Θ(log(n)) multiplicative factor more space for turnstile streams than for insertion-only streams. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PODS '16: Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems
June 2016
504 pages
ISBN:9781450341912
DOI:10.1145/2902251
General Chair:
Tova Milo
Tel Aviv University, Israel
,
Program Chair:
Wang-Chiew Tan
University of California, Santa Cruz, USA
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 15 June 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
algorithms
data streams
frequent items
heavy hitters
Qualifiers
- research-article
Conference

Acceptance Rates
PODS '16 Paper Acceptance Rate31of94submissions,33%Overall Acceptance Rate642of2,707submissions,24%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 12
  Total Citations
  View Citations
- 137
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An Optimal Algorithm for l1-Heavy Hitters in Insertion Streams and Related Problems

PODS '16: Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

BPTree: An ℓ₂ Heavy Hitters Algorithm Using Constant Memory

Beating CountSketch for heavy hitters in insertion streams

Data Streams with Bounded Deletions