ABSTRACT
We give the first optimal bounds for returning the l1-heavy hitters in a data stream of insertions, together with their approximate frequencies, closing a long line of work on this problem. For a stream of m items in {1, 2, ..., n} and parameters 0 < ε < φ ≤ 1, let fi denote the frequency of item i, i.e., the number of times item i occurs in the stream. With arbitrarily large constant probability, our algorithm returns all items i for which fi ≥ φ m, returns no items j for which fj ≤ (φ -ε)m, and returns approximations ~fi with |~fi - fi| ≤ ε m for each item i that it returns. Our algorithm uses O(ε-1 logφ-1 + φ-1 log n + log log m) bits of space, processes each stream update in O(1) worst-case time, and can report its output in time linear in the output size. We also prove a lower bound, which implies that our algorithm is optimal up to a constant factor in its space complexity. A modification of our algorithm can be used to estimate the maximum frequency up to an additive ε m error in the above amount of space, resolving Question 3 in the IITK 2006 Workshop on Algorithms for Data Streams for the case of l1-heavy hitters. We also introduce several variants of the heavy hitters and maximum frequency problems, inspired by rank aggregation and voting schemes, and show how our techniques can be applied in such settings. Unlike the traditional heavy hitters problem, some of these variants look at comparisons between items rather than numerical values to determine the frequency of an item.
- G. Adomavicius and A. Tuzhilin. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Trans. Knowl. Data Eng., 17(6):734--749, 2005. Google ScholarDigital Library
- R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proc. 20th International Conference on Very Large Data Bases, pages 487--499, 1994. Google ScholarDigital Library
- A. Arasu, S. Babu, and J. Widom. CQL: A language for continuous queries over streams and relations. In Proc. 9th International Workshop on Database Programming Languages DBPL, pages 1--19, 2003.Google Scholar
- J. A. Aslam and M. Montague. Models for metasearch. In Proc. 24th Annual international ACM SIGIR conference on Research and Development in Information Retrieval, pages 276--284. ACM, 2001. Google ScholarDigital Library
- R. Berinde, P. Indyk, G. Cormode, and M. J. Strauss. Space-optimal heavy hitters with strong error bounds. ACM Trans. Database Syst, 35(4):26, 2010. Google ScholarDigital Library
- K. S. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. In Proc. ACM SIGMOD International Conference on Management of Data, pages 359--370, 1999. Google ScholarDigital Library
- D. K. Blandford and G. E. Blelloch. Compact dictionaries for variable-length keys and data with applications. ACM Trans. Algor., 4(2):17, 2008. Google ScholarDigital Library
- P. Bonnet, J. Gehrke, and P. Seshadri. Towards sensor database systems. In Proc. 2nd International Conference on Mobile Data Management, MDM, pages 3--14, 2001. Google ScholarDigital Library
- F. Brandt, V. Conitzer, U. Endriss, J. Lang, and A. Procaccia. Handbook of computational social choice, 2015.Google Scholar
- V. Braverman, S. R. Chestnut, N. Ivkin, J. Nelson, Z. Wang, and D. P. Woodruff. BPTree: an l2 heavy hitters algorithm using constant memory. 2016. arXiv:1603.00759.Google Scholar
- V. Braverman, S. R. Chestnut, N. Ivkin, and D. P. Woodruff. Beating countsketch for heavy hitters in insertion streams. Technical report, 2016. arXiv:1511.00661. To appear in STOC '16. Google ScholarDigital Library
- E. J. Candes, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Commun. Pur. Appl. Math., 59:1207--1223, 2006.Google ScholarCross Ref
- I. Caragiannis and A. D. Procaccia. Voting almost maximizes social welfare despite limited communication. Artif. Intell., 175(9--10):1655 -- 1671, 2011. Google ScholarDigital Library
- B. Carterette and D. Petkova. Learning a ranking from pairwise preferences. In Proc. 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 629--630. ACM, 2006. Google ScholarDigital Library
- Y. Chabchoub, C. Fricker, and H. Mohamed. Analysis of a bloom filter algorithm via the supermarket model. In Proc. 21st International Teletraffic Congress, ITC, pages 1--8, 2009.Google Scholar
- M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. Theor. Comput. Sci., 312(1):3--15, 2004. Google ScholarDigital Library
- V. Conitzer and T. Sandholm. Communication complexity of common voting rules. In Proc. 6th ACM Conference on Electronic Commerce, pages 78--87. ACM, 2005. Google ScholarDigital Library
- G. Cormode. Sketch techniques for massive data. In G. Cormode, M. Garofalakis, P. J. Haas, and C. Jermaine, editors, Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches, volume 4 of Foundations and Trends in Databases, pages 1--294. Now Publishers Inc., Hanover, MA, USA, Jan. 2012. Google ScholarDigital Library
- G. Cormode and M. Hadjieleftheriou. Finding frequent items in data streams. Proceedings of the VLDB Endowment, 1(2):1530--1541, 2008. Google ScholarDigital Library
- G. Cormode, F. Korn, S. Muthukrishnan, and D. Srivastava. Finding hierarchical heavy hitters in streaming data. ACM Trans. Knowl. Discov. Data, 1(4), 2008. Google ScholarDigital Library
- G. Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. J. Algorithms, 55(1):58--75, 2005. Google ScholarDigital Library
- T. M. Cover and J. A. Thomas. Elements of information theory. John Wiley & Sons, 2012.Google ScholarDigital Library
- E. D. Demaine, A. López-Ortiz, and J. I. Munro. Frequency estimation of internet packet streams with limited space. In Proc. 10th Annual European Symposium on Algorithms, pages 348--360. Springer, 2002. Google ScholarDigital Library
- P. Dey and A. Bhattacharyya. Sample complexity for winner prediction in elections. In Proc. 14th International Conference on Autonomous Systems and Multiagent Systems (AAMAS-15), 2015. Google ScholarDigital Library
- M. Dietzfelbinger, T. Hagerup, J. Katajainen, and M. Penttonen. A reliable randomized algorithm for the closest-pair problem. J. Algorithms, 25(1):19--51, 1997. Google ScholarDigital Library
- A. Dvoretzky, J. Kiefer, and J. Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. Ann. Math. Stat., 27(3):642 -- 669, 1956.Google ScholarCross Ref
- C. Estan and G. Varghese. New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice. Theor. Comput. Syst., 21(3):270--313, 2003. Google ScholarDigital Library
- M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing iceberg queries efficiently. In Proc. 24rd International Conference on Very Large Data Bases, pages 299--310, 1998. Google ScholarDigital Library
- P. Flajolet. Approximate counting: a detailed analysis. BIT Numerical Mathematics, 25(1):113--134, 1985. Google ScholarDigital Library
- A. C. Gilbert, M. J. Strauss, J. A. Tropp, and R. Vershynin. One sketch for all: fast algorithms for compressed sensing. In Proc. 39th Annual ACM Symposium on Theory of Computing, San Diego, California, USA, June 11--13, 2007, pages 237--246, 2007. Google ScholarDigital Library
- J. Han, J. Pei, G. Dong, and K. Wang. Efficient computation of iceberg cubes with complex measures. In Proc. 2001 ACM SIGMOD International Conference on Management of Data, pages 1--12, 2001. Google ScholarDigital Library
- J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proc. 2000 ACM SIGMOD International Conference on Management of Data, pages 1--12, 2000. Google ScholarDigital Library
- J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl. Evaluating collaborative filtering recommender systems. ACM Trans. Inf. Syst., 22(1):5--53, 2004. Google ScholarDigital Library
- J. Hershberger, N. Shrivastava, S. Suri, and C. D. Tóth. Space complexity of hierarchical heavy hitters in multi-dimensional data streams. In Proc. Twenty-fourth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 338--347, 2005. Google ScholarDigital Library
- C. Hidber. Online association rule mining. In SIGMOD 1999, Proc. ACM SIGMOD International Conference on Management of Data, pages 145--156, 1999. Google ScholarDigital Library
- T. K. Ho, J. J. Hull, and S. N. Srihari. Decision combination in multiple classifier systems. IEEE Trans. Pattern Anal. Mach. Intell., 16(1):66--75, 1994. Google ScholarDigital Library
- A. Jiang, L. S. Marcolino, A. D. Procaccia, T. Sandholm, N. Shah, and M. Tambe. Diverse randomized agents vote to win. In Proc. Annual Conference on Neural Information Processing Systems, pages 2573--2581, 2014.Google Scholar
- R. M. Karp, S. Shenker, and C. H. Papadimitriou. A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst., 28(1):51--55, 2003. Google ScholarDigital Library
- P. Kellner, J. Twyman, and A. Wells. Polling voting intentions. In Political Communication in Britain, pages 94--108. Springer, 2011.Google ScholarCross Ref
- I. Kremer, N. Nisan, and D. Ron. On randomized one-round communication complexity. Comput. Complex., 8(1):21--49, 1999. Google ScholarDigital Library
- A. Kumar and J. J. Xu. Sketch guided sampling - using on-line estimates of flow size for adaptive data collection. In Proc. 25th IEEE International Conference on Computer Communications, Joint Conference of the IEEE Computer and Communications Societies, 2006.Google ScholarCross Ref
- E. Kushilevitz and N. Nisan. Communication Complexity. Cambridge University Press, New York, NY, USA, 1997. Google ScholarDigital Library
- K. G. Larsen, J. Nelson, H. L. Nguyen, and M. Thorup. Heavy hitters via cluster-preserving clustering. arXiv:1604.01357, Apr. 2016.Google Scholar
- C. E. Leiserson, R. L. Rivest, C. Stein, and T. H. Cormen. Introduction to algorithms. The MIT press, 5:2--2, 2001.Google Scholar
- H. Li. Learning to rank for information retrieval and natural language processing. In Synthesis Lectures on Human Language Technologies, volume 7, pages 1--121. Morgan & Claypool Publishers, 2014.Google Scholar
- A. Lumini and L. Nanni. Detector of image orientation based on borda count. Pattern Recogn. Lett., 27(3):180--186, 2006. Google ScholarDigital Library
- N. Mamoulis, M. L. Yiu, K. H. Cheng, and D. W. Cheung. Efficient top-phk aggregation of ranked inputs. ACM Trans. Database Syst, 32(3):19, 2007. Google ScholarDigital Library
- G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In Proc. 28th International Conference on Very Large Data Bases, pages 346--357. VLDB Endowment, 2002. Google ScholarDigital Library
- A. Mao, A. D. Procaccia, and Y. Chen. Social Choice for Human Computation. In Proc. Fourth Workshop on Human Computation (HCOMP-12), 2012.Google Scholar
- A. Mao, A. D. Procaccia, and Y. Chen. Better Human Computation Through Principled Voting. In Proc. 27th Conference on Artificial Intelligence (AAAI'13), 2013. Google ScholarDigital Library
- A. Marian, N. Bruno, and L. Gravano. Evaluating top-phk queries over web-accessible databases. ACM Trans. Database Syst., 29(2):319--362, 2004. Google ScholarDigital Library
- A. Metwally, D. Agrawal, and A. El Abbadi. Efficient computation of frequent and top-k elements in data streams. In Proc. 10th International Conference on Database Theory, ICDT'05, pages 398--412, Berlin, Heidelberg, 2005. Springer-Verlag. Google ScholarDigital Library
- P. B. Miltersen, N. Nisan, S. Safra, and A. Wigderson. On data structures and asymmetric communication complexity. J. Comput. Syst. Sci., 57(1):37--49, 1998. Google ScholarDigital Library
- J. Misra and D. Gries. Finding repeated elements. Sci. Comput. Program., 2(2):143--152, 1982.Google ScholarCross Ref
- M. M. Monwar and M. L. Gavrilova. Multimodal biometric system using rank-level fusion approach. IEEE Trans. Syst. Man. Cybern. B, Cybern., 39(4):867--878, 2009. Google ScholarDigital Library
- J. S. Moore. J. Algorithm, June 1981. p. 208--209.Google Scholar
- R. Morris. Counting large numbers of events in small registers. Commun. ACM, 21(10):840--842, 1978. Google ScholarDigital Library
- D. Mullen and B. Roth. Decision making: Its logic and practice. Savage, MD: Rowman and Littlefield Publishers, Inc., 1991.Google Scholar
- S. Muthukrishnan. Data streams: Algorithms and applications. Now Publishers Inc, 2005.Google Scholar
- J. Nelson. Sketching and streaming algorithms for processing massive data. XRDS: Crossroads, The ACM Magazine for Students, 19(1):14--19, 2012. Google ScholarDigital Library
- R. Nuray and F. Can. Automatic ranking of information retrieval systems using data fusion. Inf. Process Manag., 42(3):595--614, 2006. Google ScholarDigital Library
- A. Prasad, H. Pareek, and P. Ravikumar. Distributional rank aggregation, and an axiomatic analysis. In Proc. 32nd International Conference on Machine Learning (ICML-15), pages 2104--2112, 2015.Google Scholar
- P. Resnick and H. R. Varian. Recommender systems. Commun. ACM, 40(3):56--58, 1997. Google ScholarDigital Library
- A. Savasere, E. Omiecinski, and S. B. Navathe. An efficient algorithm for mining association rules in large databases. In Proc. 21th International Conference on Very Large Data Bases, pages 432--444, 1995. Google ScholarDigital Library
- N. Shrivastava, C. Buragohain, D. Agrawal, and S. Suri. Medians and beyond: new aggregation techniques for sensor networks. In Proc. 2nd International Conference on Embedded Networked Sensor Systems, pages 239--249, 2004. Google ScholarDigital Library
- D. Smirnov. Shannon's information methods for lower bounds for probabilistic communication complexity. Master's thesis, Moscow University, 1988.Google Scholar
- X. Sun and D. P. Woodruff. Tight bounds for graph problems in insertion streams. In Proc. 18th. International Workshop on Approximation Algorithms for Combinatorial Optimization Problems (APPROX 2015), pages 435--448, 2015.Google Scholar
- H. Toivonen. Sampling large databases for association rules. In Proc. 22th International Conference on Very Large Data Bases, pages 134--145, 1996. Google ScholarDigital Library
- D. Van Gucht, R. Williams, D. P. Woodruff, and Q. Zhang. The communication complexity of distributed set-joins with applications to matrix multiplication. In Proc. 34th ACM Symposium on Principles of Database Systems, pages 199--212. ACM, 2015. Google ScholarDigital Library
- M. N. Volkovs and R. S. Zemel. New learning methods for supervised and unsupervised preference aggregation. J. Mach. Learn. Res., 15(1):1135--1176, 2014. Google ScholarDigital Library
- G. Wang and F. H. Lochovsky. Feature selection with conditional mutual information maximin in text categorization. In Proc. 13th ACM International Conference on Information and Knowledge Management, pages 342--349. ACM, 2004. Google ScholarDigital Library
- D. P. Woodruff. New Algorithms for Heavy Hitters in Data Streams. 2016. arXiv 1603.01733. Appeared in ICDT '16.Google Scholar
- L. Xia. Computing the margin of victory for various voting rules. In Proc. 13th ACM Conference on Electronic Commerce, pages 982--999. ACM, 2012. Google ScholarDigital Library
- A. C.-C. Yao. Some complexity questions related to distributive computing (preliminary report). In Proc. 11th annual ACM Symposium on Theory of computing, pages 209--213. ACM, 1979. Google ScholarDigital Library
Index Terms
- An Optimal Algorithm for l1-Heavy Hitters in Insertion Streams and Related Problems
Recommendations
BPTree: An ℓ2 Heavy Hitters Algorithm Using Constant Memory
PODS '17: Proceedings of the 36th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database SystemsThe task of finding heavy hitters is one of the best known and well studied problems in the area of data streams. One is given a list i1,i2,...,im∈[n] and the goal is to identify the items among [n] that appear frequently in the list. In sub-polynomial ...
Beating CountSketch for heavy hitters in insertion streams
STOC '16: Proceedings of the forty-eighth annual ACM symposium on Theory of ComputingGiven a stream p1, …, pm of items from a universe U, which, without loss of generality we identify with the set of integers {1, 2, …, n}, we consider the problem of returning all ℓ2-heavy hitters, i.e., those items j for which fj ≥ є √F2, where fj is ...
Data Streams with Bounded Deletions
PODS '18: Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database SystemsTwo prevalent models in the data stream literature are the insertion-only and turnstile models. Unfortunately, many important streaming problems require a Θ(log(n)) multiplicative factor more space for turnstile streams than for insertion-only streams. ...
Comments