skip to main content
10.1145/2902251.2902284acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
research-article

An Optimal Algorithm for l1-Heavy Hitters in Insertion Streams and Related Problems

Published:15 June 2016Publication History

ABSTRACT

We give the first optimal bounds for returning the l1-heavy hitters in a data stream of insertions, together with their approximate frequencies, closing a long line of work on this problem. For a stream of m items in {1, 2, ..., n} and parameters 0 < ε < φ ≤ 1, let fi denote the frequency of item i, i.e., the number of times item i occurs in the stream. With arbitrarily large constant probability, our algorithm returns all items i for which fi ≥ φ m, returns no items j for which fj ≤ (φ -ε)m, and returns approximations ~fi with |~fi - fi| ≤ ε m for each item i that it returns. Our algorithm uses O(ε-1 logφ-1 + φ-1 log n + log log m) bits of space, processes each stream update in O(1) worst-case time, and can report its output in time linear in the output size. We also prove a lower bound, which implies that our algorithm is optimal up to a constant factor in its space complexity. A modification of our algorithm can be used to estimate the maximum frequency up to an additive ε m error in the above amount of space, resolving Question 3 in the IITK 2006 Workshop on Algorithms for Data Streams for the case of l1-heavy hitters. We also introduce several variants of the heavy hitters and maximum frequency problems, inspired by rank aggregation and voting schemes, and show how our techniques can be applied in such settings. Unlike the traditional heavy hitters problem, some of these variants look at comparisons between items rather than numerical values to determine the frequency of an item.

References

  1. G. Adomavicius and A. Tuzhilin. Toward the next generation of recommender systems: A survey of the state-of-the-art and possible extensions. IEEE Trans. Knowl. Data Eng., 17(6):734--749, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proc. 20th International Conference on Very Large Data Bases, pages 487--499, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Arasu, S. Babu, and J. Widom. CQL: A language for continuous queries over streams and relations. In Proc. 9th International Workshop on Database Programming Languages DBPL, pages 1--19, 2003.Google ScholarGoogle Scholar
  4. J. A. Aslam and M. Montague. Models for metasearch. In Proc. 24th Annual international ACM SIGIR conference on Research and Development in Information Retrieval, pages 276--284. ACM, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Berinde, P. Indyk, G. Cormode, and M. J. Strauss. Space-optimal heavy hitters with strong error bounds. ACM Trans. Database Syst, 35(4):26, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. K. S. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. In Proc. ACM SIGMOD International Conference on Management of Data, pages 359--370, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. K. Blandford and G. E. Blelloch. Compact dictionaries for variable-length keys and data with applications. ACM Trans. Algor., 4(2):17, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Bonnet, J. Gehrke, and P. Seshadri. Towards sensor database systems. In Proc. 2nd International Conference on Mobile Data Management, MDM, pages 3--14, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. F. Brandt, V. Conitzer, U. Endriss, J. Lang, and A. Procaccia. Handbook of computational social choice, 2015.Google ScholarGoogle Scholar
  10. V. Braverman, S. R. Chestnut, N. Ivkin, J. Nelson, Z. Wang, and D. P. Woodruff. BPTree: an l2 heavy hitters algorithm using constant memory. 2016. arXiv:1603.00759.Google ScholarGoogle Scholar
  11. V. Braverman, S. R. Chestnut, N. Ivkin, and D. P. Woodruff. Beating countsketch for heavy hitters in insertion streams. Technical report, 2016. arXiv:1511.00661. To appear in STOC '16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. E. J. Candes, J. Romberg, and T. Tao. Stable signal recovery from incomplete and inaccurate measurements. Commun. Pur. Appl. Math., 59:1207--1223, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  13. I. Caragiannis and A. D. Procaccia. Voting almost maximizes social welfare despite limited communication. Artif. Intell., 175(9--10):1655 -- 1671, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. B. Carterette and D. Petkova. Learning a ranking from pairwise preferences. In Proc. 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 629--630. ACM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Y. Chabchoub, C. Fricker, and H. Mohamed. Analysis of a bloom filter algorithm via the supermarket model. In Proc. 21st International Teletraffic Congress, ITC, pages 1--8, 2009.Google ScholarGoogle Scholar
  16. M. Charikar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. Theor. Comput. Sci., 312(1):3--15, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. V. Conitzer and T. Sandholm. Communication complexity of common voting rules. In Proc. 6th ACM Conference on Electronic Commerce, pages 78--87. ACM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. G. Cormode. Sketch techniques for massive data. In G. Cormode, M. Garofalakis, P. J. Haas, and C. Jermaine, editors, Synopses for Massive Data: Samples, Histograms, Wavelets, Sketches, volume 4 of Foundations and Trends in Databases, pages 1--294. Now Publishers Inc., Hanover, MA, USA, Jan. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. G. Cormode and M. Hadjieleftheriou. Finding frequent items in data streams. Proceedings of the VLDB Endowment, 1(2):1530--1541, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. G. Cormode, F. Korn, S. Muthukrishnan, and D. Srivastava. Finding hierarchical heavy hitters in streaming data. ACM Trans. Knowl. Discov. Data, 1(4), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. G. Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. J. Algorithms, 55(1):58--75, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. T. M. Cover and J. A. Thomas. Elements of information theory. John Wiley & Sons, 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. E. D. Demaine, A. López-Ortiz, and J. I. Munro. Frequency estimation of internet packet streams with limited space. In Proc. 10th Annual European Symposium on Algorithms, pages 348--360. Springer, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. P. Dey and A. Bhattacharyya. Sample complexity for winner prediction in elections. In Proc. 14th International Conference on Autonomous Systems and Multiagent Systems (AAMAS-15), 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Dietzfelbinger, T. Hagerup, J. Katajainen, and M. Penttonen. A reliable randomized algorithm for the closest-pair problem. J. Algorithms, 25(1):19--51, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Dvoretzky, J. Kiefer, and J. Wolfowitz. Asymptotic minimax character of the sample distribution function and of the classical multinomial estimator. Ann. Math. Stat., 27(3):642 -- 669, 1956.Google ScholarGoogle ScholarCross RefCross Ref
  27. C. Estan and G. Varghese. New directions in traffic measurement and accounting: Focusing on the elephants, ignoring the mice. Theor. Comput. Syst., 21(3):270--313, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing iceberg queries efficiently. In Proc. 24rd International Conference on Very Large Data Bases, pages 299--310, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. P. Flajolet. Approximate counting: a detailed analysis. BIT Numerical Mathematics, 25(1):113--134, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. C. Gilbert, M. J. Strauss, J. A. Tropp, and R. Vershynin. One sketch for all: fast algorithms for compressed sensing. In Proc. 39th Annual ACM Symposium on Theory of Computing, San Diego, California, USA, June 11--13, 2007, pages 237--246, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. J. Han, J. Pei, G. Dong, and K. Wang. Efficient computation of iceberg cubes with complex measures. In Proc. 2001 ACM SIGMOD International Conference on Management of Data, pages 1--12, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In Proc. 2000 ACM SIGMOD International Conference on Management of Data, pages 1--12, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. L. Herlocker, J. A. Konstan, L. G. Terveen, and J. T. Riedl. Evaluating collaborative filtering recommender systems. ACM Trans. Inf. Syst., 22(1):5--53, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. Hershberger, N. Shrivastava, S. Suri, and C. D. Tóth. Space complexity of hierarchical heavy hitters in multi-dimensional data streams. In Proc. Twenty-fourth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, pages 338--347, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. C. Hidber. Online association rule mining. In SIGMOD 1999, Proc. ACM SIGMOD International Conference on Management of Data, pages 145--156, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. T. K. Ho, J. J. Hull, and S. N. Srihari. Decision combination in multiple classifier systems. IEEE Trans. Pattern Anal. Mach. Intell., 16(1):66--75, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. A. Jiang, L. S. Marcolino, A. D. Procaccia, T. Sandholm, N. Shah, and M. Tambe. Diverse randomized agents vote to win. In Proc. Annual Conference on Neural Information Processing Systems, pages 2573--2581, 2014.Google ScholarGoogle Scholar
  38. R. M. Karp, S. Shenker, and C. H. Papadimitriou. A simple algorithm for finding frequent elements in streams and bags. ACM Trans. Database Syst., 28(1):51--55, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. P. Kellner, J. Twyman, and A. Wells. Polling voting intentions. In Political Communication in Britain, pages 94--108. Springer, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  40. I. Kremer, N. Nisan, and D. Ron. On randomized one-round communication complexity. Comput. Complex., 8(1):21--49, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. A. Kumar and J. J. Xu. Sketch guided sampling - using on-line estimates of flow size for adaptive data collection. In Proc. 25th IEEE International Conference on Computer Communications, Joint Conference of the IEEE Computer and Communications Societies, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  42. E. Kushilevitz and N. Nisan. Communication Complexity. Cambridge University Press, New York, NY, USA, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. K. G. Larsen, J. Nelson, H. L. Nguyen, and M. Thorup. Heavy hitters via cluster-preserving clustering. arXiv:1604.01357, Apr. 2016.Google ScholarGoogle Scholar
  44. C. E. Leiserson, R. L. Rivest, C. Stein, and T. H. Cormen. Introduction to algorithms. The MIT press, 5:2--2, 2001.Google ScholarGoogle Scholar
  45. H. Li. Learning to rank for information retrieval and natural language processing. In Synthesis Lectures on Human Language Technologies, volume 7, pages 1--121. Morgan & Claypool Publishers, 2014.Google ScholarGoogle Scholar
  46. A. Lumini and L. Nanni. Detector of image orientation based on borda count. Pattern Recogn. Lett., 27(3):180--186, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. N. Mamoulis, M. L. Yiu, K. H. Cheng, and D. W. Cheung. Efficient top-phk aggregation of ranked inputs. ACM Trans. Database Syst, 32(3):19, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In Proc. 28th International Conference on Very Large Data Bases, pages 346--357. VLDB Endowment, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. A. Mao, A. D. Procaccia, and Y. Chen. Social Choice for Human Computation. In Proc. Fourth Workshop on Human Computation (HCOMP-12), 2012.Google ScholarGoogle Scholar
  50. A. Mao, A. D. Procaccia, and Y. Chen. Better Human Computation Through Principled Voting. In Proc. 27th Conference on Artificial Intelligence (AAAI'13), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. A. Marian, N. Bruno, and L. Gravano. Evaluating top-phk queries over web-accessible databases. ACM Trans. Database Syst., 29(2):319--362, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. A. Metwally, D. Agrawal, and A. El Abbadi. Efficient computation of frequent and top-k elements in data streams. In Proc. 10th International Conference on Database Theory, ICDT'05, pages 398--412, Berlin, Heidelberg, 2005. Springer-Verlag. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. P. B. Miltersen, N. Nisan, S. Safra, and A. Wigderson. On data structures and asymmetric communication complexity. J. Comput. Syst. Sci., 57(1):37--49, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. J. Misra and D. Gries. Finding repeated elements. Sci. Comput. Program., 2(2):143--152, 1982.Google ScholarGoogle ScholarCross RefCross Ref
  55. M. M. Monwar and M. L. Gavrilova. Multimodal biometric system using rank-level fusion approach. IEEE Trans. Syst. Man. Cybern. B, Cybern., 39(4):867--878, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. J. S. Moore. J. Algorithm, June 1981. p. 208--209.Google ScholarGoogle Scholar
  57. R. Morris. Counting large numbers of events in small registers. Commun. ACM, 21(10):840--842, 1978. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. D. Mullen and B. Roth. Decision making: Its logic and practice. Savage, MD: Rowman and Littlefield Publishers, Inc., 1991.Google ScholarGoogle Scholar
  59. S. Muthukrishnan. Data streams: Algorithms and applications. Now Publishers Inc, 2005.Google ScholarGoogle Scholar
  60. J. Nelson. Sketching and streaming algorithms for processing massive data. XRDS: Crossroads, The ACM Magazine for Students, 19(1):14--19, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. R. Nuray and F. Can. Automatic ranking of information retrieval systems using data fusion. Inf. Process Manag., 42(3):595--614, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. A. Prasad, H. Pareek, and P. Ravikumar. Distributional rank aggregation, and an axiomatic analysis. In Proc. 32nd International Conference on Machine Learning (ICML-15), pages 2104--2112, 2015.Google ScholarGoogle Scholar
  63. P. Resnick and H. R. Varian. Recommender systems. Commun. ACM, 40(3):56--58, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. A. Savasere, E. Omiecinski, and S. B. Navathe. An efficient algorithm for mining association rules in large databases. In Proc. 21th International Conference on Very Large Data Bases, pages 432--444, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. N. Shrivastava, C. Buragohain, D. Agrawal, and S. Suri. Medians and beyond: new aggregation techniques for sensor networks. In Proc. 2nd International Conference on Embedded Networked Sensor Systems, pages 239--249, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. D. Smirnov. Shannon's information methods for lower bounds for probabilistic communication complexity. Master's thesis, Moscow University, 1988.Google ScholarGoogle Scholar
  67. X. Sun and D. P. Woodruff. Tight bounds for graph problems in insertion streams. In Proc. 18th. International Workshop on Approximation Algorithms for Combinatorial Optimization Problems (APPROX 2015), pages 435--448, 2015.Google ScholarGoogle Scholar
  68. H. Toivonen. Sampling large databases for association rules. In Proc. 22th International Conference on Very Large Data Bases, pages 134--145, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. D. Van Gucht, R. Williams, D. P. Woodruff, and Q. Zhang. The communication complexity of distributed set-joins with applications to matrix multiplication. In Proc. 34th ACM Symposium on Principles of Database Systems, pages 199--212. ACM, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. M. N. Volkovs and R. S. Zemel. New learning methods for supervised and unsupervised preference aggregation. J. Mach. Learn. Res., 15(1):1135--1176, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. G. Wang and F. H. Lochovsky. Feature selection with conditional mutual information maximin in text categorization. In Proc. 13th ACM International Conference on Information and Knowledge Management, pages 342--349. ACM, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. D. P. Woodruff. New Algorithms for Heavy Hitters in Data Streams. 2016. arXiv 1603.01733. Appeared in ICDT '16.Google ScholarGoogle Scholar
  73. L. Xia. Computing the margin of victory for various voting rules. In Proc. 13th ACM Conference on Electronic Commerce, pages 982--999. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. A. C.-C. Yao. Some complexity questions related to distributive computing (preliminary report). In Proc. 11th annual ACM Symposium on Theory of computing, pages 209--213. ACM, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. An Optimal Algorithm for l1-Heavy Hitters in Insertion Streams and Related Problems

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      PODS '16: Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems
      June 2016
      504 pages
      ISBN:9781450341912
      DOI:10.1145/2902251
      • General Chair:
      • Tova Milo,
      • Program Chair:
      • Wang-Chiew Tan

      Copyright © 2016 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 15 June 2016

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      PODS '16 Paper Acceptance Rate31of94submissions,33%Overall Acceptance Rate642of2,707submissions,24%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader