skip to main content
10.1145/1989284.1989289acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
research-article

Tight bounds for Lp samplers, finding duplicates in streams, and related problems

Authors Info & Claims
Published:13 June 2011Publication History

ABSTRACT

In this paper, we present near-optimal space bounds for Lp-samplers. Given a stream of updates (additions and subtraction) to the coordinates of an underlying vector x in Rn, a perfect Lp sampler outputs the i-th coordinate with probability xipxpp. In SODA 2010, Monemizadeh and Woodruff showed polylog space upper bounds for approximate Lp-samplers and demonstrated various applications of them. Very recently, Andoni, Krauthgamer and Onak improved the upper bounds and gave a O-plog3n) space ε relative error and constant failure rate Lp-sampler for p є [1,2]. In this work, we give another such algorithm requiring only O(ε-plog2n) space for p є (1,2). For p є (0,1), our space bound is O(ε-1log2n), while for the p=1 case we have an O(log(1/ε)ε-log2n) space algorithm. We also give a O(log2n) bits zero relative error L0-sampler, improving the O(log3n) bits algorithm due to Frahling, Indyk and Sohler.

As an application of our samplers, we give better upper bounds for the problem of finding duplicates in data streams. In case the length of the stream is longer than the alphabet size, L1 sampling gives us an O(log2n) space algorithm, thus improving the previous O(log3n) bound due to Gopalan and Radhakrishnan.

In the second part of our work, we prove an Ω (log2n) lower bound for sampling from 0, ± 1 vectors (in this special case, the parameter p is not relevant for Lp sampling). This matches the space of our sampling algorithms for constant ε>0. We also prove tight space lower bounds for the finding duplicates and heavy hitters problems. We obtain these lower bounds using reductions from the communication complexity problem augmented indexing.

References

  1. Alex Andoni, Robert Krauthgamer, and Krzysztof Onak. Streaming algorithms via precision sampling. Manuscript, 2010.Google ScholarGoogle Scholar
  2. Khanh Do Ba, Piotr Indyk, Eric Price, and David P. Woodruff. Lower bounds for sparse recovery. In SODA, pages 1190--1197, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Brian Babcock, Mayur Datar, and Rajeev Motwani. Sampling from a moving window over streaming data. In SODA, pages 633--634, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Radu Berinde, Graham Cormode, Piotr Indyk, and Martin J. Strauss. Space-optimal heavy hitters with strong error bounds. In PODS, pages 157--166, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Vladimir Braverman, Rafail Ostrovsky, and Carlo Zaniolo. Optimal sampling from sliding windows. In PODS, pages 147--156, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. Theor. Comput. Sci., 312(1):3--15, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Edith Cohen, Nick G. Duffield, Haim Kaplan, Carsten Lund, and Mikkel Thorup. Stream sampling for variance-optimal estimation of subset sums. In SODA, pages 1255--1264, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Graham Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. J. Algorithms, 55(1):58--75, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Graham Cormode, S. Muthukrishnan, and Irina Rozenbaum. Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. In VLDB, pages 25--36, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Graham Cormode, S. Muthukrishnan, Ke Yi, and Qin Zhang. Optimal sampling from distributed streams. In PODS, pages 77--86, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Nick G. Duffield, Carsten Lund, and Mikkel Thorup. Priority sampling for estimation of arbitrary subset sums. J. ACM, 54(6), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Gereon Frahling, Piotr Indyk, and Christian Sohler. Sampling in dynamic data streams and applications. In Proceedings of the twenty-first annual symposium on Computational geometry, SCG '05, pages 142--149, New York, NY, USA, 2005. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Anna Gilbert and Piotr Indyk. Sparse recovery using sparse matrices. In Proceeding of IEEE, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  14. Parikshit Gopalan and Jaikumar Radhakrishnan. Finding duplicates in a data stream. In Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '09, pages 402--411, Philadelphia, PA, USA, 2009. Society for Industrial and Applied Mathematics. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. T. S. Jayram and David P. Woodruff. The data stream space complexity of cascaded norms. In FOCS, pages 765--774, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Daniel M. Kane, Jelani Nelson, Ely Porat, and Woodruff David P. Fast moment estimation in data streams in optimal space. Manuscript, 2010.Google ScholarGoogle Scholar
  17. Daniel M. Kane, Jelani Nelson, and David P. Woodruff. On the exact space complexity of sketching and streaming small norms. In SODA, pages 1161--1178, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Daniel M. Kane, Jelani Nelson, and David P. Woodruff. An optimal algorithm for the distinct elements problem. In Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems of data, PODS '10, pages 41--52, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Mauricio Karchmer. A New Approach to Circuit Depth. PhD thesis, MIT, 1989.Google ScholarGoogle ScholarCross RefCross Ref
  20. Mauricio Karchmer and Avi Wigderson. Monotone circuits for connectivity require super-logarithmic depth. In Proceedings of the twentieth annual ACM symposium on Theory of computing, STOC '88, pages 539--550, New York, NY, USA, 1988. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Donald E. Knuth. The Art of Computer Programming, Volume II: Seminumerical Algorithms. Addison-Wesley, 1969.Google ScholarGoogle Scholar
  22. Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. Duplicate detection in click streams. In WWW, pages 12--21, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Peter Bro Miltersen, Noam Nisan, Shmuel Safra, and Avi Wigderson. On data structures and asymmetric communication complexity. In Proceedings of the twenty-seventh annual ACM symposium on Theory of computing, STOC '95, pages 103--111, New York, NY, USA, 1995. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Morteza Monemizadeh and David P. Woodruff. 1-pass relative-error lp-sampling with applications. In SODA, pages 1143--1160, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Muthukrishnan. Data Streams: Algorithms and Applications.Google ScholarGoogle Scholar
  26. N. Nisan. Pseudorandom generators for space-bounded computations. In Proceedings of the twenty-second annual ACM symposium on Theory of computing, STOC '90, pages 204--212, New York, NY, USA, 1990. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Gabor Tardos and Uri Zwick. The communication complexity of the universal relation. In Proceedings of the 12th Annual IEEE Conference on Computational Complexity, pages 247--, Washington, DC, USA, 1997. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Jun Tarui. Finding a duplicate and a missing item in a stream. In Jin-Yi Cai, S. Cooper, and Hong Zhu, editors, Theory and Applications of Models of Computation, volume 4484 of Lecture Notes in Computer Science, pages 128--135. Springer Berlin / Heidelberg, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. David Woodruff and T. S. Jayram. Optimal bounds for johnson-lindenstrauss transforms and streaming problems with low error. In SODA, 2011.Google ScholarGoogle Scholar

Index Terms

  1. Tight bounds for Lp samplers, finding duplicates in streams, and related problems

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          PODS '11: Proceedings of the thirtieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
          June 2011
          332 pages
          ISBN:9781450306607
          DOI:10.1145/1989284

          Copyright © 2011 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 13 June 2011

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          PODS '11 Paper Acceptance Rate25of113submissions,22%Overall Acceptance Rate642of2,707submissions,24%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader