ABSTRACT
In this paper, we present near-optimal space bounds for Lp-samplers. Given a stream of updates (additions and subtraction) to the coordinates of an underlying vector x in Rn, a perfect Lp sampler outputs the i-th coordinate with probability xipxpp. In SODA 2010, Monemizadeh and Woodruff showed polylog space upper bounds for approximate Lp-samplers and demonstrated various applications of them. Very recently, Andoni, Krauthgamer and Onak improved the upper bounds and gave a O(ε-plog3n) space ε relative error and constant failure rate Lp-sampler for p є [1,2]. In this work, we give another such algorithm requiring only O(ε-plog2n) space for p є (1,2). For p є (0,1), our space bound is O(ε-1log2n), while for the p=1 case we have an O(log(1/ε)ε-log2n) space algorithm. We also give a O(log2n) bits zero relative error L0-sampler, improving the O(log3n) bits algorithm due to Frahling, Indyk and Sohler.
As an application of our samplers, we give better upper bounds for the problem of finding duplicates in data streams. In case the length of the stream is longer than the alphabet size, L1 sampling gives us an O(log2n) space algorithm, thus improving the previous O(log3n) bound due to Gopalan and Radhakrishnan.
In the second part of our work, we prove an Ω (log2n) lower bound for sampling from 0, ± 1 vectors (in this special case, the parameter p is not relevant for Lp sampling). This matches the space of our sampling algorithms for constant ε>0. We also prove tight space lower bounds for the finding duplicates and heavy hitters problems. We obtain these lower bounds using reductions from the communication complexity problem augmented indexing.
- Alex Andoni, Robert Krauthgamer, and Krzysztof Onak. Streaming algorithms via precision sampling. Manuscript, 2010.Google Scholar
- Khanh Do Ba, Piotr Indyk, Eric Price, and David P. Woodruff. Lower bounds for sparse recovery. In SODA, pages 1190--1197, 2010. Google ScholarDigital Library
- Brian Babcock, Mayur Datar, and Rajeev Motwani. Sampling from a moving window over streaming data. In SODA, pages 633--634, 2002. Google ScholarDigital Library
- Radu Berinde, Graham Cormode, Piotr Indyk, and Martin J. Strauss. Space-optimal heavy hitters with strong error bounds. In PODS, pages 157--166, 2009. Google ScholarDigital Library
- Vladimir Braverman, Rafail Ostrovsky, and Carlo Zaniolo. Optimal sampling from sliding windows. In PODS, pages 147--156, 2009. Google ScholarDigital Library
- Moses Charikar, Kevin Chen, and Martin Farach-Colton. Finding frequent items in data streams. Theor. Comput. Sci., 312(1):3--15, 2004. Google ScholarDigital Library
- Edith Cohen, Nick G. Duffield, Haim Kaplan, Carsten Lund, and Mikkel Thorup. Stream sampling for variance-optimal estimation of subset sums. In SODA, pages 1255--1264, 2009. Google ScholarDigital Library
- Graham Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. J. Algorithms, 55(1):58--75, 2005. Google ScholarDigital Library
- Graham Cormode, S. Muthukrishnan, and Irina Rozenbaum. Summarizing and mining inverse distributions on data streams via dynamic inverse sampling. In VLDB, pages 25--36, 2005. Google ScholarDigital Library
- Graham Cormode, S. Muthukrishnan, Ke Yi, and Qin Zhang. Optimal sampling from distributed streams. In PODS, pages 77--86, 2010. Google ScholarDigital Library
- Nick G. Duffield, Carsten Lund, and Mikkel Thorup. Priority sampling for estimation of arbitrary subset sums. J. ACM, 54(6), 2007. Google ScholarDigital Library
- Gereon Frahling, Piotr Indyk, and Christian Sohler. Sampling in dynamic data streams and applications. In Proceedings of the twenty-first annual symposium on Computational geometry, SCG '05, pages 142--149, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
- Anna Gilbert and Piotr Indyk. Sparse recovery using sparse matrices. In Proceeding of IEEE, 2010.Google ScholarCross Ref
- Parikshit Gopalan and Jaikumar Radhakrishnan. Finding duplicates in a data stream. In Proceedings of the twentieth Annual ACM-SIAM Symposium on Discrete Algorithms, SODA '09, pages 402--411, Philadelphia, PA, USA, 2009. Society for Industrial and Applied Mathematics. Google ScholarDigital Library
- T. S. Jayram and David P. Woodruff. The data stream space complexity of cascaded norms. In FOCS, pages 765--774, 2009. Google ScholarDigital Library
- Daniel M. Kane, Jelani Nelson, Ely Porat, and Woodruff David P. Fast moment estimation in data streams in optimal space. Manuscript, 2010.Google Scholar
- Daniel M. Kane, Jelani Nelson, and David P. Woodruff. On the exact space complexity of sketching and streaming small norms. In SODA, pages 1161--1178, 2010. Google ScholarDigital Library
- Daniel M. Kane, Jelani Nelson, and David P. Woodruff. An optimal algorithm for the distinct elements problem. In Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems of data, PODS '10, pages 41--52, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- Mauricio Karchmer. A New Approach to Circuit Depth. PhD thesis, MIT, 1989.Google ScholarCross Ref
- Mauricio Karchmer and Avi Wigderson. Monotone circuits for connectivity require super-logarithmic depth. In Proceedings of the twentieth annual ACM symposium on Theory of computing, STOC '88, pages 539--550, New York, NY, USA, 1988. ACM. Google ScholarDigital Library
- Donald E. Knuth. The Art of Computer Programming, Volume II: Seminumerical Algorithms. Addison-Wesley, 1969.Google Scholar
- Ahmed Metwally, Divyakant Agrawal, and Amr El Abbadi. Duplicate detection in click streams. In WWW, pages 12--21, 2005. Google ScholarDigital Library
- Peter Bro Miltersen, Noam Nisan, Shmuel Safra, and Avi Wigderson. On data structures and asymmetric communication complexity. In Proceedings of the twenty-seventh annual ACM symposium on Theory of computing, STOC '95, pages 103--111, New York, NY, USA, 1995. ACM. Google ScholarDigital Library
- Morteza Monemizadeh and David P. Woodruff. 1-pass relative-error lp-sampling with applications. In SODA, pages 1143--1160, 2010. Google ScholarDigital Library
- S. Muthukrishnan. Data Streams: Algorithms and Applications.Google Scholar
- N. Nisan. Pseudorandom generators for space-bounded computations. In Proceedings of the twenty-second annual ACM symposium on Theory of computing, STOC '90, pages 204--212, New York, NY, USA, 1990. ACM. Google ScholarDigital Library
- Gabor Tardos and Uri Zwick. The communication complexity of the universal relation. In Proceedings of the 12th Annual IEEE Conference on Computational Complexity, pages 247--, Washington, DC, USA, 1997. IEEE Computer Society. Google ScholarDigital Library
- Jun Tarui. Finding a duplicate and a missing item in a stream. In Jin-Yi Cai, S. Cooper, and Hong Zhu, editors, Theory and Applications of Models of Computation, volume 4484 of Lecture Notes in Computer Science, pages 128--135. Springer Berlin / Heidelberg, 2007. Google ScholarDigital Library
- David Woodruff and T. S. Jayram. Optimal bounds for johnson-lindenstrauss transforms and streaming problems with low error. In SODA, 2011.Google Scholar
Index Terms
- Tight bounds for Lp samplers, finding duplicates in streams, and related problems
Recommendations
Tight bounds for lp oblivious subspace embeddings
SODA '19: Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete AlgorithmsAn lp oblivious subspace embedding is a distribution over r x n matrices Π such that for any fixed n x d matrix A,
Pr[for all x, ||Ax||p ≤||ΠAx||p≤k||Ax||p] ≥ 9/10,Π
where r is the dimension of the embedding, k is the distortion of the embedding, and ...
Towards Tight Bounds for the Streaming Set Cover Problem
PODS '16: Proceedings of the 35th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database SystemsWe consider the classic Set Cover problem in the data stream model. For n elements and m sets (m ≥ n) we give a O(1/δ)-pass algorithm with a strongly sub-linear ~O(mnδ) space and logarithmic approximation factor. This yields a significant improvement ...
Comparison-based time-space lower bounds for selection
We establish the first nontrivial lower bounds on time-space trade-offs for the selection problem. We prove that any comparison-based randomized algorithm for finding the median requires Ω(nlog logS n) expected time in the RAM model (or more generally ...
Comments