ABSTRACT
In many stream monitoring situations, the data arrival rate is so high that it is not even possible to observe each element of the stream. The most common solution is to sample a small fraction of the data stream and use the sample to infer properties and estimate aggregates of the original stream. However, the quantities that need to be computed on the sampled stream are often different from the original quantities of interest and their estimation requires new algorithms. We present upper and lower bounds (often matching) for estimating frequency moments, support size, entropy, and heavy hitters of the original stream from the data observed in the sampled stream.
- N. Alon, Y. Matias, and M. Szegedy. The Space Complexity of Approximating the Frequency Moments. J. Comput. Syst. Sci., 58(1):137--147, 1999. Google ScholarDigital Library
- Z. Bar-Yossef. The complexity of massive dataset computations. PhD thesis, UC Berkeley, 2002. Google ScholarDigital Library
- Z. Bar-Yossef. Sampling lower bounds via information theory. In STOC, pages 335--344, 2003. Google ScholarDigital Library
- S. Bhattacharyya, A. Madeira, S. Muthukrishnan, and T. Ye. How to scalably and accurately skip past streams. In ICDE Workshops, pages 654--663, 2007. Google ScholarDigital Library
- M. Charikar, S. Chaudhuri, R. Motwani, and V. R. Narasayya. Towards estimation error guarantees for distinct values. In PODS, 2000. Google ScholarDigital Library
- Cisco Systems. Random Sampled NetFlow. http://www.cisco.com/en/US/docs/ios/12_0s/feature/guide/nfstatsa.html.Google Scholar
- G. Cormode and M. Garofalakis. Sketching probabilistic data streams. In Proceedings of the 2007 ACM SIGMOD international conference on Management of data, SIGMOD '07, pages 281--292, 2007. Google ScholarDigital Library
- G. Cormode and S. Muthukrishnan. An improved data stream summary: the count-min sketch and its applications. J. Algorithms, 55(1):58--75, 2005. Google ScholarDigital Library
- N. G. Duffield, C. Lund, and M. Thorup. Properties and prediction of flow statistics from sampled packet streams. In Internet Measurement Workshop, pages 159--171, 2002. Google ScholarDigital Library
- N. G. Duffield, C. Lund, and M. Thorup. Estimating flow distributions from sampled flow statistics. In SIGCOMM, pages 325--336, 2003. Google ScholarDigital Library
- S. Guha and Z. Huang. Revisiting the direct sum theorem and space lower bounds in random order streams. In ICALP (1), pages 513--524, 2009. Google ScholarDigital Library
- N. J. A. Harvey, J. Nelson, and K. Onak. Sketching and streaming entropy via approximation theory. In FOCS, pages 489--498, 2008. Google ScholarDigital Library
- P. Indyk and D. P. Woodruff. Optimal approximations of the frequency moments of data streams. In Proceedings of the 37th Annual ACM Symposium on Theory of Computing (STOC), pages 202--208, 2005. Google ScholarDigital Library
- T. S. Jayram, A. McGregor, S. Muthukrishnan, and E. Vee. Estimating statistical aggregates on probabilistic data streams. ACM Trans. Database Syst., 33:26:1--26:30, December 2008. Google ScholarDigital Library
- D. M. Kane, J. Nelson, and D. P. Woodruff. On the exact space complexity of sketching and streaming small norms. In SODA, pages 1161--1178, 2010. Google ScholarDigital Library
- A. McGregor, editor. Open Problems in Data Streams and Related Topics, 2007. http://www.cse.iitk.ac.in/users/sganguly/data-stream-probs.pdf.Google Scholar
- J. Misra and D. Gries. Finding repeated elements. Science of Computer Programming, 2(2):143--152, 1982.Google ScholarCross Ref
- F. Rusu and A. Dobra. Sketching sampled data streams. In ICDE, pages 381--392, 2009. Google ScholarDigital Library
Index Terms
- Space-efficient estimation of statistics over sub-sampled streams
Recommendations
Space-Efficient Estimation of Statistics Over Sub-Sampled Streams
In many stream monitoring situations, the data arrival rate is so high that it is not even possible to observe each element of the stream. The most common solution is to sub-sample the data stream and use the sample to infer properties and estimate ...
The Value of Multiple Read/Write Streams for Approximating Frequency Moments
We consider the read/write streams model, an extension of the standard data stream model in which an algorithm can create and manipulate multiple read/write streams in addition to its input data stream. Like the data stream model, the most important ...
Adaptive stratified reservoir sampling over heterogeneous data streams
Reservoir sampling is a known technique for maintaining a random sample of a fixed size over a data stream of an unknown size. While reservoir sampling is suitable for applications demanding a sample over the whole data stream, it is not designed for ...
Comments