ABSTRACT
This paper presents algorithms for estimating aggregate functions over a "sliding window" of the N most recent data items in one or more streams. Our results include
For a single stream, we present the first ε-approximation scheme for the number of 1's in a sliding window that is optimal in both worst case time and space. We also present the first ε for the sum of integers in [0..R] in a sliding window that is optimal in both worst case time and space (assuming R is at most polynomial in N). Both algorithms are deterministic and use only logarithmic memory words.
In contrast, we show that an deterministic algorithm that estimates, to within a small constant relative error, the number of 1's (or the sum of integers) in a sliding window over the union of distributed streams requires Ω(N) space.
We present the first randomized (ε,δ)-approximation scheme for the number of 1's in a sliding window over the union of distributed streams that uses only logarithmic memory words. We also present the first (ε,δ)-approximation scheme for the number of distinct values in a sliding window over distributed streams that uses only logarithmic memory words.
Our results are obtained using a novel family of synopsis data structures.
- N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. J. of Computer and System Sciences, 58:137--147, 1999.]] Google ScholarDigital Library
- B. Babcock, M. Datar, and R. Motwani. Sampling from a moving window over streaming data. In Proc. 13th ACM-SIAM Symp. on Discrete Algorithms, pages 633--634, Jan. 2002. Short paper.]] Google ScholarDigital Library
- Z. Bar-Yossef, R. Kumar, and D. Sivakumar. Reductions in streaming algorithms, with an application to counting triangles in graphs. In Proc. 13th ACM-SIAM Symp. on Discrete Algorithms, pages 623--632, Jan. 2002.]] Google ScholarDigital Library
- M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. In Proc. 13th ACM-SIAM Symp. on Discrete Algorithms, pages 635--644, Jan. 2002.]] Google ScholarDigital Library
- J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan. An approximate L1-difference algorithm for massive data streams. In Proc. 40th IEEE Symp. on Foundations of Computer Science, pages 501--511, Oct. 1999.]] Google ScholarDigital Library
- J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan. Testing and spot-checking of data streams. In Proc. 11th ACM-SIAM Symp. on Discrete Algorithms, pages 165--174, Jan. 2000.]] Google ScholarDigital Library
- P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Computer and System Sciences, 31:182--209, 1985.]] Google ScholarDigital Library
- J. Fong and M. Strauss. An approximate Lp-difference algorithm for massive data streams. In Proc. 17th Symp. on Theoretical Aspects of Computer Science, LNCS 1770, pages 193--204. Springer, Feb. 2000.]] Google ScholarDigital Library
- J. Gehrke, F. Korn, and D. Srivastava. On computing correlated aggregates over continual data streams. In Proc. ACM SIGMOD International Conf. on Management of Data, pages 13--24, May 2001.]] Google ScholarDigital Library
- P. B. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. In Proc. 27th International Conf. on Very Large Data Bases, pages 541--550, Sept. 2001.]] Google ScholarDigital Library
- P. B. Gibbons and Y. Matias. New sampling-based summary statistics for improving approximate query answers. In Proc. ACM SIGMOD International Conf. on Management of Data, pages 331--342, June 1998.]] Google ScholarDigital Library
- P. B. Gibbons and Y. Matias. Synopsis data structures for massive data sets. In J. M. Abello and J. S. Vitter, editors, External Memory Algorithms, pages 39--70. AMS, 1999. DIMACS: Series in Discrete Mathematics and Theoretical Computer Science, Vol. 50. A two page summary appeared as a short paper in SODA'99.]] Google ScholarDigital Library
- P. B. Gibbons and S. Tirthapura. Estimating simple functions on the union of data streams. In Proc. 13th ACM Symp. on Parallel Algorithms and Architectures, pages 281--290, July 2001.]] Google ScholarDigital Library
- A. C. Gilbert, Y. Kotidis, and M. J. Strauss. Surfing wavelets on streams: One-pass summaries for approximate aggregate queries. In Proc. 27th International Conf. on Very Large Data Bases, pages 79--88, Sept. 2001.]] Google ScholarDigital Library
- M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In Proc. ACM SIGMOD International Conf. on Management of Data, pages 58--66, May 2001.]] Google ScholarDigital Library
- S. Guha, N. Koudas, and K. Shim. Data-streams and histograms. In Proc. 33rd ACM Symp. on Theory of Computing, pages 471--475, July 2001.]] Google ScholarDigital Library
- S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams. In Proc. 41st IEEE Symp. on Foundations of Computer Science, pages 359--366, Nov. 2000.]] Google ScholarDigital Library
- M. R. Henzinger, P. Raghavan, and S. Rajagopalan. Computing on data streams. Technical report, Digital Systems Research Center, Palo Alto, CA, May 1998.]]Google Scholar
- P. Indyk. Stable distributions, pseudorandom generators, embeddings and data stream computation. In Proc. 41st IEEE Symp. on Foundations of Computer Science, pages 189--197, Nov. 2000.]] Google ScholarDigital Library
- P. Indyk. Personal communication, 2002.]]Google Scholar
- I. Kremer, N. Nisan, and D. Ron. On randomized one-round communication complexity. Computational Complexity, 8(1):21--49, 1999.]] Google ScholarDigital Library
- E. Kushilevitz and N. Nisan. Communication Complexity. Cambridge University Press, Cambridge, UK, 1997.]] Google ScholarDigital Library
- I. Newman. Private vs. common random bits in communication complexity. Information Processing Letters, 39:67--71, 1991.]] Google ScholarDigital Library
- I. Newman and M. Szegedy. Public vs. private coin flips in one round communication games. In Proc. 28th ACM Symp. on the Theory of Computing, pages 561--570, May 1996.]] Google ScholarDigital Library
- L. Trevisan. A note on counting distinct elements in the streaming model. Unpublished manuscript, April 2001.]]Google Scholar
Index Terms
- Distributed streams algorithms for sliding windows
Recommendations
Optimal sampling from sliding windows
PODS '09: Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsA sliding windows model is an important case of the streaming model, where only the most "recent" elements remain active and the rest are discarded in a stream. The sliding windows model is important for many applications (see, e.g., Babcock, Babu, ...
Optimal sampling from sliding windows
A sliding windows model is an important case of the streaming model, where only the most ''recent'' elements remain active and the rest are discarded. The sliding windows model is important for many applications (see, e.g., Babcock, Babu, Datar, Motwani ...
Variance estimation over sliding windows
PODS '07: Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsCapturing characteristics of large data streams has received considerable attention. The constraints in space and time restrict the data stream processing to only one pass (or a small number of passes). Processing data streams over sliding windows make ...
Comments