ABSTRACT
In many applications from telephone fraud detection to network management, data arrives in a stream, and there is a need to maintain a variety of statistical summary information about a large number of customers in an online fashion. At present, such applications maintain basic aggregates such as running extrema values (MIN, MAX), averages, standard deviations, etc., that can be computed over data streams with limited space in a straightforward way. However, many applications require knowledge of more complex aggregates relating different attributes, so-called correlated aggregates. As an example, one might be interested in computing the percentage of international phone calls that are longer than the average duration of a domestic phone call. Exact computation of this aggregate requires multiple passes over the data stream, which is infeasible.
We propose single-pass techniques for approximate computation of correlated aggregates over both landmark and sliding window views of a data stream of tuples, using a very limited amount of space. We consider both the case where the independent aggregate (average duration in the example above) is an extrema value and the case where it is an average value, with any standard aggregate as the dependent aggregate; these can be used as building blocks for more sophisticated aggregates. We present an extensive experimental study based on some real and a wide variety of synthetic data sets to demonstrate the accuracy of our techniques. We show that this effectiveness is explained by the fact that our techniques exploit monotonicity and convergence properties of aggregates over data streams.
- 1.N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. JCSS: Journal of Computer and System Sciences, 58, 1999.]] Google ScholarDigital Library
- 2.K. Alsabti, S.Ranka, and V. Singh. A one-pass algorithm for accurately estimating quantiles for disk-resident data. In VLDB'97, Proceedings of 23rd International Conference on Very Large Data Bases, August 25-29, 1997, Athens, Greece, pages 346-355, 1997.]] Google ScholarDigital Library
- 3.R. Avnur and J. M. kHellerstein. Eddies: Continuously adaptive query processing. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16-18, 2000, Dallas, Texas, USA, pages 261-272, 2000.]] Google ScholarDigital Library
- 4.D. Chatziantoniou. Ad hoc OLAP: Expression and evaluation. In Proceedings of the IEEE International Conference on Data Engineering, 1999.]] Google ScholarDigital Library
- 5.D. Chatziantoniou, M.Akinde, T. Johnson, and S.Kim. The MD-join: An operator for complex OLAP. In Proceedings of the IEEE International Conference on Data Engineering, 2001]] Google ScholarDigital Library
- 6.D. Chatziantoniou and K. A. Ross. Querying multiple features of groups in relational databases. In Proceedings of the International Conference on Very Large Databases, pages 295-306, 1996.]] Google ScholarDigital Library
- 7.A. Delis, C. Faloutsos, and S. Ghandeharizadeh, editors. SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1-3, 1999, Philadephia, PJjennsylvania, USA. ACM Press, 1999.]]Google Scholar
- 8.P.Domingos and G. Hulten. Mining high-speed data streams. In Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining, pages 71-80, Boston, Ma, August 2000. ACM.]] Google ScholarDigital Library
- 9.J. Feigenbaum, R. Kannan, M. Strauss, and M.Viswanathan. An approximate L1-difference algorithm for massive data streams. In IEEE Symposium on Foundations of Computer Science (FOCS), 1999.]] Google ScholarDigital Library
- 10.J. Feigenbaum, S. Kannan, M. Strauss, and M.Viswanathan. Testing and spot-checking of data streams. In Proceedings of the 11th ACM-SIAM symposium on Discrete Algorithms, 2000.]] Google ScholarDigital Library
- 11.A. Feldmann, A. Gilbert, and W. Willinger. jData networks as cascades: Investigating the multifractal nature of internet wan traffic. In ACM SIGCOMM, pages 42-55, 1998.]] Google ScholarDigital Library
- 12.J.Fong and M. Strauss. An approximate L p-difference algorithm for massive data streams. In STACS: Annual Symposium on Theoretical Aspects of Computer Science, 2000.]] Google ScholarDigital Library
- 13.J. Gehrke, V. Ganti, R. Ramakrishnan, and W.-Y. Loh. Boat-optimistic decision tree construction. In Delis et al. {7}, pages 169-180.]] Google Scholar
- 14.P. Gibbons, Y. Mattias, and V. Poosala. Fast Incremental Maintenance of Approximate Histograms. Proceedings of VLDB, Athens Greece, pages 466-475, Aug. 1997.]] Google ScholarDigital Library
- 15.P.B. Gibbons and Y. Matias. Synopsis data structures for massive data sets. In Proceedings of the Tenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 909-910, N.Y., Jan. 17-19 1999. ACM-SIAM.]] Google ScholarDigital Library
- 16.S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams. In Proceedings of the Annual Symposium on Foundations of Computer Science. IEEE, November 2000.]] Google ScholarDigital Library
- 17.P.J. Haas and J.M. Hellerstein. Ripple joins for online aggregation. In SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1-3, 1999, Philadephia, Pennsulvania, USA, pages 287-298, 19999.]] Google ScholarDigital Library
- 18.J.M. Hellerstein, P.J. Haas, and H. Wang. Online aggregation. In J. Peckham, editor, SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, May 13-15, 1997, Tucson, Arizona, USA, pages 717-182. ACM Press, 1997.]] Google ScholarDigital Library
- 19.M.R. Henzinger, P. Raghavan, and S. Rajagopalan. Computing on data streams. Technical Report 1998-011, Digital Equipment Corporation, Systems Research Center, May, 1998.]]Google Scholar
- 20.G.S. Manku, S. Rajagopalan, and B. G. Lindsay. Approximate medians and other quantiles in one pass and with limited memory. In L.M. Haas and A. Tiwary, editors, SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, June 2-4, 1998, Seattle, Washington, USA. pages 426-435. ACM Press, 1998.]] Google ScholarDigital Library
- 21.G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Random sampling techniques for space efficient online computation of order statistics of large datasets. In kDelis et al. {7}, pages 251- 262.]] Google ScholarDigital Library
- 22.C. Olston and J. Widom. Offering a precision-performance tradeoff for aggregation queries over replicated data. In A. E. Abbadi, M. L.Brodie, S. Chakravarthy, U. Dayal, N. Kamel, G. Schlageter, and K.-Y. Whang, editors, VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10-14, 2000, Cairo, Egypt, pages 144-155. Morgan Kaufmann, 2000.]] Google ScholarDigital Library
- 23.V.Raman, B. Raman, and J. M. Hellerstein. Online dynamic reordering for interactive data processing. In VLDB'99 Proceedings of 25th International Conference on Very Large Data Bases, September 7-10, 1999, Edinburgh, Scotland, UK, pages 709-720, 1999.]] Google ScholarDigital Library
Index Terms
- On computing correlated aggregates over continual data streams
Recommendations
On computing correlated aggregates over continual data streams
In many applications from telephone fraud detection to network management, data arrives in a stream, and there is a need to maintain a variety of statistical summary information about a large number of customers in an online fashion. At present, such ...
Estimating statistical aggregates on probabilistic data streams
The probabilistic stream model was introduced by Jayram et al. [2007]. It is a generalization of the data stream model that is suited to handling probabilistic data, where each item of the stream represents a probability distribution over a set of ...
Time-decayed correlated aggregates over data streams
Best of SDM'09Data stream analysis frequently relies on identifying correlations and posing conditional queries on the data after it has been seen. Correlated aggregates form an important example of such queries, which ask for an aggregation over one dimension of ...
Comments