Article

On computing correlated aggregates over continual data streams

Authors:
Johannes Gehrke

Cornell University

Cornell University
View Profile

,
Flip Korn

AT&T Labs-Research

AT&T Labs-Research
View Profile

,
Divesh Srivastava

AT&T Labs-Research

AT&T Labs-Research
View Profile

SIGMOD '01: Proceedings of the 2001 ACM SIGMOD international conference on Management of dataMay 2001Pages 13–24https://doi.org/10.1145/375663.375665

Published:01 May 2001Publication History

SIGMOD '01: Proceedings of the 2001 ACM SIGMOD international conference on Management of data

Pages 13–24

ABSTRACT

In many applications from telephone fraud detection to network management, data arrives in a stream, and there is a need to maintain a variety of statistical summary information about a large number of customers in an online fashion. At present, such applications maintain basic aggregates such as running extrema values (MIN, MAX), averages, standard deviations, etc., that can be computed over data streams with limited space in a straightforward way. However, many applications require knowledge of more complex aggregates relating different attributes, so-called correlated aggregates. As an example, one might be interested in computing the percentage of international phone calls that are longer than the average duration of a domestic phone call. Exact computation of this aggregate requires multiple passes over the data stream, which is infeasible.

We propose single-pass techniques for approximate computation of correlated aggregates over both landmark and sliding window views of a data stream of tuples, using a very limited amount of space. We consider both the case where the independent aggregate (average duration in the example above) is an extrema value and the case where it is an average value, with any standard aggregate as the dependent aggregate; these can be used as building blocks for more sophisticated aggregates. We present an extensive experimental study based on some real and a wide variety of synthetic data sets to demonstrate the accuracy of our techniques. We show that this effectiveness is explained by the fact that our techniques exploit monotonicity and convergence properties of aggregates over data streams.

References

1.N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. JCSS: Journal of Computer and System Sciences, 58, 1999.]] Google ScholarDigital Library
2.K. Alsabti, S.Ranka, and V. Singh. A one-pass algorithm for accurately estimating quantiles for disk-resident data. In VLDB'97, Proceedings of 23rd International Conference on Very Large Data Bases, August 25-29, 1997, Athens, Greece, pages 346-355, 1997.]] Google ScholarDigital Library
3.R. Avnur and J. M. kHellerstein. Eddies: Continuously adaptive query processing. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16-18, 2000, Dallas, Texas, USA, pages 261-272, 2000.]] Google ScholarDigital Library
4.D. Chatziantoniou. Ad hoc OLAP: Expression and evaluation. In Proceedings of the IEEE International Conference on Data Engineering, 1999.]] Google ScholarDigital Library
5.D. Chatziantoniou, M.Akinde, T. Johnson, and S.Kim. The MD-join: An operator for complex OLAP. In Proceedings of the IEEE International Conference on Data Engineering, 2001]] Google ScholarDigital Library
6.D. Chatziantoniou and K. A. Ross. Querying multiple features of groups in relational databases. In Proceedings of the International Conference on Very Large Databases, pages 295-306, 1996.]] Google ScholarDigital Library
7.A. Delis, C. Faloutsos, and S. Ghandeharizadeh, editors. SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1-3, 1999, Philadephia, PJjennsylvania, USA. ACM Press, 1999.]]Google Scholar
8.P.Domingos and G. Hulten. Mining high-speed data streams. In Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining, pages 71-80, Boston, Ma, August 2000. ACM.]] Google ScholarDigital Library
9.J. Feigenbaum, R. Kannan, M. Strauss, and M.Viswanathan. An approximate L1-difference algorithm for massive data streams. In IEEE Symposium on Foundations of Computer Science (FOCS), 1999.]] Google ScholarDigital Library
10.J. Feigenbaum, S. Kannan, M. Strauss, and M.Viswanathan. Testing and spot-checking of data streams. In Proceedings of the 11th ACM-SIAM symposium on Discrete Algorithms, 2000.]] Google ScholarDigital Library
11.A. Feldmann, A. Gilbert, and W. Willinger. jData networks as cascades: Investigating the multifractal nature of internet wan traffic. In ACM SIGCOMM, pages 42-55, 1998.]] Google ScholarDigital Library
12.J.Fong and M. Strauss. An approximate L p-difference algorithm for massive data streams. In STACS: Annual Symposium on Theoretical Aspects of Computer Science, 2000.]] Google ScholarDigital Library
13.J. Gehrke, V. Ganti, R. Ramakrishnan, and W.-Y. Loh. Boat-optimistic decision tree construction. In Delis et al. {7}, pages 169-180.]] Google Scholar
14.P. Gibbons, Y. Mattias, and V. Poosala. Fast Incremental Maintenance of Approximate Histograms. Proceedings of VLDB, Athens Greece, pages 466-475, Aug. 1997.]] Google ScholarDigital Library
15.P.B. Gibbons and Y. Matias. Synopsis data structures for massive data sets. In Proceedings of the Tenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 909-910, N.Y., Jan. 17-19 1999. ACM-SIAM.]] Google ScholarDigital Library
16.S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams. In Proceedings of the Annual Symposium on Foundations of Computer Science. IEEE, November 2000.]] Google ScholarDigital Library
17.P.J. Haas and J.M. Hellerstein. Ripple joins for online aggregation. In SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1-3, 1999, Philadephia, Pennsulvania, USA, pages 287-298, 19999.]] Google ScholarDigital Library
18.J.M. Hellerstein, P.J. Haas, and H. Wang. Online aggregation. In J. Peckham, editor, SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, May 13-15, 1997, Tucson, Arizona, USA, pages 717-182. ACM Press, 1997.]] Google ScholarDigital Library
19.M.R. Henzinger, P. Raghavan, and S. Rajagopalan. Computing on data streams. Technical Report 1998-011, Digital Equipment Corporation, Systems Research Center, May, 1998.]]Google Scholar
20.G.S. Manku, S. Rajagopalan, and B. G. Lindsay. Approximate medians and other quantiles in one pass and with limited memory. In L.M. Haas and A. Tiwary, editors, SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, June 2-4, 1998, Seattle, Washington, USA. pages 426-435. ACM Press, 1998.]] Google ScholarDigital Library
21.G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Random sampling techniques for space efficient online computation of order statistics of large datasets. In kDelis et al. {7}, pages 251- 262.]] Google ScholarDigital Library
22.C. Olston and J. Widom. Offering a precision-performance tradeoff for aggregation queries over replicated data. In A. E. Abbadi, M. L.Brodie, S. Chakravarthy, U. Dayal, N. Kamel, G. Schlageter, and K.-Y. Whang, editors, VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10-14, 2000, Cairo, Egypt, pages 144-155. Morgan Kaufmann, 2000.]] Google ScholarDigital Library
23.V.Raman, B. Raman, and J. M. Hellerstein. Online dynamic reordering for interactive data processing. In VLDB'99 Proceedings of 25th International Conference on Very Large Data Bases, September 7-10, 1999, Edinburgh, Scotland, UK, pages 709-720, 1999.]] Google ScholarDigital Library

Index Terms

On computing correlated aggregates over continual data streams
1. Information systems
  1. Data management systems
    1. Database administration

Recommendations

On computing correlated aggregates over continual data streams

In many applications from telephone fraud detection to network management, data arrives in a stream, and there is a need to maintain a variety of statistical summary information about a large number of customers in an online fashion. At present, such ...
Read More
Estimating statistical aggregates on probabilistic data streams

The probabilistic stream model was introduced by Jayram et al. [2007]. It is a generalization of the data stream model that is suited to handling probabilistic data, where each item of the stream represents a probability distribution over a set of ...
Read More
Time-decayed correlated aggregates over data streams
Best of SDM'09

Data stream analysis frequently relies on identifying correlations and posing conditional queries on the data after it has been seen. Correlated aggregates form an important example of such queries, which ask for an aggregation over one dimension of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '01: Proceedings of the 2001 ACM SIGMOD international conference on Management of data
May 2001
630 pages
ISBN:1581133324
DOI:10.1145/375663
Editors:
Timos Sellis,
Sharad Mehrotra
ACM SIGMOD Record Volume 30, Issue 2
June 2001
625 pages
ISSN:0163-5808
DOI:10.1145/376284
Editors:
Timos Sellis
National Technical Univ. of Athens
,
Sharad Mehrotra
Univ. of California at Irvine
Issue’s Table of Contents
Copyright © 2001 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 May 2001
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
SIGMOD '01 Paper Acceptance Rate44of293submissions,15%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 208
  Total Citations
  View Citations
- 1,158
  Total Downloads
- Downloads (Last 12 months)12
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

On computing correlated aggregates over continual data streams

SIGMOD '01: Proceedings of the 2001 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

On computing correlated aggregates over continual data streams

Estimating statistical aggregates on probabilistic data streams

Time-decayed correlated aggregates over data streams