skip to main content
10.1145/375663.375665acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

On computing correlated aggregates over continual data streams

Published:01 May 2001Publication History

ABSTRACT

In many applications from telephone fraud detection to network management, data arrives in a stream, and there is a need to maintain a variety of statistical summary information about a large number of customers in an online fashion. At present, such applications maintain basic aggregates such as running extrema values (MIN, MAX), averages, standard deviations, etc., that can be computed over data streams with limited space in a straightforward way. However, many applications require knowledge of more complex aggregates relating different attributes, so-called correlated aggregates. As an example, one might be interested in computing the percentage of international phone calls that are longer than the average duration of a domestic phone call. Exact computation of this aggregate requires multiple passes over the data stream, which is infeasible.

We propose single-pass techniques for approximate computation of correlated aggregates over both landmark and sliding window views of a data stream of tuples, using a very limited amount of space. We consider both the case where the independent aggregate (average duration in the example above) is an extrema value and the case where it is an average value, with any standard aggregate as the dependent aggregate; these can be used as building blocks for more sophisticated aggregates. We present an extensive experimental study based on some real and a wide variety of synthetic data sets to demonstrate the accuracy of our techniques. We show that this effectiveness is explained by the fact that our techniques exploit monotonicity and convergence properties of aggregates over data streams.

References

  1. 1.N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. JCSS: Journal of Computer and System Sciences, 58, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. 2.K. Alsabti, S.Ranka, and V. Singh. A one-pass algorithm for accurately estimating quantiles for disk-resident data. In VLDB'97, Proceedings of 23rd International Conference on Very Large Data Bases, August 25-29, 1997, Athens, Greece, pages 346-355, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. 3.R. Avnur and J. M. kHellerstein. Eddies: Continuously adaptive query processing. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, May 16-18, 2000, Dallas, Texas, USA, pages 261-272, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. 4.D. Chatziantoniou. Ad hoc OLAP: Expression and evaluation. In Proceedings of the IEEE International Conference on Data Engineering, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. 5.D. Chatziantoniou, M.Akinde, T. Johnson, and S.Kim. The MD-join: An operator for complex OLAP. In Proceedings of the IEEE International Conference on Data Engineering, 2001]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. 6.D. Chatziantoniou and K. A. Ross. Querying multiple features of groups in relational databases. In Proceedings of the International Conference on Very Large Databases, pages 295-306, 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. 7.A. Delis, C. Faloutsos, and S. Ghandeharizadeh, editors. SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1-3, 1999, Philadephia, PJjennsylvania, USA. ACM Press, 1999.]]Google ScholarGoogle Scholar
  8. 8.P.Domingos and G. Hulten. Mining high-speed data streams. In Proceedings of the Sixth International Conference on Knowledge Discovery and Data Mining, pages 71-80, Boston, Ma, August 2000. ACM.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. 9.J. Feigenbaum, R. Kannan, M. Strauss, and M.Viswanathan. An approximate L1-difference algorithm for massive data streams. In IEEE Symposium on Foundations of Computer Science (FOCS), 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. 10.J. Feigenbaum, S. Kannan, M. Strauss, and M.Viswanathan. Testing and spot-checking of data streams. In Proceedings of the 11th ACM-SIAM symposium on Discrete Algorithms, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. 11.A. Feldmann, A. Gilbert, and W. Willinger. jData networks as cascades: Investigating the multifractal nature of internet wan traffic. In ACM SIGCOMM, pages 42-55, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. 12.J.Fong and M. Strauss. An approximate L p-difference algorithm for massive data streams. In STACS: Annual Symposium on Theoretical Aspects of Computer Science, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. 13.J. Gehrke, V. Ganti, R. Ramakrishnan, and W.-Y. Loh. Boat-optimistic decision tree construction. In Delis et al. {7}, pages 169-180.]] Google ScholarGoogle Scholar
  14. 14.P. Gibbons, Y. Mattias, and V. Poosala. Fast Incremental Maintenance of Approximate Histograms. Proceedings of VLDB, Athens Greece, pages 466-475, Aug. 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. 15.P.B. Gibbons and Y. Matias. Synopsis data structures for massive data sets. In Proceedings of the Tenth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 909-910, N.Y., Jan. 17-19 1999. ACM-SIAM.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. 16.S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams. In Proceedings of the Annual Symposium on Foundations of Computer Science. IEEE, November 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. 17.P.J. Haas and J.M. Hellerstein. Ripple joins for online aggregation. In SIGMOD 1999, Proceedings ACM SIGMOD International Conference on Management of Data, June 1-3, 1999, Philadephia, Pennsulvania, USA, pages 287-298, 19999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. 18.J.M. Hellerstein, P.J. Haas, and H. Wang. Online aggregation. In J. Peckham, editor, SIGMOD 1997, Proceedings ACM SIGMOD International Conference on Management of Data, May 13-15, 1997, Tucson, Arizona, USA, pages 717-182. ACM Press, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. 19.M.R. Henzinger, P. Raghavan, and S. Rajagopalan. Computing on data streams. Technical Report 1998-011, Digital Equipment Corporation, Systems Research Center, May, 1998.]]Google ScholarGoogle Scholar
  20. 20.G.S. Manku, S. Rajagopalan, and B. G. Lindsay. Approximate medians and other quantiles in one pass and with limited memory. In L.M. Haas and A. Tiwary, editors, SIGMOD 1998, Proceedings ACM SIGMOD International Conference on Management of Data, June 2-4, 1998, Seattle, Washington, USA. pages 426-435. ACM Press, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. 21.G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Random sampling techniques for space efficient online computation of order statistics of large datasets. In kDelis et al. {7}, pages 251- 262.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. 22.C. Olston and J. Widom. Offering a precision-performance tradeoff for aggregation queries over replicated data. In A. E. Abbadi, M. L.Brodie, S. Chakravarthy, U. Dayal, N. Kamel, G. Schlageter, and K.-Y. Whang, editors, VLDB 2000, Proceedings of 26th International Conference on Very Large Data Bases, September 10-14, 2000, Cairo, Egypt, pages 144-155. Morgan Kaufmann, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. 23.V.Raman, B. Raman, and J. M. Hellerstein. Online dynamic reordering for interactive data processing. In VLDB'99 Proceedings of 25th International Conference on Very Large Data Bases, September 7-10, 1999, Edinburgh, Scotland, UK, pages 709-720, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. On computing correlated aggregates over continual data streams

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '01: Proceedings of the 2001 ACM SIGMOD international conference on Management of data
      May 2001
      630 pages
      ISBN:1581133324
      DOI:10.1145/375663

      Copyright © 2001 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 1 May 2001

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      SIGMOD '01 Paper Acceptance Rate44of293submissions,15%Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader