skip to main content
10.1145/1265530.1265565acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
Article

Estimating statistical aggregates on probabilistic data streams

Published:11 June 2007Publication History

ABSTRACT

The probabilistic-stream model was introduced by Jayram et al. [20].It is a generalization of the data stream model that issuited to handling "probabilistic" data, where each item of the stream represents a probability distribution over a set of possible events. Therefore, a probabilistic stream determines a distribution over apotentially exponential number of classical "deterministic" streams where each item is deterministically one of the domain values.

Designing efficient aggregation algorithms for probabilistic data is crucial for handling uncertainty in data-centric applications such as OLAP. Such algorithms are also useful in a variety of other setting including analyzing search engine traffic and aggregation in sensor networks.

We present algorithms for computing commonly used aggregates ona probabilistic stream. We present the first one pass streaming algorithms for estimating the expected mean of a probabilistic stream, improving upon results in [20]. Next, we consider the problem of estimating frequency moments for probabilistic data. We propose a general approach to obtain unbiased estimators working over probabilistic data by utilizing unbiased estimators designed for standard streams. Applying this approach, we extend a classical data stream algorithm to obtain a one-pass algorithm for estimating F2, the second frequency moment. We present the first known streaming algorithms forestimating F0, the number of distinct items on probabilistic streams.Our work also gives an efficient one-pass algorithm for estimatingthe median of a probabilistic stream.

References

  1. N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137--147, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pages 1--16, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Balazinska, H. Balakrishnan, and M. Stonebraker. Load management and high availability in the medusa distributed stream processing system. In ACM SIGMOD International Conference on Management of Data, pages 929--930, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Z. Bar-Yossef, T. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In Proc. 6th International Workshop on Randomization and Approximation Techniques in Computer Science, pages 1--10, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. T. Batu, S. Kannan, S. Khanna, and A. McGregor. Reconstructing strings from random traces. In ACM-SIAM Symposium on Discrete Algorithms, pages 910--918, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Burdick, P. Deshpande, T. S. Jayram, R. Ramakrishnan, and S. Vaithyanathan. OLAP over uncertain and imprecise data. In International Conference on Very Large Data Bases (VLDB), pages 970--981, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. Burdick, P. M. Deshpande, T. S. Jayram, R. Ramakrishnan, and S. Vaithyanathan. OLAP over uncertain and imprecise data. In International Conference on Very Large Data Bases (VLDB), pages 970--981. VLDB Endowment, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. Burdick, P. M. Deshpande, T. S. Jayram, R. Ramakrishnan, and S. Vaithyanathan. Efficient allocation algorithms for OLAP over imprecise data. In International Conference on Very Large Data Bases (VLDB), pages 391--402, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. K. L. Chang and R. Kannan. The space complexity of pass-efficient algorithms for clustering. In ACM-SIAM Symposium on Discrete Algorithms, pages 1157--1166, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. Cormode and M. Garofalakis. Sketching probabilistic data streams. In ACM SIGMOD International Conference on Management of Data, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. D. Cranor, T. Johnson, O. Spatscheck, and V. Shkapenyuk. Gigascope: A stream database for network applications. In ACM SIGMOD International Conference on Management of Data, pages 647--651, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. N. N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In International Conference on Very Large Data Bases (VLDB), pages 864--875, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, and J. Zhang. Graph distances in the streaming model: the value of space. In ACM-SIAM Symposium on Discrete Algorithms, pages 745--754, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, and J. Zhang. On graph problems in a semi-streaming model. Theoretical Computer Science, 348(2-3):207--216, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 31(2):182--209, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. N. Fuhr and T. Roelleke. A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Trans. Inf. Syst., 15(1):32--66, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In ACM SIGMOD International Conference on Management of Data, pages 58--66, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. S. Guha and A. McGregor. Space-Efficient Sampling. In AISTATS, pages 169--176, 2007.Google ScholarGoogle Scholar
  19. P. Indyk. Algorithms for dynamic geometric problems over data streams. In ACM Symposium on Theory of Computing, pages 373--380, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. T. Jayram, S. Kale, and E. Vee. Efficient aggregation algorithms for probabilistic data. In ACM-SIAM Symposium on Discrete Algorithms, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Kannan and A. McGregor. More on reconstructing strings from random traces: Insertions and deletions. In IEEE International Symposium on Information Theory, pages 297--301, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  22. J. I. Munro and M. Paterson. Selection and sorting with limited storage. Theor. Comput. Sci., 12:315--323, 1980.Google ScholarGoogle ScholarCross RefCross Ref
  23. S. Muthukrishnan. Data streams: Algorithms and applications. Now Publishers, 2006.Google ScholarGoogle Scholar

Index Terms

  1. Estimating statistical aggregates on probabilistic data streams

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      PODS '07: Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
      June 2007
      328 pages
      ISBN:9781595936851
      DOI:10.1145/1265530

      Copyright © 2007 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 11 June 2007

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • Article

      Acceptance Rates

      PODS '07 Paper Acceptance Rate28of187submissions,15%Overall Acceptance Rate642of2,707submissions,24%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader