ABSTRACT
The probabilistic-stream model was introduced by Jayram et al. [20].It is a generalization of the data stream model that issuited to handling "probabilistic" data, where each item of the stream represents a probability distribution over a set of possible events. Therefore, a probabilistic stream determines a distribution over apotentially exponential number of classical "deterministic" streams where each item is deterministically one of the domain values.
Designing efficient aggregation algorithms for probabilistic data is crucial for handling uncertainty in data-centric applications such as OLAP. Such algorithms are also useful in a variety of other setting including analyzing search engine traffic and aggregation in sensor networks.
We present algorithms for computing commonly used aggregates ona probabilistic stream. We present the first one pass streaming algorithms for estimating the expected mean of a probabilistic stream, improving upon results in [20]. Next, we consider the problem of estimating frequency moments for probabilistic data. We propose a general approach to obtain unbiased estimators working over probabilistic data by utilizing unbiased estimators designed for standard streams. Applying this approach, we extend a classical data stream algorithm to obtain a one-pass algorithm for estimating F2, the second frequency moment. We present the first known streaming algorithms forestimating F0, the number of distinct items on probabilistic streams.Our work also gives an efficient one-pass algorithm for estimatingthe median of a probabilistic stream.
- N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. Journal of Computer and System Sciences, 58(1):137--147, 1999. Google ScholarDigital Library
- B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, pages 1--16, 2002. Google ScholarDigital Library
- M. Balazinska, H. Balakrishnan, and M. Stonebraker. Load management and high availability in the medusa distributed stream processing system. In ACM SIGMOD International Conference on Management of Data, pages 929--930, 2004. Google ScholarDigital Library
- Z. Bar-Yossef, T. Jayram, R. Kumar, D. Sivakumar, and L. Trevisan. Counting distinct elements in a data stream. In Proc. 6th International Workshop on Randomization and Approximation Techniques in Computer Science, pages 1--10, 2002. Google ScholarDigital Library
- T. Batu, S. Kannan, S. Khanna, and A. McGregor. Reconstructing strings from random traces. In ACM-SIAM Symposium on Discrete Algorithms, pages 910--918, 2004. Google ScholarDigital Library
- D. Burdick, P. Deshpande, T. S. Jayram, R. Ramakrishnan, and S. Vaithyanathan. OLAP over uncertain and imprecise data. In International Conference on Very Large Data Bases (VLDB), pages 970--981, 2005. Google ScholarDigital Library
- D. Burdick, P. M. Deshpande, T. S. Jayram, R. Ramakrishnan, and S. Vaithyanathan. OLAP over uncertain and imprecise data. In International Conference on Very Large Data Bases (VLDB), pages 970--981. VLDB Endowment, 2005. Google ScholarDigital Library
- D. Burdick, P. M. Deshpande, T. S. Jayram, R. Ramakrishnan, and S. Vaithyanathan. Efficient allocation algorithms for OLAP over imprecise data. In International Conference on Very Large Data Bases (VLDB), pages 391--402, 2006. Google ScholarDigital Library
- K. L. Chang and R. Kannan. The space complexity of pass-efficient algorithms for clustering. In ACM-SIAM Symposium on Discrete Algorithms, pages 1157--1166, 2006. Google ScholarDigital Library
- G. Cormode and M. Garofalakis. Sketching probabilistic data streams. In ACM SIGMOD International Conference on Management of Data, 2007. Google ScholarDigital Library
- C. D. Cranor, T. Johnson, O. Spatscheck, and V. Shkapenyuk. Gigascope: A stream database for network applications. In ACM SIGMOD International Conference on Management of Data, pages 647--651, 2003. Google ScholarDigital Library
- N. N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In International Conference on Very Large Data Bases (VLDB), pages 864--875, 2004. Google ScholarDigital Library
- J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, and J. Zhang. Graph distances in the streaming model: the value of space. In ACM-SIAM Symposium on Discrete Algorithms, pages 745--754, 2005. Google ScholarDigital Library
- J. Feigenbaum, S. Kannan, A. McGregor, S. Suri, and J. Zhang. On graph problems in a semi-streaming model. Theoretical Computer Science, 348(2-3):207--216, 2005. Google ScholarDigital Library
- P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 31(2):182--209, 1985. Google ScholarDigital Library
- N. Fuhr and T. Roelleke. A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Trans. Inf. Syst., 15(1):32--66, 1997. Google ScholarDigital Library
- M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. In ACM SIGMOD International Conference on Management of Data, pages 58--66, 2001. Google ScholarDigital Library
- S. Guha and A. McGregor. Space-Efficient Sampling. In AISTATS, pages 169--176, 2007.Google Scholar
- P. Indyk. Algorithms for dynamic geometric problems over data streams. In ACM Symposium on Theory of Computing, pages 373--380, 2004. Google ScholarDigital Library
- T. Jayram, S. Kale, and E. Vee. Efficient aggregation algorithms for probabilistic data. In ACM-SIAM Symposium on Discrete Algorithms, 2007. Google ScholarDigital Library
- S. Kannan and A. McGregor. More on reconstructing strings from random traces: Insertions and deletions. In IEEE International Symposium on Information Theory, pages 297--301, 2005.Google ScholarCross Ref
- J. I. Munro and M. Paterson. Selection and sorting with limited storage. Theor. Comput. Sci., 12:315--323, 1980.Google ScholarCross Ref
- S. Muthukrishnan. Data streams: Algorithms and applications. Now Publishers, 2006.Google Scholar
Index Terms
- Estimating statistical aggregates on probabilistic data streams
Recommendations
Estimating statistical aggregates on probabilistic data streams
The probabilistic stream model was introduced by Jayram et al. [2007]. It is a generalization of the data stream model that is suited to handling probabilistic data, where each item of the stream represents a probability distribution over a set of ...
Space-Efficient Estimation of Statistics Over Sub-Sampled Streams
In many stream monitoring situations, the data arrival rate is so high that it is not even possible to observe each element of the stream. The most common solution is to sub-sample the data stream and use the sample to infer properties and estimate ...
Space-efficient estimation of statistics over sub-sampled streams
PODS '12: Proceedings of the 31st ACM SIGMOD-SIGACT-SIGAI symposium on Principles of Database SystemsIn many stream monitoring situations, the data arrival rate is so high that it is not even possible to observe each element of the stream. The most common solution is to sample a small fraction of the data stream and use the sample to infer properties ...
Comments