Abstract
Enterprises often need to assess and manage the risk arising from uncertainty in their data. Such uncertainty is typically modeled as a probability distribution over the uncertain data values, specified by means of a complex (often predictive) stochastic model. The probability distribution over data values leads to a probability distribution over database query results, and risk assessment amounts to exploration of the upper or lower tail of a query-result distribution. In this paper, we extend the Monte Carlo Database System to efficiently obtain a set of samples from the tail of a query-result distribution by adapting recent "Gibbs cloning" ideas from the simulation literature to a database setting.
- S. Asmussen and P. W. Glynn. Stochastic Simulation: Algorithms and Analysis. Springer, 2007.Google ScholarCross Ref
- Z. I. Botev and D. P. Kroese. An efficient algorithm for rare-event probability estimation, combinatorial optimization, and counting. Methodol. Comput. Appl. Prob., 10:471--505, 2008.Google ScholarCross Ref
- R. C. Bradley. Basic properties of strong mixing conditions: A survey and some open questions. Probab. Surveys, 2:107--144, 2005.Google ScholarCross Ref
- F. Cérou, P. D. Moral, T. Furon, and A. Guyader. Rare event simulation for a static distribution. INRIA Research Report 6792, Rennes, France, 2009.Google Scholar
- C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In NIPS, pages 281--288, 2006.Google ScholarDigital Library
- J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. MAD skills: New analysis practices for big data. PVLDB, 2(2):1481--1492, 2009. Google ScholarDigital Library
- N. N. Dalvi, C. Ré, and D. Suciu. Probabilistic databases: diamonds in the dirt. Commun. ACM, 52(7):86--94, 2009. Google ScholarDigital Library
- A. Das Sarma, O. Benjelloun, A. Y. Halevy, S. U. Nabar, and J. Widom. Representing uncertain data: models, properties, and algorithms. VLDB J., 18(5):989--1019, 2009. Google ScholarDigital Library
- H. A. David and H. N. Nagaraja. Order Statistics. Wiley, third edition, 2003.Google Scholar
- A. Deshpande and S. Madden. MauveDB: supporting model-based user views in database systems. In SIGMOD, pages 73--84, 2006. Google ScholarDigital Library
- S. Geman and D. Geman. Stochastic relaxation, Gibbs distribution and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intelligence, 6(6):721--741, 1984.Google ScholarDigital Library
- S. Guha. RHIPE - R and Hadoop Integrated Processing Environment. http://ml.stat.purdue.edu/rhipe/.Google Scholar
- R. Jampani, F. Xu, M. Wu, L. L. Perez, C. M. Jermaine, and P. J. Haas. MCDB: a Monte Carlo approach to managing uncertain data. In ACM SIGMOD, pages 687--700, 2008. Google ScholarDigital Library
- O. Kennedy and C. Koch. PIP: A database system for great and small expectations. In ICDE, pages 157--168, 2010.Google ScholarCross Ref
- C. Koch and D. Olteanu. Conditioning probabilistic databases. PVLDB, 1(1):313--325, 2008. Google ScholarDigital Library
- A. J. McNeil, R. Frey, and P. Embrechts. Quantitative Risk Management: Concepts, Techniques, and Tools. Princeton University Press, 2005.Google Scholar
- G. Rubino and B. Tuffin, editors. Rare Event Simulation Using Monte Carlo. Wiley, 2009. Google ScholarDigital Library
- R. Rubinstein. The Gibbs cloner for combinatorial optimization, counting, and sampling. Methodol. Comput. Appl. Prob., 11(4):491--549, 2009.Google ScholarCross Ref
- R. J. Serfling. Approximation Theorems of Mathematical Statistics. Wiley, 1980.Google ScholarCross Ref
- M. Stonebraker, J. Becla, D. J. DeWitt, K.-T. Lim, D. Maier, O. Ratzesberger, and S. B. Zdonik. Requirements for science data bases and SciDB. In CIDR, page 26, 2009.Google Scholar
- A. Thiagarajan and S. Madden. Querying continuous functions in a database system. In SIGMOD, pages 791--804, 2008. Google ScholarDigital Library
Index Terms
- MCDB-R: risk analysis in the database
Recommendations
MCDB: a monte carlo approach to managing uncertain data
SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of dataTo deal with data uncertainty, existing probabilistic database systems augment tuples with attribute-level or tuple-level probability values, which are loaded into the database along with the data itself. This approach can severely limit the system's ...
Evaluation of probabilistic threshold queries in MCDB
SIGMOD '10: Proceedings of the 2010 ACM SIGMOD International Conference on Management of dataMCDB is a prototype database system for managing stochastic models for uncertain data. In this paper, we study the problem of how to use MCDB to answer statistical queries that search for database objects which satisfy some filter condition with greater ...
Comments