research-article

The monte carlo database system: Stochastic analysis close to the data

Authors:
Ravi Jampani

University of Florida, Gainesville, FL

University of Florida, Gainesville, FL
View Profile

,
Fei Xu

Microsoft Corporation, Redmond, WA

Microsoft Corporation, Redmond, WA
View Profile

,
Mingxi Wu

Oracle Corporation, Redwood Shores, CA

Oracle Corporation, Redwood Shores, CA
View Profile

,
Luis Perez

Rice University, Houston, TX

Rice University, Houston, TX
View Profile

,
Chris Jermaine

Rice University, Houston, TX

Rice University, Houston, TX
View Profile

,
Peter J. Haas

IBM Almaden Research Center, Armonk, NY

IBM Almaden Research Center, Armonk, NY
View Profile

Authors Info & Claims

ACM Transactions on Database Systems Volume 36 Issue 3Article No.: 18pp 1–41https://doi.org/10.1145/2000824.2000828

Published:26 August 2011Publication History

ACM Transactions on Database Systems

Abstract

The application of stochastic models and analysis techniques to large datasets is now commonplace. Unfortunately, in practice this usually means extracting data from a database system into an external tool (such as SAS, R, Arena, or Matlab), and then running the analysis there. This extract-and-model paradigm is typically error-prone, slow, does not support fine-grained modeling, and discourages what-if and sensitivity analyses.

In this article we describe MCDB, a database system that permits a wide spectrum of stochastic models to be used in conjunction with the data stored in a large database, without ever extracting the data. MCDB facilitates in-database execution of tasks such as risk assessment, prediction, and imputation of missing data, as well as management of errors due to data integration, information extraction, and privacy-preserving data anonymization. MCDB allows a user to define “random” relations whose contents are determined by stochastic models. The models can then be queried using standard SQL. Monte Carlo techniques are used to analyze the probability distribution of the result of an SQL query over random relations. Novel “tuple-bundle” processing techniques can effectively control the Monte Carlo overhead, as shown in our experiments.

Supplemental Material

Available for Download

zip

jampani.zip (173.2 KB)

Supplemental movie, image and appendix files for The monte carlo database system: Stochastic analysis close to the data

References

Agrawal, P., Benjelloun, O., Sarma, A. D., Hayworth, C., Nabar, S. U., Sugihara, T., and Widom, J. 2006. Trio: A system for data, uncertainty, and lineage. In Proceedings of the International Conference on Very Large Databases (VLDB'06). Google ScholarDigital Library
Alur, N. R., Haas, P. J., Momiroska, D., Read, P., Summers, N. H., Totanes, V., and Zuzarte, C. 2002. DB2 UDB's High Function Business Intelligence in e-Business. IBM Redbook Series.Google Scholar
Andritsos, P., Fuxman, A., and Miller, R. J. 2006. Clean answers over dirty databases: A probabilistic approach. In Proceedings of the International Conference on Data Engineering (ICDE'06). 30. Google ScholarDigital Library
Antova, L., Jansen, T., Koch, C., and Olteanu, D. 2008. Fast and simple relational processing of uncertain data. In Proceedings of the International Conference on Data Engineering (ICDE'08). 983--992. Google ScholarDigital Library
Antova, L., Koch, C., and Olteanu, D. 2007. MayBMS: Managing incomplete information with probabilistic world-set decompositions. In Proceedings of the International Conference on Data Engineering (ICDE'07). 1479--1480.Google Scholar
ApacheMahout. 2010. Apache mahout. http://lucene.apache.org/mahout/Google Scholar
Arumugam, S., Jampani, R., Perez, L. L., Xu, F., Jermaine, C. M., and Haas, P. J. 2010. MCDB-R: Risk analysis in the database. In Proceedings of the International Conference on Very Large Databases (VLDB'10). 782--793. Google ScholarDigital Library
Asmussen, S. and Glynn, P. W. 2007. Stochastic Simulation: Algorithms and Analysis. Springer.Google ScholarCross Ref
Barbara, D., Garcia-Molina, H., and Porter, D. 1992. The management of probabilistic data. IEEE Trans. Knowl. Data Engin. 4, 5, 487--502. Google ScholarDigital Library
Benjelloun, O., Sarma, A. D., Halevy, A. Y., Theobald, M., and Widom, J. 2008. Databases with uncertainty and lineage. VLDB J. 17, 2, 243--264. Google ScholarDigital Library
Biller, B. and Nelson, B. L. 2003. Modeling and generating multivariate time-series input processes using a vector autoregressive technique. ACM Trans. Model. Comput. Simul. 13, 3, 211--237. Google ScholarDigital Library
Blei, D. M., Griffiths, T. L., Jordan, M. I., and Tenenbaum, J. B. 2003. Hierarchical topic models and the nested chinese restaurant process. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS'03).Google Scholar
Cheng, R., Singh, S., and Prabhakar, S. 2005. U-DBMS: A database system for managing constantly-evolving data. In Proceedings of the International Conference on Very Large Databases (VLDB'05). 1271--1274. Google ScholarDigital Library
Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G. R., Ng, A. Y., and Olukotun, K. 2006a. Map-Reduce for machine learning on multicore. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS'06). 281--288.Google Scholar
Chu, D., Deshpande, A., Hellerstein, J. M., and Hong, W. 2006b. Approximate data collection in sensor networks using probabilistic models. In Proceedings of the International Conference on Data Engineering (ICDE'06). 48. Google ScholarDigital Library
Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J. M., and Welton, C. 2009. MAD skills: New analysis practices for big data. Proc. VLDB 2, 2, 1481--1492. Google ScholarDigital Library
Cox, D. R. 1952. Estimation by double sampling. Biometrika 39, 3-4, 217--227.Google ScholarCross Ref
Dalvi, N. N., Re, C., and Suciu, D. 2009. Probabilistic databases: Diamonds in the dirt. Comm. ACM 52, 7, 86--94. Google ScholarDigital Library
Dalvi, N. N. and Suciu, D. 2007a. The dichotomy of conjunctive queries on probabilistic structures. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Princeiples of Database Systems (PODS'07). 293--302. Google ScholarDigital Library
Dalvi, N. N. and Suciu, D. 2007b. Efficient query evaluation on probabilistic databases. VLDB J. 16, 4, 523--544. Google ScholarDigital Library
Dalvi, N. N. and Suciu, D. 2007c. Management of probabilistic data: Foundations and challenges. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'07). 1--12. Google ScholarDigital Library
Das Sarma, A., Benjelloun, O., Halevy, A. Y., Nabar, S. U., and Widom, J. 2009. Representing uncertain data: Models, properties, and algorithms. VLDB J. 18, 5, 989--1019. Google ScholarDigital Library
Das Sarma, A., Theobald, M., and Widom, J. 2008. Exploiting lineage for confidence computation in uncertain and probabilistic databases. In Proceedings of the International Conference on Data Engineering (ICDE'08). 1023--1032. Google ScholarDigital Library
Deshpande, A. and Madden, S. 2006. MauveDB: Supporting model-based user views in database systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 73--84. Google ScholarDigital Library
Devroye, L. 1986. Non-Uniform Random Variate Generation. Springer.Google Scholar
Dong, X. L., Halevy, A. Y., and Yu, C. 2009. Data integration with uncertainty. VLDB J. 18, 2, 469--500. Google ScholarDigital Library
Fishman, G. 1996. Monte Carlo: Concepts, Algorithms, and Applications. Springer.Google ScholarCross Ref
Fuhr, N. and Rolleke, T. 1997. A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Trans. Inf. Syst. 15, 1, 32--66. Google ScholarDigital Library
Gentle, J. E. 2003. Random Number Generation and Monte Carlo Methods 2nd Ed. Springer.Google Scholar
Getoor, L. and Taskar, B., Eds. 2007. Introduction to Statistical Relational Learning. MIT Press. Google ScholarDigital Library
Griffiths, T. and Ghahramani, Z. 2005. Infinite latent feature models and the indian buffet process. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS'05).Google Scholar
Guha, S. 2010. RHIPE -- R and hadoop integrated processing environment. http://ml.stat.purdue.edu/rhipe/Google Scholar
Gupta, R. and Sarawagi, S. 2006. Creating probabilistic databases from information extraction models. In Proceedings of the International Conference on Very Large Databases (VLDB'06). 965--976. Google ScholarDigital Library
Henderson, S. G. and Nelson, B. L., Eds. 2006. Simulation. North-Holland.Google Scholar
Jampani, R., Xu, F., Wu, M., Perez, L. L., Jermaine, C. M., and Haas, P. J. 2008. MCDB: A Monte Carlo approach to managing uncertain data. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 687--700. Google ScholarDigital Library
Kennedy, O. and Koch, C. 2010. PIP: A database system for great and small expectations. In Proceedings of the International Conference on Data Engineering (ICDE'10). 157--168.Google Scholar
Kimelfeld, B., Kosharovsky, Y., and Sagiv, Y. 2009. Query evaluation over probabilistic XML. VLDB J. 18, 5, 1117--1140. Google ScholarDigital Library
Koch, C. and Olteanu, D. 2008. Conditioning probabilistic databases. In Proceedings of the International Conference on Very Large Databases (VLDB'08). Google ScholarDigital Library
Lehmann, E. L. and Casella, G. 1998. Theory of Point Estimation 2nd Ed. Springer.Google Scholar
Michelakis, E., Krishnamurthy, R., Haas, P. J., and Vaithyanathan, S. 2009. Uncertainty management in rule-based information extraction systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 101--114. Google ScholarDigital Library
Miller, Jr., R. G. 1986. Beyond ANOVA, Basics of Applied Statistics. Wiley.Google Scholar
Murthy, R. and Widom, J. 2007. Making aggregation work in uncertain and probabilistic databases. In Proceedings of the 1st International VLDB Workshop on Management of Uncertain Data (MUD'07). 76--90.Google Scholar
Nelsen, R. B. 2006. An Introduction to Copulas 1st Ed. Springer Series in Statistics. Springer.Google Scholar
O'Hagan, A. and Forster, J. J. 2004. Bayesian Inference 2nd Ed. Volume 2B of Kendal l's Advanced Theory of Statistics. Arnold.Google Scholar
Panneton, F., L'Ecuyer, P., and Matsumoto, M. 2006. Improved long-period generators based on linear recurrences modulo 2. ACM Trans. Math. Softw. 32, 1, 1--16. Google ScholarDigital Library
Perez, L. L., Arumugam, S., and Jermaine, C. M. 2010. Evaluation of probabilistic threshold queries in MCDB. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 687--698. Google ScholarDigital Library
Re, C., Dalvi, N. N., and Suciu, D. 2006. Query evaluation on probabilistic databases. IEEE Data Engin. Bull. 29, 1, 25--31.Google Scholar
Re, C., Dalvi, N. N., and Suciu, D. 2007. Efficient top-k query evaluation on probabilistic data. In Proceedings of the International Conference on Data Engineering (ICDE'07). 886--895.Google Scholar
Re, C. and Suciu, D. 2008. Managing probabilistic data with MystiQ: The can-do, the could-do, and the can't-do. In Proceedings of the SUM'08 Conference. 5--18. Google ScholarDigital Library
Re, C. and Suciu, D. 2009. The trichotomy of HAVING queries on a probabilistic database. VLDB J. 18, 5, 1091--1116. Google ScholarDigital Library
Robert, C. P. and Casella, G. 2004. Monte Carlo Statistical Methods 2nd Ed. Springer. Google ScholarDigital Library
Sen, P., Deshpande, A., and Getoor, L. 2009. PrDB: Managing and exploiting rich correlations in probabilistic databases. VLDB J. 18, 5, 1065--1090. Google ScholarDigital Library
Singh, S., Mayfield, C., Mittal, S., Prabhakar, S., Hambrusch, S. E., and Shah, R. 2008. Orion 2.0: Native support for uncertain data. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1239--1242. Google ScholarDigital Library
Srinivasan, A., Ceperley, D. M., and Mascagni, M. 1997. Random number generators for parallel applications. In Monte Carlo Methods in Chemical Physics, Wiley, 13--36.Google Scholar
Stonebraker, M., Becla, J., DeWitt, D. J., Lim, K.-T., Maier, D., Ratzesberger, O., and Zdonik, S. B. 2009. Requirements for science data bases and scidb. In Proceedings of the Conference on Innovative Data Systems Research (CIDR'09). 26.Google Scholar
Tan, C. J. K. 2002. The PLFG parallel pseudo-random number generator. Fut. Gener. Comput. Syst. 18, 693--698.Google ScholarCross Ref
Teh, Y., Jordan, M., Beal, M., and Blei, D. 2003. Hierarchical dirichlet processes. Tech. rep. 653, Department of Statistics, University of California, Berkeley.Google Scholar
Thiagarajan, A. and Madden, S. 2008. Querying continuous functions in a database system. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 791--804. Google ScholarDigital Library
Wang, D. Z., Michelakis, E., Garofalakis, M., and Hellerstein, J. 2008a. BayesStore: Managing large, uncertain data repositories with probabilistic graphical models. In Proceedings of the International Conference on Very Large Databases (VLDB'08). Google ScholarDigital Library
Wang, T.-Y., Re, C., and Suciu, D. 2008b. Implementing NOT EXISTS predicates over a probabilistic database. In Proceedings of the MUD/QDB Conference. 73--86.Google Scholar
Xu, F., Beyer, K., Ercegovac, V., Haas, P. J., and Shekita, E. J. 2009. E = M C 3: Managing uncertain enterprise data in a cluster-computing environment. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 441--454. Google ScholarDigital Library

Index Terms

The monte carlo database system: Stochastic analysis close to the data
1. Computing methodologies
  1. Modeling and simulation
    1. Simulation support systems
2. Information systems
  1. Data management systems
    1. Database design and models
      1. Relational database model
    2. Query languages
      1. Relational database query languages

Recommendations

MCDB: a monte carlo approach to managing uncertain data
SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data

To deal with data uncertainty, existing probabilistic database systems augment tuples with attribute-level or tuple-level probability values, which are loaded into the database along with the data itself. This approach can severely limit the system's ...
Read More
IBM Relational Database Systems: The Early Years

The relational data model, proposed by E.F. Codd in 1970, inspired several research projects at IBM and elsewhere. Among these was System R, which demonstrated the commercial viability of relational database systems. This article describes the research ...
Read More
Relational Database Systems with Zero Information Loss

Transaction time is used for time stamping object values to record their database history and formulate a zero information loss model for database transactions. The model consists of three components, a data history store, an update store, and a query ...
Read More

Reviews

Reviewer: Nuno M Garcia

The main subject of the paper relates to using Monte Carlo simulation on databases to allow the creation of future scenarios flexible enough to allow a what-if hypothesis. It combines database theory with Monte Carlo methods; for example, "What would be the expected outcome of customers' orders if the price of component X were increased by 20 percent__?__" The paper proceeds to explain the methodology to answer such formulations, and shows interesting results and conclusions. Yet some of the content is somehow disappointing, first and foremost because the paper is not clear on some issues. For example, the Monte Carlo simulations are done using the Gamma function. But why not use other functions more directly related to the simulation of such phenomena__?__ Furthermore, at some point, the paper claims that simulation using aggregated data (for example, clumping together records of clients in a database) results in losing predictive power. This seems to contradict the law of large numbers and basic probability theory. Another aspect that is disappointing is related to the first footnote of the paper on page 5. While the authors are evaluating related work, it is claimed: "Indeed, MCDB [Monte Carlo Database System] is the first DBMS [database management system] for which the Monte Carlo approach is fundamental to the entire system design." The footnote associated with this sentence states, "The recent PIP system of Kennedy and Koch (2010) combines PrDB [probabilistic databases] and Monte Carlo techniques, and can yield superior performance for certain MCDB-style queries." It is not clear why this is not discussed in the main paper and, furthermore, why a system that has superior performance is relegated to a footnote. Another disappointing aspect is related to the second footnote: it is not true that pseudorandom number generators are statistically indistinguishable from truly independent and identically distributed (i.i.d.) uniform random numbers. Self-similarity analysis of pseudorandom number sequences changes when the pseudorandom number generators change. As to the core of the paper, and with the limitations expressed above, the idea of using Monte Carlo in database simulation is well explained and the authors show clearly how the MCDB system works. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Database Systems Volume 36, Issue 3
August 2011
207 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/2000824
Issue’s Table of Contents

Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 August 2011
- Accepted: 1 March 2011
- Revised: 1 January 2011
- Received: 1 July 2009
Published in tods Volume 36, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
MCDB
relational database systems
uncertainty
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 47
  Total Citations
  View Citations
- 1,454
  Total Downloads
- Downloads (Last 12 months)17
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The monte carlo database system: Stochastic analysis close to the data

ACM Transactions on Database Systems

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

MCDB: a monte carlo approach to managing uncertain data

IBM Relational Database Systems: The Early Years

Relational Database Systems with Zero Information Loss

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

The monte carlo database system: Stochastic analysis close to the data

ACM Transactions on Database Systems

Abstract

Supplemental Material

Available for Download

References

Cited By

Index Terms

Recommendations

MCDB: a monte carlo approach to managing uncertain data

IBM Relational Database Systems: The Early Years

Relational Database Systems with Zero Information Loss

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media