skip to main content
research-article

The monte carlo database system: Stochastic analysis close to the data

Published:26 August 2011Publication History
Skip Abstract Section

Abstract

The application of stochastic models and analysis techniques to large datasets is now commonplace. Unfortunately, in practice this usually means extracting data from a database system into an external tool (such as SAS, R, Arena, or Matlab), and then running the analysis there. This extract-and-model paradigm is typically error-prone, slow, does not support fine-grained modeling, and discourages what-if and sensitivity analyses.

In this article we describe MCDB, a database system that permits a wide spectrum of stochastic models to be used in conjunction with the data stored in a large database, without ever extracting the data. MCDB facilitates in-database execution of tasks such as risk assessment, prediction, and imputation of missing data, as well as management of errors due to data integration, information extraction, and privacy-preserving data anonymization. MCDB allows a user to define “random” relations whose contents are determined by stochastic models. The models can then be queried using standard SQL. Monte Carlo techniques are used to analyze the probability distribution of the result of an SQL query over random relations. Novel “tuple-bundle” processing techniques can effectively control the Monte Carlo overhead, as shown in our experiments.

Skip Supplemental Material Section

Supplemental Material

References

  1. Agrawal, P., Benjelloun, O., Sarma, A. D., Hayworth, C., Nabar, S. U., Sugihara, T., and Widom, J. 2006. Trio: A system for data, uncertainty, and lineage. In Proceedings of the International Conference on Very Large Databases (VLDB'06). Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Alur, N. R., Haas, P. J., Momiroska, D., Read, P., Summers, N. H., Totanes, V., and Zuzarte, C. 2002. DB2 UDB's High Function Business Intelligence in e-Business. IBM Redbook Series.Google ScholarGoogle Scholar
  3. Andritsos, P., Fuxman, A., and Miller, R. J. 2006. Clean answers over dirty databases: A probabilistic approach. In Proceedings of the International Conference on Data Engineering (ICDE'06). 30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Antova, L., Jansen, T., Koch, C., and Olteanu, D. 2008. Fast and simple relational processing of uncertain data. In Proceedings of the International Conference on Data Engineering (ICDE'08). 983--992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Antova, L., Koch, C., and Olteanu, D. 2007. MayBMS: Managing incomplete information with probabilistic world-set decompositions. In Proceedings of the International Conference on Data Engineering (ICDE'07). 1479--1480.Google ScholarGoogle Scholar
  6. ApacheMahout. 2010. Apache mahout. http://lucene.apache.org/mahout/Google ScholarGoogle Scholar
  7. Arumugam, S., Jampani, R., Perez, L. L., Xu, F., Jermaine, C. M., and Haas, P. J. 2010. MCDB-R: Risk analysis in the database. In Proceedings of the International Conference on Very Large Databases (VLDB'10). 782--793. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Asmussen, S. and Glynn, P. W. 2007. Stochastic Simulation: Algorithms and Analysis. Springer.Google ScholarGoogle ScholarCross RefCross Ref
  9. Barbara, D., Garcia-Molina, H., and Porter, D. 1992. The management of probabilistic data. IEEE Trans. Knowl. Data Engin. 4, 5, 487--502. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Benjelloun, O., Sarma, A. D., Halevy, A. Y., Theobald, M., and Widom, J. 2008. Databases with uncertainty and lineage. VLDB J. 17, 2, 243--264. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Biller, B. and Nelson, B. L. 2003. Modeling and generating multivariate time-series input processes using a vector autoregressive technique. ACM Trans. Model. Comput. Simul. 13, 3, 211--237. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Blei, D. M., Griffiths, T. L., Jordan, M. I., and Tenenbaum, J. B. 2003. Hierarchical topic models and the nested chinese restaurant process. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS'03).Google ScholarGoogle Scholar
  13. Cheng, R., Singh, S., and Prabhakar, S. 2005. U-DBMS: A database system for managing constantly-evolving data. In Proceedings of the International Conference on Very Large Databases (VLDB'05). 1271--1274. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G. R., Ng, A. Y., and Olukotun, K. 2006a. Map-Reduce for machine learning on multicore. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS'06). 281--288.Google ScholarGoogle Scholar
  15. Chu, D., Deshpande, A., Hellerstein, J. M., and Hong, W. 2006b. Approximate data collection in sensor networks using probabilistic models. In Proceedings of the International Conference on Data Engineering (ICDE'06). 48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J. M., and Welton, C. 2009. MAD skills: New analysis practices for big data. Proc. VLDB 2, 2, 1481--1492. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Cox, D. R. 1952. Estimation by double sampling. Biometrika 39, 3-4, 217--227.Google ScholarGoogle ScholarCross RefCross Ref
  18. Dalvi, N. N., Re, C., and Suciu, D. 2009. Probabilistic databases: Diamonds in the dirt. Comm. ACM 52, 7, 86--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Dalvi, N. N. and Suciu, D. 2007a. The dichotomy of conjunctive queries on probabilistic structures. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Princeiples of Database Systems (PODS'07). 293--302. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Dalvi, N. N. and Suciu, D. 2007b. Efficient query evaluation on probabilistic databases. VLDB J. 16, 4, 523--544. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Dalvi, N. N. and Suciu, D. 2007c. Management of probabilistic data: Foundations and challenges. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'07). 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Das Sarma, A., Benjelloun, O., Halevy, A. Y., Nabar, S. U., and Widom, J. 2009. Representing uncertain data: Models, properties, and algorithms. VLDB J. 18, 5, 989--1019. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Das Sarma, A., Theobald, M., and Widom, J. 2008. Exploiting lineage for confidence computation in uncertain and probabilistic databases. In Proceedings of the International Conference on Data Engineering (ICDE'08). 1023--1032. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Deshpande, A. and Madden, S. 2006. MauveDB: Supporting model-based user views in database systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 73--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Devroye, L. 1986. Non-Uniform Random Variate Generation. Springer.Google ScholarGoogle Scholar
  26. Dong, X. L., Halevy, A. Y., and Yu, C. 2009. Data integration with uncertainty. VLDB J. 18, 2, 469--500. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Fishman, G. 1996. Monte Carlo: Concepts, Algorithms, and Applications. Springer.Google ScholarGoogle ScholarCross RefCross Ref
  28. Fuhr, N. and Rolleke, T. 1997. A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Trans. Inf. Syst. 15, 1, 32--66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Gentle, J. E. 2003. Random Number Generation and Monte Carlo Methods 2nd Ed. Springer.Google ScholarGoogle Scholar
  30. Getoor, L. and Taskar, B., Eds. 2007. Introduction to Statistical Relational Learning. MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Griffiths, T. and Ghahramani, Z. 2005. Infinite latent feature models and the indian buffet process. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS'05).Google ScholarGoogle Scholar
  32. Guha, S. 2010. RHIPE -- R and hadoop integrated processing environment. http://ml.stat.purdue.edu/rhipe/Google ScholarGoogle Scholar
  33. Gupta, R. and Sarawagi, S. 2006. Creating probabilistic databases from information extraction models. In Proceedings of the International Conference on Very Large Databases (VLDB'06). 965--976. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Henderson, S. G. and Nelson, B. L., Eds. 2006. Simulation. North-Holland.Google ScholarGoogle Scholar
  35. Jampani, R., Xu, F., Wu, M., Perez, L. L., Jermaine, C. M., and Haas, P. J. 2008. MCDB: A Monte Carlo approach to managing uncertain data. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 687--700. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Kennedy, O. and Koch, C. 2010. PIP: A database system for great and small expectations. In Proceedings of the International Conference on Data Engineering (ICDE'10). 157--168.Google ScholarGoogle Scholar
  37. Kimelfeld, B., Kosharovsky, Y., and Sagiv, Y. 2009. Query evaluation over probabilistic XML. VLDB J. 18, 5, 1117--1140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Koch, C. and Olteanu, D. 2008. Conditioning probabilistic databases. In Proceedings of the International Conference on Very Large Databases (VLDB'08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Lehmann, E. L. and Casella, G. 1998. Theory of Point Estimation 2nd Ed. Springer.Google ScholarGoogle Scholar
  40. Michelakis, E., Krishnamurthy, R., Haas, P. J., and Vaithyanathan, S. 2009. Uncertainty management in rule-based information extraction systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 101--114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Miller, Jr., R. G. 1986. Beyond ANOVA, Basics of Applied Statistics. Wiley.Google ScholarGoogle Scholar
  42. Murthy, R. and Widom, J. 2007. Making aggregation work in uncertain and probabilistic databases. In Proceedings of the 1st International VLDB Workshop on Management of Uncertain Data (MUD'07). 76--90.Google ScholarGoogle Scholar
  43. Nelsen, R. B. 2006. An Introduction to Copulas 1st Ed. Springer Series in Statistics. Springer.Google ScholarGoogle Scholar
  44. O'Hagan, A. and Forster, J. J. 2004. Bayesian Inference 2nd Ed. Volume 2B of Kendal l's Advanced Theory of Statistics. Arnold.Google ScholarGoogle Scholar
  45. Panneton, F., L'Ecuyer, P., and Matsumoto, M. 2006. Improved long-period generators based on linear recurrences modulo 2. ACM Trans. Math. Softw. 32, 1, 1--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Perez, L. L., Arumugam, S., and Jermaine, C. M. 2010. Evaluation of probabilistic threshold queries in MCDB. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 687--698. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Re, C., Dalvi, N. N., and Suciu, D. 2006. Query evaluation on probabilistic databases. IEEE Data Engin. Bull. 29, 1, 25--31.Google ScholarGoogle Scholar
  48. Re, C., Dalvi, N. N., and Suciu, D. 2007. Efficient top-k query evaluation on probabilistic data. In Proceedings of the International Conference on Data Engineering (ICDE'07). 886--895.Google ScholarGoogle Scholar
  49. Re, C. and Suciu, D. 2008. Managing probabilistic data with MystiQ: The can-do, the could-do, and the can't-do. In Proceedings of the SUM'08 Conference. 5--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Re, C. and Suciu, D. 2009. The trichotomy of HAVING queries on a probabilistic database. VLDB J. 18, 5, 1091--1116. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Robert, C. P. and Casella, G. 2004. Monte Carlo Statistical Methods 2nd Ed. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Sen, P., Deshpande, A., and Getoor, L. 2009. PrDB: Managing and exploiting rich correlations in probabilistic databases. VLDB J. 18, 5, 1065--1090. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Singh, S., Mayfield, C., Mittal, S., Prabhakar, S., Hambrusch, S. E., and Shah, R. 2008. Orion 2.0: Native support for uncertain data. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1239--1242. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Srinivasan, A., Ceperley, D. M., and Mascagni, M. 1997. Random number generators for parallel applications. In Monte Carlo Methods in Chemical Physics, Wiley, 13--36.Google ScholarGoogle Scholar
  55. Stonebraker, M., Becla, J., DeWitt, D. J., Lim, K.-T., Maier, D., Ratzesberger, O., and Zdonik, S. B. 2009. Requirements for science data bases and scidb. In Proceedings of the Conference on Innovative Data Systems Research (CIDR'09). 26.Google ScholarGoogle Scholar
  56. Tan, C. J. K. 2002. The PLFG parallel pseudo-random number generator. Fut. Gener. Comput. Syst. 18, 693--698.Google ScholarGoogle ScholarCross RefCross Ref
  57. Teh, Y., Jordan, M., Beal, M., and Blei, D. 2003. Hierarchical dirichlet processes. Tech. rep. 653, Department of Statistics, University of California, Berkeley.Google ScholarGoogle Scholar
  58. Thiagarajan, A. and Madden, S. 2008. Querying continuous functions in a database system. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 791--804. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Wang, D. Z., Michelakis, E., Garofalakis, M., and Hellerstein, J. 2008a. BayesStore: Managing large, uncertain data repositories with probabilistic graphical models. In Proceedings of the International Conference on Very Large Databases (VLDB'08). Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Wang, T.-Y., Re, C., and Suciu, D. 2008b. Implementing NOT EXISTS predicates over a probabilistic database. In Proceedings of the MUD/QDB Conference. 73--86.Google ScholarGoogle Scholar
  61. Xu, F., Beyer, K., Ercegovac, V., Haas, P. J., and Shekita, E. J. 2009. E = M C 3: Managing uncertain enterprise data in a cluster-computing environment. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 441--454. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. The monte carlo database system: Stochastic analysis close to the data

        Recommendations

        Reviews

        Nuno M Garcia

        The main subject of the paper relates to using Monte Carlo simulation on databases to allow the creation of future scenarios flexible enough to allow a what-if hypothesis. It combines database theory with Monte Carlo methods; for example, "What would be the expected outcome of customers' orders if the price of component X were increased by 20 percent__?__" The paper proceeds to explain the methodology to answer such formulations, and shows interesting results and conclusions. Yet some of the content is somehow disappointing, first and foremost because the paper is not clear on some issues. For example, the Monte Carlo simulations are done using the Gamma function. But why not use other functions more directly related to the simulation of such phenomena__?__ Furthermore, at some point, the paper claims that simulation using aggregated data (for example, clumping together records of clients in a database) results in losing predictive power. This seems to contradict the law of large numbers and basic probability theory. Another aspect that is disappointing is related to the first footnote of the paper on page 5. While the authors are evaluating related work, it is claimed: "Indeed, MCDB [Monte Carlo Database System] is the first DBMS [database management system] for which the Monte Carlo approach is fundamental to the entire system design." The footnote associated with this sentence states, "The recent PIP system of Kennedy and Koch (2010) combines PrDB [probabilistic databases] and Monte Carlo techniques, and can yield superior performance for certain MCDB-style queries." It is not clear why this is not discussed in the main paper and, furthermore, why a system that has superior performance is relegated to a footnote. Another disappointing aspect is related to the second footnote: it is not true that pseudorandom number generators are statistically indistinguishable from truly independent and identically distributed (i.i.d.) uniform random numbers. Self-similarity analysis of pseudorandom number sequences changes when the pseudorandom number generators change. As to the core of the paper, and with the limitations expressed above, the idea of using Monte Carlo in database simulation is well explained and the authors show clearly how the MCDB system works. Online Computing Reviews Service

        Access critical reviews of Computing literature here

        Become a reviewer for Computing Reviews.

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Database Systems
          ACM Transactions on Database Systems  Volume 36, Issue 3
          August 2011
          207 pages
          ISSN:0362-5915
          EISSN:1557-4644
          DOI:10.1145/2000824
          Issue’s Table of Contents

          Copyright © 2011 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 26 August 2011
          • Accepted: 1 March 2011
          • Revised: 1 January 2011
          • Received: 1 July 2009
          Published in tods Volume 36, Issue 3

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader