Abstract
The application of stochastic models and analysis techniques to large datasets is now commonplace. Unfortunately, in practice this usually means extracting data from a database system into an external tool (such as SAS, R, Arena, or Matlab), and then running the analysis there. This extract-and-model paradigm is typically error-prone, slow, does not support fine-grained modeling, and discourages what-if and sensitivity analyses.
In this article we describe MCDB, a database system that permits a wide spectrum of stochastic models to be used in conjunction with the data stored in a large database, without ever extracting the data. MCDB facilitates in-database execution of tasks such as risk assessment, prediction, and imputation of missing data, as well as management of errors due to data integration, information extraction, and privacy-preserving data anonymization. MCDB allows a user to define “random” relations whose contents are determined by stochastic models. The models can then be queried using standard SQL. Monte Carlo techniques are used to analyze the probability distribution of the result of an SQL query over random relations. Novel “tuple-bundle” processing techniques can effectively control the Monte Carlo overhead, as shown in our experiments.
Supplemental Material
Available for Download
Supplemental movie, image and appendix files for The monte carlo database system: Stochastic analysis close to the data
- Agrawal, P., Benjelloun, O., Sarma, A. D., Hayworth, C., Nabar, S. U., Sugihara, T., and Widom, J. 2006. Trio: A system for data, uncertainty, and lineage. In Proceedings of the International Conference on Very Large Databases (VLDB'06). Google ScholarDigital Library
- Alur, N. R., Haas, P. J., Momiroska, D., Read, P., Summers, N. H., Totanes, V., and Zuzarte, C. 2002. DB2 UDB's High Function Business Intelligence in e-Business. IBM Redbook Series.Google Scholar
- Andritsos, P., Fuxman, A., and Miller, R. J. 2006. Clean answers over dirty databases: A probabilistic approach. In Proceedings of the International Conference on Data Engineering (ICDE'06). 30. Google ScholarDigital Library
- Antova, L., Jansen, T., Koch, C., and Olteanu, D. 2008. Fast and simple relational processing of uncertain data. In Proceedings of the International Conference on Data Engineering (ICDE'08). 983--992. Google ScholarDigital Library
- Antova, L., Koch, C., and Olteanu, D. 2007. MayBMS: Managing incomplete information with probabilistic world-set decompositions. In Proceedings of the International Conference on Data Engineering (ICDE'07). 1479--1480.Google Scholar
- ApacheMahout. 2010. Apache mahout. http://lucene.apache.org/mahout/Google Scholar
- Arumugam, S., Jampani, R., Perez, L. L., Xu, F., Jermaine, C. M., and Haas, P. J. 2010. MCDB-R: Risk analysis in the database. In Proceedings of the International Conference on Very Large Databases (VLDB'10). 782--793. Google ScholarDigital Library
- Asmussen, S. and Glynn, P. W. 2007. Stochastic Simulation: Algorithms and Analysis. Springer.Google ScholarCross Ref
- Barbara, D., Garcia-Molina, H., and Porter, D. 1992. The management of probabilistic data. IEEE Trans. Knowl. Data Engin. 4, 5, 487--502. Google ScholarDigital Library
- Benjelloun, O., Sarma, A. D., Halevy, A. Y., Theobald, M., and Widom, J. 2008. Databases with uncertainty and lineage. VLDB J. 17, 2, 243--264. Google ScholarDigital Library
- Biller, B. and Nelson, B. L. 2003. Modeling and generating multivariate time-series input processes using a vector autoregressive technique. ACM Trans. Model. Comput. Simul. 13, 3, 211--237. Google ScholarDigital Library
- Blei, D. M., Griffiths, T. L., Jordan, M. I., and Tenenbaum, J. B. 2003. Hierarchical topic models and the nested chinese restaurant process. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS'03).Google Scholar
- Cheng, R., Singh, S., and Prabhakar, S. 2005. U-DBMS: A database system for managing constantly-evolving data. In Proceedings of the International Conference on Very Large Databases (VLDB'05). 1271--1274. Google ScholarDigital Library
- Chu, C.-T., Kim, S. K., Lin, Y.-A., Yu, Y., Bradski, G. R., Ng, A. Y., and Olukotun, K. 2006a. Map-Reduce for machine learning on multicore. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS'06). 281--288.Google Scholar
- Chu, D., Deshpande, A., Hellerstein, J. M., and Hong, W. 2006b. Approximate data collection in sensor networks using probabilistic models. In Proceedings of the International Conference on Data Engineering (ICDE'06). 48. Google ScholarDigital Library
- Cohen, J., Dolan, B., Dunlap, M., Hellerstein, J. M., and Welton, C. 2009. MAD skills: New analysis practices for big data. Proc. VLDB 2, 2, 1481--1492. Google ScholarDigital Library
- Cox, D. R. 1952. Estimation by double sampling. Biometrika 39, 3-4, 217--227.Google ScholarCross Ref
- Dalvi, N. N., Re, C., and Suciu, D. 2009. Probabilistic databases: Diamonds in the dirt. Comm. ACM 52, 7, 86--94. Google ScholarDigital Library
- Dalvi, N. N. and Suciu, D. 2007a. The dichotomy of conjunctive queries on probabilistic structures. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Princeiples of Database Systems (PODS'07). 293--302. Google ScholarDigital Library
- Dalvi, N. N. and Suciu, D. 2007b. Efficient query evaluation on probabilistic databases. VLDB J. 16, 4, 523--544. Google ScholarDigital Library
- Dalvi, N. N. and Suciu, D. 2007c. Management of probabilistic data: Foundations and challenges. In Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS'07). 1--12. Google ScholarDigital Library
- Das Sarma, A., Benjelloun, O., Halevy, A. Y., Nabar, S. U., and Widom, J. 2009. Representing uncertain data: Models, properties, and algorithms. VLDB J. 18, 5, 989--1019. Google ScholarDigital Library
- Das Sarma, A., Theobald, M., and Widom, J. 2008. Exploiting lineage for confidence computation in uncertain and probabilistic databases. In Proceedings of the International Conference on Data Engineering (ICDE'08). 1023--1032. Google ScholarDigital Library
- Deshpande, A. and Madden, S. 2006. MauveDB: Supporting model-based user views in database systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 73--84. Google ScholarDigital Library
- Devroye, L. 1986. Non-Uniform Random Variate Generation. Springer.Google Scholar
- Dong, X. L., Halevy, A. Y., and Yu, C. 2009. Data integration with uncertainty. VLDB J. 18, 2, 469--500. Google ScholarDigital Library
- Fishman, G. 1996. Monte Carlo: Concepts, Algorithms, and Applications. Springer.Google ScholarCross Ref
- Fuhr, N. and Rolleke, T. 1997. A probabilistic relational algebra for the integration of information retrieval and database systems. ACM Trans. Inf. Syst. 15, 1, 32--66. Google ScholarDigital Library
- Gentle, J. E. 2003. Random Number Generation and Monte Carlo Methods 2nd Ed. Springer.Google Scholar
- Getoor, L. and Taskar, B., Eds. 2007. Introduction to Statistical Relational Learning. MIT Press. Google ScholarDigital Library
- Griffiths, T. and Ghahramani, Z. 2005. Infinite latent feature models and the indian buffet process. In Proceedings of the Conference on Advances in Neural Information Processing Systems (NIPS'05).Google Scholar
- Guha, S. 2010. RHIPE -- R and hadoop integrated processing environment. http://ml.stat.purdue.edu/rhipe/Google Scholar
- Gupta, R. and Sarawagi, S. 2006. Creating probabilistic databases from information extraction models. In Proceedings of the International Conference on Very Large Databases (VLDB'06). 965--976. Google ScholarDigital Library
- Henderson, S. G. and Nelson, B. L., Eds. 2006. Simulation. North-Holland.Google Scholar
- Jampani, R., Xu, F., Wu, M., Perez, L. L., Jermaine, C. M., and Haas, P. J. 2008. MCDB: A Monte Carlo approach to managing uncertain data. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 687--700. Google ScholarDigital Library
- Kennedy, O. and Koch, C. 2010. PIP: A database system for great and small expectations. In Proceedings of the International Conference on Data Engineering (ICDE'10). 157--168.Google Scholar
- Kimelfeld, B., Kosharovsky, Y., and Sagiv, Y. 2009. Query evaluation over probabilistic XML. VLDB J. 18, 5, 1117--1140. Google ScholarDigital Library
- Koch, C. and Olteanu, D. 2008. Conditioning probabilistic databases. In Proceedings of the International Conference on Very Large Databases (VLDB'08). Google ScholarDigital Library
- Lehmann, E. L. and Casella, G. 1998. Theory of Point Estimation 2nd Ed. Springer.Google Scholar
- Michelakis, E., Krishnamurthy, R., Haas, P. J., and Vaithyanathan, S. 2009. Uncertainty management in rule-based information extraction systems. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 101--114. Google ScholarDigital Library
- Miller, Jr., R. G. 1986. Beyond ANOVA, Basics of Applied Statistics. Wiley.Google Scholar
- Murthy, R. and Widom, J. 2007. Making aggregation work in uncertain and probabilistic databases. In Proceedings of the 1st International VLDB Workshop on Management of Uncertain Data (MUD'07). 76--90.Google Scholar
- Nelsen, R. B. 2006. An Introduction to Copulas 1st Ed. Springer Series in Statistics. Springer.Google Scholar
- O'Hagan, A. and Forster, J. J. 2004. Bayesian Inference 2nd Ed. Volume 2B of Kendal l's Advanced Theory of Statistics. Arnold.Google Scholar
- Panneton, F., L'Ecuyer, P., and Matsumoto, M. 2006. Improved long-period generators based on linear recurrences modulo 2. ACM Trans. Math. Softw. 32, 1, 1--16. Google ScholarDigital Library
- Perez, L. L., Arumugam, S., and Jermaine, C. M. 2010. Evaluation of probabilistic threshold queries in MCDB. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 687--698. Google ScholarDigital Library
- Re, C., Dalvi, N. N., and Suciu, D. 2006. Query evaluation on probabilistic databases. IEEE Data Engin. Bull. 29, 1, 25--31.Google Scholar
- Re, C., Dalvi, N. N., and Suciu, D. 2007. Efficient top-k query evaluation on probabilistic data. In Proceedings of the International Conference on Data Engineering (ICDE'07). 886--895.Google Scholar
- Re, C. and Suciu, D. 2008. Managing probabilistic data with MystiQ: The can-do, the could-do, and the can't-do. In Proceedings of the SUM'08 Conference. 5--18. Google ScholarDigital Library
- Re, C. and Suciu, D. 2009. The trichotomy of HAVING queries on a probabilistic database. VLDB J. 18, 5, 1091--1116. Google ScholarDigital Library
- Robert, C. P. and Casella, G. 2004. Monte Carlo Statistical Methods 2nd Ed. Springer. Google ScholarDigital Library
- Sen, P., Deshpande, A., and Getoor, L. 2009. PrDB: Managing and exploiting rich correlations in probabilistic databases. VLDB J. 18, 5, 1065--1090. Google ScholarDigital Library
- Singh, S., Mayfield, C., Mittal, S., Prabhakar, S., Hambrusch, S. E., and Shah, R. 2008. Orion 2.0: Native support for uncertain data. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 1239--1242. Google ScholarDigital Library
- Srinivasan, A., Ceperley, D. M., and Mascagni, M. 1997. Random number generators for parallel applications. In Monte Carlo Methods in Chemical Physics, Wiley, 13--36.Google Scholar
- Stonebraker, M., Becla, J., DeWitt, D. J., Lim, K.-T., Maier, D., Ratzesberger, O., and Zdonik, S. B. 2009. Requirements for science data bases and scidb. In Proceedings of the Conference on Innovative Data Systems Research (CIDR'09). 26.Google Scholar
- Tan, C. J. K. 2002. The PLFG parallel pseudo-random number generator. Fut. Gener. Comput. Syst. 18, 693--698.Google ScholarCross Ref
- Teh, Y., Jordan, M., Beal, M., and Blei, D. 2003. Hierarchical dirichlet processes. Tech. rep. 653, Department of Statistics, University of California, Berkeley.Google Scholar
- Thiagarajan, A. and Madden, S. 2008. Querying continuous functions in a database system. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 791--804. Google ScholarDigital Library
- Wang, D. Z., Michelakis, E., Garofalakis, M., and Hellerstein, J. 2008a. BayesStore: Managing large, uncertain data repositories with probabilistic graphical models. In Proceedings of the International Conference on Very Large Databases (VLDB'08). Google ScholarDigital Library
- Wang, T.-Y., Re, C., and Suciu, D. 2008b. Implementing NOT EXISTS predicates over a probabilistic database. In Proceedings of the MUD/QDB Conference. 73--86.Google Scholar
- Xu, F., Beyer, K., Ercegovac, V., Haas, P. J., and Shekita, E. J. 2009. E = M C 3: Managing uncertain enterprise data in a cluster-computing environment. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 441--454. Google ScholarDigital Library
Index Terms
- The monte carlo database system: Stochastic analysis close to the data
Recommendations
MCDB: a monte carlo approach to managing uncertain data
SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of dataTo deal with data uncertainty, existing probabilistic database systems augment tuples with attribute-level or tuple-level probability values, which are loaded into the database along with the data itself. This approach can severely limit the system's ...
IBM Relational Database Systems: The Early Years
The relational data model, proposed by E.F. Codd in 1970, inspired several research projects at IBM and elsewhere. Among these was System R, which demonstrated the commercial viability of relational database systems. This article describes the research ...
Relational Database Systems with Zero Information Loss
Transaction time is used for time stamping object values to record their database history and formulate a zero information loss model for database transactions. The model consists of three components, a data history store, an update store, and a query ...
Comments