skip to main content
10.1145/2463676.2465283acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Simulation of database-valued markov chains using SimSQL

Published:22 June 2013Publication History

ABSTRACT

This paper describes the SimSQL system, which allows for SQLbased specification, simulation, and querying of database-valued Markov chains, i.e., chains whose value at any time step comprises the contents of an entire database. SimSQL extends the earlier Monte Carlo database system (MCDB), which permitted Monte Carlo simulation of static database-valued random variables. Like MCDB, SimSQL uses user-specified "VG functions" to generate the simulated data values that are the building blocks of a simulated database. The enhanced functionality of SimSQL is enabled by the ability to parametrize VG functions using stochastic tables, so that one stochastic database can be used to parametrize the generation of another stochastic database, which can parametrize another, and so on. Other key extensions include the ability to explicitly define recursive versions of a stochastic table and the ability to execute the simulation in a MapReduce environment. We focus on applying SimSQL to Bayesian machine learning.

References

  1. 20Newsgroups Dataset. Available at: http://people.csail.mit.edu/jrennie/20newsgroups/.Google ScholarGoogle Scholar
  2. A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB, 2(1):922--933, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. B. Bahmani, K. Chakrabarti, and D. Xin. Fast Personalized Pagerank on MapReduce. In SIGMOD, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. M. Blei, A. N. Ng, and M. I. Jordan. Latent Dirichlet Allocation. JMLR, 3:993--1022, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. Haloop: efficient iterative data processing on large clusters. PVLDB, 3(1-2):285--296, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. R. Chaiken, B. Jenkins, P. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: easy and efficient parallel processing of massive data sets. PVLDB, 1(2):1265--1276, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. Flumejava: easy, efficient data-parallel pipelines. In PLDA, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. B. Chattopadhyay, L. Lin, W. Liu, S. Mittal, P. Aragonda, V. Lychagina, Y. Kwon, and M. Wong. Tenzing a sql implementation on the mapreduce framework. PVLDB, 4(12):1318--1327, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In NIPS, 2006.Google ScholarGoogle Scholar
  11. T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction to Algorithms. McGraw-Hill, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Das, Y. Sismanis, K. S. Beyer, R. Gemulla, P. J. Haas, and J. McPherson. Ricardo: Integrating r and hadoop. In SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Deutch, C. Koch, and T. Milo. On probabilistic fixpoint and markov chain query languages. In PODS, pages 215--226, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. SystemML: Declarative machine learning on mapreduce. In ICDE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Goldstein and P. Larson. Optimizing Queries Using Materialized Views: A practical, scalable solution. In SIGMOD, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. T. J. Green and V. Tannen. Models for incomplete and probabilistic information. IEEE Data Eng. Bull., 29(1):17--24, 2006.Google ScholarGoogle Scholar
  17. Y. E. Ioannidis and S. Christodoulakis. Optimal histograms for limiting worst-case error propagation in the size of join results. TODS, 18(4):709--748, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. R. Jampani, F. Xu, M. Wu, L. L. Perez, C. Jermaine, and P. J. Haas. The Monte Carlo Database System: Stochastic analysis close to the data. TODS, 36(3):1--41, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system - implementation and observations. In ICDM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. O. Kennedy and S. Nath. Jigsaw: Efficient optimization over uncertain enterprise data. In SIGMOD, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. C. Koch. On query algebras for probabilistic databases. SIGMOD Record, 37(4):78--85, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Y. Kwon, D. Nunley, J. D. Gardner, M. Balazinska, B. Howe, and S. Loebman. Scalable clustering for n-body simulations in a shared-nothing cluster. In SSDBM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Z. Liu, Y. Zhang, E. Y. Chang, and M. Sun. Plda+: Parallel Latent Dirichlet Allocation with Data Placement and Pipeline Processing. ACM TIST, 2(3):26:1--26:18, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. GraphLab: A New Parallel Framework for Machine Learning. In UAI, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Mahout. Available at: http://mahout.apache.org/.Google ScholarGoogle Scholar
  26. T. Neumann. Efficiently compiling efficient query plans for modern hardware. PVLDB, 4(9):539--550, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo. Planet: Massively parallel learning of tree ensembles with mapreduce. PVLDB, 2(2):1426--1437, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. Papadimitriou and J. Sun. DisCo: Distributed co-clustering with map-reduce: A case study towards petabyte-scale end-to-end mining. In ICDM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. I. Porteous, D. Newman, A. T. Ihler, A. Asuncion, P. Smyth, and M. Welling. Fast collapsed Gibbs sampling for Latent Dirichlet Allocation. In SIGKDD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. RHIPE: R and Hadoop Integrated Programming Environment. Available at: http://ml.stat.purdue.edu/rhipe/.Google ScholarGoogle Scholar
  31. C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. S. Singh, A. Subramanya, F. Pereira, and A. McCallum. Distributed MAP inference for undirected graphical models. In NIPS Workshops, 2010.Google ScholarGoogle Scholar
  33. A. J. Smola and S. Narayanamurthy. An Architecture for Parallel Topic models. PVLDB, 3(1):703--710, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. PVLDB, 2(2):1626--1629, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. G. Wang, M. V. Salles, B. Sowell, X. Wang, T. Cao, A. Demers, J. Gehrke, and W. White. Behavioral simulations in mapreduce. PVLDB, 3(1-2):952--963, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M. Wick, A. McCallum, and G. Miklau. Scalable probabilistic databases with factor graphs and MCMC. In VLDB, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey. Dryadlinq: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. D. Z. Zhang, M. J. Franklin, M. Garofalakis, J. M. Hellerstein, and M. L. Wick. Hybrid in-database inference for declarative information extraction. In SIGMOD, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Simulation of database-valued markov chains using SimSQL

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
      June 2013
      1322 pages
      ISBN:9781450320375
      DOI:10.1145/2463676

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 22 June 2013

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SIGMOD '13 Paper Acceptance Rate76of372submissions,20%Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader