research-article

Simulation of database-valued markov chains using SimSQL

Authors:
Zhuhua Cai

Rice University, Houston, TX, USA

Rice University, Houston, TX, USA
View Profile

,
Zografoula Vagena

LogicBlox, Inc., Atlanta, GA, USA

LogicBlox, Inc., Atlanta, GA, USA
View Profile

,
Luis Perez

Rice University, Houston, TX, USA

Rice University, Houston, TX, USA
View Profile

,
Subramanian Arumugam

Rice University, Houston, TX, USA

Rice University, Houston, TX, USA
View Profile

,
Peter J. Haas

IBM Almaden, San Jose, CA, USA

IBM Almaden, San Jose, CA, USA
View Profile

,
Christopher Jermaine

Rice University, Houston, TX, USA

Rice University, Houston, TX, USA
View Profile

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of DataJune 2013Pages 637–648https://doi.org/10.1145/2463676.2465283

Published:22 June 2013Publication History

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

Pages 637–648

ABSTRACT

This paper describes the SimSQL system, which allows for SQLbased specification, simulation, and querying of database-valued Markov chains, i.e., chains whose value at any time step comprises the contents of an entire database. SimSQL extends the earlier Monte Carlo database system (MCDB), which permitted Monte Carlo simulation of static database-valued random variables. Like MCDB, SimSQL uses user-specified "VG functions" to generate the simulated data values that are the building blocks of a simulated database. The enhanced functionality of SimSQL is enabled by the ability to parametrize VG functions using stochastic tables, so that one stochastic database can be used to parametrize the generation of another stochastic database, which can parametrize another, and so on. Other key extensions include the ability to explicitly define recursive versions of a stochastic table and the ability to execute the simulation in a MapReduce environment. We focus on applying SimSQL to Bayesian machine learning.

References

20Newsgroups Dataset. Available at: http://people.csail.mit.edu/jrennie/20newsgroups/.Google Scholar
A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. PVLDB, 2(1):922--933, 2009. Google ScholarDigital Library
B. Bahmani, K. Chakrabarti, and D. Xin. Fast Personalized Pagerank on MapReduce. In SIGMOD, 2011. Google ScholarDigital Library
C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. Google ScholarDigital Library
D. M. Blei, A. N. Ng, and M. I. Jordan. Latent Dirichlet Allocation. JMLR, 3:993--1022, 2003. Google ScholarDigital Library
Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. Haloop: efficient iterative data processing on large clusters. PVLDB, 3(1-2):285--296, 2010. Google ScholarDigital Library
R. Chaiken, B. Jenkins, P. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: easy and efficient parallel processing of massive data sets. PVLDB, 1(2):1265--1276, 2008. Google ScholarDigital Library
C. Chambers, A. Raniwala, F. Perry, S. Adams, R. R. Henry, R. Bradshaw, and N. Weizenbaum. Flumejava: easy, efficient data-parallel pipelines. In PLDA, 2010. Google ScholarDigital Library
B. Chattopadhyay, L. Lin, W. Liu, S. Mittal, P. Aragonda, V. Lychagina, Y. Kwon, and M. Wong. Tenzing a sql implementation on the mapreduce framework. PVLDB, 4(12):1318--1327, 2011.Google ScholarDigital Library
C.-T. Chu, S. K. Kim, Y.-A. Lin, Y. Yu, G. Bradski, A. Y. Ng, and K. Olukotun. Map-reduce for machine learning on multicore. In NIPS, 2006.Google Scholar
T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction to Algorithms. McGraw-Hill, 2003. Google ScholarDigital Library
S. Das, Y. Sismanis, K. S. Beyer, R. Gemulla, P. J. Haas, and J. McPherson. Ricardo: Integrating r and hadoop. In SIGMOD, 2010. Google ScholarDigital Library
D. Deutch, C. Koch, and T. Milo. On probabilistic fixpoint and markov chain query languages. In PODS, pages 215--226, 2010. Google ScholarDigital Library
A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. SystemML: Declarative machine learning on mapreduce. In ICDE, 2011. Google ScholarDigital Library
J. Goldstein and P. Larson. Optimizing Queries Using Materialized Views: A practical, scalable solution. In SIGMOD, 2001. Google ScholarDigital Library
T. J. Green and V. Tannen. Models for incomplete and probabilistic information. IEEE Data Eng. Bull., 29(1):17--24, 2006.Google Scholar
Y. E. Ioannidis and S. Christodoulakis. Optimal histograms for limiting worst-case error propagation in the size of join results. TODS, 18(4):709--748, 1993. Google ScholarDigital Library
R. Jampani, F. Xu, M. Wu, L. L. Perez, C. Jermaine, and P. J. Haas. The Monte Carlo Database System: Stochastic analysis close to the data. TODS, 36(3):1--41, 2011. Google ScholarDigital Library
U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system - implementation and observations. In ICDM, 2009. Google ScholarDigital Library
O. Kennedy and S. Nath. Jigsaw: Efficient optimization over uncertain enterprise data. In SIGMOD, 2011. Google ScholarDigital Library
C. Koch. On query algebras for probabilistic databases. SIGMOD Record, 37(4):78--85, 2008. Google ScholarDigital Library
Y. Kwon, D. Nunley, J. D. Gardner, M. Balazinska, B. Howe, and S. Loebman. Scalable clustering for n-body simulations in a shared-nothing cluster. In SSDBM, 2010. Google ScholarDigital Library
Z. Liu, Y. Zhang, E. Y. Chang, and M. Sun. Plda+: Parallel Latent Dirichlet Allocation with Data Placement and Pipeline Processing. ACM TIST, 2(3):26:1--26:18, 2011. Google ScholarDigital Library
Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. GraphLab: A New Parallel Framework for Machine Learning. In UAI, 2010.Google ScholarDigital Library
A. Mahout. Available at: http://mahout.apache.org/.Google Scholar
T. Neumann. Efficiently compiling efficient query plans for modern hardware. PVLDB, 4(9):539--550, 2011. Google ScholarDigital Library
B. Panda, J. S. Herbach, S. Basu, and R. J. Bayardo. Planet: Massively parallel learning of tree ensembles with mapreduce. PVLDB, 2(2):1426--1437, 2009. Google ScholarDigital Library
S. Papadimitriou and J. Sun. DisCo: Distributed co-clustering with map-reduce: A case study towards petabyte-scale end-to-end mining. In ICDM, 2008. Google ScholarDigital Library
I. Porteous, D. Newman, A. T. Ihler, A. Asuncion, P. Smyth, and M. Welling. Fast collapsed Gibbs sampling for Latent Dirichlet Allocation. In SIGKDD, 2008. Google ScholarDigital Library
RHIPE: R and Hadoop Integrated Programming Environment. Available at: http://ml.stat.purdue.edu/rhipe/.Google Scholar
C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer, 2010. Google ScholarDigital Library
S. Singh, A. Subramanya, F. Pereira, and A. McCallum. Distributed MAP inference for undirected graphical models. In NIPS Workshops, 2010.Google Scholar
A. J. Smola and S. Narayanamurthy. An Architecture for Parallel Topic models. PVLDB, 3(1):703--710, 2010. Google ScholarDigital Library
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. PVLDB, 2(2):1626--1629, 2009. Google ScholarDigital Library
G. Wang, M. V. Salles, B. Sowell, X. Wang, T. Cao, A. Demers, J. Gehrke, and W. White. Behavioral simulations in mapreduce. PVLDB, 3(1-2):952--963, 2010. Google ScholarDigital Library
M. Wick, A. McCallum, and G. Miklau. Scalable probabilistic databases with factor graphs and MCMC. In VLDB, 2010. Google ScholarDigital Library
Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey. Dryadlinq: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, 2008. Google ScholarDigital Library
D. Z. Zhang, M. J. Franklin, M. Garofalakis, J. M. Hellerstein, and M. L. Wick. Hybrid in-database inference for declarative information extraction. In SIGMOD, 2011. Google ScholarDigital Library

Index Terms

Simulation of database-valued markov chains using SimSQL
1. Information systems

Recommendations

Measure-Valued Differentiation for Stationary Markov Chains

We study general state-space Markov chains that depend on a parameter, say, <italic/>. Sufficient conditions are established for the stationary performance of such a Markov chain to be differentiable with respect to <italic/>. Specifically, we study the ...
Read More
Markov chains aggregation using discrete event optimization via simulation
SummerSim '19: Proceedings of the 2019 Summer Simulation Conference

Markov chains are an important form of stochastic system representation. Recent modeling techniques supporting discrete-event Markov model composition make it easy to build large Markov chains that are difficult to analyze due to state space explosion. ...
Read More
Variance reduction for additive functionals of Markov chains via martingale representations
Abstract
In this paper, we propose an efficient variance reduction approach for additive functionals of Markov chains relying on a novel discrete-time martingale representation. Our approach is fully non-asymptotic and does not require the knowledge of the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
June 2013
1322 pages
ISBN:9781450320375
DOI:10.1145/2463676
General Chairs:
Kenneth Ross
Columbia University
,
Divesh Srivastava
AT&T Research
,
Program Chair:
Dimitris Papadias
HKUST
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 June 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
databases
machine learning
markov chains
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '13 Paper Acceptance Rate76of372submissions,20%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 51
  Total Citations
  View Citations
- 742
  Total Downloads
- Downloads (Last 12 months)20
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Simulation of database-valued markov chains using SimSQL

SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Measure-Valued Differentiation for Stationary Markov Chains

Markov chains aggregation using discrete event optimization via simulation

Variance reduction for additive functionals of Markov chains via martingale representations