research-article

A comparison of platforms for implementing and running very large scale machine learning algorithms

Authors:
Zhuhua Cai

Rice University, Houston, TX, USA

Rice University, Houston, TX, USA
View Profile

,
Zekai J. Gao

Rice University, Houston, TX, USA

Rice University, Houston, TX, USA
View Profile

,
Shangyu Luo

Rice University, Houston, TX, USA

Rice University, Houston, TX, USA
View Profile

,
Luis L. Perez

Rice University, Houston, TX, USA

Rice University, Houston, TX, USA
View Profile

,
Zografoula Vagena

LogicBlox, Inc., Atlanta, GA, USA

LogicBlox, Inc., Atlanta, GA, USA
View Profile

,
Christopher Jermaine

Rice University, Houston, TX, USA

Rice University, Houston, TX, USA
View Profile

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of DataJune 2014Pages 1371–1382https://doi.org/10.1145/2588555.2593680

Published:18 June 2014Publication History

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

Pages 1371–1382

ABSTRACT

We describe an extensive benchmark of platforms available to a user who wants to run a machine learning (ML) inference algorithm over a very large data set, but cannot find an existing implementation and thus must "roll her own" ML code. We have carefully chosen a set of five ML implementation tasks that involve learning relatively complex, hierarchical models. We completed those tasks on four different computational platforms, and using 70,000 hours of Amazon EC2 compute time, we carefully compared running times, tuning requirements, and ease-of-programming of each.

References

A. Agarwal, O. Chapelle, M. Dudık, and J. Langford. A reliable effective terascale linear learning system. arXiv preprint arXiv:1110.4198, 2011.Google Scholar
C. Avery. Giraph: Large-scale graph processing infrastructure on hadoop. Proceedings of the Hadoop Summit. Santa Clara, 2011.Google Scholar
D. M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77--84, 2012. Google ScholarDigital Library
Z. Cai, Z. Vagena, L. Perez, S. Arumugam, P. J. Haas, and C. Jermaine. Simulation of database-valued markov chains using simsql. In SIGMOD, pages 637--648, 2013. Google ScholarDigital Library
S. Das, Y. Sismanis, K. S. Beyer, R. Gemulla, P. J. Haas, and J. McPherson. Ricardo: integrating r and hadoop. In SIGMOD, pages 987--998, 2010. Google ScholarDigital Library
A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. Systemml: Declarative machine learning on mapreduce. In ICDE, pages 231--242, 2011. Google ScholarDigital Library
W. R. Gilks, S. Richardson, and D. J. Spiegelhalter. Markov chain Monte Carlo in practice, volume 2. CRC press, 1996.Google Scholar
J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In OSDI, pages 17--30, 2012. Google ScholarDigital Library
J. M. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, et al. The madlib analytics library: or mad skills, the sql. PVLDB, 5(12):1700--1711, 2012. Google ScholarDigital Library
T. Kraska, A. Talwalkar, J. C. Duchi, R. Griffith, M. J. Franklin, and M. I. Jordan. Mlbase: A distributed machine-learning system. In CIDR, 2013.Google Scholar
Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Graphlab: A new framework for parallel machine learning. arXiv preprint arXiv:1006.4990, 2010.Google Scholar
G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, pages 135--146, 2010. Google ScholarDigital Library
A. McCallum. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu/, 2002.Google Scholar
D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: a timely dataflow system. In SOSP, pages 439--455, 2013. Google ScholarDigital Library
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD, pages 1099--1110. ACM, 2008. Google ScholarDigital Library
C. Ordonez and P. Cereghini. Sqlem: Fast clustering in sql using the em algorithm. In SIGMOD, pages 559--570, 2000. Google ScholarDigital Library
S. Owen, R. Anil, T. Dunning, and E. Friedman. Mahout in action. Manning, 2011. Google ScholarDigital Library
T. Park and G. Casella. The bayesian lasso. JASA, 103(482):681--686, 2008.Google ScholarCross Ref
A. Sujeeth, H. Lee, K. Brown, T. Rompf, H. Chafi, M. Wu, A. Atreya, M. Odersky, and K. Olukotun. Optiml: an implicitly parallel domain-specific language for machine learning. In ICML, pages 609--616, 2011.Google Scholar
L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103--111, 1990. Google ScholarDigital Library
M. Weimer, T. Condie, R. Ramakrishnan, et al. Machine learning in scalops, a higher order cloud computing language. In BigLearn, volume 9, pages 389--396, 2011.Google Scholar
Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. Dryadlinq: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, volume 8, pages 1--14, 2008. Google ScholarDigital Library
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In USENIX, pages 2--2, 2012. Google ScholarDigital Library
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In USENIX, pages 10--10, 2010. Google ScholarDigital Library

Index Terms

A comparison of platforms for implementing and running very large scale machine learning algorithms
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Parallel and distributed DBMSs

Recommendations

Distributed Machine Learning: Foundations, Trends, and Practices
WWW '17 Companion: Proceedings of the 26th International Conference on World Wide Web Companion

In recent years, artificial intelligence has achieved great success in many important applications. Both novel machine learning algorithms (e.g., deep neural networks), and their distributed implementations play very critical roles in the success. In ...
Read More
A framework for agent-based distributed machine learning and data mining
AAMAS '07: Proceedings of the 6th international joint conference on Autonomous agents and multiagent systems

This paper proposes a framework for agent-based distributed machine learning and data mining based on (i) the exchange of meta-level descriptions of individual learning processes among agents and (ii) online reasoning about learning success and learning ...
Read More
Lifelong Machine Learning
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data
June 2014
1645 pages
ISBN:9781450323765
DOI:10.1145/2588555
General Chairs:
Curtis Dyreson
Utah State University, USA
,
Feifei Li
University of Utah, USA
,
Program Chair:
M. Tamer Özsu
University of Waterloo, Canada
Copyright © 2014 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 June 2014
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
distributed machine learning
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '14 Paper Acceptance Rate107of421submissions,25%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 35
  Total Citations
  View Citations
- 1,148
  Total Downloads
- Downloads (Last 12 months)26
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A comparison of platforms for implementing and running very large scale machine learning algorithms

SIGMOD '14: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Distributed Machine Learning: Foundations, Trends, and Practices

A framework for agent-based distributed machine learning and data mining

Lifelong Machine Learning