ABSTRACT
We describe an extensive benchmark of platforms available to a user who wants to run a machine learning (ML) inference algorithm over a very large data set, but cannot find an existing implementation and thus must "roll her own" ML code. We have carefully chosen a set of five ML implementation tasks that involve learning relatively complex, hierarchical models. We completed those tasks on four different computational platforms, and using 70,000 hours of Amazon EC2 compute time, we carefully compared running times, tuning requirements, and ease-of-programming of each.
- A. Agarwal, O. Chapelle, M. Dudık, and J. Langford. A reliable effective terascale linear learning system. arXiv preprint arXiv:1110.4198, 2011.Google Scholar
- C. Avery. Giraph: Large-scale graph processing infrastructure on hadoop. Proceedings of the Hadoop Summit. Santa Clara, 2011.Google Scholar
- D. M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77--84, 2012. Google ScholarDigital Library
- Z. Cai, Z. Vagena, L. Perez, S. Arumugam, P. J. Haas, and C. Jermaine. Simulation of database-valued markov chains using simsql. In SIGMOD, pages 637--648, 2013. Google ScholarDigital Library
- S. Das, Y. Sismanis, K. S. Beyer, R. Gemulla, P. J. Haas, and J. McPherson. Ricardo: integrating r and hadoop. In SIGMOD, pages 987--998, 2010. Google ScholarDigital Library
- A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. Systemml: Declarative machine learning on mapreduce. In ICDE, pages 231--242, 2011. Google ScholarDigital Library
- W. R. Gilks, S. Richardson, and D. J. Spiegelhalter. Markov chain Monte Carlo in practice, volume 2. CRC press, 1996.Google Scholar
- J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. Powergraph: Distributed graph-parallel computation on natural graphs. In OSDI, pages 17--30, 2012. Google ScholarDigital Library
- J. M. Hellerstein, C. Ré, F. Schoppmann, D. Z. Wang, E. Fratkin, A. Gorajek, K. S. Ng, C. Welton, X. Feng, K. Li, et al. The madlib analytics library: or mad skills, the sql. PVLDB, 5(12):1700--1711, 2012. Google ScholarDigital Library
- T. Kraska, A. Talwalkar, J. C. Duchi, R. Griffith, M. J. Franklin, and M. I. Jordan. Mlbase: A distributed machine-learning system. In CIDR, 2013.Google Scholar
- Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. Graphlab: A new framework for parallel machine learning. arXiv preprint arXiv:1006.4990, 2010.Google Scholar
- G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, pages 135--146, 2010. Google ScholarDigital Library
- A. McCallum. Mallet: A machine learning for language toolkit. http://mallet.cs.umass.edu/, 2002.Google Scholar
- D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: a timely dataflow system. In SOSP, pages 439--455, 2013. Google ScholarDigital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD, pages 1099--1110. ACM, 2008. Google ScholarDigital Library
- C. Ordonez and P. Cereghini. Sqlem: Fast clustering in sql using the em algorithm. In SIGMOD, pages 559--570, 2000. Google ScholarDigital Library
- S. Owen, R. Anil, T. Dunning, and E. Friedman. Mahout in action. Manning, 2011. Google ScholarDigital Library
- T. Park and G. Casella. The bayesian lasso. JASA, 103(482):681--686, 2008.Google ScholarCross Ref
- A. Sujeeth, H. Lee, K. Brown, T. Rompf, H. Chafi, M. Wu, A. Atreya, M. Odersky, and K. Olukotun. Optiml: an implicitly parallel domain-specific language for machine learning. In ICML, pages 609--616, 2011.Google Scholar
- L. G. Valiant. A bridging model for parallel computation. Communications of the ACM, 33(8):103--111, 1990. Google ScholarDigital Library
- M. Weimer, T. Condie, R. Ramakrishnan, et al. Machine learning in scalops, a higher order cloud computing language. In BigLearn, volume 9, pages 389--396, 2011.Google Scholar
- Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. Dryadlinq: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, volume 8, pages 1--14, 2008. Google ScholarDigital Library
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In USENIX, pages 2--2, 2012. Google ScholarDigital Library
- M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: cluster computing with working sets. In USENIX, pages 10--10, 2010. Google ScholarDigital Library
Index Terms
- A comparison of platforms for implementing and running very large scale machine learning algorithms
Recommendations
Distributed Machine Learning: Foundations, Trends, and Practices
WWW '17 Companion: Proceedings of the 26th International Conference on World Wide Web CompanionIn recent years, artificial intelligence has achieved great success in many important applications. Both novel machine learning algorithms (e.g., deep neural networks), and their distributed implementations play very critical roles in the success. In ...
A framework for agent-based distributed machine learning and data mining
AAMAS '07: Proceedings of the 6th international joint conference on Autonomous agents and multiagent systemsThis paper proposes a framework for agent-based distributed machine learning and data mining based on (i) the exchange of meta-level descriptions of individual learning processes among agents and (ii) online reasoning about learning success and learning ...
Comments