Abstract
The rising need for custom machine learning (ML) algorithms and the growing data sizes that require the exploitation of distributed, data-parallel frameworks such as MapReduce or Spark, pose significant productivity challenges to data scientists. Apache SystemML addresses these challenges through declarative ML by (1) increasing the productivity of data scientists as they are able to express custom algorithms in a familiar domain-specific language covering linear algebra primitives and statistical functions, and (2) transparently running these ML algorithms on distributed, data-parallel frameworks by applying cost-based compilation techniques to generate efficient, low-level execution plans with in-memory single-node and large-scale distributed operations. This paper describes SystemML on Apache Spark, end to end, including insights into various optimizer and runtime techniques as well as performance characteristics. We also share lessons learned from porting SystemML to Spark and declarative ML in general. Finally, SystemML is open-source, which allows the database community to leverage it as a testbed for further research.
- M. Abadi et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. CoRR, abs/1603.04467, 2016.Google Scholar
- A. Alexandrov et al. The Stratosphere Platform for Big Data Analytics. VLDB J., 23(6), 2014. Google ScholarDigital Library
- A. Alexandrov et al. Implicit Parallelism through Deep Language Embedding. In SIGMOD, 2015. Google ScholarDigital Library
- Apache. Mahout, 2016. http://mahout.apache.org.Google Scholar
- Apache. Spark, 2016. http://spark.apache.org.Google Scholar
- Apache. SystemML (incubating), 2016. http://systemml.apache.org.Google Scholar
- A. Ashari et al. On Optimizing Machine Learning Workloads via Kernel Fusion. In PPoPP, 2015. Google ScholarDigital Library
- M. Boehm et al. Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML. PVLDB, 7(7), 2014. Google ScholarDigital Library
- M. Boehm et al. SystemML's Optimizer: Plan Generation for Large-Scale Machine Learning Programs. IEEE Data Eng. Bull., 37(3), 2014.Google Scholar
- M. Boehm et al. Declarative Machine Learning --- A Classification of Basic Properties and Types. CoRR, abs/1605.05826, 2016.Google Scholar
- Z. Cai et al. Simulation of Database-Valued Markov Chains Using SimSQL. In SIGMOD, 2013. Google ScholarDigital Library
- A. Crotty et al. An Architecture for Compiling UDF-centric Workflows. PVLDB, 8(12), 2015. Google ScholarDigital Library
- R. R. Curtin et al. MLPACK: A Scalable C++ Machine Learning Library. JMLR, 14(1), 2013. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, 2004. Google ScholarDigital Library
- A. Ghoting et al. SystemML: Declarative Machine Learning on MapReduce. In ICDE, 2011. Google ScholarDigital Library
- H2O. H2O. http://h2o.ai/product/.Google Scholar
- J. M. Hellerstein et al. The MADlib Analytics Library or MAD Skills, the SQL. PVLDB, 5(12), 2012. Google ScholarDigital Library
- B. Huang, S. Babu, and J. Yang. Cumulon: Optimizing Statistical Data Analysis in the Cloud. In SIGMOD, 2013. Google ScholarDigital Library
- B. Huang et al. Resource Elasticity for Large-Scale Machine Learning. In SIGMOD, 2015. Google ScholarDigital Library
- T. Kraska et al. MLbase: A Distributed Machine-learning System. In CIDR, 2013.Google Scholar
- J. Langford, L. Li, and A. Strehl. Vowpal Wabbit Online Learning Project, 2007.Google Scholar
- C. Liu et al. Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce. In WWW, 2010. Google ScholarDigital Library
- Y. Low et al. Distributed GraphLab: A Framework for Machine Learning in the Cloud. PVLDB, 5(8), 2012. Google ScholarDigital Library
- D. Lyubimov. Mahout Scala Bindings and Mahout Spark Bindings for Linear Algebra Subroutines. Apache, 2016.Google Scholar
- X. Meng et al. MLlib: Machine Learning in Apache Spark. CoRR, abs/1505.06807, 2015.Google Scholar
- Microsoft. Distributed Machine Learning Toolkit.Google Scholar
- R-project. CRAN Task View: High-Performance and Parallel Computing with R, 2016.Google Scholar
- Revolution Analytics. Revolution R Enterprise ScaleR, 2016.Google Scholar
- Skytree. Skytree Machine Learning Software.Google Scholar
- Sonnenburg et al. The SHOGUN Machine Learning Toolbox. JMLR, 11, 2010. Google ScholarDigital Library
- E. R. Sparks et al. MLI: An API for Distributed Machine Learning. In ICDM, 2013.Google ScholarCross Ref
- E. R. Sparks et al. Automating Model Search for Large Scale Machine Learning. In SOCC, 2015. Google ScholarDigital Library
- M. Stonebraker et al. The Architecture of SciDB. In SSDBM, 2011. Google ScholarDigital Library
- A. K. Sujeeth et al. OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning. In ICML, 2011.Google Scholar
- Y. Tian, S. Tatikonda, and B. Reinwald. Scalable and Numerically Stable Descriptive Statistics in SystemML. In ICDE, 2012. Google ScholarDigital Library
- V. K. Vavilapalli et al. Apache Hadoop YARN: Yet Another Resource Negotiator. In SOCC, 2013. Google ScholarDigital Library
- S. Venkataraman et al. Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices. In Eurosys, 2013. Google ScholarDigital Library
- M. Weimer et. al. REEF: Retainable Evaluator Execution Framework. In SIGMOD, 2015. Google ScholarDigital Library
- L. Yu, Y. Shao, and B. Cui. Exploiting Matrix Dependency for Efficient Distributed Matrix Computation. In SIGMOD, 2015. Google ScholarDigital Library
- R. B. Zadeh et al. linalg: Matrix Computations in Apache Spark. CoRR, abs/1509.02256, 2015.Google Scholar
- M. Zaharia et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In NSDI, 2012. Google ScholarDigital Library
- C. Zhang, A. Kumar, and C. Ré. Materialization Optimizations for Feature Selection Workloads. In SIGMOD, 2014. Google ScholarDigital Library
- Y. Zhou et al. Large-Scale Parallel Collaborative Filtering for the Netflix Prize. In AAIM, 2008. Google ScholarDigital Library
Recommendations
SystemML: Declarative machine learning on MapReduce
ICDE '11: Proceedings of the 2011 IEEE 27th International Conference on Data EngineeringMapReduce is emerging as a generic parallel programming paradigm for large clusters of machines. This trend combined with the growing need to run machine learning (ML) algorithms on massive datasets has led to an increased interest in implementing ML ...
Compiling machine learning algorithms with SystemML
SOCC '13: Proceedings of the 4th annual Symposium on Cloud ComputingAnalytics on big data range from passenger volume prediction in transportation to customer satisfaction in automotive diagnostic systems, and from correlation analysis in social media data to log analysis in manufacturing. Expressing and running these ...
Scalable and Numerically Stable Descriptive Statistics in SystemML
ICDE '12: Proceedings of the 2012 IEEE 28th International Conference on Data EngineeringWith the exponential growth in the amount of data that is being generated in recent years, there is a pressing need for applying machine learning algorithms to large data sets. SystemML is a framework that employs a declarative approach for large scale ...
Comments