research-article

SystemML: declarative machine learning on spark

Authors:
Matthias Boehm

IBM Research --- Almaden

IBM Research --- Almaden
View Profile

,
Michael W. Dusenberry

IBM Spark Technology Center

IBM Spark Technology Center
View Profile

,
Deron Eriksson

IBM Spark Technology Center

IBM Spark Technology Center
View Profile

,
Alexandre V. Evfimievski

IBM Research --- Almaden

IBM Research --- Almaden
View Profile

,
Faraz Makari Manshadi

IBM Research --- Almaden

IBM Research --- Almaden
View Profile

,
Niketan Pansare

IBM Research --- Almaden

IBM Research --- Almaden
View Profile

,
Berthold Reinwald

IBM Research --- Almaden

IBM Research --- Almaden
View Profile

,
Frederick R. Reiss

IBM Research --- Almaden and IBM Spark Technology Center

IBM Research --- Almaden and IBM Spark Technology Center
View Profile

,
Prithviraj Sen

IBM Research --- Almaden

IBM Research --- Almaden
View Profile

,
Arvind C. Surve

IBM Spark Technology Center

IBM Spark Technology Center
View Profile

,
Shirish Tatikonda

IBM Research --- Almaden

IBM Research --- Almaden
View Profile

Proceedings of the VLDB Endowment Volume 9 Issue 13pp 1425–1436https://doi.org/10.14778/3007263.3007279

Published:01 September 2016Publication History

Proceedings of the VLDB Endowment

Abstract

The rising need for custom machine learning (ML) algorithms and the growing data sizes that require the exploitation of distributed, data-parallel frameworks such as MapReduce or Spark, pose significant productivity challenges to data scientists. Apache SystemML addresses these challenges through declarative ML by (1) increasing the productivity of data scientists as they are able to express custom algorithms in a familiar domain-specific language covering linear algebra primitives and statistical functions, and (2) transparently running these ML algorithms on distributed, data-parallel frameworks by applying cost-based compilation techniques to generate efficient, low-level execution plans with in-memory single-node and large-scale distributed operations. This paper describes SystemML on Apache Spark, end to end, including insights into various optimizer and runtime techniques as well as performance characteristics. We also share lessons learned from porting SystemML to Spark and declarative ML in general. Finally, SystemML is open-source, which allows the database community to leverage it as a testbed for further research.

References

M. Abadi et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. CoRR, abs/1603.04467, 2016.Google Scholar
A. Alexandrov et al. The Stratosphere Platform for Big Data Analytics. VLDB J., 23(6), 2014. Google ScholarDigital Library
A. Alexandrov et al. Implicit Parallelism through Deep Language Embedding. In SIGMOD, 2015. Google ScholarDigital Library
Apache. Mahout, 2016. http://mahout.apache.org.Google Scholar
Apache. Spark, 2016. http://spark.apache.org.Google Scholar
Apache. SystemML (incubating), 2016. http://systemml.apache.org.Google Scholar
A. Ashari et al. On Optimizing Machine Learning Workloads via Kernel Fusion. In PPoPP, 2015. Google ScholarDigital Library
M. Boehm et al. Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML. PVLDB, 7(7), 2014. Google ScholarDigital Library
M. Boehm et al. SystemML's Optimizer: Plan Generation for Large-Scale Machine Learning Programs. IEEE Data Eng. Bull., 37(3), 2014.Google Scholar
M. Boehm et al. Declarative Machine Learning --- A Classification of Basic Properties and Types. CoRR, abs/1605.05826, 2016.Google Scholar
Z. Cai et al. Simulation of Database-Valued Markov Chains Using SimSQL. In SIGMOD, 2013. Google ScholarDigital Library
A. Crotty et al. An Architecture for Compiling UDF-centric Workflows. PVLDB, 8(12), 2015. Google ScholarDigital Library
R. R. Curtin et al. MLPACK: A Scalable C++ Machine Learning Library. JMLR, 14(1), 2013. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, 2004. Google ScholarDigital Library
A. Ghoting et al. SystemML: Declarative Machine Learning on MapReduce. In ICDE, 2011. Google ScholarDigital Library
H2O. H2O. http://h2o.ai/product/.Google Scholar
J. M. Hellerstein et al. The MADlib Analytics Library or MAD Skills, the SQL. PVLDB, 5(12), 2012. Google ScholarDigital Library
B. Huang, S. Babu, and J. Yang. Cumulon: Optimizing Statistical Data Analysis in the Cloud. In SIGMOD, 2013. Google ScholarDigital Library
B. Huang et al. Resource Elasticity for Large-Scale Machine Learning. In SIGMOD, 2015. Google ScholarDigital Library
T. Kraska et al. MLbase: A Distributed Machine-learning System. In CIDR, 2013.Google Scholar
J. Langford, L. Li, and A. Strehl. Vowpal Wabbit Online Learning Project, 2007.Google Scholar
C. Liu et al. Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce. In WWW, 2010. Google ScholarDigital Library
Y. Low et al. Distributed GraphLab: A Framework for Machine Learning in the Cloud. PVLDB, 5(8), 2012. Google ScholarDigital Library
D. Lyubimov. Mahout Scala Bindings and Mahout Spark Bindings for Linear Algebra Subroutines. Apache, 2016.Google Scholar
X. Meng et al. MLlib: Machine Learning in Apache Spark. CoRR, abs/1505.06807, 2015.Google Scholar
Microsoft. Distributed Machine Learning Toolkit.Google Scholar
R-project. CRAN Task View: High-Performance and Parallel Computing with R, 2016.Google Scholar
Revolution Analytics. Revolution R Enterprise ScaleR, 2016.Google Scholar
Skytree. Skytree Machine Learning Software.Google Scholar
Sonnenburg et al. The SHOGUN Machine Learning Toolbox. JMLR, 11, 2010. Google ScholarDigital Library
E. R. Sparks et al. MLI: An API for Distributed Machine Learning. In ICDM, 2013.Google ScholarCross Ref
E. R. Sparks et al. Automating Model Search for Large Scale Machine Learning. In SOCC, 2015. Google ScholarDigital Library
M. Stonebraker et al. The Architecture of SciDB. In SSDBM, 2011. Google ScholarDigital Library
A. K. Sujeeth et al. OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning. In ICML, 2011.Google Scholar
Y. Tian, S. Tatikonda, and B. Reinwald. Scalable and Numerically Stable Descriptive Statistics in SystemML. In ICDE, 2012. Google ScholarDigital Library
V. K. Vavilapalli et al. Apache Hadoop YARN: Yet Another Resource Negotiator. In SOCC, 2013. Google ScholarDigital Library
S. Venkataraman et al. Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices. In Eurosys, 2013. Google ScholarDigital Library
M. Weimer et. al. REEF: Retainable Evaluator Execution Framework. In SIGMOD, 2015. Google ScholarDigital Library
L. Yu, Y. Shao, and B. Cui. Exploiting Matrix Dependency for Efficient Distributed Matrix Computation. In SIGMOD, 2015. Google ScholarDigital Library
R. B. Zadeh et al. linalg: Matrix Computations in Apache Spark. CoRR, abs/1509.02256, 2015.Google Scholar
M. Zaharia et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In NSDI, 2012. Google ScholarDigital Library
C. Zhang, A. Kumar, and C. Ré. Materialization Optimizations for Feature Selection Workloads. In SIGMOD, 2014. Google ScholarDigital Library
Y. Zhou et al. Large-Scale Parallel Collaborative Filtering for the Netflix Prize. In AAIM, 2008. Google ScholarDigital Library

Recommendations

SystemML: Declarative machine learning on MapReduce
ICDE '11: Proceedings of the 2011 IEEE 27th International Conference on Data Engineering

MapReduce is emerging as a generic parallel programming paradigm for large clusters of machines. This trend combined with the growing need to run machine learning (ML) algorithms on massive datasets has led to an increased interest in implementing ML ...
Read More
Compiling machine learning algorithms with SystemML
SOCC '13: Proceedings of the 4th annual Symposium on Cloud Computing

Analytics on big data range from passenger volume prediction in transportation to customer satisfaction in automotive diagnostic systems, and from correlation analysis in social media data to log analysis in manufacturing. Expressing and running these ...
Read More
Scalable and Numerically Stable Descriptive Statistics in SystemML
ICDE '12: Proceedings of the 2012 IEEE 28th International Conference on Data Engineering

With the exponential growth in the amount of data that is being generated in recent years, there is a pressing need for applying machine learning algorithms to large data sets. SystemML is a framework that employs a declarative approach for large scale ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 9, Issue 13
September 2016
378 pages
ISSN:2150-8097
Editor:
Surajit Chaudhuri
Microsoft Research
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 September 2016
Published in pvldb Volume 9, Issue 13
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 86
  Total Citations
  View Citations
- 872
  Total Downloads
- Downloads (Last 12 months)63
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

SystemML: declarative machine learning on spark

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

SystemML: Declarative machine learning on MapReduce

Compiling machine learning algorithms with SystemML

Scalable and Numerically Stable Descriptive Statistics in SystemML

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

SystemML: declarative machine learning on spark

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

SystemML: Declarative machine learning on MapReduce

Compiling machine learning algorithms with SystemML

Scalable and Numerically Stable Descriptive Statistics in SystemML

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media