skip to main content
research-article

SystemML: declarative machine learning on spark

Published:01 September 2016Publication History
Skip Abstract Section

Abstract

The rising need for custom machine learning (ML) algorithms and the growing data sizes that require the exploitation of distributed, data-parallel frameworks such as MapReduce or Spark, pose significant productivity challenges to data scientists. Apache SystemML addresses these challenges through declarative ML by (1) increasing the productivity of data scientists as they are able to express custom algorithms in a familiar domain-specific language covering linear algebra primitives and statistical functions, and (2) transparently running these ML algorithms on distributed, data-parallel frameworks by applying cost-based compilation techniques to generate efficient, low-level execution plans with in-memory single-node and large-scale distributed operations. This paper describes SystemML on Apache Spark, end to end, including insights into various optimizer and runtime techniques as well as performance characteristics. We also share lessons learned from porting SystemML to Spark and declarative ML in general. Finally, SystemML is open-source, which allows the database community to leverage it as a testbed for further research.

References

  1. M. Abadi et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. CoRR, abs/1603.04467, 2016.Google ScholarGoogle Scholar
  2. A. Alexandrov et al. The Stratosphere Platform for Big Data Analytics. VLDB J., 23(6), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Alexandrov et al. Implicit Parallelism through Deep Language Embedding. In SIGMOD, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Apache. Mahout, 2016. http://mahout.apache.org.Google ScholarGoogle Scholar
  5. Apache. Spark, 2016. http://spark.apache.org.Google ScholarGoogle Scholar
  6. Apache. SystemML (incubating), 2016. http://systemml.apache.org.Google ScholarGoogle Scholar
  7. A. Ashari et al. On Optimizing Machine Learning Workloads via Kernel Fusion. In PPoPP, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Boehm et al. Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML. PVLDB, 7(7), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Boehm et al. SystemML's Optimizer: Plan Generation for Large-Scale Machine Learning Programs. IEEE Data Eng. Bull., 37(3), 2014.Google ScholarGoogle Scholar
  10. M. Boehm et al. Declarative Machine Learning --- A Classification of Basic Properties and Types. CoRR, abs/1605.05826, 2016.Google ScholarGoogle Scholar
  11. Z. Cai et al. Simulation of Database-Valued Markov Chains Using SimSQL. In SIGMOD, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Crotty et al. An Architecture for Compiling UDF-centric Workflows. PVLDB, 8(12), 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. R. Curtin et al. MLPACK: A Scalable C++ Machine Learning Library. JMLR, 14(1), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Ghoting et al. SystemML: Declarative Machine Learning on MapReduce. In ICDE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. H2O. H2O. http://h2o.ai/product/.Google ScholarGoogle Scholar
  17. J. M. Hellerstein et al. The MADlib Analytics Library or MAD Skills, the SQL. PVLDB, 5(12), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. B. Huang, S. Babu, and J. Yang. Cumulon: Optimizing Statistical Data Analysis in the Cloud. In SIGMOD, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. B. Huang et al. Resource Elasticity for Large-Scale Machine Learning. In SIGMOD, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. T. Kraska et al. MLbase: A Distributed Machine-learning System. In CIDR, 2013.Google ScholarGoogle Scholar
  21. J. Langford, L. Li, and A. Strehl. Vowpal Wabbit Online Learning Project, 2007.Google ScholarGoogle Scholar
  22. C. Liu et al. Distributed Nonnegative Matrix Factorization for Web-Scale Dyadic Data Analysis on MapReduce. In WWW, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Y. Low et al. Distributed GraphLab: A Framework for Machine Learning in the Cloud. PVLDB, 5(8), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. D. Lyubimov. Mahout Scala Bindings and Mahout Spark Bindings for Linear Algebra Subroutines. Apache, 2016.Google ScholarGoogle Scholar
  25. X. Meng et al. MLlib: Machine Learning in Apache Spark. CoRR, abs/1505.06807, 2015.Google ScholarGoogle Scholar
  26. Microsoft. Distributed Machine Learning Toolkit.Google ScholarGoogle Scholar
  27. R-project. CRAN Task View: High-Performance and Parallel Computing with R, 2016.Google ScholarGoogle Scholar
  28. Revolution Analytics. Revolution R Enterprise ScaleR, 2016.Google ScholarGoogle Scholar
  29. Skytree. Skytree Machine Learning Software.Google ScholarGoogle Scholar
  30. Sonnenburg et al. The SHOGUN Machine Learning Toolbox. JMLR, 11, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. E. R. Sparks et al. MLI: An API for Distributed Machine Learning. In ICDM, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  32. E. R. Sparks et al. Automating Model Search for Large Scale Machine Learning. In SOCC, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. M. Stonebraker et al. The Architecture of SciDB. In SSDBM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. A. K. Sujeeth et al. OptiML: An Implicitly Parallel Domain-Specific Language for Machine Learning. In ICML, 2011.Google ScholarGoogle Scholar
  35. Y. Tian, S. Tatikonda, and B. Reinwald. Scalable and Numerically Stable Descriptive Statistics in SystemML. In ICDE, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. V. K. Vavilapalli et al. Apache Hadoop YARN: Yet Another Resource Negotiator. In SOCC, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. Venkataraman et al. Presto: Distributed Machine Learning and Graph Processing with Sparse Matrices. In Eurosys, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. M. Weimer et. al. REEF: Retainable Evaluator Execution Framework. In SIGMOD, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. L. Yu, Y. Shao, and B. Cui. Exploiting Matrix Dependency for Efficient Distributed Matrix Computation. In SIGMOD, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. R. B. Zadeh et al. linalg: Matrix Computations in Apache Spark. CoRR, abs/1509.02256, 2015.Google ScholarGoogle Scholar
  41. M. Zaharia et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In NSDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. C. Zhang, A. Kumar, and C. Ré. Materialization Optimizations for Feature Selection Workloads. In SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Y. Zhou et al. Large-Scale Parallel Collaborative Filtering for the Netflix Prize. In AAIM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 9, Issue 13
    September 2016
    378 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 September 2016
    Published in pvldb Volume 9, Issue 13

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader