skip to main content
10.1145/2926534.2926540acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
short-paper

Bridging the gap: towards optimization across linear and relational algebra

Published:26 June 2016Publication History

ABSTRACT

Advanced data analysis typically requires some form of pre-processing in order to extract and transform data before processing it with machine learning and statistical analysis techniques. Pre-processing pipelines are naturally expressed in dataflow APIs (e.g., MapReduce, Flink, etc.), while machine learning is expressed in linear algebra with iterations. Programmers therefore perform end-to-end data analysis utilizing multiple programming paradigms and systems. This impedance mismatch not only hinders productivity but also prevents optimization opportunities, such as sharing of physical data layouts (e.g., partitioning) and data structures among different parts of a data analysis program.

The goal of this work is twofold. First, it aims to alleviate the impedance mismatch by allowing programmers to author complete end-to-end programs in one engine-independent language that is automatically parallelized. Second, it aims to enable joint optimizations over both relational and linear algebra. To achieve this goal, we present the design of Lara, a deeply embedded language in Scala which enables authoring scalable programs using two abstract data types (DataBag and Matrix) and control flow constructs. Programs written in Lara are compiled to an intermediate representation (IR) which enables optimizations across linear and relational algebra. The IR is finally used to compile code for different execution engines.

References

  1. A. Alexandrov, A. Kunft, A. Katsifodimos, F. Schüler, L. Thamsen, O. Kao, T. Herb, and V. Markl. Implicit parallelism through deep language embedding. In ACM SIGMOD, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Alexandrov, A. Salzmann, G. Krastev, A. Katsifodimos, and V. Markl. Emma in action: Declarative dataflows for scalable data analysis. In ACM SIGMOD, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia. Spark SQL: relational data processing in spark. In ACM SIGMOD, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Boehm, D. R. Burdick, A. V. Evfimievski, B. Reinwald, F. R. Reiss, P. Sen, S. Tatikonda, and Y. Tian. SystemML's Optimizer: Plan Generation for Large-Scale Machine Learning Programs. IEEE Data Engineering Bulletin, 2014.Google ScholarGoogle Scholar
  5. P. G. Brown. Overview of scidb: Large scale array storage, processing and analysis. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. H. Chafi, A. K. Sujeeth, K. J. Brown, H. Lee, A. R. Atreya, and K. Olukotun. A domain-specific approach to heterogeneous parallelism. ACM SIGPLAN Notices, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Chaudhuri and K. Shim. Including group-by in query optimization. In VLDB, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. H. Ehrig and B. Mahr. Fundamentals of Algebraic Specification 1: Equations und Initial Semantics, volume 6 of EATCS Monographs on Theoretical Computer Science. Springer, 1985. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. L. Fegaras and D. Maier. Optimizing object queries using an effective calculus. ACM TODS, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. SystemML: Declarative machine learning on MapReduce. ICDE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. T. Kraska, A. Talwalkar, J. Duchi, R. Griffith, M. J. Franklin, and M. Jordan. MLbase: A Distributed Machine-learning System. In CIDR, 2013.Google ScholarGoogle Scholar
  12. A. Kumar, J. Naughton, and J. M. Patel. Learning Generalized Linear Models Over Normalized Data. ACM SIMGOD, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. D. Maier and B. Vance. A call to order. In ACM PODS, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In ACM SIGMOD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony, H. Liu, and R. Murthy. Hive - a petabyte scale data warehouse using hadoop. In ICDE. IEEE, 2010.Google ScholarGoogle ScholarCross RefCross Ref

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in
  • Published in

    cover image ACM Conferences
    BeyondMR '16: Proceedings of the 3rd ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond
    June 2016
    70 pages
    ISBN:9781450343114
    DOI:10.1145/2926534

    Copyright © 2016 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 26 June 2016

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • short-paper

    Acceptance Rates

    BeyondMR '16 Paper Acceptance Rate10of19submissions,53%Overall Acceptance Rate19of36submissions,53%

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader