short-paper

Bridging the gap: towards optimization across linear and relational algebra

Authors:
Andreas Kunft

TU Berlin

TU Berlin
View Profile

,
Alexander Alexandrov

TU Berlin

TU Berlin
View Profile

,
Asterios Katsifodimos

TU Berlin

TU Berlin
View Profile

,
Volker Markl

TU Berlin

TU Berlin
View Profile

BeyondMR '16: Proceedings of the 3rd ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and BeyondJune 2016Article No.: 1Pages 1–4https://doi.org/10.1145/2926534.2926540

Published:26 June 2016Publication History

BeyondMR '16: Proceedings of the 3rd ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond

Pages 1–4

ABSTRACT

Advanced data analysis typically requires some form of pre-processing in order to extract and transform data before processing it with machine learning and statistical analysis techniques. Pre-processing pipelines are naturally expressed in dataflow APIs (e.g., MapReduce, Flink, etc.), while machine learning is expressed in linear algebra with iterations. Programmers therefore perform end-to-end data analysis utilizing multiple programming paradigms and systems. This impedance mismatch not only hinders productivity but also prevents optimization opportunities, such as sharing of physical data layouts (e.g., partitioning) and data structures among different parts of a data analysis program.

The goal of this work is twofold. First, it aims to alleviate the impedance mismatch by allowing programmers to author complete end-to-end programs in one engine-independent language that is automatically parallelized. Second, it aims to enable joint optimizations over both relational and linear algebra. To achieve this goal, we present the design of Lara, a deeply embedded language in Scala which enables authoring scalable programs using two abstract data types (DataBag and Matrix) and control flow constructs. Programs written in Lara are compiled to an intermediate representation (IR) which enables optimizations across linear and relational algebra. The IR is finally used to compile code for different execution engines.

References

A. Alexandrov, A. Kunft, A. Katsifodimos, F. Schüler, L. Thamsen, O. Kao, T. Herb, and V. Markl. Implicit parallelism through deep language embedding. In ACM SIGMOD, 2015. Google ScholarDigital Library
A. Alexandrov, A. Salzmann, G. Krastev, A. Katsifodimos, and V. Markl. Emma in action: Declarative dataflows for scalable data analysis. In ACM SIGMOD, 2016. Google ScholarDigital Library
M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia. Spark SQL: relational data processing in spark. In ACM SIGMOD, 2015. Google ScholarDigital Library
M. Boehm, D. R. Burdick, A. V. Evfimievski, B. Reinwald, F. R. Reiss, P. Sen, S. Tatikonda, and Y. Tian. SystemML's Optimizer: Plan Generation for Large-Scale Machine Learning Programs. IEEE Data Engineering Bulletin, 2014.Google Scholar
P. G. Brown. Overview of scidb: Large scale array storage, processing and analysis. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, 2010. Google ScholarDigital Library
H. Chafi, A. K. Sujeeth, K. J. Brown, H. Lee, A. R. Atreya, and K. Olukotun. A domain-specific approach to heterogeneous parallelism. ACM SIGPLAN Notices, 2011. Google ScholarDigital Library
S. Chaudhuri and K. Shim. Including group-by in query optimization. In VLDB, 1994. Google ScholarDigital Library
H. Ehrig and B. Mahr. Fundamentals of Algebraic Specification 1: Equations und Initial Semantics, volume 6 of EATCS Monographs on Theoretical Computer Science. Springer, 1985. Google ScholarDigital Library
L. Fegaras and D. Maier. Optimizing object queries using an effective calculus. ACM TODS, 2000. Google ScholarDigital Library
A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. SystemML: Declarative machine learning on MapReduce. ICDE, 2011. Google ScholarDigital Library
T. Kraska, A. Talwalkar, J. Duchi, R. Griffith, M. J. Franklin, and M. Jordan. MLbase: A Distributed Machine-learning System. In CIDR, 2013.Google Scholar
A. Kumar, J. Naughton, and J. M. Patel. Learning Generalized Linear Models Over Normalized Data. ACM SIMGOD, 2015. Google ScholarDigital Library
D. Maier and B. Vance. A call to order. In ACM PODS, 1993. Google ScholarDigital Library
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In ACM SIGMOD, 2008. Google ScholarDigital Library
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony, H. Liu, and R. Murthy. Hive - a petabyte scale data warehouse using hadoop. In ICDE. IEEE, 2010.Google ScholarCross Ref

Recommendations

Bridging the Gap Between Object-Oriented and Logic Programming

A description is given of an interface that was developed between Loops and Xerox Quintus Prolog. Loops is an extension to the Xerox AI environment to support object-oriented programming; Xerox Quintus Prolog is a version of Prolog that runs on Xerox ...
Read More
Bridging the gulf: a common intermediate language for ML and Haskell
POPL '98: Proceedings of the 25th ACM SIGPLAN-SIGACT symposium on Principles of programming languages

Compilers for ML and Haskell use intermediate languages that incorporate deeply-embedded assumptions about order of evaluation and side effects. We propose an intermediate language into which one can compile both ML and Haskell, thereby facilitating the ...
Read More
Bridging the language gap in scientific computing: the Chasm approach
Computational Frameworks

Chasm is a toolkit providing seamless language interoperability between Fortran 95 and C++. Language interoperability is important to scientific programmers because scientific applications are predominantly written in Fortran, while software tools are ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
BeyondMR '16: Proceedings of the 3rd ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond
June 2016
70 pages
ISBN:9781450343114
DOI:10.1145/2926534
Program Chairs:
Foto Afrati
National Technical University of Athens
,
Jacek Sroka
University of Warsaw, Poland
,
Jan Hidders
Vrije Universiteit Brussel, Belgium
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 June 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- short-paper
Conference

Acceptance Rates
BeyondMR '16 Paper Acceptance Rate10of19submissions,53%Overall Acceptance Rate19of36submissions,53%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 28
  Total Citations
  View Citations
- 375
  Total Downloads
- Downloads (Last 12 months)16
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Bridging the gap: towards optimization across linear and relational algebra

BeyondMR '16: Proceedings of the 3rd ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond

ABSTRACT

References

Cited By

Recommendations

Bridging the Gap Between Object-Oriented and Logic Programming

Bridging the gulf: a common intermediate language for ML and Haskell

Bridging the language gap in scientific computing: the Chasm approach