ABSTRACT
Advanced data analysis typically requires some form of pre-processing in order to extract and transform data before processing it with machine learning and statistical analysis techniques. Pre-processing pipelines are naturally expressed in dataflow APIs (e.g., MapReduce, Flink, etc.), while machine learning is expressed in linear algebra with iterations. Programmers therefore perform end-to-end data analysis utilizing multiple programming paradigms and systems. This impedance mismatch not only hinders productivity but also prevents optimization opportunities, such as sharing of physical data layouts (e.g., partitioning) and data structures among different parts of a data analysis program.
The goal of this work is twofold. First, it aims to alleviate the impedance mismatch by allowing programmers to author complete end-to-end programs in one engine-independent language that is automatically parallelized. Second, it aims to enable joint optimizations over both relational and linear algebra. To achieve this goal, we present the design of Lara, a deeply embedded language in Scala which enables authoring scalable programs using two abstract data types (DataBag and Matrix) and control flow constructs. Programs written in Lara are compiled to an intermediate representation (IR) which enables optimizations across linear and relational algebra. The IR is finally used to compile code for different execution engines.
- A. Alexandrov, A. Kunft, A. Katsifodimos, F. Schüler, L. Thamsen, O. Kao, T. Herb, and V. Markl. Implicit parallelism through deep language embedding. In ACM SIGMOD, 2015. Google ScholarDigital Library
- A. Alexandrov, A. Salzmann, G. Krastev, A. Katsifodimos, and V. Markl. Emma in action: Declarative dataflows for scalable data analysis. In ACM SIGMOD, 2016. Google ScholarDigital Library
- M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, and M. Zaharia. Spark SQL: relational data processing in spark. In ACM SIGMOD, 2015. Google ScholarDigital Library
- M. Boehm, D. R. Burdick, A. V. Evfimievski, B. Reinwald, F. R. Reiss, P. Sen, S. Tatikonda, and Y. Tian. SystemML's Optimizer: Plan Generation for Large-Scale Machine Learning Programs. IEEE Data Engineering Bulletin, 2014.Google Scholar
- P. G. Brown. Overview of scidb: Large scale array storage, processing and analysis. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, 2010. Google ScholarDigital Library
- H. Chafi, A. K. Sujeeth, K. J. Brown, H. Lee, A. R. Atreya, and K. Olukotun. A domain-specific approach to heterogeneous parallelism. ACM SIGPLAN Notices, 2011. Google ScholarDigital Library
- S. Chaudhuri and K. Shim. Including group-by in query optimization. In VLDB, 1994. Google ScholarDigital Library
- H. Ehrig and B. Mahr. Fundamentals of Algebraic Specification 1: Equations und Initial Semantics, volume 6 of EATCS Monographs on Theoretical Computer Science. Springer, 1985. Google ScholarDigital Library
- L. Fegaras and D. Maier. Optimizing object queries using an effective calculus. ACM TODS, 2000. Google ScholarDigital Library
- A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V. Sindhwani, S. Tatikonda, Y. Tian, and S. Vaithyanathan. SystemML: Declarative machine learning on MapReduce. ICDE, 2011. Google ScholarDigital Library
- T. Kraska, A. Talwalkar, J. Duchi, R. Griffith, M. J. Franklin, and M. Jordan. MLbase: A Distributed Machine-learning System. In CIDR, 2013.Google Scholar
- A. Kumar, J. Naughton, and J. M. Patel. Learning Generalized Linear Models Over Normalized Data. ACM SIMGOD, 2015. Google ScholarDigital Library
- D. Maier and B. Vance. A call to order. In ACM PODS, 1993. Google ScholarDigital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In ACM SIGMOD, 2008. Google ScholarDigital Library
- A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Anthony, H. Liu, and R. Murthy. Hive - a petabyte scale data warehouse using hadoop. In ICDE. IEEE, 2010.Google ScholarCross Ref
Recommendations
Bridging the Gap Between Object-Oriented and Logic Programming
A description is given of an interface that was developed between Loops and Xerox Quintus Prolog. Loops is an extension to the Xerox AI environment to support object-oriented programming; Xerox Quintus Prolog is a version of Prolog that runs on Xerox ...
Bridging the gulf: a common intermediate language for ML and Haskell
POPL '98: Proceedings of the 25th ACM SIGPLAN-SIGACT symposium on Principles of programming languagesCompilers for ML and Haskell use intermediate languages that incorporate deeply-embedded assumptions about order of evaluation and side effects. We propose an intermediate language into which one can compile both ML and Haskell, thereby facilitating the ...
Bridging the language gap in scientific computing: the Chasm approach
Computational FrameworksChasm is a toolkit providing seamless language interoperability between Fortran 95 and C++. Language interoperability is important to scientific programmers because scientific applications are predominantly written in Fortran, while software tools are ...
Comments