Abstract
Providing machine learning (ML) over relational data is a mainstream requirement for data analytics systems. While almost all ML tools require the input data to be presented as a single table, many datasets are multi-table. This forces data scientists to join those tables first, which often leads to data redundancy and runtime waste. Recent works on "factorized" ML mitigate this issue for a few specific ML algorithms by pushing ML through joins. But their approaches require a manual rewrite of ML implementations. Such piecemeal methods create a massive development overhead when extending such ideas to other ML algorithms. In this paper, we show that it is possible to mitigate this overhead by leveraging a popular formal algebra to represent the computations of many ML algorithms: linear algebra. We introduce a new logical data type to represent normalized data and devise a framework of algebraic rewrite rules to convert a large set of linear algebra operations over denormalized data into operations over normalized data. We show how this enables us to automatically "factorize" several popular ML algorithms, thus unifying and generalizing several prior works. We prototype our framework in the popular ML environment R and an industrial R-over-RDBMS tool. Experiments with both synthetic and real normalized data show that our framework also yields significant speed-ups, up to 36x on real data.
- Matlab mmtimes function. mathworks.com/matlabcentral/fileexchange/27950-mmtimes-matrix-chain-product.Google Scholar
- Oracle R Enterprise.Google Scholar
- R. r-project.org.Google Scholar
- SparkR. spark.apache.org/R.Google Scholar
- M. Abadi et al. TensorFlow: A System for Large-Scale Machine Learning. In OSDI, 2016. Google ScholarDigital Library
- E. Anderson et al. LAPACK Users' Guide. SIAM, 1999. Google ScholarDigital Library
- N. Bakibayev et al. FDB: A Query Engine for Factorised Relational Databases. In VLDB, 2012. Google ScholarDigital Library
- M. Boehm et al. Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML. In VLDB, 2014. Google ScholarDigital Library
- M. Boehm et al. SystemML: Declarative Machine Learning on Spark. In VLDB, 2016. Google ScholarDigital Library
- Z. Cai et al. Simulation of Database-valued Markov Chains Using SimSQL. In SIGMOD, 2013. Google ScholarDigital Library
- Z. Cai et al. A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms. In SIGMOD, 2014. Google ScholarDigital Library
- S. Chaudhuri and K. Shim. Including Group-By in Query Optimization. In VLDB, 1994. Google ScholarDigital Library
- L. Chen et al. Towards Linear Algebra over Normalized Data. https://arxiv.org/abs/1612.07448.Google Scholar
- P. Cudré-Mauroux et al. A demonstration of SciDB: A science-oriented DBMS. PVLDB, 2(2):1534--1537, 2009. Google ScholarDigital Library
- J. Dongarra et al. An Extended Set of FORTRAN Basic Linear Algebra Subprograms. ACM Trans. Math. Softw., 14(1):1--17, 1988. Google ScholarDigital Library
- L. Eldén. Matrix Methods in Data Mining and Pattern Recognition. SIAM, 2007.Google Scholar
- T. Elgamal et al. SPOOF: Sum-Product Optimization and Operator Fusion for Large-Scale ML. In CIDR, 2017.Google Scholar
- A. Elgohary et al. Compressed Linear Algebra for Large-scale Machine Learning. In VLDB, 2016. Google ScholarDigital Library
- X. Feng, A. Kumar, B. Recht, and C. Ré. Towards a Unified Architecture for in-RDBMS Analytics. In SIGMOD, 2012. Google ScholarDigital Library
- J. Hellerstein et al. The MADlib Analytics Library or MAD Skills, the SQL. In VLDB, 2012. Google ScholarDigital Library
- R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, New York, NY, USA, 2nd edition, 2012. Google ScholarDigital Library
- T. C. Hu and M. T. Shing. Computation of Matrix Chain Products. Part I. SIAM J. Comput., 11(2):362--373, 1982.Google Scholar
- T. Kraska et al. MLbase: A Distributed Machine-learning System. In CIDR, 2013.Google Scholar
- A. Kumar et al. Demonstration of Santoku: Optimizing Machine Learning over Normalized Data. In VLDB, 2015. Google ScholarDigital Library
- A. Kumar et al. Learning Generalized Linear Models Over Normalized Data. In SIGMOD, 2015. Google ScholarDigital Library
- A. Kumar et al. Model Selection Management Systems: The Next Frontier of Advanced Analytics. ACM SIGMOD Rec., Dec. 2015. Google ScholarDigital Library
- A. Kumar et al. To Join or Not to Join? Thinking Twice about Joins before Feature Selection. In SIGMOD, 2016. Google ScholarDigital Library
- A. Kunft et al. Bridging the Gap: Towards Optimization Across Linear and Relational Algebra. In SIGMOD Beyond M R Workshop, 2016. Google ScholarDigital Library
- C. L. Lawson et al. Basic Linear Algebra Subprograms for Fortran Usage. ACM Trans. Math. Softw., 5(3):308--323, 1979. Google ScholarDigital Library
- M. Nikolic et al. LINVIEW: incremental view maintenance for complex analytical queries. In SIGMOD, 2014. Google ScholarDigital Library
- D. Olteanu and M. Schleich. F: Regression Models over Factorized Views. PVLDB, 9(13):1573--1576, 2016. Google ScholarDigital Library
- R. Ramakrishnan and J. Gehrke. Database Management Systems. McGraw-Hill, Inc., 2003. Google ScholarDigital Library
- S. Rendle. Scaling Factorization Machines to Relational Data. In VLDB, 2013. Google ScholarDigital Library
- M. Schleich et al. Learning Linear Regression Models over Factorized Joins. In SIGMOD, 2016. Google ScholarDigital Library
- T. K. Sellis. Multiple-Query Optimization. ACM TODS, 13(1):23--52, Mar. 1988. Google ScholarDigital Library
- W. P. Yan and P.-Å. Larson. Eager Aggregation and Lazy Aggregation. In VLDB, 1995. Google ScholarDigital Library
- Y. Zhang et al. I/O-Efficient Statistical Computing with RIOT. In ICDE, 2010.Google Scholar
Recommendations
Learning Generalized Linear Models Over Normalized Data
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of DataEnterprise data analytics is a booming area in the data management industry. Many companies are racing to develop toolkits that closely integrate statistical and machine learning techniques with data management systems. Almost all such toolkits assume ...
Accelerating Linear Algebra over Normalized Data
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of DataTowards a typed omega algebra
RAMICS'11: Proceedings of the 12th international conference on Relational and algebraic methods in computer scienceWe propose axioms for 1-free omega algebra, typed 1-free omega algebra and typed omega algebra. They are based on Kozen's axioms for 1-free and typed Kleene algebra and Cohen's axioms for the omega operation. In contrast to Kleene algebra, several laws ...
Comments