research-article

Towards linear algebra over normalized data

Authors:
Lingjiao Chen

University of Wisconsin-Madison

University of Wisconsin-Madison
View Profile

,
Arun Kumar

University of California

University of California
View Profile

,
Jeffrey Naughton

Google

Google
View Profile

,
Jignesh M. Patel

University of Wisconsin-Madison

University of Wisconsin-Madison
View Profile

Proceedings of the VLDB Endowment Volume 10 Issue 11pp 1214–1225https://doi.org/10.14778/3137628.3137633

Published:01 August 2017Publication History

Proceedings of the VLDB Endowment

Abstract

Providing machine learning (ML) over relational data is a mainstream requirement for data analytics systems. While almost all ML tools require the input data to be presented as a single table, many datasets are multi-table. This forces data scientists to join those tables first, which often leads to data redundancy and runtime waste. Recent works on "factorized" ML mitigate this issue for a few specific ML algorithms by pushing ML through joins. But their approaches require a manual rewrite of ML implementations. Such piecemeal methods create a massive development overhead when extending such ideas to other ML algorithms. In this paper, we show that it is possible to mitigate this overhead by leveraging a popular formal algebra to represent the computations of many ML algorithms: linear algebra. We introduce a new logical data type to represent normalized data and devise a framework of algebraic rewrite rules to convert a large set of linear algebra operations over denormalized data into operations over normalized data. We show how this enables us to automatically "factorize" several popular ML algorithms, thus unifying and generalizing several prior works. We prototype our framework in the popular ML environment R and an industrial R-over-RDBMS tool. Experiments with both synthetic and real normalized data show that our framework also yields significant speed-ups, up to 36x on real data.

References

Matlab mmtimes function. mathworks.com/matlabcentral/fileexchange/27950-mmtimes-matrix-chain-product.Google Scholar
Oracle R Enterprise.Google Scholar
R. r-project.org.Google Scholar
SparkR. spark.apache.org/R.Google Scholar
M. Abadi et al. TensorFlow: A System for Large-Scale Machine Learning. In OSDI, 2016. Google ScholarDigital Library
E. Anderson et al. LAPACK Users' Guide. SIAM, 1999. Google ScholarDigital Library
N. Bakibayev et al. FDB: A Query Engine for Factorised Relational Databases. In VLDB, 2012. Google ScholarDigital Library
M. Boehm et al. Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML. In VLDB, 2014. Google ScholarDigital Library
M. Boehm et al. SystemML: Declarative Machine Learning on Spark. In VLDB, 2016. Google ScholarDigital Library
Z. Cai et al. Simulation of Database-valued Markov Chains Using SimSQL. In SIGMOD, 2013. Google ScholarDigital Library
Z. Cai et al. A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms. In SIGMOD, 2014. Google ScholarDigital Library
S. Chaudhuri and K. Shim. Including Group-By in Query Optimization. In VLDB, 1994. Google ScholarDigital Library
L. Chen et al. Towards Linear Algebra over Normalized Data. https://arxiv.org/abs/1612.07448.Google Scholar
P. Cudré-Mauroux et al. A demonstration of SciDB: A science-oriented DBMS. PVLDB, 2(2):1534--1537, 2009. Google ScholarDigital Library
J. Dongarra et al. An Extended Set of FORTRAN Basic Linear Algebra Subprograms. ACM Trans. Math. Softw., 14(1):1--17, 1988. Google ScholarDigital Library
L. Eldén. Matrix Methods in Data Mining and Pattern Recognition. SIAM, 2007.Google Scholar
T. Elgamal et al. SPOOF: Sum-Product Optimization and Operator Fusion for Large-Scale ML. In CIDR, 2017.Google Scholar
A. Elgohary et al. Compressed Linear Algebra for Large-scale Machine Learning. In VLDB, 2016. Google ScholarDigital Library
X. Feng, A. Kumar, B. Recht, and C. Ré. Towards a Unified Architecture for in-RDBMS Analytics. In SIGMOD, 2012. Google ScholarDigital Library
J. Hellerstein et al. The MADlib Analytics Library or MAD Skills, the SQL. In VLDB, 2012. Google ScholarDigital Library
R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, New York, NY, USA, 2nd edition, 2012. Google ScholarDigital Library
T. C. Hu and M. T. Shing. Computation of Matrix Chain Products. Part I. SIAM J. Comput., 11(2):362--373, 1982.Google Scholar
T. Kraska et al. MLbase: A Distributed Machine-learning System. In CIDR, 2013.Google Scholar
A. Kumar et al. Demonstration of Santoku: Optimizing Machine Learning over Normalized Data. In VLDB, 2015. Google ScholarDigital Library
A. Kumar et al. Learning Generalized Linear Models Over Normalized Data. In SIGMOD, 2015. Google ScholarDigital Library
A. Kumar et al. Model Selection Management Systems: The Next Frontier of Advanced Analytics. ACM SIGMOD Rec., Dec. 2015. Google ScholarDigital Library
A. Kumar et al. To Join or Not to Join? Thinking Twice about Joins before Feature Selection. In SIGMOD, 2016. Google ScholarDigital Library
A. Kunft et al. Bridging the Gap: Towards Optimization Across Linear and Relational Algebra. In SIGMOD Beyond M R Workshop, 2016. Google ScholarDigital Library
C. L. Lawson et al. Basic Linear Algebra Subprograms for Fortran Usage. ACM Trans. Math. Softw., 5(3):308--323, 1979. Google ScholarDigital Library
M. Nikolic et al. LINVIEW: incremental view maintenance for complex analytical queries. In SIGMOD, 2014. Google ScholarDigital Library
D. Olteanu and M. Schleich. F: Regression Models over Factorized Views. PVLDB, 9(13):1573--1576, 2016. Google ScholarDigital Library
R. Ramakrishnan and J. Gehrke. Database Management Systems. McGraw-Hill, Inc., 2003. Google ScholarDigital Library
S. Rendle. Scaling Factorization Machines to Relational Data. In VLDB, 2013. Google ScholarDigital Library
M. Schleich et al. Learning Linear Regression Models over Factorized Joins. In SIGMOD, 2016. Google ScholarDigital Library
T. K. Sellis. Multiple-Query Optimization. ACM TODS, 13(1):23--52, Mar. 1988. Google ScholarDigital Library
W. P. Yan and P.-Å. Larson. Eager Aggregation and Lazy Aggregation. In VLDB, 1995. Google ScholarDigital Library
Y. Zhang et al. I/O-Efficient Statistical Computing with RIOT. In ICDE, 2010.Google Scholar

Recommendations

Learning Generalized Linear Models Over Normalized Data
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Enterprise data analytics is a booming area in the data management industry. Many companies are racing to develop toolkits that closely integrate statistical and machine learning techniques with data management systems. Almost all such toolkits assume ...
Read More
Accelerating Linear Algebra over Normalized Data
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data
Read More
Towards a typed omega algebra
RAMICS'11: Proceedings of the 12th international conference on Relational and algebraic methods in computer science

We propose axioms for 1-free omega algebra, typed 1-free omega algebra and typed omega algebra. They are based on Kozen's axioms for 1-free and typed Kleene algebra and Cohen's axioms for the omega operation. In contrast to Kleene algebra, several laws ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 10, Issue 11
August 2017
432 pages
ISSN:2150-8097
Editors:
Peter Boncz
CWI
,
Ken Salem
University of Waterloo
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2017
Published in pvldb Volume 10, Issue 11
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 40
  Total Citations
  View Citations
- 137
  Total Downloads
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Towards linear algebra over normalized data

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Learning Generalized Linear Models Over Normalized Data

Accelerating Linear Algebra over Normalized Data

Towards a typed omega algebra

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Towards linear algebra over normalized data

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Learning Generalized Linear Models Over Normalized Data

Accelerating Linear Algebra over Normalized Data

Towards a typed omega algebra

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media