skip to main content
research-article

Towards linear algebra over normalized data

Published:01 August 2017Publication History
Skip Abstract Section

Abstract

Providing machine learning (ML) over relational data is a mainstream requirement for data analytics systems. While almost all ML tools require the input data to be presented as a single table, many datasets are multi-table. This forces data scientists to join those tables first, which often leads to data redundancy and runtime waste. Recent works on "factorized" ML mitigate this issue for a few specific ML algorithms by pushing ML through joins. But their approaches require a manual rewrite of ML implementations. Such piecemeal methods create a massive development overhead when extending such ideas to other ML algorithms. In this paper, we show that it is possible to mitigate this overhead by leveraging a popular formal algebra to represent the computations of many ML algorithms: linear algebra. We introduce a new logical data type to represent normalized data and devise a framework of algebraic rewrite rules to convert a large set of linear algebra operations over denormalized data into operations over normalized data. We show how this enables us to automatically "factorize" several popular ML algorithms, thus unifying and generalizing several prior works. We prototype our framework in the popular ML environment R and an industrial R-over-RDBMS tool. Experiments with both synthetic and real normalized data show that our framework also yields significant speed-ups, up to 36x on real data.

References

  1. Matlab mmtimes function. mathworks.com/matlabcentral/fileexchange/27950-mmtimes-matrix-chain-product.Google ScholarGoogle Scholar
  2. Oracle R Enterprise.Google ScholarGoogle Scholar
  3. R. r-project.org.Google ScholarGoogle Scholar
  4. SparkR. spark.apache.org/R.Google ScholarGoogle Scholar
  5. M. Abadi et al. TensorFlow: A System for Large-Scale Machine Learning. In OSDI, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. E. Anderson et al. LAPACK Users' Guide. SIAM, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. N. Bakibayev et al. FDB: A Query Engine for Factorised Relational Databases. In VLDB, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Boehm et al. Hybrid Parallelization Strategies for Large-Scale Machine Learning in SystemML. In VLDB, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Boehm et al. SystemML: Declarative Machine Learning on Spark. In VLDB, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Z. Cai et al. Simulation of Database-valued Markov Chains Using SimSQL. In SIGMOD, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Z. Cai et al. A Comparison of Platforms for Implementing and Running Very Large Scale Machine Learning Algorithms. In SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Chaudhuri and K. Shim. Including Group-By in Query Optimization. In VLDB, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. L. Chen et al. Towards Linear Algebra over Normalized Data. https://arxiv.org/abs/1612.07448.Google ScholarGoogle Scholar
  14. P. Cudré-Mauroux et al. A demonstration of SciDB: A science-oriented DBMS. PVLDB, 2(2):1534--1537, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Dongarra et al. An Extended Set of FORTRAN Basic Linear Algebra Subprograms. ACM Trans. Math. Softw., 14(1):1--17, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. L. Eldén. Matrix Methods in Data Mining and Pattern Recognition. SIAM, 2007.Google ScholarGoogle Scholar
  17. T. Elgamal et al. SPOOF: Sum-Product Optimization and Operator Fusion for Large-Scale ML. In CIDR, 2017.Google ScholarGoogle Scholar
  18. A. Elgohary et al. Compressed Linear Algebra for Large-scale Machine Learning. In VLDB, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. X. Feng, A. Kumar, B. Recht, and C. Ré. Towards a Unified Architecture for in-RDBMS Analytics. In SIGMOD, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Hellerstein et al. The MADlib Analytics Library or MAD Skills, the SQL. In VLDB, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. A. Horn and C. R. Johnson. Matrix Analysis. Cambridge University Press, New York, NY, USA, 2nd edition, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. T. C. Hu and M. T. Shing. Computation of Matrix Chain Products. Part I. SIAM J. Comput., 11(2):362--373, 1982.Google ScholarGoogle Scholar
  23. T. Kraska et al. MLbase: A Distributed Machine-learning System. In CIDR, 2013.Google ScholarGoogle Scholar
  24. A. Kumar et al. Demonstration of Santoku: Optimizing Machine Learning over Normalized Data. In VLDB, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Kumar et al. Learning Generalized Linear Models Over Normalized Data. In SIGMOD, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Kumar et al. Model Selection Management Systems: The Next Frontier of Advanced Analytics. ACM SIGMOD Rec., Dec. 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. Kumar et al. To Join or Not to Join? Thinking Twice about Joins before Feature Selection. In SIGMOD, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. Kunft et al. Bridging the Gap: Towards Optimization Across Linear and Relational Algebra. In SIGMOD Beyond M R Workshop, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. C. L. Lawson et al. Basic Linear Algebra Subprograms for Fortran Usage. ACM Trans. Math. Softw., 5(3):308--323, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. Nikolic et al. LINVIEW: incremental view maintenance for complex analytical queries. In SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. D. Olteanu and M. Schleich. F: Regression Models over Factorized Views. PVLDB, 9(13):1573--1576, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. R. Ramakrishnan and J. Gehrke. Database Management Systems. McGraw-Hill, Inc., 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. S. Rendle. Scaling Factorization Machines to Relational Data. In VLDB, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. M. Schleich et al. Learning Linear Regression Models over Factorized Joins. In SIGMOD, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. T. K. Sellis. Multiple-Query Optimization. ACM TODS, 13(1):23--52, Mar. 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. W. P. Yan and P.-Å. Larson. Eager Aggregation and Lazy Aggregation. In VLDB, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Y. Zhang et al. I/O-Efficient Statistical Computing with RIOT. In ICDE, 2010.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 10, Issue 11
    August 2017
    432 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 August 2017
    Published in pvldb Volume 10, Issue 11

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader