skip to main content
10.1145/3318464.3386137acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Open Access

Vertica-ML: Distributed Machine Learning in Vertica Database

Published:31 May 2020Publication History

ABSTRACT

A growing number of companies rely on machine learning as a key element for gaining a competitive edge from their collected Big Data. An in-database machine learning system can provide many advantages in this scenario, e.g., eliminating the overhead of data transfer, avoiding the maintenance costs of a separate analytical system, and addressing data security and provenance concerns. In this paper, we present our distributed machine learning subsystem within the Vertica database. This subsystem, Vertica-ML, includes machine learning functionalities with SQL API which cover a complete data science workflow as well as model management. We treat machine learning models in Vertica as first-class database objects like tables and views; therefore, they enjoy a similar mechanism for archiving and managing. We explain the architecture of the subsystem, and present a set of experiments to evaluate the performance of the machine learning algorithms implemented on top of it.

Skip Supplemental Material Section

Supplemental Material

3318464.3386137.mp4

mp4

117 MB

References

  1. Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). Savannah, GA, 265--283.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Ashvin Agrawal, Rony Chatterjee, Carlo Curino, Avrilia Floratou, Neha Gowdal, Matteo Interlandi, Alekh Jindal, Kostantinos Karanasos, Subru Krishnan, Brian Kroth, et al. 2019. Cloudy with high chance of DBMS: A 10-year prediction for Enterprise-Grade ML. arXiv preprint arXiv:1909.00084 (2019).Google ScholarGoogle Scholar
  3. ASA. 2009. Airline on-time performance. stat-computing.org/dataexpo/2009/Google ScholarGoogle Scholar
  4. Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. 2012. Scalable K-meansGoogle ScholarGoogle Scholar
  5. . Proc. VLDB Endow., Vol. 5, 7 (March 2012), 622--633. https://doi.org/10.14778/2180912.2180915Google ScholarGoogle Scholar
  6. DMG. 2019. PMML website. dmg.org/pmml/v4--4/GeneralStructure.htmlGoogle ScholarGoogle Scholar
  7. Dheeru Dua and Efi Karra Taniskidou. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/mlGoogle ScholarGoogle Scholar
  8. Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The elements of statistical learning. Vol. 1. Springer series in statistics New York.Google ScholarGoogle Scholar
  9. Greenplum. 2019. Greenplum website. greenplum.orgGoogle ScholarGoogle Scholar
  10. H2O. 2009. H2O. ai. www.h2o.aiGoogle ScholarGoogle Scholar
  11. Joseph M. Hellerstein, Christoper Ré, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, and Arun Kumar. 2012. The MADlib Analytics Library: Or MAD Skills, the SQL. Proc. VLDB Endow., Vol. 5, 12 (Aug. 2012), 1700--1711. https://doi.org/10.14778/2367502.2367510Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Joab Jackson. 2018. Google Adds Machine Learning Modeling to BigQuery. thenewstack.io/google-adds-machine-learning-modeling-to-bigquery/Google ScholarGoogle Scholar
  13. Dimitrije Jankov, Shangyu Luo, Binhang Yuan, Zhuhua Cai, Jia Zou, Chris Jermaine, and Zekai J. Gao. 2019. Declarative Recursive Computation on an RDBMS: Or, Why You Should Use a Database for Distributed Machine Learning. Proc. VLDB Endow., Vol. 12, 7 (March 2019), 822--835. https://doi.org/10.14778/3317315.3317323Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Kaggle. 2019. Diamonds-Kaggle Dataset. www.kaggle.com/shivam2503/diamondsGoogle ScholarGoogle Scholar
  15. Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, Nga Tran, Ben Vandiver, Lyric Doshi, and Chuck Bear. 2012. The vertica analytic database: C-store 7 years later. Proceedings of the VLDB Endowment, Vol. 5, 12 (2012), 1790--1801.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jeff LeFevre, Rui Liu, Cornelio Inigo, Lupita Paz, Edward Ma, Malu Castellanos, and Meichun Hsu. 2016. Building the Enterprise Fabric for Big Data with Vertica and Spark Integration. In SIGMOD. ACM, 63--75.Google ScholarGoogle Scholar
  17. MADlib. 2019. MADlib website. madlib.apache.org/Google ScholarGoogle Scholar
  18. Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. 2019. Neo: A Learned Query Optimizer. Proc. VLDB Endow., Vol. 12, 11 (July 2019), 1705--1718. https://doi.org/10.14778/3342263.3342644Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, et al. 2016. MLlib: Machine learning in Apache Spark. The Journal of Machine Learning Research, Vol. 17, 1 (2016), 1235--1241.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Microsoft. 2009. Microsoft SQL MLS. docs.microsoft.com/en-us/sql/advanced-analytics/what-is-sql-server-machine-learning?view=sql-server-2017Google ScholarGoogle Scholar
  21. Oracle. 2019. Oracle Advanced Analytics. www.oracle.com/technetwork/database/options/advanced-analytics/overview/index.htmlGoogle ScholarGoogle Scholar
  22. Biswanath Panda, Joshua S. Herbach, Sugato Basu, and Roberto J. Bayardo. 2009. PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Proc. VLDB Endow., Vol. 2, 2 (Aug. 2009), 1426--1437. https://doi.org/10.14778/1687553.1687569Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2018. Data lifecycle challenges in production machine learning: A survey. ACM SIGMOD Record, Vol. 47, 2 (2018), 17--28.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. PostgreSQL. 2019. PostgreSQL website. www.postgresql.orgGoogle ScholarGoogle Scholar
  25. Christopher Ré, Divy Agrawal, Magdalena Balazinska, Michael Cafarella, Michael Jordan, Tim Kraska, and Raghu Ramakrishnan. 2015. Machine learning and databases: The sound of things to come or a cacophony of hype?. In SIGMOD. ACM, 283--284.Google ScholarGoogle Scholar
  26. Seyed H. Roosta. 1999. Parallel Processing and Parallel Algorithms: Theory and Computation 1st ed.). Springer-Verlag, Berlin, Heidelberg.Google ScholarGoogle ScholarCross RefCross Ref
  27. Vinay Sridhar, Sriram Subramanian, Dulcardo Arteaga, Swaminathan Sundararaman, Drew Roselli, and Nisha Talagala. 2018. Model governance: Reducing the anarchy of production ML. In USENIX. 351--358.Google ScholarGoogle Scholar
  28. Mike Stonebraker, Daniel J Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O'Neil, et al. 2005. C-store: a column-oriented DBMS. In VLDB. VLDB Endowment, 553--564.Google ScholarGoogle Scholar
  29. Manasi Vartak and Samuel Madden. 2018. MODELDB: Opportunities and Challenges in Managing Machine Learning Models. Data Engineering (2018), 16.Google ScholarGoogle Scholar
  30. Vertica. 2019. Vertica-SDK. www.vertica.com/docs/9.3.x/HTML/Content/Home.htmGoogle ScholarGoogle Scholar
  31. Hadley Wickham, Romain Francois, L Henry, and K Müller. 2017. dplyr: A Grammar of Data Manipulation. R package version 0.7.4.Google ScholarGoogle Scholar

Index Terms

  1. Vertica-ML: Distributed Machine Learning in Vertica Database

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
              June 2020
              2925 pages
              ISBN:9781450367356
              DOI:10.1145/3318464

              Copyright © 2020 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 31 May 2020

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article

              Acceptance Rates

              Overall Acceptance Rate785of4,003submissions,20%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader