research-article

Open Access

Vertica-ML: Distributed Machine Learning in Vertica Database

Authors:
Arash Fard

Vertica, Cambridge, MA, USA

Vertica, Cambridge, MA, USA
View Profile

,
Anh Le

Vertica, Cambridge, MA, USA

Vertica, Cambridge, MA, USA
View Profile

,
George Larionov

Vertica, Cambridge, MA, USA

Vertica, Cambridge, MA, USA
View Profile

,
Waqas Dhillon

Vertica, Cambridge, MA, USA

Vertica, Cambridge, MA, USA
View Profile

,
Chuck Bear

Vertica, Cambridge, MA, USA

Vertica, Cambridge, MA, USA
View Profile

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of DataJune 2020Pages 755–768https://doi.org/10.1145/3318464.3386137

Published:31 May 2020Publication History

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

Pages 755–768

ABSTRACT

A growing number of companies rely on machine learning as a key element for gaining a competitive edge from their collected Big Data. An in-database machine learning system can provide many advantages in this scenario, e.g., eliminating the overhead of data transfer, avoiding the maintenance costs of a separate analytical system, and addressing data security and provenance concerns. In this paper, we present our distributed machine learning subsystem within the Vertica database. This subsystem, Vertica-ML, includes machine learning functionalities with SQL API which cover a complete data science workflow as well as model management. We treat machine learning models in Vertica as first-class database objects like tables and views; therefore, they enjoy a similar mechanism for archiving and managing. We explain the architecture of the subsystem, and present a set of experiments to evaluate the performance of the machine learning algorithms implemented on top of it.

Supplemental Material

3318464.3386137.mp4

mp4

117 MB

Download

References

Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). Savannah, GA, 265--283.Google ScholarDigital Library
Ashvin Agrawal, Rony Chatterjee, Carlo Curino, Avrilia Floratou, Neha Gowdal, Matteo Interlandi, Alekh Jindal, Kostantinos Karanasos, Subru Krishnan, Brian Kroth, et al. 2019. Cloudy with high chance of DBMS: A 10-year prediction for Enterprise-Grade ML. arXiv preprint arXiv:1909.00084 (2019).Google Scholar
ASA. 2009. Airline on-time performance. stat-computing.org/dataexpo/2009/Google Scholar
Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. 2012. Scalable K-meansGoogle Scholar
. Proc. VLDB Endow., Vol. 5, 7 (March 2012), 622--633. https://doi.org/10.14778/2180912.2180915Google Scholar
DMG. 2019. PMML website. dmg.org/pmml/v4--4/GeneralStructure.htmlGoogle Scholar
Dheeru Dua and Efi Karra Taniskidou. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/mlGoogle Scholar
Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The elements of statistical learning. Vol. 1. Springer series in statistics New York.Google Scholar
Greenplum. 2019. Greenplum website. greenplum.orgGoogle Scholar
H2O. 2009. H2O. ai. www.h2o.aiGoogle Scholar
Joseph M. Hellerstein, Christoper Ré, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, and Arun Kumar. 2012. The MADlib Analytics Library: Or MAD Skills, the SQL. Proc. VLDB Endow., Vol. 5, 12 (Aug. 2012), 1700--1711. https://doi.org/10.14778/2367502.2367510Google ScholarDigital Library
Joab Jackson. 2018. Google Adds Machine Learning Modeling to BigQuery. thenewstack.io/google-adds-machine-learning-modeling-to-bigquery/Google Scholar
Dimitrije Jankov, Shangyu Luo, Binhang Yuan, Zhuhua Cai, Jia Zou, Chris Jermaine, and Zekai J. Gao. 2019. Declarative Recursive Computation on an RDBMS: Or, Why You Should Use a Database for Distributed Machine Learning. Proc. VLDB Endow., Vol. 12, 7 (March 2019), 822--835. https://doi.org/10.14778/3317315.3317323Google ScholarDigital Library
Kaggle. 2019. Diamonds-Kaggle Dataset. www.kaggle.com/shivam2503/diamondsGoogle Scholar
Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, Nga Tran, Ben Vandiver, Lyric Doshi, and Chuck Bear. 2012. The vertica analytic database: C-store 7 years later. Proceedings of the VLDB Endowment, Vol. 5, 12 (2012), 1790--1801.Google ScholarDigital Library
Jeff LeFevre, Rui Liu, Cornelio Inigo, Lupita Paz, Edward Ma, Malu Castellanos, and Meichun Hsu. 2016. Building the Enterprise Fabric for Big Data with Vertica and Spark Integration. In SIGMOD. ACM, 63--75.Google Scholar
MADlib. 2019. MADlib website. madlib.apache.org/Google Scholar
Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. 2019. Neo: A Learned Query Optimizer. Proc. VLDB Endow., Vol. 12, 11 (July 2019), 1705--1718. https://doi.org/10.14778/3342263.3342644Google ScholarDigital Library
Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, et al. 2016. MLlib: Machine learning in Apache Spark. The Journal of Machine Learning Research, Vol. 17, 1 (2016), 1235--1241.Google ScholarDigital Library
Microsoft. 2009. Microsoft SQL MLS. docs.microsoft.com/en-us/sql/advanced-analytics/what-is-sql-server-machine-learning?view=sql-server-2017Google Scholar
Oracle. 2019. Oracle Advanced Analytics. www.oracle.com/technetwork/database/options/advanced-analytics/overview/index.htmlGoogle Scholar
Biswanath Panda, Joshua S. Herbach, Sugato Basu, and Roberto J. Bayardo. 2009. PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Proc. VLDB Endow., Vol. 2, 2 (Aug. 2009), 1426--1437. https://doi.org/10.14778/1687553.1687569Google ScholarDigital Library
Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2018. Data lifecycle challenges in production machine learning: A survey. ACM SIGMOD Record, Vol. 47, 2 (2018), 17--28.Google ScholarDigital Library
PostgreSQL. 2019. PostgreSQL website. www.postgresql.orgGoogle Scholar
Christopher Ré, Divy Agrawal, Magdalena Balazinska, Michael Cafarella, Michael Jordan, Tim Kraska, and Raghu Ramakrishnan. 2015. Machine learning and databases: The sound of things to come or a cacophony of hype?. In SIGMOD. ACM, 283--284.Google Scholar
Seyed H. Roosta. 1999. Parallel Processing and Parallel Algorithms: Theory and Computation 1st ed.). Springer-Verlag, Berlin, Heidelberg.Google ScholarCross Ref
Vinay Sridhar, Sriram Subramanian, Dulcardo Arteaga, Swaminathan Sundararaman, Drew Roselli, and Nisha Talagala. 2018. Model governance: Reducing the anarchy of production ML. In USENIX. 351--358.Google Scholar
Mike Stonebraker, Daniel J Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O'Neil, et al. 2005. C-store: a column-oriented DBMS. In VLDB. VLDB Endowment, 553--564.Google Scholar
Manasi Vartak and Samuel Madden. 2018. MODELDB: Opportunities and Challenges in Managing Machine Learning Models. Data Engineering (2018), 16.Google Scholar
Vertica. 2019. Vertica-SDK. www.vertica.com/docs/9.3.x/HTML/Content/Home.htmGoogle Scholar
Hadley Wickham, Romain Francois, L Henry, and K Müller. 2017. dplyr: A Grammar of Data Manipulation. R package version 0.7.4.Google Scholar

Index Terms

Vertica-ML: Distributed Machine Learning in Vertica Database
1. Computing methodologies
  1. Distributed computing methodologies
    1. Distributed algorithms
  2. Machine learning
2. Information systems
  1. Information systems applications
    1. Data mining
      1. Clustering
    2. Decision support systems
      1. Data analytics
      2. Data warehouses

Recommendations

Building the Enterprise Fabric for Big Data with Vertica and Spark Integration
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Enterprise customers increasingly require greater flexibility in the way they access and process their Big Data while at the same time they continue to request advanced analytics and access to diverse data sources. Yet customers also still require the ...
Read More
The vertica database: SQL RDBMS for managing big data
MBDS '12: Proceedings of the 2012 workshop on Management of big data systems

In this presentation, we describe the architecture of the Vertica Analytic Database (Vertica), with an emphasis on the management features. Vertica combines a scale-out design, commodity hardware, and the RDBMS data management paradigm to keep SQL ...
Read More
Large-scale Predictive Analytics in Vertica: Fast Data Transfer, Distributed Model Creation, and In-database Prediction
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

A typical predictive analytics workflow will pre-process data in a database, transfer the resulting data to an external statistical tool such as R, create machine learning models in R, and then apply the model on newly arriving data. Today, this ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
June 2020
2925 pages
ISBN:9781450367356
DOI:10.1145/3318464
General Chairs:
David Maier
Portland State University, USA
,
Rachel Pottinger
University of British Columbia, Canada
,
Program Chairs:
AnHai Doan
University of Wisconsin, USA
,
Wang-Chiew Tan
Megagon Labs, USA
,
Publications Chairs:
Abdussalam Alawini
University of Illinois at Urbana-Champaign, USA
,
Hung Q. Ngo
RelationalAI, USA
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 31 May 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
big data
database
distributed computing
machine learning
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 18
  Total Citations
  View Citations
- 2,126
  Total Downloads
- Downloads (Last 12 months)344
- Downloads (Last 6 weeks)41
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Vertica-ML: Distributed Machine Learning in Vertica Database

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Building the Enterprise Fabric for Big Data with Vertica and Spark Integration

The vertica database: SQL RDBMS for managing big data

Large-scale Predictive Analytics in Vertica: Fast Data Transfer, Distributed Model Creation, and In-database Prediction

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Vertica-ML: Distributed Machine Learning in Vertica Database

SIGMOD '20: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Building the Enterprise Fabric for Big Data with Vertica and Spark Integration

The vertica database: SQL RDBMS for managing big data

Large-scale Predictive Analytics in Vertica: Fast Data Transfer, Distributed Model Creation, and In-database Prediction

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media