ABSTRACT
A growing number of companies rely on machine learning as a key element for gaining a competitive edge from their collected Big Data. An in-database machine learning system can provide many advantages in this scenario, e.g., eliminating the overhead of data transfer, avoiding the maintenance costs of a separate analytical system, and addressing data security and provenance concerns. In this paper, we present our distributed machine learning subsystem within the Vertica database. This subsystem, Vertica-ML, includes machine learning functionalities with SQL API which cover a complete data science workflow as well as model management. We treat machine learning models in Vertica as first-class database objects like tables and views; therefore, they enjoy a similar mechanism for archiving and managing. We explain the architecture of the subsystem, and present a set of experiments to evaluate the performance of the machine learning algorithms implemented on top of it.
Supplemental Material
- Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16). Savannah, GA, 265--283.Google ScholarDigital Library
- Ashvin Agrawal, Rony Chatterjee, Carlo Curino, Avrilia Floratou, Neha Gowdal, Matteo Interlandi, Alekh Jindal, Kostantinos Karanasos, Subru Krishnan, Brian Kroth, et al. 2019. Cloudy with high chance of DBMS: A 10-year prediction for Enterprise-Grade ML. arXiv preprint arXiv:1909.00084 (2019).Google Scholar
- ASA. 2009. Airline on-time performance. stat-computing.org/dataexpo/2009/Google Scholar
- Bahman Bahmani, Benjamin Moseley, Andrea Vattani, Ravi Kumar, and Sergei Vassilvitskii. 2012. Scalable K-meansGoogle Scholar
- . Proc. VLDB Endow., Vol. 5, 7 (March 2012), 622--633. https://doi.org/10.14778/2180912.2180915Google Scholar
- DMG. 2019. PMML website. dmg.org/pmml/v4--4/GeneralStructure.htmlGoogle Scholar
- Dheeru Dua and Efi Karra Taniskidou. 2017. UCI Machine Learning Repository. http://archive.ics.uci.edu/mlGoogle Scholar
- Jerome Friedman, Trevor Hastie, and Robert Tibshirani. 2001. The elements of statistical learning. Vol. 1. Springer series in statistics New York.Google Scholar
- Greenplum. 2019. Greenplum website. greenplum.orgGoogle Scholar
- H2O. 2009. H2O. ai. www.h2o.aiGoogle Scholar
- Joseph M. Hellerstein, Christoper Ré, Florian Schoppmann, Daisy Zhe Wang, Eugene Fratkin, Aleksander Gorajek, Kee Siong Ng, Caleb Welton, Xixuan Feng, Kun Li, and Arun Kumar. 2012. The MADlib Analytics Library: Or MAD Skills, the SQL. Proc. VLDB Endow., Vol. 5, 12 (Aug. 2012), 1700--1711. https://doi.org/10.14778/2367502.2367510Google ScholarDigital Library
- Joab Jackson. 2018. Google Adds Machine Learning Modeling to BigQuery. thenewstack.io/google-adds-machine-learning-modeling-to-bigquery/Google Scholar
- Dimitrije Jankov, Shangyu Luo, Binhang Yuan, Zhuhua Cai, Jia Zou, Chris Jermaine, and Zekai J. Gao. 2019. Declarative Recursive Computation on an RDBMS: Or, Why You Should Use a Database for Distributed Machine Learning. Proc. VLDB Endow., Vol. 12, 7 (March 2019), 822--835. https://doi.org/10.14778/3317315.3317323Google ScholarDigital Library
- Kaggle. 2019. Diamonds-Kaggle Dataset. www.kaggle.com/shivam2503/diamondsGoogle Scholar
- Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, Nga Tran, Ben Vandiver, Lyric Doshi, and Chuck Bear. 2012. The vertica analytic database: C-store 7 years later. Proceedings of the VLDB Endowment, Vol. 5, 12 (2012), 1790--1801.Google ScholarDigital Library
- Jeff LeFevre, Rui Liu, Cornelio Inigo, Lupita Paz, Edward Ma, Malu Castellanos, and Meichun Hsu. 2016. Building the Enterprise Fabric for Big Data with Vertica and Spark Integration. In SIGMOD. ACM, 63--75.Google Scholar
- MADlib. 2019. MADlib website. madlib.apache.org/Google Scholar
- Ryan Marcus, Parimarjan Negi, Hongzi Mao, Chi Zhang, Mohammad Alizadeh, Tim Kraska, Olga Papaemmanouil, and Nesime Tatbul. 2019. Neo: A Learned Query Optimizer. Proc. VLDB Endow., Vol. 12, 11 (July 2019), 1705--1718. https://doi.org/10.14778/3342263.3342644Google ScholarDigital Library
- Xiangrui Meng, Joseph Bradley, Burak Yavuz, Evan Sparks, Shivaram Venkataraman, Davies Liu, Jeremy Freeman, DB Tsai, Manish Amde, Sean Owen, et al. 2016. MLlib: Machine learning in Apache Spark. The Journal of Machine Learning Research, Vol. 17, 1 (2016), 1235--1241.Google ScholarDigital Library
- Microsoft. 2009. Microsoft SQL MLS. docs.microsoft.com/en-us/sql/advanced-analytics/what-is-sql-server-machine-learning?view=sql-server-2017Google Scholar
- Oracle. 2019. Oracle Advanced Analytics. www.oracle.com/technetwork/database/options/advanced-analytics/overview/index.htmlGoogle Scholar
- Biswanath Panda, Joshua S. Herbach, Sugato Basu, and Roberto J. Bayardo. 2009. PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Proc. VLDB Endow., Vol. 2, 2 (Aug. 2009), 1426--1437. https://doi.org/10.14778/1687553.1687569Google ScholarDigital Library
- Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2018. Data lifecycle challenges in production machine learning: A survey. ACM SIGMOD Record, Vol. 47, 2 (2018), 17--28.Google ScholarDigital Library
- PostgreSQL. 2019. PostgreSQL website. www.postgresql.orgGoogle Scholar
- Christopher Ré, Divy Agrawal, Magdalena Balazinska, Michael Cafarella, Michael Jordan, Tim Kraska, and Raghu Ramakrishnan. 2015. Machine learning and databases: The sound of things to come or a cacophony of hype?. In SIGMOD. ACM, 283--284.Google Scholar
- Seyed H. Roosta. 1999. Parallel Processing and Parallel Algorithms: Theory and Computation 1st ed.). Springer-Verlag, Berlin, Heidelberg.Google ScholarCross Ref
- Vinay Sridhar, Sriram Subramanian, Dulcardo Arteaga, Swaminathan Sundararaman, Drew Roselli, and Nisha Talagala. 2018. Model governance: Reducing the anarchy of production ML. In USENIX. 351--358.Google Scholar
- Mike Stonebraker, Daniel J Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O'Neil, et al. 2005. C-store: a column-oriented DBMS. In VLDB. VLDB Endowment, 553--564.Google Scholar
- Manasi Vartak and Samuel Madden. 2018. MODELDB: Opportunities and Challenges in Managing Machine Learning Models. Data Engineering (2018), 16.Google Scholar
- Vertica. 2019. Vertica-SDK. www.vertica.com/docs/9.3.x/HTML/Content/Home.htmGoogle Scholar
- Hadley Wickham, Romain Francois, L Henry, and K Müller. 2017. dplyr: A Grammar of Data Manipulation. R package version 0.7.4.Google Scholar
Index Terms
- Vertica-ML: Distributed Machine Learning in Vertica Database
Recommendations
Building the Enterprise Fabric for Big Data with Vertica and Spark Integration
SIGMOD '16: Proceedings of the 2016 International Conference on Management of DataEnterprise customers increasingly require greater flexibility in the way they access and process their Big Data while at the same time they continue to request advanced analytics and access to diverse data sources. Yet customers also still require the ...
The vertica database: SQL RDBMS for managing big data
MBDS '12: Proceedings of the 2012 workshop on Management of big data systemsIn this presentation, we describe the architecture of the Vertica Analytic Database (Vertica), with an emphasis on the management features. Vertica combines a scale-out design, commodity hardware, and the RDBMS data management paradigm to keep SQL ...
Large-scale Predictive Analytics in Vertica: Fast Data Transfer, Distributed Model Creation, and In-database Prediction
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of DataA typical predictive analytics workflow will pre-process data in a database, transfer the resulting data to an external statistical tool such as R, create machine learning models in R, and then apply the model on newly arriving data. Today, this ...
Comments