research-article

Data Civilizer 2.0: a holistic framework for data preparation and analytics

Authors:
El Kindi Rezig

MIT CSAIL

MIT CSAIL
View Profile

,
Lei Cao

MIT CSAIL

MIT CSAIL
View Profile

,
Michael Stonebraker

MIT CSAIL

MIT CSAIL
View Profile

,
Giovanni Simonini

MIT CSAIL

MIT CSAIL
View Profile

,
Wenbo Tao

MIT CSAIL

MIT CSAIL
View Profile

,
Samuel Madden

MIT CSAIL

MIT CSAIL
View Profile

,
Mourad Ouzzani

HBKU

HBKU
View Profile

,
Nan Tang

HBKU

HBKU
View Profile

,
Ahmed K. Elmagarmid

HBKU

HBKU
View Profile

Proceedings of the VLDB Endowment Volume 12 Issue 12pp 1954–1957https://doi.org/10.14778/3352063.3352108

Published:01 August 2019Publication History

Proceedings of the VLDB Endowment

Abstract

Data scientists spend over 80% of their time (1) parameter-tuning machine learning models and (2) iterating between data cleaning and machine learning model execution. While there are existing efforts to support the first requirement, there is currently no integrated workflow system that couples data cleaning and machine learning development. The previous version of Data Civilizer was geared towards data cleaning and discovery using a set of pre-defined tools. In this paper, we introduce Data Civilizer 2.0, an end-to-end workflow system satisfying both requirements. In addition, this system also supports a sophisticated data debugger and a workflow visualization system. In this demo, we will show how we used Data Civilizer 2.0 to help scientists at the Massachusetts General Hospital build their cleaning and machine learning pipeline on their 30TB brain activity dataset.

References

Apache Airflow, https://airflow.apache.org. Accessed: March 2019.Google Scholar
mlflow. https://mlflow.org. Accessed: March 2019.Google Scholar
R. Castro Fernandez, D. Deng, E. Mansour, A. A. Qahtan, W. Tao, Z. Abedjan, A. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang. A demo of the data civilizer system. In SIGMOD, pages 1639--1642, 2017. Google ScholarDigital Library
C. De Sa, A. Ratner, C. Ré, J. Shin, F. Wang, S. Wu, and C. Zhang. Deepdive: Declarative knowledge base construction. SIGMOD Rec., 45(1):60--67, June 2016. Google ScholarDigital Library
D. Deng, R. C. Fernandez, Z. Abedjan, S. Wang, M. Stonebraker, A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, and N. Tang. The data civilizer system. In CIDR, 2017.Google Scholar
M. A. Gulzar, M. Interlandi, S. Yoo, S. D. Tetali, T. Condie, T. D. Millstein, and M. Kim. Bigdebug: debugging primitives for interactive big data processing in spark. In ICSE, pages 784--795. ACM, 2016. Google ScholarDigital Library
P. Konda, S. Das, P. S. G. C, A. Doan, A. Ardalan, J. R. Ballard, H. Li, F. Panahi, H. Zhang, J. F. Naughton, S. Prasad, G. Krishnan, R. Deep, and V. Raghavendra. Magellan: Toward building entity matching management systems. PVLDB, 9(12):1197--1208, 2016. Google ScholarDigital Library
E. Mansour, D. Deng, R. C. Fernandez, A. A. Qahtan, W. Tao, Z. Abedjan, A. K. Elmagarmid, I. F. Ilyas, S. Madden, M. Ouzzani, M. Stonebraker, and N. Tang. Building data civilizer pipelines with an advanced workflow engine. In ICDE, pages 1593--1596, 2018.Google ScholarCross Ref
T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. Holoclean: Holistic data repairs with probabilistic inference. PVLDB, 10(11):1190--1201, 2017. Google ScholarDigital Library
E. R. Sparks, S. Venkataraman, T. Kaftan, M. J. Franklin, and B. Recht. Keystoneml: Optimizing pipelines for large-scale advanced analytics. In ICDE, pages 535--546, 2017.Google ScholarCross Ref
W. Tao, X. Liu, Ç. Demiralp, R. Chang, and M. Stonebraker. Kyrix: Interactive visual data exploration at scale. In CIDR, 2019.Google Scholar
M. Vartak, H. Subramanyam, W. Lee, S. Viswanathan, S. Husnoo, S. Madden, and M. Zaharia. Modeldb: a system for machine learning model management. In HILDA, page 14, 2016. Google ScholarDigital Library
D. Xin, L. Ma, J. Liu, S. Macke, S. Song, and A. G. Parameswaran. Helix: Accelerating human-in-the-loop machine learning. PVLDB, 11(12): 1958--1961, 2018. Google ScholarDigital Library

Index Terms

Data Civilizer 2.0: a holistic framework for data preparation and analytics
1. Computing methodologies
  1. Machine learning

Index terms have been assigned to the content through auto-classification.

Recommendations

A Demo of the Data Civilizer System
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

Finding relevant data for a specific task from the numerous data sources available in any organization is a daunting task. This is not only because of the number of possible data sources where the data of interest resides, but also due to the data being ...
Read More
DW 2.0: The Architecture for the Next Generation of Data Warehousing
Read More
Building a Scalable Data Warehouse with Data Vault 2.0
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 12, Issue 12
August 2019
547 pages
ISSN:2150-8097
Editors:
Lei Chen,
Fatma Özcan
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2019
Published in pvldb Volume 12, Issue 12
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 422
  Total Downloads
- Downloads (Last 12 months)40
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Data Civilizer 2.0: a holistic framework for data preparation and analytics

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

A Demo of the Data Civilizer System

DW 2.0: The Architecture for the Next Generation of Data Warehousing

Building a Scalable Data Warehouse with Data Vault 2.0

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Data Civilizer 2.0: a holistic framework for data preparation and analytics

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

A Demo of the Data Civilizer System

DW 2.0: The Architecture for the Next Generation of Data Warehousing

Building a Scalable Data Warehouse with Data Vault 2.0

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media