short-paper

Data Cleaning for Accurate, Fair, and Robust Models: A Big Data - AI Integration Approach

Authors:
Ki Hyun Tae

KAIST

KAIST
View Profile

,
Yuji Roh

KAIST

KAIST
View Profile

,
Young Hun Oh

KAIST

KAIST
View Profile

,
Hyunsu Kim

KAIST

KAIST
View Profile

,
Steven Euijong Whang

KAIST

KAIST
View Profile

DEEM'19: Proceedings of the 3rd International Workshop on Data Management for End-to-End Machine LearningJune 2019Article No.: 5Pages 1–4https://doi.org/10.1145/3329486.3329493

Published:30 June 2019Publication History

DEEM'19: Proceedings of the 3rd International Workshop on Data Management for End-to-End Machine Learning

Pages 1–4

ABSTRACT

The wide use of machine learning is fundamentally changing the software development paradigm (a.k.a. Software 2.0) where data becomes a first-class citizen, on par with code. As machine learning is used in sensitive applications, it becomes imperative that the trained model is accurate, fair, and robust to attacks. While many techniques have been proposed to improve the model training process (in-processing approach) or the trained model itself (post-processing), we argue that the most effective method is to clean the root cause of error: the data the model is trained on (pre-processing). Historically, there are at least three research communities that have been separately studying this problem: data management, machine learning (model fairness), and security. Although a significant amount of research has been done by each community, ultimately the same datasets must be preprocessed, and there is little understanding how the techniques relate to each other and can possibly be integrated. We contend that it is time to extend the notion of data cleaning for modern machine learning needs. We identify dependencies among the data preprocessing techniques and propose MLClean, a unified data cleaning framework that integrates the techniques and helps train accurate and fair models. This work is part of a broader trend of Big data -- Artificial Intelligence (AI) integration.

References

{n. d.}. Software 2.0. https://medium.com/@karpathy/software-2-0-a64152b37c35. Accessed Mar. 11th, 2019.Google Scholar
Rachel K. E. Bellamy, Kuntal Dey, and Michael Hind et al. 2018. AI Fairness 360: An Extensible Toolkit for Detecting, Understanding, and Mitigating Unwanted Algorithmic Bias. CoRR abs/1810.01943 (2018).Google Scholar
Alexandra Chouldechova and Aaron Roth. 2018. The Frontiers of Fairness in Machine Learning. CoRR abs/1810.08810 (2018).Google Scholar
Xu Chu, Ihab F. Ilyas, Sanjay Krishnan, and Jiannan Wang. 2016. Data Cleaning: Overview and Emerging Challenges. In SIGMOD. 2201--2206. Google ScholarDigital Library
Xin Luna Dong and Theodoros Rekatsinas. 2018. Data Integration and Machine Learning: A Natural Synergy. In SIGMOD. 1645--1650. Google ScholarDigital Library
Denis Baylor et al. 2017. TFX: A TensorFlow-Based Production-Scale Machine Learning Platform. In KDD. 1387--1395. Google ScholarDigital Library
András Györgye, Luis Muñoz-González, Andras Gyorgy, and Emil C. Lupu. 2018. Detection of Adversarial Training Examples in Poisoning Attacks through Anomaly Detection. CoRR abs/1802.03041 (2018).Google Scholar
Pang Wei Koh and Percy Liang. 2017. Understanding Black-box Predictions via Influence Functions. In ICML. 1885--1894. Google ScholarDigital Library
Pang Wei Koh, Jacob Steinhardt, and Percy Liang. 2018. Stronger Data Poisoning Attacks Break Data Sanitization Defenses. CoRR abs/1811.00741 (2018).Google Scholar
Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2017. Data Management Challenges in Production Machine Learning. In SIGMOD. 1723--1726. Google ScholarDigital Library
Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean: Holistic Data Repairs with Probabilistic Inference. PVLDB 10, 11 (2017), 1190--1201. Google ScholarDigital Library

Recommendations

Obtaining Robust Models from Imbalanced Data
WSDM '22: Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining

The vulnerability of deep neural network (DNN) models has been verified by the existence of adversarial examples. By exploiting slight changes to input examples, the generated adversarial examples can easily cause well trained DNN models make wrong ...
Read More
Cleaning crowdsourced labels using oracles for statistical classification

Nowadays, crowdsourcing is being widely used to collect training data for solving classification problems. However, crowdsourced labels are often noisy, and there is a performance gap between classification with noisy labels and classification with ...
Read More
A Model and System for Querying Provenance from Data Cleaning Workflows
Provenance and Annotation of Data and Processes
Abstract
Data cleaning is an essential component of data preparation in machine learning and other data science workflows, and is widely recognized as the most time-consuming and error-prone part when working with real-world data. How data was prepared and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

DEEM'19: Proceedings of the 3rd International Workshop on Data Management for End-to-End Machine Learning
June 2019
72 pages
ISBN:9781450367974
DOI:10.1145/3329486

Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 June 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- short-paper
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate44of67submissions,66%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 29
  Total Citations
  View Citations
- 1,115
  Total Downloads
- Downloads (Last 12 months)233
- Downloads (Last 6 weeks)27
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Data Cleaning for Accurate, Fair, and Robust Models: A Big Data - AI Integration Approach

DEEM'19: Proceedings of the 3rd International Workshop on Data Management for End-to-End Machine Learning

ABSTRACT

References

Cited By

Recommendations

Obtaining Robust Models from Imbalanced Data

Cleaning crowdsourced labels using oracles for statistical classification

A Model and System for Querying Provenance from Data Cleaning Workflows

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Data Cleaning for Accurate, Fair, and Robust Models: A Big Data - AI Integration Approach

DEEM'19: Proceedings of the 3rd International Workshop on Data Management for End-to-End Machine Learning

ABSTRACT

References

Cited By

Recommendations

Obtaining Robust Models from Imbalanced Data

Cleaning crowdsourced labels using oracles for statistical classification

A Model and System for Querying Provenance from Data Cleaning Workflows

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media