ABSTRACT
The wide use of machine learning is fundamentally changing the software development paradigm (a.k.a. Software 2.0) where data becomes a first-class citizen, on par with code. As machine learning is used in sensitive applications, it becomes imperative that the trained model is accurate, fair, and robust to attacks. While many techniques have been proposed to improve the model training process (in-processing approach) or the trained model itself (post-processing), we argue that the most effective method is to clean the root cause of error: the data the model is trained on (pre-processing). Historically, there are at least three research communities that have been separately studying this problem: data management, machine learning (model fairness), and security. Although a significant amount of research has been done by each community, ultimately the same datasets must be preprocessed, and there is little understanding how the techniques relate to each other and can possibly be integrated. We contend that it is time to extend the notion of data cleaning for modern machine learning needs. We identify dependencies among the data preprocessing techniques and propose MLClean, a unified data cleaning framework that integrates the techniques and helps train accurate and fair models. This work is part of a broader trend of Big data -- Artificial Intelligence (AI) integration.
- {n. d.}. Software 2.0. https://medium.com/@karpathy/software-2-0-a64152b37c35. Accessed Mar. 11th, 2019.Google Scholar
- Rachel K. E. Bellamy, Kuntal Dey, and Michael Hind et al. 2018. AI Fairness 360: An Extensible Toolkit for Detecting, Understanding, and Mitigating Unwanted Algorithmic Bias. CoRR abs/1810.01943 (2018).Google Scholar
- Alexandra Chouldechova and Aaron Roth. 2018. The Frontiers of Fairness in Machine Learning. CoRR abs/1810.08810 (2018).Google Scholar
- Xu Chu, Ihab F. Ilyas, Sanjay Krishnan, and Jiannan Wang. 2016. Data Cleaning: Overview and Emerging Challenges. In SIGMOD. 2201--2206. Google ScholarDigital Library
- Xin Luna Dong and Theodoros Rekatsinas. 2018. Data Integration and Machine Learning: A Natural Synergy. In SIGMOD. 1645--1650. Google ScholarDigital Library
- Denis Baylor et al. 2017. TFX: A TensorFlow-Based Production-Scale Machine Learning Platform. In KDD. 1387--1395. Google ScholarDigital Library
- András Györgye, Luis Muñoz-González, Andras Gyorgy, and Emil C. Lupu. 2018. Detection of Adversarial Training Examples in Poisoning Attacks through Anomaly Detection. CoRR abs/1802.03041 (2018).Google Scholar
- Pang Wei Koh and Percy Liang. 2017. Understanding Black-box Predictions via Influence Functions. In ICML. 1885--1894. Google ScholarDigital Library
- Pang Wei Koh, Jacob Steinhardt, and Percy Liang. 2018. Stronger Data Poisoning Attacks Break Data Sanitization Defenses. CoRR abs/1811.00741 (2018).Google Scholar
- Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich. 2017. Data Management Challenges in Production Machine Learning. In SIGMOD. 1723--1726. Google ScholarDigital Library
- Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean: Holistic Data Repairs with Probabilistic Inference. PVLDB 10, 11 (2017), 1190--1201. Google ScholarDigital Library
Recommendations
Obtaining Robust Models from Imbalanced Data
WSDM '22: Proceedings of the Fifteenth ACM International Conference on Web Search and Data MiningThe vulnerability of deep neural network (DNN) models has been verified by the existence of adversarial examples. By exploiting slight changes to input examples, the generated adversarial examples can easily cause well trained DNN models make wrong ...
Cleaning crowdsourced labels using oracles for statistical classification
Nowadays, crowdsourcing is being widely used to collect training data for solving classification problems. However, crowdsourced labels are often noisy, and there is a performance gap between classification with noisy labels and classification with ...
A Model and System for Querying Provenance from Data Cleaning Workflows
Provenance and Annotation of Data and ProcessesAbstractData cleaning is an essential component of data preparation in machine learning and other data science workflows, and is widely recognized as the most time-consuming and error-prone part when working with real-world data. How data was prepared and ...
Comments