research-article

Data Cleaning: Overview and Emerging Challenges

Authors:
Xu Chu

University of Waterloo, Waterloo, ON, Canada

University of Waterloo, Waterloo, ON, Canada
View Profile

,
Ihab F. Ilyas

University of Waterloo, Waterloo, ON, Canada

University of Waterloo, Waterloo, ON, Canada
View Profile

,
Sanjay Krishnan

UC Berkeley, Berkeley, CA, USA

UC Berkeley, Berkeley, CA, USA
View Profile

,
Jiannan Wang

Simon Fraser University, Burnaby, BC, Canada

Simon Fraser University, Burnaby, BC, Canada
View Profile

SIGMOD '16: Proceedings of the 2016 International Conference on Management of DataJune 2016Pages 2201–2206https://doi.org/10.1145/2882903.2912574

Published:26 June 2016Publication History

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Pages 2201–2206

ABSTRACT

Detecting and repairing dirty data is one of the perennial challenges in data analytics, and failure to do so can result in inaccurate analytics and unreliable decisions. Over the past few years, there has been a surge of interest from both industry and academia on data cleaning problems including new abstractions, interfaces, approaches for scalability, and statistical techniques. To better understand the new advances in the field, we will first present a taxonomy of the data cleaning literature in which we highlight the recent interest in techniques that use constraints, rules, or patterns to detect errors, which we call qualitative data cleaning. We will describe the state-of-the-art techniques and also highlight their limitations with a series of illustrative examples. While traditionally such approaches are distinct from quantitative approaches such as outlier detection, we also discuss recent work that casts such approaches into a statistical estimation framework including: using Machine Learning to improve the efficiency and accuracy of data cleaning and considering the effects of data cleaning on statistical analysis.

References

Trifacta. http://www.trifacta.com.Google Scholar
C. C. Aggarwal. Outlier Analysis. Springer, 2013. Google ScholarCross Ref
Y. Altowim, D. V. Kalashnikov, and S. Mehrotra. Progressive approach to relational entity resolution. PVLDB, 7(11), 2014. Google ScholarDigital Library
H. Altwaijry, S. Mehrotra, and D. V. Kalashnikov. Query: A framework for integrating entity resolution with query processing. PVLDB, 9(3):120--131, 2015. Google ScholarDigital Library
R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In PVLDB, pages 586--597, 2002. Google ScholarDigital Library
M. Balazinska, A. Deshpande, M. J. Franklin, P. B. Gibbons, J. Gray, M. H. Hansen, M. Liebhold, S. Nath, A. S. Szalay, and V. Tao. Data management in the worldwide sensor web. IEEE Pervasive Computing, 6(2):30--40, 2007. Google ScholarDigital Library
M. Bergman, T. Milo, S. Novgorodov, and W. C. Tan. Query-oriented data cleaning with oracles. In SIGMOD, 2015. Google ScholarDigital Library
L. Berti-Equille, T. Dasu, and D. Srivastava. Discovery of complex glitch patterns: A novel approach to quantitative data cleaning. In ICDE, pages 733--744, 2011. Google ScholarDigital Library
L. E. Bertossi. Consistent query answering in databases. SIGMOD Record, 35(2):68--76, 2006. Google ScholarDigital Library
G. Beskales, I. F. Ilyas, and L. Golab. Sampling the repairs of functional dependency violations under hard constraints. PVLDB, 3(1--2):197--207, 2010. Google ScholarDigital Library
G. Beskales, I. F. Ilyas, L. Golab, and A. Galiullin. On the relative trust between inconsistent data and inaccurate constraints. In ICDE, pages 541--552, 2013. Google ScholarDigital Library
G. Beskales, M. A. Soliman, I. F. Ilyas, and S. Ben-David. Modeling and querying possible repairs in duplicate detection. PVLDB, pages 598--609, 2009. Google ScholarDigital Library
P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, pages 143--154. ACM, 2005. Google ScholarDigital Library
P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE, pages 746--755, 2007.Google ScholarCross Ref
L. Cao, D. Yang, Q. Wang, Y. Yu, J. Wang, and E. A. Rundensteiner. Scalable distance-based outlier detection over high-volume data streams. In ICDE, pages 76--87, 2014.Google ScholarCross Ref
A. Chalamalla, I. F. Ilyas, M. Ouzzani, and P. Papotti. Descriptive and prescriptive data cleaning. In SIGMOD, pages 445--456, 2014. Google ScholarDigital Library
S. Chawla and P. Sun. Outlier detection: Principles, techniques and applications. In PAKDD, 2006.Google Scholar
Z. Chen and M. Cafarella. Integrating spreadsheet data via accurate and low-effort extraction. In KDD. ACM, 2014. Google ScholarDigital Library
F. Chiang and R. J. Miller. A unified model for data and constraint repair. In ICDE, pages 446--457, 2011. Google ScholarDigital Library
X. Chu, I. F. Ilyas, and P. Koutris. Distributed Data Deduplication. Technical Report CS-2016-02, University of Waterloo, 2016.Google Scholar
X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. PVLDB, 6(13):1498--1509, 2013. Google ScholarDigital Library
X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458--469, 2013. Google ScholarDigital Library
X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye. KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. In SIGMOD, pages 1247--1261, 2015. Google ScholarDigital Library
Y. Chung, M. L. Mortensen, C. Binnig, and T. Kraska. Estimating the impact of unknown unknowns on aggregate query results. CoRR, abs/1507.05591, 2015.Google Scholar
G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In PVLDB, pages 315--326. VLDB Endowment, 2007. Google ScholarDigital Library
M. Dallachiesa, A. Ebaid, A. Eldawy, A. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. Nadeef: a commodity data cleaning system. In SIGMOD, pages 541--552, 2013. Google ScholarDigital Library
A. Deligiannakis, Y. Kotidis, V. Vassalos, V. Stoumpos, and A. Delis. Another outlier bites the dust: Computing meaningful aggregates in sensor networks. In ICDE, pages 988--999, 2009. Google ScholarDigital Library
A. Deshpande, C. Guestrin, S. Madden, J. M. Hellerstein, and W. Hong. Model-driven data acquisition in sensor networks. In PVLDB, pages 588--599, 2004. Google ScholarDigital Library
C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. L. Roth. Preserving statistical validity in adaptive data analysis. In STOC, pages 117--126, 2015. Google ScholarDigital Library
W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. PVLDB, 3(1--2):173--184, 2010. Google ScholarDigital Library
W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Interaction between record matching and data repairing. In SIGMOD, pages 469--480. ACM, 2011. Google ScholarDigital Library
Gartner. Forecast: The internet of things, worldwide. https://www.gartner.com/doc/2625419/forecast-internet-things-worldwide-.Google Scholar
F. Geerts, G. Mecca, P. Papotti, and D. Santoro. The llunatic data-cleaning framework. PVLDB, 6(9):625--636, 2013. Google ScholarDigital Library
D. Georgiadis, M. Kontaki, A. Gounaris, A. N. Papadopoulos, K. Tsichlas, and Y. Manolopoulos. Continuous outlier detection in data streams: an extensible framework and state-of-the-art algorithms. In SIGMOD, pages 1061--1064, 2013. Google ScholarDigital Library
C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. Shavlik, and X. Zhu. Corleone: Hands-off crowdsourcing for entity matching. In SIGMOD, 2014. Google ScholarDigital Library
L. Golab, H. Karloff, F. Korn, D. Srivastava, and B. Yu. On generating near-optimal tableaux for conditional functional dependencies. PVLDB, 1(1):376--390, 2008. Google ScholarDigital Library
D. Haas, S. Krishnan, J. Wang, M. J. Franklin, and E. Wu. Wisteria: Nurturing scalable data cleaning infrastructure. PVLDB, 8(12), 2015. Google ScholarDigital Library
D. Haas, J. Wang, E. Wu, and M. J. Franklin. Clamshell: Speeding up crowds for low-latency data labeling. PVLDB, 9(4):372--383, Dec. 2015. Google ScholarDigital Library
A. Heise, G. Kasneci, and F. Naumann. Estimating the number and sizes of fuzzy-duplicate clusters. In CIKM Conference, 2014. Google ScholarDigital Library
J. M. Hellerstein. Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE), 2008.Google Scholar
I. F. Ilyas and X. Chu. Trends in cleaning relational data: Consistency and deduplication. Foundations and Trends in Databases, 5(4):281--393, 2015. Google ScholarDigital Library
S. R. Jeffery, G. Alonso, M. J. Franklin, W. Hong, and J. Widom. A pipelined framework for online cleaning of sensor data streams. In ICDE, 2006. Google ScholarDigital Library
S. R. Jeffery, M. N. Garofalakis, and M. J. Franklin. Adaptive cleaning for RFID data streams. In Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12--15, 2006, pages 163--174, 2006. Google ScholarDigital Library
T. Johnson and T. Dasu. Data quality and data cleaning: An overview. In SIGMOD, page 681, 2003. Google ScholarDigital Library
Z. Khayyat, I. F. Ilyas, A. Jindal, S. Madden, M. Ouzzani, P. Papotti, J. Quiané-Ruiz, N. Tang, and S. Yin. Bigdansing: A system for big data cleansing. pages 1215--1230, 2015. Google ScholarDigital Library
S. Kolahi and L. V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, pages 53--62, 2009. Google ScholarDigital Library
L. Kolb, A. Thor, and E. Rahm. Dedoop: efficient deduplication with hadoop. PVLDB, 5(12):1878--1881, 2012. Google ScholarDigital Library
H.-P. Kriegel, P. Kröger, and A. Zimek. Outlier detection techniques. In Tutorial at SIGKDD, 2010.Google Scholar
S. Krishnan, J. Patel, M. J. Franklin, and K. Goldberg. A methodology for learning, analyzing, and mitigating social influence bias in recommender systems. In RecSys, 2014. Google ScholarDigital Library
S. Krishnan, J. Wang, M. J. Franklin, K. Goldberg, and T. Kraska. Stale view cleaning: Getting fresh answers from stale materialized views. PVLDB, 8(12), 2015. Google ScholarDigital Library
S. Krishnan, J. Wang, M. J. Franklin, K. Goldberg, T. Kraska, T. Milo, and E. Wu. Sampleclean: Fast and reliable analytics on dirty data. IEEE Data Eng. Bull., 38(3):59--75, 2015.Google Scholar
S. Krishnan, J. Wang, E. Wu, M. J. Franklin, and K. Goldberg. Activeclean: Interactive data cleaning while learning convex loss models. In Arxiv: http://arxiv.org/pdf/1601.03797.pdf, 2015.Google Scholar
Z. Li, S. Shang, Q. Xie, and X. Zhang. Cost reduction for web-based data imputation. In Database Systems for Advanced Applications, pages 438--452. Springer, 2014.Google ScholarCross Ref
S. Madden. Database abstractions for managing sensor network data. Proceedings of the IEEE, 98(11):1879--1886, 2010.Google ScholarCross Ref
J. Mahler, S. Krishnan, M. Laskey, S. Sen, A. Murali, B. Kehoe, S. Patil, J. Wang, M. Franklin, P. Abbeel, and K. Y. Goldberg. Learning accurate kinematic control of cable-driven surgical robots using data cleaning and gaussian process regression. In CASE, 2014.Google Scholar
A. Marcus and A. Parameswaran. Crowdsourced data management: Industry and academic perspectives. Foundations and Trends in Databases, 6(1--2):1--161, 2013. Google ScholarDigital Library
C. Mayfield, J. Neville, and S. Prabhakar. ERACER: a database approach for statistical inference and data cleaning. In SIGMOD, 2010. Google ScholarDigital Library
A. Meliou, W. Gatterbauer, S. Nath, and D. Suciu. Tracing data errors with view-conditioned causality. In SIGMOD, pages 505--516, 2011. Google ScholarDigital Library
B. Mozafari, P. Sarkar, M. J. Franklin, M. I. Jordan, and S. Madden. Scaling up crowd-sourcing to very large datasets: A case for active learning. PVLDB, 8(2), 2014. Google ScholarDigital Library
A. Parameswaran, H. Garcia-Molina, H. Park, N. Polyzotis, A. Ramesh, and J. Widom. Crowdscreen: Algorithms for filtering data with humans.Google Scholar
A. Parameswaran and N. Polyzotis. Answering queries using humans, algorithms and databases. 2011.Google Scholar
H. Park and J. Widom. Crowdfill: collecting structured data from the crowd. In SIGMOD, 2014. Google ScholarDigital Library
E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4), 2000.Google Scholar
V. Raman and J. M. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB, pages 381--390, 2001. Google ScholarDigital Library
P. J. Rousseeuw and A. M. Leroy. Robust regression and outlier detection, volume 589. John Wiley & Sons, 2005.Google Scholar
D. Russo and J. Zou. Controlling bias in adaptive data analysis using information theory. CoRR, abs/1511.05219, 2015.Google Scholar
G. Simoes, H. Galhardas, and L. Gravano. When speed has a price: Fast information extraction using approximate algorithms. PVLDB, 6(13):1462--1473, 2013. Google ScholarDigital Library
E. H. Simpson. The interpretation of interaction in contingency tables. Journal of the Royal Statistical Society. Series B (Methodological), 1951.Google Scholar
M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. B. Zdonik, A. Pagan, and S. Xu. Data curation at scale: The data tamer system. In CIDR, 2013.Google Scholar
S. Subramaniam, T. Palpanas, D. Papadopoulos, V. Kalogeraki, and D. Gunopulos. Online outlier detection in sensor data using non-parametric models. In PVLDB, pages 187--198, 2006. Google ScholarDigital Library
Y. Tong, C. C. Cao, C. J. Zhang, Y. Li, and L. Chen. Crowdcleaner: Data cleaning for multi-version data on the web via crowdsourcing. In ICDE, pages 1182--1185, 2014.Google ScholarCross Ref
R. Verborgh and M. De Wilde. Using OpenRefine. Packt Publishing Ltd, 2013. Google ScholarDigital Library
M. Volkovs, F. Chiang, J. Szlichta, and R. J. Miller. Continuous data cleaning. In ICDE, pages 244--255, 2014.Google ScholarCross Ref
J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. PVLDB, 5(11), 2012. Google ScholarDigital Library
J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A sample-and-clean framework for fast and accurate query processing on dirty data. In SIGMOD, 2014. Google ScholarDigital Library
J. Wang, G. Li, T. Kraska, M. J. Franklin, and J. Feng. Leveraging transitive relations for crowdsourced joins. In SIGMOD, pages 229--240, 2013. Google ScholarDigital Library
J. Wang and N. Tang. Towards dependable data repairing with fixing rules. In SIGMOD, pages 457--468. ACM, 2014. Google ScholarDigital Library
S. E. Whang, P. Lofgren, and H. Garcia-Molina. Question selection for crowd entity resolution. PVLDB, 6(6):349--360, Apr. 2013. Google ScholarDigital Library
E. Wu and S. Madden. Scorpion: Explaining away outliers in aggregate queries. PVLDB, 6(8):553--564, 2013. Google ScholarDigital Library
H. Xiao, B. Biggio, G. Brown, G. Fumera, C. Eckert, and F. Roli. Is feature selection secure against training data poisoning? In ICML, 2015.Google ScholarDigital Library
M. Yakout, L. Berti-Equille, and A. K. Elmagarmid. Don't be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In SIGMOD, 2013. Google ScholarDigital Library
M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 4(5):279--289, 2011. Google ScholarDigital Library

Index Terms

Data Cleaning: Overview and Emerging Challenges
1. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory

Recommendations

An Enhanced Technique to Clean Data in the Data Warehouse
DESE '11: Proceedings of the 2011 Developments in E-systems Engineering

Data quality is a critical factor for the success of data warehousing projects. Improving the quality of data is important in data warehouse, because it is used in the process of decision support, which requires accurate data. There are many errors and ...
Read More
A Comparative Study of Data Cleaning Tools

In the information era, data is crucial in decision making. Most data sets contain impurities that need to be weeded out before any meaningful decision can be made from the data. Hence, data cleaning is essential and often takes more than 80 percent ...
Read More
An Ontology-based Methodology for Reusing Data Cleaning Knowledge
IC3K 2015: Proceedings of the International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management

The organizations' demand to integrate several heterogeneous data sources and an ever-increasing volume of data is revealing the presence of quality problems in data. Currently, most of the data cleaning approaches (for detection and correction of data ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
June 2016
2300 pages
ISBN:9781450335317
DOI:10.1145/2882903
General Chairs:
Fatma Özcan
IBM Research, USA
,
Georgia Koutrika
HP Labs, USA
,
Program Chair:
Sam Madden
Massachusetts Institute of Technology, USA
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 June 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data cleaning
data quality
integrity constraints
sampling
statistical cleaning
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 215
  Total Citations
  View Citations
- 7,255
  Total Downloads
- Downloads (Last 12 months)1,846
- Downloads (Last 6 weeks)295
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Data Cleaning: Overview and Emerging Challenges

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

An Enhanced Technique to Clean Data in the Data Warehouse

A Comparative Study of Data Cleaning Tools

An Ontology-based Methodology for Reusing Data Cleaning Knowledge

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Data Cleaning: Overview and Emerging Challenges

SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

An Enhanced Technique to Clean Data in the Data Warehouse

A Comparative Study of Data Cleaning Tools

An Ontology-based Methodology for Reusing Data Cleaning Knowledge

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media