research-article

Detecting data errors: where are we and what needs to be done?

Authors:
Ziawasch Abedjan

MIT CSAIL

MIT CSAIL
View Profile

,
Xu Chu

University of Waterloo

University of Waterloo
View Profile

,
Dong Deng

Tsinghua University

Tsinghua University
View Profile

,
Raul Castro Fernandez

MIT CSAIL

MIT CSAIL
View Profile

,
Ihab F. Ilyas

University of Waterloo

University of Waterloo
View Profile

,
Mourad Ouzzani

Qatar Computing Research Institute, HBKU

Qatar Computing Research Institute, HBKU
View Profile

,
Paolo Papotti

Arizona State University

Arizona State University
View Profile

,
Michael Stonebraker

MIT CSAIL

MIT CSAIL
View Profile

,
Nan Tang

Qatar Computing Research Institute, HBKU

Qatar Computing Research Institute, HBKU
View Profile

Proceedings of the VLDB Endowment Volume 9 Issue 12pp 993–1004https://doi.org/10.14778/2994509.2994518

Published:01 August 2016Publication History

Proceedings of the VLDB Endowment

Abstract

Data cleaning has played a critical role in ensuring data quality for enterprise applications. Naturally, there has been extensive research in this area, and many data cleaning algorithms have been translated into tools to detect and to possibly repair certain classes of errors such as outliers, duplicates, missing values, and violations of integrity constraints. Since different types of errors may coexist in the same data set, we often need to run more than one kind of tool. In this paper, we investigate two pragmatic questions: (1) are these tools robust enough to capture most errors in real-world data sets? and (2) what is the best strategy to holistically run multiple tools to optimize the detection effort? To answer these two questions, we obtained multiple data cleaning tools that utilize a variety of error detection techniques. We also collected five real-world data sets, for which we could obtain both the raw data and the ground truth on existing errors. In this paper, we report our experimental findings on the errors detected by the tools we tested. First, we show that the coverage of each tool is well below 100%. Second, we show that the order in which multiple tools are run makes a big difference. Hence, we propose a holistic multi-tool strategy that orders the invocations of the available tools to maximize their benefit, while minimizing human effort in verifying results. Third, since this holistic approach still does not lead to acceptable error coverage, we discuss two simple strategies that have the potential to improve the situation, namely domain specific tools and data enrichment. We close this paper by reasoning about the errors that are not detectable by any of the tools we tested.

References

Z. Abedjan, C. Akcora, M. Ouzzani, P. Papotti, and M. Stonebraker. Temporal rules discovery for web data cleaning. PVLDB, 9(4):336--347, 2015. Google ScholarDigital Library
Z. Abedjan, L. Golab, and F. Naumann. Profiling relational data: a survey. VLDB Journal, 24(4):557--581, 2015. Google ScholarDigital Library
Z. Abedjan, J. Morcos, I. F. Ilyas, P. Papotti, M. Ouzzani, and M. Stonebraker. DataXFormer: A robust data transformation system. In ICDE, 2016.Google ScholarCross Ref
P. C. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, and D. Santoro. Messing-Up with BART: Error Generation for Evaluating Data Cleaning Algorithms. PVLDB, 9(2):36--47, 2015. Google ScholarDigital Library
V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Comput. Surv., 41(3):15:1--15:58, July 2009. Google ScholarDigital Library
X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, 2013.Google ScholarDigital Library
X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye. Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In SIGMOD, 2015. Google ScholarDigital Library
M. Dallachiesa, A. Ebaid, A. Eldawy, A. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. Nadeef: A commodity data cleaning system. In SIGMOD, 2013. Google ScholarDigital Library
T. Dasu and J. M. Loh. Statistical distortion: Consequences of data cleaning. PVLDB, 5(11):1674--1683, 2012. Google ScholarDigital Library
A. Doan, A. Y. Halevy, and Z. G. Ives. Principles of Data Integration. Morgan Kaufmann, 2012. Google ScholarDigital Library
X. L. Dong and D. Srivastava. Big Data Integration. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2015.Google ScholarCross Ref
A. Elmagarmid, Z. Fedorowicz, H. Hammady, I. Ilyas, M. Khabsa, and O. Mourad. Rayyan: a systematic reviews web app for exploring and filtering searches for eligible studies for cochrane reviews. In Abstracts of the 22nd Cochrane Colloquium, page 9. John Wiley & Sons, Sept. 2014.Google Scholar
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering (TKDE), 19(1):1--16, 2007. Google ScholarDigital Library
W. Fan and F. Geerts. Foundations of Data Quality Management. Morgan & Claypool, 2012. Google ScholarDigital Library
W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. VLDB Journal, 21(2):213--238, 2012. Google ScholarDigital Library
F. Geerts, G. Mecca, P. Papotti, and D. Santoro. Mapping and Cleaning. In ICDE, 2014.Google ScholarCross Ref
J. M. Hellerstein. Quantitative data cleaning for large databases, 2008.Google Scholar
H. Hemila and E. Chalker. Vitamin c for preventing and treating the common cold. Cochrane Database Syst Rev, 1, 2013.Google Scholar
I. F. Ilyas and X. Chu. Trends in cleaning relational data: Consistency and deduplication. Foundations and Trends in Databases, 5(4):281--393, 2015. Google ScholarDigital Library
E. T. Jaynes. On the rationale of maximum-entropy methods. Proceedings of the IEEE, 70(9):939--952, 1982.Google ScholarCross Ref
S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Interactive visual specification of data transformation scripts. New York, NY, USA, 2011.Google Scholar
S. Kandel, A. Paepcke, J. M. Hellerstein, and J. Heer. Enterprise data analysis and visualization: An interview study. IEEE Trans. Vis. Comput. Graph., 18(12):2917--2926, 2012. Google ScholarDigital Library
Z. Khayyat, I. F. Ilyas, A. Jindal, S. Madden, M. Ouzzani, P. Papotti, J.-A. Quiané-Ruiz, N. Tang, and S. Yin. Bigdansing: A system for big data cleansing. In SIGMOD, pages 1215--1230, 2015. Google ScholarDigital Library
W. Kim, B.-J. Choi, E.-K. Hong, S.-K. Kim, and D. Lee. A taxonomy of dirty data. Data Min. Knowl. Discov., 7(1):81--99, Jan. 2003. Google ScholarDigital Library
S. Kolahi and L. V. S. Lakshmanan. On Approximating Optimum Repairs for Functional Dependency Violations. In ICDT, 2009. Google ScholarDigital Library
F. Naumann and M. Herschel. An Introduction to Duplicate Detection. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2010. Google ScholarDigital Library
C. Pit-Claudel, Z. Mariet, R. Harding, and S. Madden. Outlier detection in heterogeneous datasets using automatic tuple expansion. Technical Report MIT-CSAIL-TR-2016-002, CSAIL, MIT, 32 Vassar Street, Cambridge MA 02139, February 2016.Google Scholar
N. Prokoshyna, J. Szlichta, F. Chiang, R. J. Miller, and D. Srivastava. Combining quantitative and logical data cleaning. PVLDB, 9(4):300--311, 2015. Google ScholarDigital Library
E. Rahm and H.-H. Do. Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4):3--13, 2000.Google Scholar
M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. Zdonik, A. Pagan, and S. Xu. Data curation at scale: The Data Tamer system. In CIDR, 2013.Google Scholar
F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowledge. In WWW, pages 697--706, 2007. Google ScholarDigital Library
M. Vartak, S. Rahman, S. Madden, A. Parameswaran, and N. Polyzotis. Seedb: Efficient data-driven visualization recommendations to support visual analytics. PVLDB, 8(13):2182--2193, Sept. 2015. Google ScholarDigital Library
J. Wang and N. Tang. Towards dependable data repairing with fixing rules. In SIGMOD, pages 457--468, 2014. Google ScholarDigital Library
E. Wu and S. Madden. Scorpion: Explaining away outliers in aggregate queries. PVLDB, 6(8):553--564, June 2013. Google ScholarDigital Library

Recommendations

Comments on 'A Systematic (16, 8) Code for Correcting Double Errors and Detecting Triple-Adjacent Errors'

In this paper we first present a systematic (16, 8) code that can correct double errors and detect all triple-adjacent errors. The restriction of detecting triple-adjacent errors in 8-bit bytes has been removed. We also present a systematic (16, 8) code ...
Read More
Optimal codes for correcting single errors and detecting adjacent errors

Optimal codes that correct single errors and detect double errors within nibbles of power of two length are presented. For each n , a code of length n with the largest possible dimension which corrects single errors and detects double adjacent errors is ...
Read More
A Systematic (16,8) Code for Correcting Double Errors and Detecting Triple-Adjacent Errors

A double error correcting systematic (16,8) quasi-cycle (QC) code that can detect all triple-adjacent errors within each 8-b byte is presented. This code is useful in computer memory applications where adjacent errors are more likely than random errors. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 9, Issue 12
August 2016
345 pages
ISSN:2150-8097
Editors:
Surajit Chaudhuri
Microsoft Research
,
Jayant Haritsa
I.I.Sc. Bangalore
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2016
Published in pvldb Volume 9, Issue 12
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 68
  Total Citations
  View Citations
- 1,721
  Total Downloads
- Downloads (Last 12 months)624
- Downloads (Last 6 weeks)148
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Detecting data errors: where are we and what needs to be done?

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Comments on 'A Systematic (16, 8) Code for Correcting Double Errors and Detecting Triple-Adjacent Errors'

Optimal codes for correcting single errors and detecting adjacent errors

A Systematic (16,8) Code for Correcting Double Errors and Detecting Triple-Adjacent Errors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Detecting data errors: where are we and what needs to be done?

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Comments on 'A Systematic (16, 8) Code for Correcting Double Errors and Detecting Triple-Adjacent Errors'

Optimal codes for correcting single errors and detecting adjacent errors

A Systematic (16,8) Code for Correcting Double Errors and Detecting Triple-Adjacent Errors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media