skip to main content
research-article

Detecting data errors: where are we and what needs to be done?

Published:01 August 2016Publication History
Skip Abstract Section

Abstract

Data cleaning has played a critical role in ensuring data quality for enterprise applications. Naturally, there has been extensive research in this area, and many data cleaning algorithms have been translated into tools to detect and to possibly repair certain classes of errors such as outliers, duplicates, missing values, and violations of integrity constraints. Since different types of errors may coexist in the same data set, we often need to run more than one kind of tool. In this paper, we investigate two pragmatic questions: (1) are these tools robust enough to capture most errors in real-world data sets? and (2) what is the best strategy to holistically run multiple tools to optimize the detection effort? To answer these two questions, we obtained multiple data cleaning tools that utilize a variety of error detection techniques. We also collected five real-world data sets, for which we could obtain both the raw data and the ground truth on existing errors. In this paper, we report our experimental findings on the errors detected by the tools we tested. First, we show that the coverage of each tool is well below 100%. Second, we show that the order in which multiple tools are run makes a big difference. Hence, we propose a holistic multi-tool strategy that orders the invocations of the available tools to maximize their benefit, while minimizing human effort in verifying results. Third, since this holistic approach still does not lead to acceptable error coverage, we discuss two simple strategies that have the potential to improve the situation, namely domain specific tools and data enrichment. We close this paper by reasoning about the errors that are not detectable by any of the tools we tested.

References

  1. Z. Abedjan, C. Akcora, M. Ouzzani, P. Papotti, and M. Stonebraker. Temporal rules discovery for web data cleaning. PVLDB, 9(4):336--347, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Z. Abedjan, L. Golab, and F. Naumann. Profiling relational data: a survey. VLDB Journal, 24(4):557--581, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Z. Abedjan, J. Morcos, I. F. Ilyas, P. Papotti, M. Ouzzani, and M. Stonebraker. DataXFormer: A robust data transformation system. In ICDE, 2016.Google ScholarGoogle ScholarCross RefCross Ref
  4. P. C. Arocena, B. Glavic, G. Mecca, R. J. Miller, P. Papotti, and D. Santoro. Messing-Up with BART: Error Generation for Evaluating Data Cleaning Algorithms. PVLDB, 9(2):36--47, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. V. Chandola, A. Banerjee, and V. Kumar. Anomaly detection: A survey. ACM Comput. Surv., 41(3):15:1--15:58, July 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye. Katara: A data cleaning system powered by knowledge bases and crowdsourcing. In SIGMOD, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. M. Dallachiesa, A. Ebaid, A. Eldawy, A. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. Nadeef: A commodity data cleaning system. In SIGMOD, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. Dasu and J. M. Loh. Statistical distortion: Consequences of data cleaning. PVLDB, 5(11):1674--1683, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Doan, A. Y. Halevy, and Z. G. Ives. Principles of Data Integration. Morgan Kaufmann, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. X. L. Dong and D. Srivastava. Big Data Integration. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  12. A. Elmagarmid, Z. Fedorowicz, H. Hammady, I. Ilyas, M. Khabsa, and O. Mourad. Rayyan: a systematic reviews web app for exploring and filtering searches for eligible studies for cochrane reviews. In Abstracts of the 22nd Cochrane Colloquium, page 9. John Wiley & Sons, Sept. 2014.Google ScholarGoogle Scholar
  13. A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering (TKDE), 19(1):1--16, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. W. Fan and F. Geerts. Foundations of Data Quality Management. Morgan & Claypool, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. VLDB Journal, 21(2):213--238, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. F. Geerts, G. Mecca, P. Papotti, and D. Santoro. Mapping and Cleaning. In ICDE, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  17. J. M. Hellerstein. Quantitative data cleaning for large databases, 2008.Google ScholarGoogle Scholar
  18. H. Hemila and E. Chalker. Vitamin c for preventing and treating the common cold. Cochrane Database Syst Rev, 1, 2013.Google ScholarGoogle Scholar
  19. I. F. Ilyas and X. Chu. Trends in cleaning relational data: Consistency and deduplication. Foundations and Trends in Databases, 5(4):281--393, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. E. T. Jaynes. On the rationale of maximum-entropy methods. Proceedings of the IEEE, 70(9):939--952, 1982.Google ScholarGoogle ScholarCross RefCross Ref
  21. S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Interactive visual specification of data transformation scripts. New York, NY, USA, 2011.Google ScholarGoogle Scholar
  22. S. Kandel, A. Paepcke, J. M. Hellerstein, and J. Heer. Enterprise data analysis and visualization: An interview study. IEEE Trans. Vis. Comput. Graph., 18(12):2917--2926, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Z. Khayyat, I. F. Ilyas, A. Jindal, S. Madden, M. Ouzzani, P. Papotti, J.-A. Quiané-Ruiz, N. Tang, and S. Yin. Bigdansing: A system for big data cleansing. In SIGMOD, pages 1215--1230, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. W. Kim, B.-J. Choi, E.-K. Hong, S.-K. Kim, and D. Lee. A taxonomy of dirty data. Data Min. Knowl. Discov., 7(1):81--99, Jan. 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Kolahi and L. V. S. Lakshmanan. On Approximating Optimum Repairs for Functional Dependency Violations. In ICDT, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. F. Naumann and M. Herschel. An Introduction to Duplicate Detection. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. C. Pit-Claudel, Z. Mariet, R. Harding, and S. Madden. Outlier detection in heterogeneous datasets using automatic tuple expansion. Technical Report MIT-CSAIL-TR-2016-002, CSAIL, MIT, 32 Vassar Street, Cambridge MA 02139, February 2016.Google ScholarGoogle Scholar
  28. N. Prokoshyna, J. Szlichta, F. Chiang, R. J. Miller, and D. Srivastava. Combining quantitative and logical data cleaning. PVLDB, 9(4):300--311, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. E. Rahm and H.-H. Do. Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4):3--13, 2000.Google ScholarGoogle Scholar
  30. M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. Zdonik, A. Pagan, and S. Xu. Data curation at scale: The Data Tamer system. In CIDR, 2013.Google ScholarGoogle Scholar
  31. F. M. Suchanek, G. Kasneci, and G. Weikum. Yago: a core of semantic knowledge. In WWW, pages 697--706, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. Vartak, S. Rahman, S. Madden, A. Parameswaran, and N. Polyzotis. Seedb: Efficient data-driven visualization recommendations to support visual analytics. PVLDB, 8(13):2182--2193, Sept. 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Wang and N. Tang. Towards dependable data repairing with fixing rules. In SIGMOD, pages 457--468, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. E. Wu and S. Madden. Scorpion: Explaining away outliers in aggregate queries. PVLDB, 6(8):553--564, June 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 9, Issue 12
    August 2016
    345 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 August 2016
    Published in pvldb Volume 9, Issue 12

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader