skip to main content
10.1145/2882903.2912574acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Data Cleaning: Overview and Emerging Challenges

Published:26 June 2016Publication History

ABSTRACT

Detecting and repairing dirty data is one of the perennial challenges in data analytics, and failure to do so can result in inaccurate analytics and unreliable decisions. Over the past few years, there has been a surge of interest from both industry and academia on data cleaning problems including new abstractions, interfaces, approaches for scalability, and statistical techniques. To better understand the new advances in the field, we will first present a taxonomy of the data cleaning literature in which we highlight the recent interest in techniques that use constraints, rules, or patterns to detect errors, which we call qualitative data cleaning. We will describe the state-of-the-art techniques and also highlight their limitations with a series of illustrative examples. While traditionally such approaches are distinct from quantitative approaches such as outlier detection, we also discuss recent work that casts such approaches into a statistical estimation framework including: using Machine Learning to improve the efficiency and accuracy of data cleaning and considering the effects of data cleaning on statistical analysis.

References

  1. Trifacta. http://www.trifacta.com.Google ScholarGoogle Scholar
  2. C. C. Aggarwal. Outlier Analysis. Springer, 2013. Google ScholarGoogle ScholarCross RefCross Ref
  3. Y. Altowim, D. V. Kalashnikov, and S. Mehrotra. Progressive approach to relational entity resolution. PVLDB, 7(11), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. H. Altwaijry, S. Mehrotra, and D. V. Kalashnikov. Query: A framework for integrating entity resolution with query processing. PVLDB, 9(3):120--131, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In PVLDB, pages 586--597, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Balazinska, A. Deshpande, M. J. Franklin, P. B. Gibbons, J. Gray, M. H. Hansen, M. Liebhold, S. Nath, A. S. Szalay, and V. Tao. Data management in the worldwide sensor web. IEEE Pervasive Computing, 6(2):30--40, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Bergman, T. Milo, S. Novgorodov, and W. C. Tan. Query-oriented data cleaning with oracles. In SIGMOD, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. L. Berti-Equille, T. Dasu, and D. Srivastava. Discovery of complex glitch patterns: A novel approach to quantitative data cleaning. In ICDE, pages 733--744, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. L. E. Bertossi. Consistent query answering in databases. SIGMOD Record, 35(2):68--76, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. Beskales, I. F. Ilyas, and L. Golab. Sampling the repairs of functional dependency violations under hard constraints. PVLDB, 3(1--2):197--207, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. Beskales, I. F. Ilyas, L. Golab, and A. Galiullin. On the relative trust between inconsistent data and inaccurate constraints. In ICDE, pages 541--552, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G. Beskales, M. A. Soliman, I. F. Ilyas, and S. Ben-David. Modeling and querying possible repairs in duplicate detection. PVLDB, pages 598--609, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, pages 143--154. ACM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE, pages 746--755, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  15. L. Cao, D. Yang, Q. Wang, Y. Yu, J. Wang, and E. A. Rundensteiner. Scalable distance-based outlier detection over high-volume data streams. In ICDE, pages 76--87, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  16. A. Chalamalla, I. F. Ilyas, M. Ouzzani, and P. Papotti. Descriptive and prescriptive data cleaning. In SIGMOD, pages 445--456, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Chawla and P. Sun. Outlier detection: Principles, techniques and applications. In PAKDD, 2006.Google ScholarGoogle Scholar
  18. Z. Chen and M. Cafarella. Integrating spreadsheet data via accurate and low-effort extraction. In KDD. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. F. Chiang and R. J. Miller. A unified model for data and constraint repair. In ICDE, pages 446--457, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. X. Chu, I. F. Ilyas, and P. Koutris. Distributed Data Deduplication. Technical Report CS-2016-02, University of Waterloo, 2016.Google ScholarGoogle Scholar
  21. X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. PVLDB, 6(13):1498--1509, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458--469, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye. KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. In SIGMOD, pages 1247--1261, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Y. Chung, M. L. Mortensen, C. Binnig, and T. Kraska. Estimating the impact of unknown unknowns on aggregate query results. CoRR, abs/1507.05591, 2015.Google ScholarGoogle Scholar
  25. G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In PVLDB, pages 315--326. VLDB Endowment, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Dallachiesa, A. Ebaid, A. Eldawy, A. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. Nadeef: a commodity data cleaning system. In SIGMOD, pages 541--552, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. A. Deligiannakis, Y. Kotidis, V. Vassalos, V. Stoumpos, and A. Delis. Another outlier bites the dust: Computing meaningful aggregates in sensor networks. In ICDE, pages 988--999, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. Deshpande, C. Guestrin, S. Madden, J. M. Hellerstein, and W. Hong. Model-driven data acquisition in sensor networks. In PVLDB, pages 588--599, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. C. Dwork, V. Feldman, M. Hardt, T. Pitassi, O. Reingold, and A. L. Roth. Preserving statistical validity in adaptive data analysis. In STOC, pages 117--126, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. PVLDB, 3(1--2):173--184, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Interaction between record matching and data repairing. In SIGMOD, pages 469--480. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Gartner. Forecast: The internet of things, worldwide. https://www.gartner.com/doc/2625419/forecast-internet-things-worldwide-.Google ScholarGoogle Scholar
  33. F. Geerts, G. Mecca, P. Papotti, and D. Santoro. The llunatic data-cleaning framework. PVLDB, 6(9):625--636, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. D. Georgiadis, M. Kontaki, A. Gounaris, A. N. Papadopoulos, K. Tsichlas, and Y. Manolopoulos. Continuous outlier detection in data streams: an extensible framework and state-of-the-art algorithms. In SIGMOD, pages 1061--1064, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. C. Gokhale, S. Das, A. Doan, J. F. Naughton, N. Rampalli, J. Shavlik, and X. Zhu. Corleone: Hands-off crowdsourcing for entity matching. In SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. L. Golab, H. Karloff, F. Korn, D. Srivastava, and B. Yu. On generating near-optimal tableaux for conditional functional dependencies. PVLDB, 1(1):376--390, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. D. Haas, S. Krishnan, J. Wang, M. J. Franklin, and E. Wu. Wisteria: Nurturing scalable data cleaning infrastructure. PVLDB, 8(12), 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. D. Haas, J. Wang, E. Wu, and M. J. Franklin. Clamshell: Speeding up crowds for low-latency data labeling. PVLDB, 9(4):372--383, Dec. 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. A. Heise, G. Kasneci, and F. Naumann. Estimating the number and sizes of fuzzy-duplicate clusters. In CIKM Conference, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. J. M. Hellerstein. Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE), 2008.Google ScholarGoogle Scholar
  41. I. F. Ilyas and X. Chu. Trends in cleaning relational data: Consistency and deduplication. Foundations and Trends in Databases, 5(4):281--393, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. S. R. Jeffery, G. Alonso, M. J. Franklin, W. Hong, and J. Widom. A pipelined framework for online cleaning of sensor data streams. In ICDE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. S. R. Jeffery, M. N. Garofalakis, and M. J. Franklin. Adaptive cleaning for RFID data streams. In Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12--15, 2006, pages 163--174, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. T. Johnson and T. Dasu. Data quality and data cleaning: An overview. In SIGMOD, page 681, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Z. Khayyat, I. F. Ilyas, A. Jindal, S. Madden, M. Ouzzani, P. Papotti, J. Quiané-Ruiz, N. Tang, and S. Yin. Bigdansing: A system for big data cleansing. pages 1215--1230, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. S. Kolahi and L. V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, pages 53--62, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. L. Kolb, A. Thor, and E. Rahm. Dedoop: efficient deduplication with hadoop. PVLDB, 5(12):1878--1881, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. H.-P. Kriegel, P. Kröger, and A. Zimek. Outlier detection techniques. In Tutorial at SIGKDD, 2010.Google ScholarGoogle Scholar
  49. S. Krishnan, J. Patel, M. J. Franklin, and K. Goldberg. A methodology for learning, analyzing, and mitigating social influence bias in recommender systems. In RecSys, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. S. Krishnan, J. Wang, M. J. Franklin, K. Goldberg, and T. Kraska. Stale view cleaning: Getting fresh answers from stale materialized views. PVLDB, 8(12), 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. S. Krishnan, J. Wang, M. J. Franklin, K. Goldberg, T. Kraska, T. Milo, and E. Wu. Sampleclean: Fast and reliable analytics on dirty data. IEEE Data Eng. Bull., 38(3):59--75, 2015.Google ScholarGoogle Scholar
  52. S. Krishnan, J. Wang, E. Wu, M. J. Franklin, and K. Goldberg. Activeclean: Interactive data cleaning while learning convex loss models. In Arxiv: http://arxiv.org/pdf/1601.03797.pdf, 2015.Google ScholarGoogle Scholar
  53. Z. Li, S. Shang, Q. Xie, and X. Zhang. Cost reduction for web-based data imputation. In Database Systems for Advanced Applications, pages 438--452. Springer, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  54. S. Madden. Database abstractions for managing sensor network data. Proceedings of the IEEE, 98(11):1879--1886, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  55. J. Mahler, S. Krishnan, M. Laskey, S. Sen, A. Murali, B. Kehoe, S. Patil, J. Wang, M. Franklin, P. Abbeel, and K. Y. Goldberg. Learning accurate kinematic control of cable-driven surgical robots using data cleaning and gaussian process regression. In CASE, 2014.Google ScholarGoogle Scholar
  56. A. Marcus and A. Parameswaran. Crowdsourced data management: Industry and academic perspectives. Foundations and Trends in Databases, 6(1--2):1--161, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. C. Mayfield, J. Neville, and S. Prabhakar. ERACER: a database approach for statistical inference and data cleaning. In SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. A. Meliou, W. Gatterbauer, S. Nath, and D. Suciu. Tracing data errors with view-conditioned causality. In SIGMOD, pages 505--516, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. B. Mozafari, P. Sarkar, M. J. Franklin, M. I. Jordan, and S. Madden. Scaling up crowd-sourcing to very large datasets: A case for active learning. PVLDB, 8(2), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. A. Parameswaran, H. Garcia-Molina, H. Park, N. Polyzotis, A. Ramesh, and J. Widom. Crowdscreen: Algorithms for filtering data with humans.Google ScholarGoogle Scholar
  61. A. Parameswaran and N. Polyzotis. Answering queries using humans, algorithms and databases. 2011.Google ScholarGoogle Scholar
  62. H. Park and J. Widom. Crowdfill: collecting structured data from the crowd. In SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4), 2000.Google ScholarGoogle Scholar
  64. V. Raman and J. M. Hellerstein. Potter's wheel: An interactive data cleaning system. In VLDB, pages 381--390, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  65. P. J. Rousseeuw and A. M. Leroy. Robust regression and outlier detection, volume 589. John Wiley & Sons, 2005.Google ScholarGoogle Scholar
  66. D. Russo and J. Zou. Controlling bias in adaptive data analysis using information theory. CoRR, abs/1511.05219, 2015.Google ScholarGoogle Scholar
  67. G. Simoes, H. Galhardas, and L. Gravano. When speed has a price: Fast information extraction using approximate algorithms. PVLDB, 6(13):1462--1473, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. E. H. Simpson. The interpretation of interaction in contingency tables. Journal of the Royal Statistical Society. Series B (Methodological), 1951.Google ScholarGoogle Scholar
  69. M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. B. Zdonik, A. Pagan, and S. Xu. Data curation at scale: The data tamer system. In CIDR, 2013.Google ScholarGoogle Scholar
  70. S. Subramaniam, T. Palpanas, D. Papadopoulos, V. Kalogeraki, and D. Gunopulos. Online outlier detection in sensor data using non-parametric models. In PVLDB, pages 187--198, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Y. Tong, C. C. Cao, C. J. Zhang, Y. Li, and L. Chen. Crowdcleaner: Data cleaning for multi-version data on the web via crowdsourcing. In ICDE, pages 1182--1185, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  72. R. Verborgh and M. De Wilde. Using OpenRefine. Packt Publishing Ltd, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. M. Volkovs, F. Chiang, J. Szlichta, and R. J. Miller. Continuous data cleaning. In ICDE, pages 244--255, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  74. J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. PVLDB, 5(11), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. J. Wang, S. Krishnan, M. J. Franklin, K. Goldberg, T. Kraska, and T. Milo. A sample-and-clean framework for fast and accurate query processing on dirty data. In SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. J. Wang, G. Li, T. Kraska, M. J. Franklin, and J. Feng. Leveraging transitive relations for crowdsourced joins. In SIGMOD, pages 229--240, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. J. Wang and N. Tang. Towards dependable data repairing with fixing rules. In SIGMOD, pages 457--468. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. S. E. Whang, P. Lofgren, and H. Garcia-Molina. Question selection for crowd entity resolution. PVLDB, 6(6):349--360, Apr. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. E. Wu and S. Madden. Scorpion: Explaining away outliers in aggregate queries. PVLDB, 6(8):553--564, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. H. Xiao, B. Biggio, G. Brown, G. Fumera, C. Eckert, and F. Roli. Is feature selection secure against training data poisoning? In ICML, 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. M. Yakout, L. Berti-Equille, and A. K. Elmagarmid. Don't be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In SIGMOD, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. M. Yakout, A. K. Elmagarmid, J. Neville, M. Ouzzani, and I. F. Ilyas. Guided data repair. PVLDB, 4(5):279--289, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Data Cleaning: Overview and Emerging Challenges

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data
            June 2016
            2300 pages
            ISBN:9781450335317
            DOI:10.1145/2882903

            Copyright © 2016 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 26 June 2016

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate785of4,003submissions,20%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader