skip to main content
research-article

HoloClean: holistic data repairs with probabilistic inference

Published:01 August 2017Publication History
Skip Abstract Section

Abstract

We introduce HoloClean, a framework for holistic data repairing driven by probabilistic inference. HoloClean unifies qualitative data repairing, which relies on integrity constraints or external data sources, with quantitative data repairing methods, which leverage statistical properties of the input data. Given an inconsistent dataset as input, HoloClean automatically generates a probabilistic program that performs data repairing. Inspired by recent theoretical advances in probabilistic inference, we introduce a series of optimizations which ensure that inference over HoloClean's probabilistic model scales to instances with millions of tuples. We show that HoloClean finds data repairs with an average precision of ∼ 90% and an average recall of above ∼ 76% across a diverse array of datasets exhibiting different types of errors. This yields an average F1 improvement of more than 2× against state-of-the-art methods.

References

  1. https://data.cityofchicago.org.Google ScholarGoogle Scholar
  2. https://alchemy.cs.washington.edu/.Google ScholarGoogle Scholar
  3. https://data.medicare.gov/data/physician-compare.Google ScholarGoogle Scholar
  4. Z. Abedjan, X. Chu, D. Deng, R. C. Fernandez, I. F. Ilyas, M. Ouzzani, P. Papotti, M. Stonebraker, and N. Tang. Detecting data errors: Where are we and what needs to be done? Proc. VLDB Endow., 9(12):993--1004, Aug. 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. F. N. Afrati and P. G. Kolaitis. Repair checking in inconsistent databases: algorithms and complexity. In ICDT, pages 31--41, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. H. Bach, M. Broecheler, B. Huang, and L. Getoor. Hinge-loss markov random fields and probabilistic soft logic. CoRR, abs/1505.04406, 2015.Google ScholarGoogle Scholar
  7. L. E. Bertossi, S. Kolahi, and L. V. S. Lakshmanan. Data cleaning and query answering with matching dependencies and matching functions. In ICDT, pages 268--279, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. G. Beskales, I. F. Ilyas, and L. Golab. Sampling the repairs of functional dependency violations under hard constraints. PVLDB, 3(1--2):197--207, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. Beskales, I. F. Ilyas, L. Golab, and A. Galiullin. On the relative trust between inconsistent data and inaccurate constraints. In ICDE, pages 541--552, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, pages 143--154. ACM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE, pages 746--755, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  12. V. Chandrasekaran and M. I. Jordan. Computational and statistical tradeoffs via convex relaxation. PNAS, 110(13):1181--1190, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  13. J. Chomicki and J. Marcinkowski. Minimal-change integrity maintenance using tuple deletions. Information and Computation, 197(1):90--121, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. PVLDB, 6(13):1498--1509, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458--469, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye. KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. In SIGMOD, pages 1247--1261, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In PVLDB, pages 315--326. VLDB Endowment, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Dallachiesa, A. Ebaid, A. Eldawy, A. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. Nadeef: a commodity data cleaning system. In SIGMOD, pages 541--552, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. K. Das and J. Schneider. Detecting anomalous records In categorical datasets. KDD, pages 220--229, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Ermon, C. P. Gomes, and B. Selman. Uniform solution sampling using a constraint solver as an oracle. In UAI, pages 255--264, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. R. Fagin, B. Kimelfeld, and P. G. Kolaitis. Dichotomies in the complexity of preferred repairs. PODS, pages 3--15, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. W. Fan. Dependencies revisited for improving data quality. In SIGMOD, pages 159--170, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. W. Fan and F. Geerts. Foundations of Data Quality Management. Synthesis Lectures on Data Management. 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. W. Fan, X. Jia, J. Li, and S. Ma. Reasoning about record matching rules. PVLDB, 2(1):407--418, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. PVLDB, 3(1--2):173--184, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. F. Geerts, G. Mecca, P. Papotti, and D. Santoro. The llunatic data-cleaning framework. PVLDB, 6(9):625--636, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. L. Getoor and A. Machanavajjhala. Entity resolution: theory, practice & open challenges. PVLDB, 5(12):2018--2019, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. T. P. Hayes and A. Sinclair. A general lower bound for mixing of single-site dynamics on graphs. The Annals of Applied Probability, 17(3):931--952, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  29. J. M. Hellerstein. Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE), 2008.Google ScholarGoogle Scholar
  30. I. F. Ilyas. Effective data cleaning with continuous evaluation. IEEE Data Eng. Bull., 39:38--46, 2016.Google ScholarGoogle Scholar
  31. I. F. Ilyas and X. Chu. Trends in cleaning relational data: Consistency and deduplication. Foundations and Trends in Databases, 5(4):281--393, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. Interlandi and N. Tang. Proof positive and negative in data cleaning. In ICDE, 2015.Google ScholarGoogle ScholarCross RefCross Ref
  33. S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Interactive visual specification of data transformation scripts. In CHI, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. S. Kolahi and L. V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, pages 53--62, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning. The MIT Press, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. N. Koudas, A. Saha, D. Srivastava, and S. Venkatasubramanian. Metric functional dependencies. In ICDE, pages 1275--1278, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In SIGMOD, pages 802--803, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. X. Li, X. L. Dong, K. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: Is the problem solved? PVLDB, pages 97--108, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. C. Mayfield, J. Neville, and S. Prabhakar. Eracer: A database approach for statistical inference and data cleaning. SIGMOD, pages 75--86, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. F. Naumann and M. Herschel. An Introduction to Duplicate Detection. Synthesis Lectures on Data Management. 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. F. Niu, C. Ré, A. Doan, and J. Shavlik. Tuffy: Scaling up statistical inference in markov logic networks using an rdbms. Proc. VLDB Endow., 4(6):373--384, Mar. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3--13, 2000.Google ScholarGoogle Scholar
  43. A. J. Ratner, C. D. Sa, S. Wu, D. Selsam, and C. Ré. Data programming: Creating large training sets, quickly. In NIPS, pages 3567--3575, 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. T. Rekatsinas, M. Joglekar, H. Garcia-Molina, A. G. Parameswaran, and C. Ré. SLiMFast: Guaranteed results for data fusion and source reliability. In SIGMOD, 2017. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. C. D. Sa, C. Zhang, K. Olukotun, and C. Ré. Rapidly mixing gibbs sampling for a class of factor graphs using hierarchy width. In NIPS, pages 3097--3105, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. J. Shin, S. Wu, F. Wang, C. De Sa, C. Zhang, and C. Ré. Incremental knowledge base construction using DeepDive. Proc. VLDB Endow., 8(11):1310--1321, July 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. B. Zdonik, A. Pagan, and S. Xu. Data curation at scale: The data tamer system. In CIDR, 2013.Google ScholarGoogle Scholar
  48. J. Wang and N. Tang. Towards dependable data repairing with fixing rules. In SIGMOD, pages 457--468. ACM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. M. Yakout, L. Berti-Équille, and A. K. Elmagarmid. Don't be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In SIGMOD, pages 553--564, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. C. Zhang and C. Ré. Dimmwitted: A study of main-memory statistical analytics. PVLDB, pages 1283--1294, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. HoloClean: holistic data repairs with probabilistic inference
    Index terms have been assigned to the content through auto-classification.

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Proceedings of the VLDB Endowment
      Proceedings of the VLDB Endowment  Volume 10, Issue 11
      August 2017
      432 pages
      ISSN:2150-8097
      Issue’s Table of Contents

      Publisher

      VLDB Endowment

      Publication History

      • Published: 1 August 2017
      Published in pvldb Volume 10, Issue 11

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader