research-article

HoloClean: holistic data repairs with probabilistic inference

Authors:
Theodoros Rekatsinas

Stanford University

Stanford University
View Profile

,
Xu Chu

University of Waterloo

University of Waterloo
View Profile

,
Ihab F. Ilyas

University of Waterloo

University of Waterloo
View Profile

,
Christopher Ré

Stanford University

Stanford University
View Profile

Proceedings of the VLDB Endowment Volume 10 Issue 11pp 1190–1201https://doi.org/10.14778/3137628.3137631

Published:01 August 2017Publication History

Proceedings of the VLDB Endowment

Abstract

We introduce HoloClean, a framework for holistic data repairing driven by probabilistic inference. HoloClean unifies qualitative data repairing, which relies on integrity constraints or external data sources, with quantitative data repairing methods, which leverage statistical properties of the input data. Given an inconsistent dataset as input, HoloClean automatically generates a probabilistic program that performs data repairing. Inspired by recent theoretical advances in probabilistic inference, we introduce a series of optimizations which ensure that inference over HoloClean's probabilistic model scales to instances with millions of tuples. We show that HoloClean finds data repairs with an average precision of ∼ 90% and an average recall of above ∼ 76% across a diverse array of datasets exhibiting different types of errors. This yields an average F1 improvement of more than 2× against state-of-the-art methods.

References

https://data.cityofchicago.org.Google Scholar
https://alchemy.cs.washington.edu/.Google Scholar
https://data.medicare.gov/data/physician-compare.Google Scholar
Z. Abedjan, X. Chu, D. Deng, R. C. Fernandez, I. F. Ilyas, M. Ouzzani, P. Papotti, M. Stonebraker, and N. Tang. Detecting data errors: Where are we and what needs to be done? Proc. VLDB Endow., 9(12):993--1004, Aug. 2016. Google ScholarDigital Library
F. N. Afrati and P. G. Kolaitis. Repair checking in inconsistent databases: algorithms and complexity. In ICDT, pages 31--41, 2009. Google ScholarDigital Library
S. H. Bach, M. Broecheler, B. Huang, and L. Getoor. Hinge-loss markov random fields and probabilistic soft logic. CoRR, abs/1505.04406, 2015.Google Scholar
L. E. Bertossi, S. Kolahi, and L. V. S. Lakshmanan. Data cleaning and query answering with matching dependencies and matching functions. In ICDT, pages 268--279, 2011. Google ScholarDigital Library
G. Beskales, I. F. Ilyas, and L. Golab. Sampling the repairs of functional dependency violations under hard constraints. PVLDB, 3(1--2):197--207, 2010. Google ScholarDigital Library
G. Beskales, I. F. Ilyas, L. Golab, and A. Galiullin. On the relative trust between inconsistent data and inaccurate constraints. In ICDE, pages 541--552, 2013. Google ScholarDigital Library
P. Bohannon, W. Fan, M. Flaster, and R. Rastogi. A cost-based model and effective heuristic for repairing constraints by value modification. In SIGMOD, pages 143--154. ACM, 2005. Google ScholarDigital Library
P. Bohannon, W. Fan, F. Geerts, X. Jia, and A. Kementsietsidis. Conditional functional dependencies for data cleaning. In ICDE, pages 746--755, 2007.Google ScholarCross Ref
V. Chandrasekaran and M. I. Jordan. Computational and statistical tradeoffs via convex relaxation. PNAS, 110(13):1181--1190, 2013.Google ScholarCross Ref
J. Chomicki and J. Marcinkowski. Minimal-change integrity maintenance using tuple deletions. Information and Computation, 197(1):90--121, 2005. Google ScholarDigital Library
X. Chu, I. F. Ilyas, and P. Papotti. Discovering denial constraints. PVLDB, 6(13):1498--1509, 2013. Google ScholarDigital Library
X. Chu, I. F. Ilyas, and P. Papotti. Holistic data cleaning: Putting violations into context. In ICDE, pages 458--469, 2013. Google ScholarDigital Library
X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, and Y. Ye. KATARA: A data cleaning system powered by knowledge bases and crowdsourcing. In SIGMOD, pages 1247--1261, 2015. Google ScholarDigital Library
G. Cong, W. Fan, F. Geerts, X. Jia, and S. Ma. Improving data quality: Consistency and accuracy. In PVLDB, pages 315--326. VLDB Endowment, 2007. Google ScholarDigital Library
M. Dallachiesa, A. Ebaid, A. Eldawy, A. Elmagarmid, I. F. Ilyas, M. Ouzzani, and N. Tang. Nadeef: a commodity data cleaning system. In SIGMOD, pages 541--552, 2013. Google ScholarDigital Library
K. Das and J. Schneider. Detecting anomalous records In categorical datasets. KDD, pages 220--229, 2007. Google ScholarDigital Library
S. Ermon, C. P. Gomes, and B. Selman. Uniform solution sampling using a constraint solver as an oracle. In UAI, pages 255--264, 2012. Google ScholarDigital Library
R. Fagin, B. Kimelfeld, and P. G. Kolaitis. Dichotomies in the complexity of preferred repairs. PODS, pages 3--15, 2015. Google ScholarDigital Library
W. Fan. Dependencies revisited for improving data quality. In SIGMOD, pages 159--170, 2008. Google ScholarDigital Library
W. Fan and F. Geerts. Foundations of Data Quality Management. Synthesis Lectures on Data Management. 2012. Google ScholarDigital Library
W. Fan, X. Jia, J. Li, and S. Ma. Reasoning about record matching rules. PVLDB, 2(1):407--418, 2009. Google ScholarDigital Library
W. Fan, J. Li, S. Ma, N. Tang, and W. Yu. Towards certain fixes with editing rules and master data. PVLDB, 3(1--2):173--184, 2010. Google ScholarDigital Library
F. Geerts, G. Mecca, P. Papotti, and D. Santoro. The llunatic data-cleaning framework. PVLDB, 6(9):625--636, 2013. Google ScholarDigital Library
L. Getoor and A. Machanavajjhala. Entity resolution: theory, practice & open challenges. PVLDB, 5(12):2018--2019, 2012. Google ScholarDigital Library
T. P. Hayes and A. Sinclair. A general lower bound for mixing of single-site dynamics on graphs. The Annals of Applied Probability, 17(3):931--952, 2007.Google ScholarCross Ref
J. M. Hellerstein. Quantitative data cleaning for large databases. United Nations Economic Commission for Europe (UNECE), 2008.Google Scholar
I. F. Ilyas. Effective data cleaning with continuous evaluation. IEEE Data Eng. Bull., 39:38--46, 2016.Google Scholar
I. F. Ilyas and X. Chu. Trends in cleaning relational data: Consistency and deduplication. Foundations and Trends in Databases, 5(4):281--393, 2015. Google ScholarDigital Library
M. Interlandi and N. Tang. Proof positive and negative in data cleaning. In ICDE, 2015.Google ScholarCross Ref
S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wrangler: Interactive visual specification of data transformation scripts. In CHI, 2011. Google ScholarDigital Library
S. Kolahi and L. V. S. Lakshmanan. On approximating optimum repairs for functional dependency violations. In ICDT, pages 53--62, 2009. Google ScholarDigital Library
D. Koller and N. Friedman. Probabilistic Graphical Models: Principles and Techniques - Adaptive Computation and Machine Learning. The MIT Press, 2009. Google ScholarDigital Library
N. Koudas, A. Saha, D. Srivastava, and S. Venkatasubramanian. Metric functional dependencies. In ICDE, pages 1275--1278, 2009. Google ScholarDigital Library
N. Koudas, S. Sarawagi, and D. Srivastava. Record linkage: similarity measures and algorithms. In SIGMOD, pages 802--803, 2006. Google ScholarDigital Library
X. Li, X. L. Dong, K. Lyons, W. Meng, and D. Srivastava. Truth finding on the deep web: Is the problem solved? PVLDB, pages 97--108, 2013. Google ScholarDigital Library
C. Mayfield, J. Neville, and S. Prabhakar. Eracer: A database approach for statistical inference and data cleaning. SIGMOD, pages 75--86, 2010. Google ScholarDigital Library
F. Naumann and M. Herschel. An Introduction to Duplicate Detection. Synthesis Lectures on Data Management. 2010. Google ScholarDigital Library
F. Niu, C. Ré, A. Doan, and J. Shavlik. Tuffy: Scaling up statistical inference in markov logic networks using an rdbms. Proc. VLDB Endow., 4(6):373--384, Mar. 2011. Google ScholarDigital Library
E. Rahm and H. H. Do. Data cleaning: Problems and current approaches. IEEE Data Eng. Bull., 23(4):3--13, 2000.Google Scholar
A. J. Ratner, C. D. Sa, S. Wu, D. Selsam, and C. Ré. Data programming: Creating large training sets, quickly. In NIPS, pages 3567--3575, 2016.Google ScholarDigital Library
T. Rekatsinas, M. Joglekar, H. Garcia-Molina, A. G. Parameswaran, and C. Ré. SLiMFast: Guaranteed results for data fusion and source reliability. In SIGMOD, 2017. Google ScholarDigital Library
C. D. Sa, C. Zhang, K. Olukotun, and C. Ré. Rapidly mixing gibbs sampling for a class of factor graphs using hierarchy width. In NIPS, pages 3097--3105, 2015. Google ScholarDigital Library
J. Shin, S. Wu, F. Wang, C. De Sa, C. Zhang, and C. Ré. Incremental knowledge base construction using DeepDive. Proc. VLDB Endow., 8(11):1310--1321, July 2015. Google ScholarDigital Library
M. Stonebraker, D. Bruckner, I. F. Ilyas, G. Beskales, M. Cherniack, S. B. Zdonik, A. Pagan, and S. Xu. Data curation at scale: The data tamer system. In CIDR, 2013.Google Scholar
J. Wang and N. Tang. Towards dependable data repairing with fixing rules. In SIGMOD, pages 457--468. ACM, 2014. Google ScholarDigital Library
M. Yakout, L. Berti-Équille, and A. K. Elmagarmid. Don't be scared: use scalable automatic repairing with maximal likelihood and bounded changes. In SIGMOD, pages 553--564, 2013. Google ScholarDigital Library
C. Zhang and C. Ré. Dimmwitted: A study of main-memory statistical analytics. PVLDB, pages 1283--1294, 2014. Google ScholarDigital Library

Index Terms

HoloClean: holistic data repairs with probabilistic inference
1. Information systems
  1. Data management systems
    1. Database management system engines

Index terms have been assigned to the content through auto-classification.

Recommendations

Hierarchical hidden conditional random fields for information extraction
LION'05: Proceedings of the 5th international conference on Learning and Intelligent Optimization

Hidden Markov Models (HMMs) are very popular generative models for time series data. Recent work, however, has shown that for many tasks Conditional Random Fields (CRFs), a type of discriminative model, perform better than HMMs. Information extraction ...
Read More
Endpoint detection of sio₂ plasma etching using expanded hidden markov model
ISNN'10: Proceedings of the 7th international conference on Advances in Neural Networks - Volume Part II

In this paper, extended Hidden Markov Model (eHMM) is employed to resolve transition detection problems in plasma etch processes using optical emission spectroscopy (OES) data The proposed eHMM framework is a one of various semi-Markov models: a ...
Read More
The infinite hidden Markov random field model

Hidden Markov random field (HMRF) models are widely used for image segmentation, as they appear naturally in problems where a spatially constrained clustering scheme is asked for. A major limitation of HMRF models concerns the automatic selection of the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 10, Issue 11
August 2017
432 pages
ISSN:2150-8097
Editors:
Peter Boncz
CWI
,
Ken Salem
University of Waterloo
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2017
Published in pvldb Volume 10, Issue 11
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 141
  Total Citations
  View Citations
- 407
  Total Downloads
- Downloads (Last 12 months)62
- Downloads (Last 6 weeks)19
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HoloClean: holistic data repairs with probabilistic inference

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Hierarchical hidden conditional random fields for information extraction

Endpoint detection of sio₂ plasma etching using expanded hidden markov model

The infinite hidden Markov random field model

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

HoloClean: holistic data repairs with probabilistic inference

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Hierarchical hidden conditional random fields for information extraction

Endpoint detection of sio2 plasma etching using expanded hidden markov model

The infinite hidden Markov random field model

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media

Endpoint detection of sio₂ plasma etching using expanded hidden markov model