research-article

Reasoning about record matching rules

Authors:
Wenfei Fan

University of Edinburgh and Bell Laboratories and Harbin Institute of Technology

University of Edinburgh and Bell Laboratories and Harbin Institute of Technology
View Profile

,
Xibei Jia

University of Edinburgh

University of Edinburgh
View Profile

,
Jianzhong Li

Harbin Institute of Technology

Harbin Institute of Technology
View Profile

,
Shuai Ma

University of Edinburgh

University of Edinburgh
View Profile

Proceedings of the VLDB Endowment Volume 2 Issue 1pp 407–418https://doi.org/10.14778/1687627.1687674

Published:01 August 2009Publication History

Proceedings of the VLDB Endowment

Abstract

To accurately match records it is often necessary to utilize the semantics of the data. Functional dependencies (FDs) have proven useful in identifying tuples in a clean relation, based on the semantics of the data. For all the reasons that FDs and their inference are needed, it is also important to develop dependencies and their reasoning techniques for matching tuples from unreliable data sources. This paper investigates dependencies and their reasoning for record matching. (a) We introduce a class of matching dependencies (MDs) for specifying the semantics of data in unreliable relations, defined in terms of similarity metrics and a dynamic semantics. (b) We identify a special case of MDs, referred to as relative candidate keys (RCKs), to determine what attributes to compare and how to compare them when matching records across possibly different relations. (c) We propose a mechanism for inferring MDs, a departure from traditional implication analysis, such that when we cannot match records by comparing attributes that contain errors, we may still find matches by using other, more reliable attributes. (d) We provide an O(n²) time algorithm for inferring MDs, and an effective algorithm for deducing a set of RCKs from MDs. (e) We experimentally verify that the algorithms help matching tools efficiently identify keys at compile time for matching, blocking or windowing, and that the techniques effectively improve both the quality and efficiency of various record matching methods.

References

http://www.sas.com/industry/fsi/fraud/.Google Scholar
S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995. Google ScholarDigital Library
R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB, 2002. Google ScholarDigital Library
J. W. Anish Das Sarma, Jeffrey Ullman. Schema design for uncertain databases. In Proceedings of the 3rd Alberto Mendelzon Workshop on Foundations of Data Management, 2009.Google Scholar
A. Arasu, S. Chaudhuri, and R. Kaushik. Transformation-based framework for record matching. In ICDE, 2008. Google ScholarDigital Library
A. Arasu, C. Re, and D. Suciu. Large-scale deduplication with constraints using Dedupalog. In ICDE, 2009. Google ScholarDigital Library
C. Batini and M. Scannapieco. Data Quality: Concepts, Methodologies and Techniques. Springer, 2006. Google ScholarDigital Library
C. Beeri and P. A. Bernstein. Computational problems related to the design of normal form relational schemas. TODS, 4(1):30--59, 1979. Google ScholarDigital Library
R. Belohlávek and V. Vychodil. Data tables with similarity relations: Functional dependencies, complete rules and non-redundant bases. In DASFAA, 2006. Google ScholarDigital Library
S. Chaudhuri, B.-C. Chen, V. Ganti, and R. Kaushik. Example-driven design of efficient record matching queries. In VLDB, 2007. Google ScholarDigital Library
S. Chaudhuri, A. D. Sarma, V. Ganti, and R. Kaushik. Leveraging aggregate constraints for deduplication. In SIGMOD, 2007. Google ScholarDigital Library
W. W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration. In KDD, 2002. Google ScholarDigital Library
R. Dhamankar, Y. Lee, A. Doan, A. Y. Halevy, and P. Domingos. iMAP: Discovering complex mappings between database schemas. In SIGMOD, 2004. Google ScholarDigital Library
A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. TKDE, 19(1):1--16, 2007. Google ScholarDigital Library
W. Fan. Dependencies revisited for improving data quality. In PODS, 2008. Google ScholarDigital Library
I. Fellegi and D. Holt. A systematic approach to automatic edit and imputation. J. American Statistical Association, 71(353):17--35, 1976.Google ScholarCross Ref
I. Fellegi and A. B. Sunter. A theory for record linkage. J. American Statistical Association, 64(328):1183--1210, 1969.Google ScholarCross Ref
H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C. Saita. Declarative data cleaning: Language, model and algorithms. In VLDB, 2001. Google ScholarDigital Library
S. Guha, N. Koudas, A. Marathe, and D. Srivastava. Merging the results of approximate match operations. In VLDB, 2004. Google ScholarDigital Library
M. A. Hernndez and S. J. Stolfo. The merge/purge problem for large databases. In SIGMOD, 1995. Google ScholarDigital Library
M. Jaro. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa Florida. J. American Statistical Association, 89:414--420, 1989.Google ScholarCross Ref
N. Koudas, A. Saha, D. Srivastava, and S. Venkatasubramanian. Metric functional dependencies. In ICDE, 2009. Google ScholarDigital Library
E.-P. Lim, J. Srivastava, S. Prabhakar, and J. Richardson. Entity identification in database integration. Inf. Sci., 89(1--2):1--38, 1996. Google ScholarDigital Library
C. L. Lucchesi and S. L. Osborn. Candidate keys for relations. JCSS, 17(2):270--279, 1978.Google ScholarCross Ref
D. Maier. The Theory of Relational Databases. Computer Science Press, 1983. Google ScholarDigital Library
J. Radcliffe and A. White. Key issues for master data management. Technical report, Gartner, 2008.Google Scholar
S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In KDD, 2002. Google ScholarDigital Library
W. Shen, X. Li, and A. Doan. Constraint-based entity matching. In AAAI, 2005. Google ScholarDigital Library
P. Singla and P. Domingos. Object identification with attributemediated dependences. In PKDD, 2005. Google ScholarDigital Library
V. S. Verykios, A. K. Elmagarmid, and E. Houstis. Automating the approximate record-matching process. Inf. Sci., 126(1--4):83--89, 2002. Google ScholarDigital Library
M. Weis, F. Naumann, U. Jehle, J. Lufter, and H. Schuster. Industry-scale duplicate detection. In VLDB, 2008. Google ScholarDigital Library
W. Winkler. Methods for record linkage and bayesian networks. Technical Report RRS2002/05, U.S. Census Bureau, 2002.Google Scholar
W. E. Winkler. Methods for evaluating and creating data quality. Information Systems, 29(7):531--550, 2004. Google ScholarDigital Library
W. Yancey. BigMatch: A program for extracting probable matches from a large file. Technical Report Computing 2007/01, U.S. Census Bureau, 2007.Google Scholar

Index Terms

Reasoning about record matching rules
1. Information systems
  1. Information retrieval
    1. Document representation
  2. Information systems applications
2. Theory of computation
  1. Semantics and reasoning
    1. Program reasoning
      1. Abstraction

Recommendations

Dynamic constraints for record matching

This paper investigates constraints for matching records from unreliable data sources. (a) We introduce a class of matching dependencies (mds) for specifying the semantics of unreliable data. As opposed to static constraints for schema design, mds are ...
Read More
Example-driven design of efficient record matching queries
VLDB '07: Proceedings of the 33rd international conference on Very large data bases

Record matching is the task of identifying records that match the same real world entity. This is a problem of great significance for a variety of business intelligence applications. Implementations of record matching rely on exact as well as ...
Read More
Interaction between Record Matching and Data Repairing

Central to a data cleaning system are record matching and data repairing. Matching aims to identify tuples that refer to the same real-world object, and repairing is to make a database consistent by fixing errors in the data by using integrity ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 2, Issue 1
August 2009
1293 pages
ISSN:2150-8097
Editors:
Serge Abiteboul,
Tova Milo,
Jignesh Patel,
Philippe Rigaux
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2009
Published in pvldb Volume 2, Issue 1
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 77
  Total Citations
  View Citations
- 516
  Total Downloads
- Downloads (Last 12 months)29
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Reasoning about record matching rules

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Dynamic constraints for record matching

Example-driven design of efficient record matching queries

Interaction between Record Matching and Data Repairing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Reasoning about record matching rules

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Dynamic constraints for record matching

Example-driven design of efficient record matching queries

Interaction between Record Matching and Data Repairing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media