Abstract
To accurately match records it is often necessary to utilize the semantics of the data. Functional dependencies (FDs) have proven useful in identifying tuples in a clean relation, based on the semantics of the data. For all the reasons that FDs and their inference are needed, it is also important to develop dependencies and their reasoning techniques for matching tuples from unreliable data sources. This paper investigates dependencies and their reasoning for record matching. (a) We introduce a class of matching dependencies (MDs) for specifying the semantics of data in unreliable relations, defined in terms of similarity metrics and a dynamic semantics. (b) We identify a special case of MDs, referred to as relative candidate keys (RCKs), to determine what attributes to compare and how to compare them when matching records across possibly different relations. (c) We propose a mechanism for inferring MDs, a departure from traditional implication analysis, such that when we cannot match records by comparing attributes that contain errors, we may still find matches by using other, more reliable attributes. (d) We provide an O(n2) time algorithm for inferring MDs, and an effective algorithm for deducing a set of RCKs from MDs. (e) We experimentally verify that the algorithms help matching tools efficiently identify keys at compile time for matching, blocking or windowing, and that the techniques effectively improve both the quality and efficiency of various record matching methods.
- http://www.sas.com/industry/fsi/fraud/.Google Scholar
- S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995. Google ScholarDigital Library
- R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB, 2002. Google ScholarDigital Library
- J. W. Anish Das Sarma, Jeffrey Ullman. Schema design for uncertain databases. In Proceedings of the 3rd Alberto Mendelzon Workshop on Foundations of Data Management, 2009.Google Scholar
- A. Arasu, S. Chaudhuri, and R. Kaushik. Transformation-based framework for record matching. In ICDE, 2008. Google ScholarDigital Library
- A. Arasu, C. Re, and D. Suciu. Large-scale deduplication with constraints using Dedupalog. In ICDE, 2009. Google ScholarDigital Library
- C. Batini and M. Scannapieco. Data Quality: Concepts, Methodologies and Techniques. Springer, 2006. Google ScholarDigital Library
- C. Beeri and P. A. Bernstein. Computational problems related to the design of normal form relational schemas. TODS, 4(1):30--59, 1979. Google ScholarDigital Library
- R. Belohlávek and V. Vychodil. Data tables with similarity relations: Functional dependencies, complete rules and non-redundant bases. In DASFAA, 2006. Google ScholarDigital Library
- S. Chaudhuri, B.-C. Chen, V. Ganti, and R. Kaushik. Example-driven design of efficient record matching queries. In VLDB, 2007. Google ScholarDigital Library
- S. Chaudhuri, A. D. Sarma, V. Ganti, and R. Kaushik. Leveraging aggregate constraints for deduplication. In SIGMOD, 2007. Google ScholarDigital Library
- W. W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration. In KDD, 2002. Google ScholarDigital Library
- R. Dhamankar, Y. Lee, A. Doan, A. Y. Halevy, and P. Domingos. iMAP: Discovering complex mappings between database schemas. In SIGMOD, 2004. Google ScholarDigital Library
- A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. TKDE, 19(1):1--16, 2007. Google ScholarDigital Library
- W. Fan. Dependencies revisited for improving data quality. In PODS, 2008. Google ScholarDigital Library
- I. Fellegi and D. Holt. A systematic approach to automatic edit and imputation. J. American Statistical Association, 71(353):17--35, 1976.Google ScholarCross Ref
- I. Fellegi and A. B. Sunter. A theory for record linkage. J. American Statistical Association, 64(328):1183--1210, 1969.Google ScholarCross Ref
- H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C. Saita. Declarative data cleaning: Language, model and algorithms. In VLDB, 2001. Google ScholarDigital Library
- S. Guha, N. Koudas, A. Marathe, and D. Srivastava. Merging the results of approximate match operations. In VLDB, 2004. Google ScholarDigital Library
- M. A. Hernndez and S. J. Stolfo. The merge/purge problem for large databases. In SIGMOD, 1995. Google ScholarDigital Library
- M. Jaro. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa Florida. J. American Statistical Association, 89:414--420, 1989.Google ScholarCross Ref
- N. Koudas, A. Saha, D. Srivastava, and S. Venkatasubramanian. Metric functional dependencies. In ICDE, 2009. Google ScholarDigital Library
- E.-P. Lim, J. Srivastava, S. Prabhakar, and J. Richardson. Entity identification in database integration. Inf. Sci., 89(1--2):1--38, 1996. Google ScholarDigital Library
- C. L. Lucchesi and S. L. Osborn. Candidate keys for relations. JCSS, 17(2):270--279, 1978.Google ScholarCross Ref
- D. Maier. The Theory of Relational Databases. Computer Science Press, 1983. Google ScholarDigital Library
- J. Radcliffe and A. White. Key issues for master data management. Technical report, Gartner, 2008.Google Scholar
- S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In KDD, 2002. Google ScholarDigital Library
- W. Shen, X. Li, and A. Doan. Constraint-based entity matching. In AAAI, 2005. Google ScholarDigital Library
- P. Singla and P. Domingos. Object identification with attributemediated dependences. In PKDD, 2005. Google ScholarDigital Library
- V. S. Verykios, A. K. Elmagarmid, and E. Houstis. Automating the approximate record-matching process. Inf. Sci., 126(1--4):83--89, 2002. Google ScholarDigital Library
- M. Weis, F. Naumann, U. Jehle, J. Lufter, and H. Schuster. Industry-scale duplicate detection. In VLDB, 2008. Google ScholarDigital Library
- W. Winkler. Methods for record linkage and bayesian networks. Technical Report RRS2002/05, U.S. Census Bureau, 2002.Google Scholar
- W. E. Winkler. Methods for evaluating and creating data quality. Information Systems, 29(7):531--550, 2004. Google ScholarDigital Library
- W. Yancey. BigMatch: A program for extracting probable matches from a large file. Technical Report Computing 2007/01, U.S. Census Bureau, 2007.Google Scholar
Index Terms
- Reasoning about record matching rules
Recommendations
Dynamic constraints for record matching
This paper investigates constraints for matching records from unreliable data sources. (a) We introduce a class of matching dependencies (mds) for specifying the semantics of unreliable data. As opposed to static constraints for schema design, mds are ...
Example-driven design of efficient record matching queries
VLDB '07: Proceedings of the 33rd international conference on Very large data basesRecord matching is the task of identifying records that match the same real world entity. This is a problem of great significance for a variety of business intelligence applications. Implementations of record matching rely on exact as well as ...
Interaction between Record Matching and Data Repairing
Central to a data cleaning system are record matching and data repairing. Matching aims to identify tuples that refer to the same real-world object, and repairing is to make a database consistent by fixing errors in the data by using integrity ...
Comments