skip to main content
research-article

Reasoning about record matching rules

Published:01 August 2009Publication History
Skip Abstract Section

Abstract

To accurately match records it is often necessary to utilize the semantics of the data. Functional dependencies (FDs) have proven useful in identifying tuples in a clean relation, based on the semantics of the data. For all the reasons that FDs and their inference are needed, it is also important to develop dependencies and their reasoning techniques for matching tuples from unreliable data sources. This paper investigates dependencies and their reasoning for record matching. (a) We introduce a class of matching dependencies (MDs) for specifying the semantics of data in unreliable relations, defined in terms of similarity metrics and a dynamic semantics. (b) We identify a special case of MDs, referred to as relative candidate keys (RCKs), to determine what attributes to compare and how to compare them when matching records across possibly different relations. (c) We propose a mechanism for inferring MDs, a departure from traditional implication analysis, such that when we cannot match records by comparing attributes that contain errors, we may still find matches by using other, more reliable attributes. (d) We provide an O(n2) time algorithm for inferring MDs, and an effective algorithm for deducing a set of RCKs from MDs. (e) We experimentally verify that the algorithms help matching tools efficiently identify keys at compile time for matching, blocking or windowing, and that the techniques effectively improve both the quality and efficiency of various record matching methods.

References

  1. http://www.sas.com/industry/fsi/fraud/.Google ScholarGoogle Scholar
  2. S. Abiteboul, R. Hull, and V. Vianu. Foundations of Databases. Addison-Wesley, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In VLDB, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. W. Anish Das Sarma, Jeffrey Ullman. Schema design for uncertain databases. In Proceedings of the 3rd Alberto Mendelzon Workshop on Foundations of Data Management, 2009.Google ScholarGoogle Scholar
  5. A. Arasu, S. Chaudhuri, and R. Kaushik. Transformation-based framework for record matching. In ICDE, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Arasu, C. Re, and D. Suciu. Large-scale deduplication with constraints using Dedupalog. In ICDE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. C. Batini and M. Scannapieco. Data Quality: Concepts, Methodologies and Techniques. Springer, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Beeri and P. A. Bernstein. Computational problems related to the design of normal form relational schemas. TODS, 4(1):30--59, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. R. Belohlávek and V. Vychodil. Data tables with similarity relations: Functional dependencies, complete rules and non-redundant bases. In DASFAA, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. Chaudhuri, B.-C. Chen, V. Ganti, and R. Kaushik. Example-driven design of efficient record matching queries. In VLDB, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Chaudhuri, A. D. Sarma, V. Ganti, and R. Kaushik. Leveraging aggregate constraints for deduplication. In SIGMOD, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. W. W. Cohen and J. Richman. Learning to match and cluster large high-dimensional data sets for data integration. In KDD, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. Dhamankar, Y. Lee, A. Doan, A. Y. Halevy, and P. Domingos. iMAP: Discovering complex mappings between database schemas. In SIGMOD, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. K. Elmagarmid, P. G. Ipeirotis, and V. S. Verykios. Duplicate record detection: A survey. TKDE, 19(1):1--16, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. W. Fan. Dependencies revisited for improving data quality. In PODS, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. I. Fellegi and D. Holt. A systematic approach to automatic edit and imputation. J. American Statistical Association, 71(353):17--35, 1976.Google ScholarGoogle ScholarCross RefCross Ref
  17. I. Fellegi and A. B. Sunter. A theory for record linkage. J. American Statistical Association, 64(328):1183--1210, 1969.Google ScholarGoogle ScholarCross RefCross Ref
  18. H. Galhardas, D. Florescu, D. Shasha, E. Simon, and C. Saita. Declarative data cleaning: Language, model and algorithms. In VLDB, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. Guha, N. Koudas, A. Marathe, and D. Srivastava. Merging the results of approximate match operations. In VLDB, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. A. Hernndez and S. J. Stolfo. The merge/purge problem for large databases. In SIGMOD, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. M. Jaro. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa Florida. J. American Statistical Association, 89:414--420, 1989.Google ScholarGoogle ScholarCross RefCross Ref
  22. N. Koudas, A. Saha, D. Srivastava, and S. Venkatasubramanian. Metric functional dependencies. In ICDE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. E.-P. Lim, J. Srivastava, S. Prabhakar, and J. Richardson. Entity identification in database integration. Inf. Sci., 89(1--2):1--38, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. C. L. Lucchesi and S. L. Osborn. Candidate keys for relations. JCSS, 17(2):270--279, 1978.Google ScholarGoogle ScholarCross RefCross Ref
  25. D. Maier. The Theory of Relational Databases. Computer Science Press, 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Radcliffe and A. White. Key issues for master data management. Technical report, Gartner, 2008.Google ScholarGoogle Scholar
  27. S. Sarawagi and A. Bhamidipaty. Interactive deduplication using active learning. In KDD, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. W. Shen, X. Li, and A. Doan. Constraint-based entity matching. In AAAI, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. P. Singla and P. Domingos. Object identification with attributemediated dependences. In PKDD, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. V. S. Verykios, A. K. Elmagarmid, and E. Houstis. Automating the approximate record-matching process. Inf. Sci., 126(1--4):83--89, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. Weis, F. Naumann, U. Jehle, J. Lufter, and H. Schuster. Industry-scale duplicate detection. In VLDB, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. W. Winkler. Methods for record linkage and bayesian networks. Technical Report RRS2002/05, U.S. Census Bureau, 2002.Google ScholarGoogle Scholar
  33. W. E. Winkler. Methods for evaluating and creating data quality. Information Systems, 29(7):531--550, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. W. Yancey. BigMatch: A program for extracting probable matches from a large file. Technical Report Computing 2007/01, U.S. Census Bureau, 2007.Google ScholarGoogle Scholar

Index Terms

  1. Reasoning about record matching rules

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader