skip to main content
research-article

Efficient Discovery of Matching Dependencies

Published:26 August 2020Publication History
Skip Abstract Section

Abstract

Matching dependencies (MDs) are data profiling results that are often used for data integration, data cleaning, and entity matching. They are a generalization of functional dependencies (FDs) matching similar rather than same elements. As their discovery is very difficult, existing profiling algorithms find either only small subsets of all MDs or their scope is limited to only small datasets.

We focus on the efficient discovery of all interesting MDs in real-world datasets. For this purpose, we propose HyMD, a novel MD discovery algorithm that finds all minimal, non-trivial MDs within given similarity boundaries. The algorithm extracts the exact similarity thresholds for the individual MDs from the data instead of using predefined similarity thresholds. For this reason, it is the first approach to solve the MD discovery problem in an exact and truly complete way. If needed, the algorithm can, however, enforce certain properties on the reported MDs, such as disjointness and minimum support, to focus the discovery on such results that are actually required by downstream use cases. HyMD is technically a hybrid approach that combines the two most popular dependency discovery strategies in related work: lattice traversal and inference from record pairs. Despite the additional effort of finding exact similarity thresholds for all MD candidates, the algorithm is still able to efficiently process large datasets, e.g., datasets larger than 3 GB.

References

  1. Zeinab Bahmani, Leopoldo Bertossi, and Nikolaos Vasiloglou. 2015. ERBlox: Combining matching dependencies with machine learning for entity resolution. In Proceedings of the International Conference on Scalable Uncertainty Management. Springer, 399--414.Google ScholarGoogle ScholarCross RefCross Ref
  2. Zeinab Bahmani, Leopoldo E. Bertossi, Solmaz Kolahi, and Laks V. S. Lakshmanan. 2012. Declarative entity resolution via matching dependencies and answer set programs. In Proceedings of the International Conference on Principles of Knowledge Representation and Reasoning (KR’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Mikhail Bilenko, Raymond Mooney, William Cohen, Pradeep Ravikumar, and Stephen Fienberg. 2003. Adaptive name matching in information integration. IEEE Intell. Syst. 18, 5 (2003), 16--23. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Leo Breiman. 2001. Random forests. Mach. Learn. 45, 1 (2001), 5--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2016. Relaxed functional dependencies—A survey of approaches. IEEE Trans. Knowl. Data Eng. 28, 1 (2016), 147--165. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Peter Christen. 2012. Data Matching. Springer.Google ScholarGoogle Scholar
  7. Edgar F. Codd. 1970. A relational model of data for large shared data banks. Commun. ACM 13, 6 (1970), 377--387. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Stavros S. Cosmadakis, Paris C. Kanellakis, and Nicolas Spyratos. 1986. Partition semantics for relations. J. Comput. Syst. Sci. 33, 2 (1986), 203--233. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19, 1 (2007), 1--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Wenfei Fan. 2008. Dependencies revisited for improving data quality. In Proceedings of the Symposium on Principles of Database Systems (PODS’08). ACM, 159--170. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Wenfei Fan, Hong Gao, Xibei Jia, Jianzhong Li, and Shuai Ma. 2011. Dynamic constraints for record matching. VLDB J. 20, 4 (2011), 495--520. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Wenfei Fan, Xibei Jia, Jianzhong Li, and Shuai Ma. 2009. Reasoning about record matching rules. Proc. VLDB Endow. 2, 1 (2009), 407--418. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Wenfei Fan, Shuai Ma, Nan Tang, and Wenyuan Yu. 2014. Interaction between record matching and data repairing. J. Data Inf. Qual. 4, 4 (2014), 16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Peter A. Flach and Iztok Savnik. 1999. Database dependency discovery: A machine learning approach. AI Commun. 12, 3 (1999), 139--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Jaffer Gardezi, Leopoldo Bertossi, and Iluju Kiringa. 2011. Matching dependencies with arbitrary attribute values: Semantics, query answering, and integrity constraints. In Proceedings of the International Workshop on Logic in Databases. ACM, 23--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Jaffer Gardezi, Leopoldo Bertossi, and Iluju Kiringa. 2012. Matching dependencies: Semantics and query answering. Front. Comput. Sci. 6 (2012), 278--292.Google ScholarGoogle Scholar
  17. Mauricio A. Hernández and Salvatore J. Stolfo. 1995. The merge/purge problem for large databases. In Proceedings of the International Conference on Management of Data (SIGMOD’95). 127--138. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Ykä Huhtala, Juha Kärkkäinen, Pasi Porkka, and Hannu Toivonen. 1999. TANE: An efficient algorithm for discovering functional and approximate dependencies. Comput. J. 42, 2 (1999), 100--111.Google ScholarGoogle ScholarCross RefCross Ref
  19. Ron Kohavi. 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’95), Vol. 14. 1137--1145. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Ioannis Koumarelas, Thorsten Papenbrock, and Felix Naumann. 2020. MDedup: Duplicate detection with matching dependencies. Proc. VLDB Endow. 13, 5 (2020), 712--725.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Michael Levandowsky and David Winter. 1971. Distance between sets. Nature 234, 5323 (1971), 34--35.Google ScholarGoogle Scholar
  22. Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet Physics Doklady, Vol. 10. 707--710.Google ScholarGoogle Scholar
  23. Guoliang Li, Dong Deng, Jiannan Wang, and Jianhua Feng. 2011. Pass-join: A partition-based method for similarity joins. Proc. VLDB Endow. 5, 3 (2011), 253--264. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Renée J. Miller, Mauricio A. Hernández, Laura M. Haas, Ling-Ling Yan, C. T. Howard Ho, Ronald Fagin, and Lucian Popa. 2001. The Clio project: Managing heterogeneity. SIGMOD Rec. 30, 1 (2001), 78--83. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Alvaro E. Monge and Charles Elkan. 1996. The field matching problem: Algorithms and applications. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD’96). 267--270. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Thorsten Papenbrock, Tanja Bergmann, Moritz Finke, Jakob Zwiener, and Felix Naumann. 2015. Data profiling with Metanome. Proc. VLDB Endow. 8, 12 (2015), 1860--1871. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Thorsten Papenbrock and Felix Naumann. 2016. A hybrid approach to functional dependency discovery. In Proceedings of the International Conference on Management of Data (SIGMOD’16). 821--833. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Thorsten Papenbrock and Felix Naumann. 2017. Data-driven schema normalization. In Proceedings of the International Conference on Extending Database Technology (EDBT’17). 342--353.Google ScholarGoogle Scholar
  29. Glenn Norman Paulley. 2000. Exploiting Functional Dependence in Query Optimization. Ph.D. Dissertation. Waterloo, Ont., Canada.Google ScholarGoogle Scholar
  30. Erhard Rahm and Philip A. Bernstein. 2001. A survey of approaches to automatic schema matching. VLDB J. 10, 4 (2001), 334--350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Khalid Sayood. 2017. Introduction to Data Compression (5th ed.). Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Rohit Singh, Venkata Vamsikrishna Meduri, Ahmed K. Elmagarmid, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Armando Solar-Lezama, and Nan Tang. 2017. Synthesizing entity matching rules by examples. Proc. VLDB Endow. 11, 2 (2017), 189--202. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Shaoxu Song and Lei Chen. 2009. Discovering matching dependencies. In Proceedings of the International Conference on Information and Knowledge Management (CIKM’09). ACM, 1421--1424. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Shaoxu Song and Lei Chen. 2013. Efficient discovery of similarity constraints for matching dependencies. Data Knowl. Eng. 87 (2013), 146--166. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Umair ul Hassan, Sean O’Riain, and Edward Curry. 2012. Leveraging matching dependencies for guided user feedback in linked data applications. In Proceedings of the International Workshop on Information Integration on the Web (IIWeb’12). 1--6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Jiannan Wang, Guoliang Li, and Jianhua Feng. 2012. Can we beat the prefix filtering?: An adaptive framework for similarity join and search. In Proceedings of the International Conference on Management of Data (SIGMOD’12). ACM, 85--96. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Jiannan Wang, Guoliang Li, Jeffrey Xu Yu, and Jianhua Feng. 2011. Entity matching: How similar is similar. Proc. VLDB Endow. 4, 10 (2011), 622--633. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Yihan Wang, Shaoxu Song, Lei Chen, Jeffrey Xu Yu, and Hong Cheng. 2017. Discovering conditional matching rules. ACM Trans. Knowl. Disc. Data 11, 4 (2017), 46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Chuan Xiao, Wei Wang, and Xuemin Lin. 2008. Ed-Join: An efficient algorithm for similarity joins with edit distance constraints. Proc. VLDB Endow. 1, 1 (2008), 933--944. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Efficient Discovery of Matching Dependencies

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Database Systems
            ACM Transactions on Database Systems  Volume 45, Issue 3
            Best of ICDT 2019 and Regular Papers
            September 2020
            213 pages
            ISSN:0362-5915
            EISSN:1557-4644
            DOI:10.1145/3420008
            Issue’s Table of Contents

            Copyright © 2020 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 26 August 2020
            • Online AM: 7 May 2020
            • Accepted: 1 April 2020
            • Revised: 1 December 2019
            • Received: 1 February 2019
            Published in tods Volume 45, Issue 3

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format