research-article

Efficient Discovery of Matching Dependencies

Authors:
Philipp Schirmer

Hasso Plattner Institute, University of Potsdam, Germany

Hasso Plattner Institute, University of Potsdam, Germany
View Profile

,
Thorsten Papenbrock

Hasso Plattner Institute, University of Potsdam, Germany

Hasso Plattner Institute, University of Potsdam, Germany
View Profile

,
Ioannis Koumarelas

Hasso Plattner Institute, University of Potsdam, Germany

Hasso Plattner Institute, University of Potsdam, Germany
View Profile

,
Felix Naumann

Hasso Plattner Institute, University of Potsdam, Germany

Hasso Plattner Institute, University of Potsdam, Germany
View Profile

Authors Info & Claims

ACM Transactions on Database Systems Volume 45 Issue 3Article No.: 13pp 1–33https://doi.org/10.1145/3392778

Published:26 August 2020Publication History

ACM Transactions on Database Systems

Abstract

Matching dependencies (MDs) are data profiling results that are often used for data integration, data cleaning, and entity matching. They are a generalization of functional dependencies (FDs) matching similar rather than same elements. As their discovery is very difficult, existing profiling algorithms find either only small subsets of all MDs or their scope is limited to only small datasets.

We focus on the efficient discovery of all interesting MDs in real-world datasets. For this purpose, we propose HyMD, a novel MD discovery algorithm that finds all minimal, non-trivial MDs within given similarity boundaries. The algorithm extracts the exact similarity thresholds for the individual MDs from the data instead of using predefined similarity thresholds. For this reason, it is the first approach to solve the MD discovery problem in an exact and truly complete way. If needed, the algorithm can, however, enforce certain properties on the reported MDs, such as disjointness and minimum support, to focus the discovery on such results that are actually required by downstream use cases. HyMD is technically a hybrid approach that combines the two most popular dependency discovery strategies in related work: lattice traversal and inference from record pairs. Despite the additional effort of finding exact similarity thresholds for all MD candidates, the algorithm is still able to efficiently process large datasets, e.g., datasets larger than 3 GB.

References

Zeinab Bahmani, Leopoldo Bertossi, and Nikolaos Vasiloglou. 2015. ERBlox: Combining matching dependencies with machine learning for entity resolution. In Proceedings of the International Conference on Scalable Uncertainty Management. Springer, 399--414.Google ScholarCross Ref
Zeinab Bahmani, Leopoldo E. Bertossi, Solmaz Kolahi, and Laks V. S. Lakshmanan. 2012. Declarative entity resolution via matching dependencies and answer set programs. In Proceedings of the International Conference on Principles of Knowledge Representation and Reasoning (KR’12). Google ScholarDigital Library
Mikhail Bilenko, Raymond Mooney, William Cohen, Pradeep Ravikumar, and Stephen Fienberg. 2003. Adaptive name matching in information integration. IEEE Intell. Syst. 18, 5 (2003), 16--23. Google ScholarDigital Library
Leo Breiman. 2001. Random forests. Mach. Learn. 45, 1 (2001), 5--32. Google ScholarDigital Library
Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2016. Relaxed functional dependencies—A survey of approaches. IEEE Trans. Knowl. Data Eng. 28, 1 (2016), 147--165. Google ScholarDigital Library
Peter Christen. 2012. Data Matching. Springer.Google Scholar
Edgar F. Codd. 1970. A relational model of data for large shared data banks. Commun. ACM 13, 6 (1970), 377--387. Google ScholarDigital Library
Stavros S. Cosmadakis, Paris C. Kanellakis, and Nicolas Spyratos. 1986. Partition semantics for relations. J. Comput. Syst. Sci. 33, 2 (1986), 203--233. Google ScholarDigital Library
Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19, 1 (2007), 1--16. Google ScholarDigital Library
Wenfei Fan. 2008. Dependencies revisited for improving data quality. In Proceedings of the Symposium on Principles of Database Systems (PODS’08). ACM, 159--170. Google ScholarDigital Library
Wenfei Fan, Hong Gao, Xibei Jia, Jianzhong Li, and Shuai Ma. 2011. Dynamic constraints for record matching. VLDB J. 20, 4 (2011), 495--520. Google ScholarDigital Library
Wenfei Fan, Xibei Jia, Jianzhong Li, and Shuai Ma. 2009. Reasoning about record matching rules. Proc. VLDB Endow. 2, 1 (2009), 407--418. Google ScholarDigital Library
Wenfei Fan, Shuai Ma, Nan Tang, and Wenyuan Yu. 2014. Interaction between record matching and data repairing. J. Data Inf. Qual. 4, 4 (2014), 16. Google ScholarDigital Library
Peter A. Flach and Iztok Savnik. 1999. Database dependency discovery: A machine learning approach. AI Commun. 12, 3 (1999), 139--160. Google ScholarDigital Library
Jaffer Gardezi, Leopoldo Bertossi, and Iluju Kiringa. 2011. Matching dependencies with arbitrary attribute values: Semantics, query answering, and integrity constraints. In Proceedings of the International Workshop on Logic in Databases. ACM, 23--30. Google ScholarDigital Library
Jaffer Gardezi, Leopoldo Bertossi, and Iluju Kiringa. 2012. Matching dependencies: Semantics and query answering. Front. Comput. Sci. 6 (2012), 278--292.Google Scholar
Mauricio A. Hernández and Salvatore J. Stolfo. 1995. The merge/purge problem for large databases. In Proceedings of the International Conference on Management of Data (SIGMOD’95). 127--138. Google ScholarDigital Library
Ykä Huhtala, Juha Kärkkäinen, Pasi Porkka, and Hannu Toivonen. 1999. TANE: An efficient algorithm for discovering functional and approximate dependencies. Comput. J. 42, 2 (1999), 100--111.Google ScholarCross Ref
Ron Kohavi. 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’95), Vol. 14. 1137--1145. Google ScholarDigital Library
Ioannis Koumarelas, Thorsten Papenbrock, and Felix Naumann. 2020. MDedup: Duplicate detection with matching dependencies. Proc. VLDB Endow. 13, 5 (2020), 712--725.Google ScholarDigital Library
Michael Levandowsky and David Winter. 1971. Distance between sets. Nature 234, 5323 (1971), 34--35.Google Scholar
Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet Physics Doklady, Vol. 10. 707--710.Google Scholar
Guoliang Li, Dong Deng, Jiannan Wang, and Jianhua Feng. 2011. Pass-join: A partition-based method for similarity joins. Proc. VLDB Endow. 5, 3 (2011), 253--264. Google ScholarDigital Library
Renée J. Miller, Mauricio A. Hernández, Laura M. Haas, Ling-Ling Yan, C. T. Howard Ho, Ronald Fagin, and Lucian Popa. 2001. The Clio project: Managing heterogeneity. SIGMOD Rec. 30, 1 (2001), 78--83. Google ScholarDigital Library
Alvaro E. Monge and Charles Elkan. 1996. The field matching problem: Algorithms and applications. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD’96). 267--270. Google ScholarDigital Library
Thorsten Papenbrock, Tanja Bergmann, Moritz Finke, Jakob Zwiener, and Felix Naumann. 2015. Data profiling with Metanome. Proc. VLDB Endow. 8, 12 (2015), 1860--1871. Google ScholarDigital Library
Thorsten Papenbrock and Felix Naumann. 2016. A hybrid approach to functional dependency discovery. In Proceedings of the International Conference on Management of Data (SIGMOD’16). 821--833. Google ScholarDigital Library
Thorsten Papenbrock and Felix Naumann. 2017. Data-driven schema normalization. In Proceedings of the International Conference on Extending Database Technology (EDBT’17). 342--353.Google Scholar
Glenn Norman Paulley. 2000. Exploiting Functional Dependence in Query Optimization. Ph.D. Dissertation. Waterloo, Ont., Canada.Google Scholar
Erhard Rahm and Philip A. Bernstein. 2001. A survey of approaches to automatic schema matching. VLDB J. 10, 4 (2001), 334--350. Google ScholarDigital Library
Khalid Sayood. 2017. Introduction to Data Compression (5th ed.). Morgan Kaufmann. Google ScholarDigital Library
Rohit Singh, Venkata Vamsikrishna Meduri, Ahmed K. Elmagarmid, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Armando Solar-Lezama, and Nan Tang. 2017. Synthesizing entity matching rules by examples. Proc. VLDB Endow. 11, 2 (2017), 189--202. Google ScholarDigital Library
Shaoxu Song and Lei Chen. 2009. Discovering matching dependencies. In Proceedings of the International Conference on Information and Knowledge Management (CIKM’09). ACM, 1421--1424. Google ScholarDigital Library
Shaoxu Song and Lei Chen. 2013. Efficient discovery of similarity constraints for matching dependencies. Data Knowl. Eng. 87 (2013), 146--166. Google ScholarDigital Library
Umair ul Hassan, Sean O’Riain, and Edward Curry. 2012. Leveraging matching dependencies for guided user feedback in linked data applications. In Proceedings of the International Workshop on Information Integration on the Web (IIWeb’12). 1--6. Google ScholarDigital Library
Jiannan Wang, Guoliang Li, and Jianhua Feng. 2012. Can we beat the prefix filtering?: An adaptive framework for similarity join and search. In Proceedings of the International Conference on Management of Data (SIGMOD’12). ACM, 85--96. Google ScholarDigital Library
Jiannan Wang, Guoliang Li, Jeffrey Xu Yu, and Jianhua Feng. 2011. Entity matching: How similar is similar. Proc. VLDB Endow. 4, 10 (2011), 622--633. Google ScholarDigital Library
Yihan Wang, Shaoxu Song, Lei Chen, Jeffrey Xu Yu, and Hong Cheng. 2017. Discovering conditional matching rules. ACM Trans. Knowl. Disc. Data 11, 4 (2017), 46. Google ScholarDigital Library
Chuan Xiao, Wei Wang, and Xuemin Lin. 2008. Ed-Join: An efficient algorithm for similarity joins with edit distance constraints. Proc. VLDB Endow. 1, 1 (2008), 933--944. Google ScholarDigital Library

Index Terms

Efficient Discovery of Matching Dependencies
1. Information systems
  1. Data management systems
    1. Information integration
      1. Entity resolution
  2. Information systems applications
    1. Data mining
      1. Association rules
      2. Data cleaning
    2. Decision support systems
      1. Data analytics

Recommendations

A Hybrid Approach to Functional Dependency Discovery
SIGMOD '16: Proceedings of the 2016 International Conference on Management of Data

Functional dependencies are structural metadata that can be used for schema normalization, data integration, data cleansing, and many other data management tasks. Despite their importance, the functional dependencies of a specific dataset are usually ...
Read More
Efficient distributed discovery of bidirectional order dependencies
Abstract
Bidirectional order dependencies (bODs) capture order relationships between lists of attributes in a relational table. They can express that, for example, sorting books by publication date in ascending order also sorts them by age in descending ...
Read More
Approximate Discovery of Functional Dependencies for Large Datasets
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

Functional dependencies (FDs) are an important prerequisite for various data management tasks, such as schema normalization, query optimization, and data cleansing. However, automatic FD discovery entails an exponentially growing search and solution ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Database Systems Volume 45, Issue 3
Best of ICDT 2019 and Regular Papers
September 2020
213 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/3420008
Editor:
Christian S. Jensen
Aalborg University, Denmark
Issue’s Table of Contents
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 26 August 2020
- Online AM: 7 May 2020
- Accepted: 1 April 2020
- Revised: 1 December 2019
- Received: 1 February 2019
Published in tods Volume 45, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Matching dependencies
data matching
data profiling
dependency discovery
entity resolution
functional dependencies
similarity measures
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 22
  Total Citations
  View Citations
- 361
  Total Downloads
- Downloads (Last 12 months)52
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Efficient Discovery of Matching Dependencies

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

A Hybrid Approach to Functional Dependency Discovery

Efficient distributed discovery of bidirectional order dependencies

Approximate Discovery of Functional Dependencies for Large Datasets

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Efficient Discovery of Matching Dependencies

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

A Hybrid Approach to Functional Dependency Discovery

Efficient distributed discovery of bidirectional order dependencies

Approximate Discovery of Functional Dependencies for Large Datasets

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media