Abstract
Matching dependencies (MDs) are data profiling results that are often used for data integration, data cleaning, and entity matching. They are a generalization of functional dependencies (FDs) matching similar rather than same elements. As their discovery is very difficult, existing profiling algorithms find either only small subsets of all MDs or their scope is limited to only small datasets.
We focus on the efficient discovery of all interesting MDs in real-world datasets. For this purpose, we propose HyMD, a novel MD discovery algorithm that finds all minimal, non-trivial MDs within given similarity boundaries. The algorithm extracts the exact similarity thresholds for the individual MDs from the data instead of using predefined similarity thresholds. For this reason, it is the first approach to solve the MD discovery problem in an exact and truly complete way. If needed, the algorithm can, however, enforce certain properties on the reported MDs, such as disjointness and minimum support, to focus the discovery on such results that are actually required by downstream use cases. HyMD is technically a hybrid approach that combines the two most popular dependency discovery strategies in related work: lattice traversal and inference from record pairs. Despite the additional effort of finding exact similarity thresholds for all MD candidates, the algorithm is still able to efficiently process large datasets, e.g., datasets larger than 3 GB.
- Zeinab Bahmani, Leopoldo Bertossi, and Nikolaos Vasiloglou. 2015. ERBlox: Combining matching dependencies with machine learning for entity resolution. In Proceedings of the International Conference on Scalable Uncertainty Management. Springer, 399--414.Google ScholarCross Ref
- Zeinab Bahmani, Leopoldo E. Bertossi, Solmaz Kolahi, and Laks V. S. Lakshmanan. 2012. Declarative entity resolution via matching dependencies and answer set programs. In Proceedings of the International Conference on Principles of Knowledge Representation and Reasoning (KR’12). Google ScholarDigital Library
- Mikhail Bilenko, Raymond Mooney, William Cohen, Pradeep Ravikumar, and Stephen Fienberg. 2003. Adaptive name matching in information integration. IEEE Intell. Syst. 18, 5 (2003), 16--23. Google ScholarDigital Library
- Leo Breiman. 2001. Random forests. Mach. Learn. 45, 1 (2001), 5--32. Google ScholarDigital Library
- Loredana Caruccio, Vincenzo Deufemia, and Giuseppe Polese. 2016. Relaxed functional dependencies—A survey of approaches. IEEE Trans. Knowl. Data Eng. 28, 1 (2016), 147--165. Google ScholarDigital Library
- Peter Christen. 2012. Data Matching. Springer.Google Scholar
- Edgar F. Codd. 1970. A relational model of data for large shared data banks. Commun. ACM 13, 6 (1970), 377--387. Google ScholarDigital Library
- Stavros S. Cosmadakis, Paris C. Kanellakis, and Nicolas Spyratos. 1986. Partition semantics for relations. J. Comput. Syst. Sci. 33, 2 (1986), 203--233. Google ScholarDigital Library
- Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios. 2007. Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19, 1 (2007), 1--16. Google ScholarDigital Library
- Wenfei Fan. 2008. Dependencies revisited for improving data quality. In Proceedings of the Symposium on Principles of Database Systems (PODS’08). ACM, 159--170. Google ScholarDigital Library
- Wenfei Fan, Hong Gao, Xibei Jia, Jianzhong Li, and Shuai Ma. 2011. Dynamic constraints for record matching. VLDB J. 20, 4 (2011), 495--520. Google ScholarDigital Library
- Wenfei Fan, Xibei Jia, Jianzhong Li, and Shuai Ma. 2009. Reasoning about record matching rules. Proc. VLDB Endow. 2, 1 (2009), 407--418. Google ScholarDigital Library
- Wenfei Fan, Shuai Ma, Nan Tang, and Wenyuan Yu. 2014. Interaction between record matching and data repairing. J. Data Inf. Qual. 4, 4 (2014), 16. Google ScholarDigital Library
- Peter A. Flach and Iztok Savnik. 1999. Database dependency discovery: A machine learning approach. AI Commun. 12, 3 (1999), 139--160. Google ScholarDigital Library
- Jaffer Gardezi, Leopoldo Bertossi, and Iluju Kiringa. 2011. Matching dependencies with arbitrary attribute values: Semantics, query answering, and integrity constraints. In Proceedings of the International Workshop on Logic in Databases. ACM, 23--30. Google ScholarDigital Library
- Jaffer Gardezi, Leopoldo Bertossi, and Iluju Kiringa. 2012. Matching dependencies: Semantics and query answering. Front. Comput. Sci. 6 (2012), 278--292.Google Scholar
- Mauricio A. Hernández and Salvatore J. Stolfo. 1995. The merge/purge problem for large databases. In Proceedings of the International Conference on Management of Data (SIGMOD’95). 127--138. Google ScholarDigital Library
- Ykä Huhtala, Juha Kärkkäinen, Pasi Porkka, and Hannu Toivonen. 1999. TANE: An efficient algorithm for discovering functional and approximate dependencies. Comput. J. 42, 2 (1999), 100--111.Google ScholarCross Ref
- Ron Kohavi. 1995. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI’95), Vol. 14. 1137--1145. Google ScholarDigital Library
- Ioannis Koumarelas, Thorsten Papenbrock, and Felix Naumann. 2020. MDedup: Duplicate detection with matching dependencies. Proc. VLDB Endow. 13, 5 (2020), 712--725.Google ScholarDigital Library
- Michael Levandowsky and David Winter. 1971. Distance between sets. Nature 234, 5323 (1971), 34--35.Google Scholar
- Vladimir I. Levenshtein. 1966. Binary codes capable of correcting deletions, insertions, and reversals. In Soviet Physics Doklady, Vol. 10. 707--710.Google Scholar
- Guoliang Li, Dong Deng, Jiannan Wang, and Jianhua Feng. 2011. Pass-join: A partition-based method for similarity joins. Proc. VLDB Endow. 5, 3 (2011), 253--264. Google ScholarDigital Library
- Renée J. Miller, Mauricio A. Hernández, Laura M. Haas, Ling-Ling Yan, C. T. Howard Ho, Ronald Fagin, and Lucian Popa. 2001. The Clio project: Managing heterogeneity. SIGMOD Rec. 30, 1 (2001), 78--83. Google ScholarDigital Library
- Alvaro E. Monge and Charles Elkan. 1996. The field matching problem: Algorithms and applications. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (SIGKDD’96). 267--270. Google ScholarDigital Library
- Thorsten Papenbrock, Tanja Bergmann, Moritz Finke, Jakob Zwiener, and Felix Naumann. 2015. Data profiling with Metanome. Proc. VLDB Endow. 8, 12 (2015), 1860--1871. Google ScholarDigital Library
- Thorsten Papenbrock and Felix Naumann. 2016. A hybrid approach to functional dependency discovery. In Proceedings of the International Conference on Management of Data (SIGMOD’16). 821--833. Google ScholarDigital Library
- Thorsten Papenbrock and Felix Naumann. 2017. Data-driven schema normalization. In Proceedings of the International Conference on Extending Database Technology (EDBT’17). 342--353.Google Scholar
- Glenn Norman Paulley. 2000. Exploiting Functional Dependence in Query Optimization. Ph.D. Dissertation. Waterloo, Ont., Canada.Google Scholar
- Erhard Rahm and Philip A. Bernstein. 2001. A survey of approaches to automatic schema matching. VLDB J. 10, 4 (2001), 334--350. Google ScholarDigital Library
- Khalid Sayood. 2017. Introduction to Data Compression (5th ed.). Morgan Kaufmann. Google ScholarDigital Library
- Rohit Singh, Venkata Vamsikrishna Meduri, Ahmed K. Elmagarmid, Samuel Madden, Paolo Papotti, Jorge-Arnulfo Quiané-Ruiz, Armando Solar-Lezama, and Nan Tang. 2017. Synthesizing entity matching rules by examples. Proc. VLDB Endow. 11, 2 (2017), 189--202. Google ScholarDigital Library
- Shaoxu Song and Lei Chen. 2009. Discovering matching dependencies. In Proceedings of the International Conference on Information and Knowledge Management (CIKM’09). ACM, 1421--1424. Google ScholarDigital Library
- Shaoxu Song and Lei Chen. 2013. Efficient discovery of similarity constraints for matching dependencies. Data Knowl. Eng. 87 (2013), 146--166. Google ScholarDigital Library
- Umair ul Hassan, Sean O’Riain, and Edward Curry. 2012. Leveraging matching dependencies for guided user feedback in linked data applications. In Proceedings of the International Workshop on Information Integration on the Web (IIWeb’12). 1--6. Google ScholarDigital Library
- Jiannan Wang, Guoliang Li, and Jianhua Feng. 2012. Can we beat the prefix filtering?: An adaptive framework for similarity join and search. In Proceedings of the International Conference on Management of Data (SIGMOD’12). ACM, 85--96. Google ScholarDigital Library
- Jiannan Wang, Guoliang Li, Jeffrey Xu Yu, and Jianhua Feng. 2011. Entity matching: How similar is similar. Proc. VLDB Endow. 4, 10 (2011), 622--633. Google ScholarDigital Library
- Yihan Wang, Shaoxu Song, Lei Chen, Jeffrey Xu Yu, and Hong Cheng. 2017. Discovering conditional matching rules. ACM Trans. Knowl. Disc. Data 11, 4 (2017), 46. Google ScholarDigital Library
- Chuan Xiao, Wei Wang, and Xuemin Lin. 2008. Ed-Join: An efficient algorithm for similarity joins with edit distance constraints. Proc. VLDB Endow. 1, 1 (2008), 933--944. Google ScholarDigital Library
Index Terms
- Efficient Discovery of Matching Dependencies
Recommendations
A Hybrid Approach to Functional Dependency Discovery
SIGMOD '16: Proceedings of the 2016 International Conference on Management of DataFunctional dependencies are structural metadata that can be used for schema normalization, data integration, data cleansing, and many other data management tasks. Despite their importance, the functional dependencies of a specific dataset are usually ...
Efficient distributed discovery of bidirectional order dependencies
AbstractBidirectional order dependencies (bODs) capture order relationships between lists of attributes in a relational table. They can express that, for example, sorting books by publication date in ascending order also sorts them by age in descending ...
Approximate Discovery of Functional Dependencies for Large Datasets
CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge ManagementFunctional dependencies (FDs) are an important prerequisite for various data management tasks, such as schema normalization, query optimization, and data cleansing. However, automatic FD discovery entails an exponentially growing search and solution ...
Comments