ABSTRACT
Creating semantic matches between disparate data sources is fundamental to numerous data sharing efforts. Manually creating matches is extremely tedious and error-prone. Hence many recent works have focused on automating the matching process. To date, however, virtually all of these works deal only with one-to-one (1-1) matches, such as address = location. They do not consider the important class of more complex matches, such as address = concat (city, state) and room-pric = room-rate* (1 + tax-rate).We describe the iMAP system which semi-automatically discovers both 1-1 and complex matches. iMAP reformulates schema matching as a search in an often very large or infinite match space. To search effectively, it employs a set of searchers, each discovering specific types of complex matches. To further improve matching accuracy, iMAP exploits a variety of domain knowledge, including past complex matches, domain integrity constraints, and overlap data. Finally, iMAP introduces a novel feature that generates explanation of predicted matches, to provide insights into the matching process and suggest actions to converge on correct matches quickly. We apply iMAP to several real-world domains to match relational tables, and show that it discovers both 1-1 and complex matches with high accuracy.
- J. Berlin and A. Motro. Database schema matching using machine learning with feature selection. In Proc. of CAiSE-2002. Google ScholarDigital Library
- S. Castano and V. D. Antonellis. A schema analysis and reconciliation tool environment. In Proc. of IDEAS-1999. Google ScholarDigital Library
- C. Clifton, E. Housman, and A. Rosenthal. Experience with a combined approach to attribute-matching across heterogeneous databases. In Proc. of the IFIP Working Conference on Data Semantics (DS-7), 1997.Google Scholar
- T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, New York, NY, 1991. Google ScholarDigital Library
- T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. Mining database structure; or, how to build a data quality browser. In Proc. of SIGMOD-2002. Google ScholarDigital Library
- R. Dhamankar. Semi-automated discovery of matches between schemas, ontologies, and data fragments of disparate data sources. M. S. Thesis, Dept. of CS, Univ. of Illinois. To appear.Google Scholar
- H. Do, S. Melnik, and E. Rahm. Comparison of schema matching evaluations. In Proceedings of the 2nd Int. Workshop on Web Databases 2002. Google ScholarDigital Library
- H. Do and E. Rahm. Coma: A system for flexible combination of schema matching approaches. In Proc. of VLDB-2002. Google ScholarDigital Library
- A. Doan, P. Domingos, and A. Halevy. Reconciling schemas of disparate data sources: A machine learning approach. In Proc. of SIGMOD-2001. Google ScholarDigital Library
- R. O. Duda and P. E. Hart. Pattern Classification and Scene Analysis. Wiley, New York, NY, 1973.Google ScholarDigital Library
- D. Embley, D. Jackman, and L. Xu. Multifaceted exploitation of metadata for attribute match discovery in information integration. In Proc. of the WIIW-01, 2001.Google Scholar
- B. He and K. C.-C. Chang. Statistical schema matching across web query interfaces. In Proc. of SIGMOD-2003. Google ScholarDigital Library
- J. Kang and J. Naughton. On schema matching with opaque column names and data values. In Proc. of SIGMOD-2003. Google ScholarDigital Library
- M. Lenzerini. Data integration; a theoretical perspective. In Proc. of PODS-2002. Google ScholarDigital Library
- W. Li and C. Clifton. SEMINT: A tool for identifying attribute correspondence in heterogeneous databases using neural networks. Data and Knowledge Engineering, 33:49--84, 2000. Google ScholarDigital Library
- J. Madhavan, P. Bernstein, K. Chen, A. Halevy, and P. Shenoy. Matching schemas by learning from a schema corpus. In Proc. of the IJCAI-03 Workshop on Info. Integration, 2003.Google Scholar
- J. Madhavan, P. Bernstein, and E. Rahm. Generic schema matching with cupid. In Proc. of VLDB-2001. Google ScholarDigital Library
- C. Manning and H. Schütze. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, US, 1999. Google ScholarDigital Library
- S. Melnik, H. Molina-Garcia, and E. Rahm. Similarity flooding: a versatile graph matching algorithm. In Proc. of ICDE-2002. Google ScholarDigital Library
- R. Miller. Using schematically heterogeneous structures. In Proc. of SIGMOD-1998. Google ScholarDigital Library
- T. Milo and S. Zohar. Using schema matching to simplify heterogeneous data translation. In Proc. of VLDB-1998. Google ScholarDigital Library
- P. Mitra, G. Wiederhold, and J. Jannink. Semi-automatic integration of knowledge sources. In Proc. of Fusion-1999.Google Scholar
- M. Perkowitz and O. Etzioni. Category translation: Learning to understand information on the internet. In Proc. of Int. Conf. on AI (IJCAI), 1995. Google ScholarDigital Library
- E. Rahm and P. Bernstein. On matching schemas automatically. VLDB Journal, 10(4), 2001. Google ScholarDigital Library
- S. Russell and P. Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, 1995. Google ScholarDigital Library
- L. Seligman, A. Rosenthal, P. Lehner, and A. Smith. Data integration: Where does the time go? IEEE Data Engineering Bulletin, 2002.Google Scholar
- L. Todorovski and S. Dzeroski. Declarative bias in equation discovery. In Proc. of the Int. Conf. on Machine Learning (ICML), 1997. Google ScholarDigital Library
- L. Xu and D. Embley. Using domain ontologies to discover direct and indirect matches for schema elements. In Proc. of the Semantic Integration Workshop at ISWC-2003.Google Scholar
- L. Yan, R. Miller, L. Haas, and R. Fagin. Data driven understanding and refinement of schema mappings. In Proc. of SIGMOD-2001. Google ScholarDigital Library
- iMAP: discovering complex semantic matches between database schemas
Comments