article

Automatic complex schema matching across Web query interfaces: A correlation mining approach

Authors:
Bin He

University of Illinois at Urbana-Champaign, Urbana, IL

University of Illinois at Urbana-Champaign, Urbana, IL
View Profile

,
Kevin Chen-Chuan Chang

University of Illinois at Urbana-Champaign, Urbana, IL

University of Illinois at Urbana-Champaign, Urbana, IL
View Profile

Authors Info & Claims

ACM Transactions on Database Systems Volume 31 Issue 1pp 346–395https://doi.org/10.1145/1132863.1132872

Published:01 March 2006Publication History

ACM Transactions on Database Systems

Abstract

To enable information integration, schema matching is a critical step for discovering semantic correspondences of attributes across heterogeneous sources. While complex matchings are common, because of their far more complex search space, most existing techniques focus on simple 1:1 matchings. To tackle this challenge, this article takes a conceptually novel approach by viewing schema matching as correlation mining, for our task of matching Web query interfaces to integrate the myriad databases on the Internet. On this “deep Web ” query interfaces generally form complex matchings between attribute groups (e.g., {author} corresponds to {first name, last name} in the Books domain). We observe that the co-occurrences patterns across query interfaces often reveal such complex semantic relationships: grouping attributes (e.g., {first name, last name}) tend to be co-present in query interfaces and thus positively correlated. In contrast, synonym attributes are negatively correlated because they rarely co-occur. This insight enables us to discover complex matchings by a correlation mining approach. In particular, we develop the DCM framework, which consists of data preprocessing, dual mining of positive and negative correlations, and finally matching construction. We evaluate the DCM framework on manually extracted interfaces and the results show good accuracy for discovering complex matchings. Further, to automate the entire matching process, we incorporate automatic techniques for interface extraction. Executing the DCM framework on automatically extracted interfaces, we find that the inevitable errors in automatic interface extraction may significantly affect the matching result. To make the DCM framework robust against such “noisy” schemas, we integrate it with a novel “ensemble” approach, which creates an ensemble of DCM matchers, by randomizing the schema data into many trials and aggregating their ranked results by taking majority voting. As a principled basis, we provide analytic justification of the robustness of the ensemble approach. Empirically, our experiments show that the “ensemblization” indeed significantly boosts the matching accuracy, over automatically extracted and thus noisy schema data. By employing the DCM framework with the ensemble approach, we thus complete an automatic process of matchings Web query interfaces.

References

Agrawal, R., Imielinski, T., and Swami, A. N. 1993. Mining association rules between sets of items in large databases. In Proceedings of the SIGMOD 1993 Conference. ACM, New York.]] Google ScholarDigital Library
Anderson, D. R., Sweeney, D. J., and Williams, T. A. 1984. Statistics for Business and Economics (Second Edition). West Publishing Company.]] Google ScholarDigital Library
Batini, C., Lenzerini, M., and Navathe, S. B. 1986. A comparative analysis of methodologies for database schema integration. ACM Comput. Surv. 18, 4, 323--364.]] Google ScholarDigital Library
Bergman, M. K. 2000. The deep web: Surfacing hidden value. Tech. rep., BrightPlanet LLC. Dec.]]Google Scholar
Borda, J. C. 1781. Mémoire sur les élections au scrutin. Histoire de l'Académie Royale des Sciences.]]Google Scholar
Breiman, L. 1996. Bagging predictors. Machine Learning 24, 2, 123--140.]] Google ScholarDigital Library
Brin, S., Motwani, R., and Silverstein, C. 1997. Beyond market baskets: Generalizing association rules to correlations. In Proceedings of the SIGMOD 1997 Conference. ACM, New York.]] Google ScholarDigital Library
Brunk, H. D. 1965. An Introduction to Mathematical Statistics. Blaisdell Publishing Company, New York.]]Google Scholar
Chang, K. C.-C., He, B., Li, C., Patel, M., and Zhang, Z. 2004. Structured databases on the web: Observations and implications. SIGMOD Record 33, 3, 61--70.]] Google ScholarDigital Library
Chang, K. C.-C., He, B., Li, C., and Zhang, Z. 2003. The UIUC web integration repository. Computer Science Department, University of Illinois at Urbana-Champaign. http://metaquerier.cs.uiuc.edu/repository.]]Google Scholar
Chang, K. C.-C., He, B., and Zhang, Z. 2005. Toward large scale integration: Building a metaquerier over databases on the web. In Proceedings of the CIDR 2005 Conference.]]Google Scholar
Chaudhuri, S., Ganjam, K., Ganti, V., and Motwani, R. 2003. Robust and efficient fuzzy match for online data cleaning. In Proceedings of the SIGMOD 2003 Conference. ACM, New York.]] Google ScholarDigital Library
Dhamankar, R., Lee, Y., Doan, A., Halevy, A., and Domingos, P. 2004. imap: Discovering complex semantic matches between database schemas. In Proceedings of the SIGMOD 2004 Conference. ACM, New York.]] Google ScholarDigital Library
Diaconis, P. and Graham, R. 1977. Spearman's footrule as a measure of disarray. J. Roy. Statis. Soc. Ser. B 39, 2, 262--268.]]Google ScholarCross Ref
Doan, A., Domingos, P., and Halevy, A. Y. 2001. Reconciling schemas of disparate data sources: A machine-learning approach. In Proceedings of the SIGMOD 2001 Conference. ACM, New York.]] Google ScholarDigital Library
Dwork, C., Kumar, R., Naor, M., and Sivakumar, D. 2001. Rank aggregation methods for the web. In Proceedings of the WWW 2001 Conference.]] Google ScholarDigital Library
Fagin, R., Kumar, R., and Sivakumar, D. 2003. Efficient similarity search and classification via rank aggregation. In Proceedings of the SIGMOD 2003 Conference. ACM, New York.]] Google ScholarDigital Library
Goodman, L. and Kruskal, W. 1979. Measures of Association for Cross Classification. Springer-Verlag, New York.]]Google Scholar
He, B. and Chang, K. C.-C. 2003. Statistical schema matching across web query interfaces. In Proceedings of the SIGMOD 2003 Conference. ACM, New York.]] Google ScholarDigital Library
He, B. and Chang, K. C.-C. 2005. Making holistic schema matching robust: An ensemble approach. In Proceedings of the SIGKDD 2005 Conference. ACM, New York.]] Google ScholarDigital Library
He, B., Chang, K. C.-C., and Han, J. 2004. Discovering complex matchings across web query interfaces: A correlation mining approach. In Proceedings of the SIGKDD 2004 Conference. ACM, New York.]] Google ScholarDigital Library
He, H., Meng, W., Yu, C., and Wu, Z. 2003. Wise-integrator: An automatic integrator of web search interfaces for e-commerce. In Proceedings of the VLDB 2003 Conference.]]Google Scholar
Ipeirotis, P. G., Gravano, L., and Sahami, M. 2001. Probe, count, and classify: Categorizing hidden web databases. In Proceedings of the SIGMOD 2001 Conference. ACM, New York.]] Google ScholarDigital Library
Kemeny, J. G. 1959. Mathematics without numbers. Daedalus 88, 571--591.]]Google Scholar
Langley, P. 1995. Elements of Machine Learning. Morgan-Kaufmann. Ban Francisco, CA.]] Google ScholarDigital Library
Lee, Y.-K., Kim, W.-Y., Cai, Y. D., and Han, J. 2003. Comine: Efficient mining of correlated patterns. In Proceedings of the 2003 International Conference Data Mining.]] Google ScholarDigital Library
Madhavan, J., Bernstein, P. A., and Rahm, E. 2001. Generic schema matching with cupid. In Proceedings of the VLDB 2001 Conference. 49--58.]] Google ScholarDigital Library
Melnik, S., Garcia-Molina, H., and Rahm, E. 2002. Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In Proceedings of the ICDE 2002 Conference.]] Google ScholarDigital Library
Omiecinski, E. 2003. Alternative interest measures for mining associations. IEEE Trans. Knowl. Data Eng. 15, 57--69.]] Google ScholarDigital Library
Porter, M. The porter stemming algorithm. Accessible at http://www.tartarus.org/~martin/Porter Stemmer.]]Google Scholar
Rahm, E. and Bernstein, P. A. 2001. A survey of approaches to automatic schema matching. VLDB J. 10, 4, 334--350.]] Google ScholarDigital Library
Seligman, L., Rosenthal, A., Lehner, P., and Smith, A. 2002. Data integration: Where does the time go? Bull. Tech. Comm. Data Engr. 25, 3.]]Google Scholar
Tan, P., Kumar, V., and Srivastava, J. 2002. Selecting the right interestingness measure for association patterns. In Proceedings of the SIGKDD 2002 Conference.]] Google ScholarDigital Library
Wang, J., Wen, J.-R., Lochovsky, F., and Ma, W.-Y. 2004. Instance-based schema matching for web databases by domain-specific query probing. In Proceedings of the VLDB 2004 Conference.]]Google Scholar
Wu, W., Yu, C. T., Doan, A., and Meng, W. 2004. An interactive clustering-based approach to integrating source query interfaces on the deep web. In Proceedings of the SIGMOD 2004 Conference. ACM, New York.]] Google ScholarDigital Library
Young, H. P. 1974. An axiomatization of borda's rule. J. Economic Theory 9, 43--52.]]Google ScholarCross Ref
Young, H. P. 1988. Condorcet's theory of voting. American Political Science Review 82, 1231--1244.]]Google ScholarCross Ref
Zhang, Z., He, B., and Chang, K. C.-C. 2004. Understanding web query interfaces: Best-effort parsing with hidden syntax. In Proceedings of the SIGMOD 2004 Conference. ACM, New York.]] Google ScholarDigital Library
Zhang, Z., He, B., and Chang, K. C.-C. 2005. Light-weight domain-based form assistant: Querying web databases on the fly. In Proceedings of the VLDB 2005 Conference.]] Google ScholarDigital Library

Index Terms

Automatic complex schema matching across Web query interfaces: A correlation mining approach
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
2. Information systems
  1. Data management systems
  2. Information systems applications
    1. Data mining

Recommendations

Discovering complex matchings across web query interfaces: a correlation mining approach
KDD '04: Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining

To enable information integration, schema matching is a critical step for discovering semantic correspondences of attributes across heterogeneous sources. While complex matchings are common, because of their far more complex search space, most existing ...
Read More
Making holistic schema matching robust: an ensemble approach
KDD '05: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining

The Web has been rapidly "deepened" by myriad searchable databases online, where data are hidden behind query interfaces. As an essential task toward integrating these massive "deep Web" sources, large scale schema matching (i.e., discovering semantic ...
Read More
Complex Synonymous Matchings Based on Correlation Mining
I-ESA '09: Proceedings of the 2009 International Conference on Interoperability for Enterprise Software and Applications China

In recent years, with the virtually unlimited amount of information sources, the Deep Web is clearly becoming an important frontier for data integration. Schema matching is fundamental for supporting query mediation across Deep Web sources. To integrate ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Database Systems Volume 31, Issue 1
March 2006
438 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/1132863
Issue’s Table of Contents

Copyright © 2006 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 March 2006
Published in tods Volume 31, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Data integration
bagging predictors
correlation mining
deep Web
ensemble
schema matching
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 68
  Total Citations
  View Citations
- 1,456
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Automatic complex schema matching across Web query interfaces: A correlation mining approach

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

Discovering complex matchings across web query interfaces: a correlation mining approach

Making holistic schema matching robust: an ensemble approach

Complex Synonymous Matchings Based on Correlation Mining

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Automatic complex schema matching across Web query interfaces: A correlation mining approach

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

Discovering complex matchings across web query interfaces: a correlation mining approach

Making holistic schema matching robust: an ensemble approach

Complex Synonymous Matchings Based on Correlation Mining

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media