skip to main content
article

Automatic complex schema matching across Web query interfaces: A correlation mining approach

Published:01 March 2006Publication History
Skip Abstract Section

Abstract

To enable information integration, schema matching is a critical step for discovering semantic correspondences of attributes across heterogeneous sources. While complex matchings are common, because of their far more complex search space, most existing techniques focus on simple 1:1 matchings. To tackle this challenge, this article takes a conceptually novel approach by viewing schema matching as correlation mining, for our task of matching Web query interfaces to integrate the myriad databases on the Internet. On this “deep Web ” query interfaces generally form complex matchings between attribute groups (e.g., {author} corresponds to {first name, last name} in the Books domain). We observe that the co-occurrences patterns across query interfaces often reveal such complex semantic relationships: grouping attributes (e.g., {first name, last name}) tend to be co-present in query interfaces and thus positively correlated. In contrast, synonym attributes are negatively correlated because they rarely co-occur. This insight enables us to discover complex matchings by a correlation mining approach. In particular, we develop the DCM framework, which consists of data preprocessing, dual mining of positive and negative correlations, and finally matching construction. We evaluate the DCM framework on manually extracted interfaces and the results show good accuracy for discovering complex matchings. Further, to automate the entire matching process, we incorporate automatic techniques for interface extraction. Executing the DCM framework on automatically extracted interfaces, we find that the inevitable errors in automatic interface extraction may significantly affect the matching result. To make the DCM framework robust against such “noisy” schemas, we integrate it with a novel “ensemble” approach, which creates an ensemble of DCM matchers, by randomizing the schema data into many trials and aggregating their ranked results by taking majority voting. As a principled basis, we provide analytic justification of the robustness of the ensemble approach. Empirically, our experiments show that the “ensemblization” indeed significantly boosts the matching accuracy, over automatically extracted and thus noisy schema data. By employing the DCM framework with the ensemble approach, we thus complete an automatic process of matchings Web query interfaces.

References

  1. Agrawal, R., Imielinski, T., and Swami, A. N. 1993. Mining association rules between sets of items in large databases. In Proceedings of the SIGMOD 1993 Conference. ACM, New York.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Anderson, D. R., Sweeney, D. J., and Williams, T. A. 1984. Statistics for Business and Economics (Second Edition). West Publishing Company.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Batini, C., Lenzerini, M., and Navathe, S. B. 1986. A comparative analysis of methodologies for database schema integration. ACM Comput. Surv. 18, 4, 323--364.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Bergman, M. K. 2000. The deep web: Surfacing hidden value. Tech. rep., BrightPlanet LLC. Dec.]]Google ScholarGoogle Scholar
  5. Borda, J. C. 1781. Mémoire sur les élections au scrutin. Histoire de l'Académie Royale des Sciences.]]Google ScholarGoogle Scholar
  6. Breiman, L. 1996. Bagging predictors. Machine Learning 24, 2, 123--140.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Brin, S., Motwani, R., and Silverstein, C. 1997. Beyond market baskets: Generalizing association rules to correlations. In Proceedings of the SIGMOD 1997 Conference. ACM, New York.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Brunk, H. D. 1965. An Introduction to Mathematical Statistics. Blaisdell Publishing Company, New York.]]Google ScholarGoogle Scholar
  9. Chang, K. C.-C., He, B., Li, C., Patel, M., and Zhang, Z. 2004. Structured databases on the web: Observations and implications. SIGMOD Record 33, 3, 61--70.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chang, K. C.-C., He, B., Li, C., and Zhang, Z. 2003. The UIUC web integration repository. Computer Science Department, University of Illinois at Urbana-Champaign. http://metaquerier.cs.uiuc.edu/repository.]]Google ScholarGoogle Scholar
  11. Chang, K. C.-C., He, B., and Zhang, Z. 2005. Toward large scale integration: Building a metaquerier over databases on the web. In Proceedings of the CIDR 2005 Conference.]]Google ScholarGoogle Scholar
  12. Chaudhuri, S., Ganjam, K., Ganti, V., and Motwani, R. 2003. Robust and efficient fuzzy match for online data cleaning. In Proceedings of the SIGMOD 2003 Conference. ACM, New York.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Dhamankar, R., Lee, Y., Doan, A., Halevy, A., and Domingos, P. 2004. imap: Discovering complex semantic matches between database schemas. In Proceedings of the SIGMOD 2004 Conference. ACM, New York.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Diaconis, P. and Graham, R. 1977. Spearman's footrule as a measure of disarray. J. Roy. Statis. Soc. Ser. B 39, 2, 262--268.]]Google ScholarGoogle ScholarCross RefCross Ref
  15. Doan, A., Domingos, P., and Halevy, A. Y. 2001. Reconciling schemas of disparate data sources: A machine-learning approach. In Proceedings of the SIGMOD 2001 Conference. ACM, New York.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Dwork, C., Kumar, R., Naor, M., and Sivakumar, D. 2001. Rank aggregation methods for the web. In Proceedings of the WWW 2001 Conference.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Fagin, R., Kumar, R., and Sivakumar, D. 2003. Efficient similarity search and classification via rank aggregation. In Proceedings of the SIGMOD 2003 Conference. ACM, New York.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Goodman, L. and Kruskal, W. 1979. Measures of Association for Cross Classification. Springer-Verlag, New York.]]Google ScholarGoogle Scholar
  19. He, B. and Chang, K. C.-C. 2003. Statistical schema matching across web query interfaces. In Proceedings of the SIGMOD 2003 Conference. ACM, New York.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. He, B. and Chang, K. C.-C. 2005. Making holistic schema matching robust: An ensemble approach. In Proceedings of the SIGKDD 2005 Conference. ACM, New York.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. He, B., Chang, K. C.-C., and Han, J. 2004. Discovering complex matchings across web query interfaces: A correlation mining approach. In Proceedings of the SIGKDD 2004 Conference. ACM, New York.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. He, H., Meng, W., Yu, C., and Wu, Z. 2003. Wise-integrator: An automatic integrator of web search interfaces for e-commerce. In Proceedings of the VLDB 2003 Conference.]]Google ScholarGoogle Scholar
  23. Ipeirotis, P. G., Gravano, L., and Sahami, M. 2001. Probe, count, and classify: Categorizing hidden web databases. In Proceedings of the SIGMOD 2001 Conference. ACM, New York.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Kemeny, J. G. 1959. Mathematics without numbers. Daedalus 88, 571--591.]]Google ScholarGoogle Scholar
  25. Langley, P. 1995. Elements of Machine Learning. Morgan-Kaufmann. Ban Francisco, CA.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Lee, Y.-K., Kim, W.-Y., Cai, Y. D., and Han, J. 2003. Comine: Efficient mining of correlated patterns. In Proceedings of the 2003 International Conference Data Mining.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Madhavan, J., Bernstein, P. A., and Rahm, E. 2001. Generic schema matching with cupid. In Proceedings of the VLDB 2001 Conference. 49--58.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Melnik, S., Garcia-Molina, H., and Rahm, E. 2002. Similarity flooding: A versatile graph matching algorithm and its application to schema matching. In Proceedings of the ICDE 2002 Conference.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Omiecinski, E. 2003. Alternative interest measures for mining associations. IEEE Trans. Knowl. Data Eng. 15, 57--69.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Porter, M. The porter stemming algorithm. Accessible at http://www.tartarus.org/~martin/Porter Stemmer.]]Google ScholarGoogle Scholar
  31. Rahm, E. and Bernstein, P. A. 2001. A survey of approaches to automatic schema matching. VLDB J. 10, 4, 334--350.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Seligman, L., Rosenthal, A., Lehner, P., and Smith, A. 2002. Data integration: Where does the time go? Bull. Tech. Comm. Data Engr. 25, 3.]]Google ScholarGoogle Scholar
  33. Tan, P., Kumar, V., and Srivastava, J. 2002. Selecting the right interestingness measure for association patterns. In Proceedings of the SIGKDD 2002 Conference.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Wang, J., Wen, J.-R., Lochovsky, F., and Ma, W.-Y. 2004. Instance-based schema matching for web databases by domain-specific query probing. In Proceedings of the VLDB 2004 Conference.]]Google ScholarGoogle Scholar
  35. Wu, W., Yu, C. T., Doan, A., and Meng, W. 2004. An interactive clustering-based approach to integrating source query interfaces on the deep web. In Proceedings of the SIGMOD 2004 Conference. ACM, New York.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Young, H. P. 1974. An axiomatization of borda's rule. J. Economic Theory 9, 43--52.]]Google ScholarGoogle ScholarCross RefCross Ref
  37. Young, H. P. 1988. Condorcet's theory of voting. American Political Science Review 82, 1231--1244.]]Google ScholarGoogle ScholarCross RefCross Ref
  38. Zhang, Z., He, B., and Chang, K. C.-C. 2004. Understanding web query interfaces: Best-effort parsing with hidden syntax. In Proceedings of the SIGMOD 2004 Conference. ACM, New York.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Zhang, Z., He, B., and Chang, K. C.-C. 2005. Light-weight domain-based form assistant: Querying web databases on the fly. In Proceedings of the VLDB 2005 Conference.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Automatic complex schema matching across Web query interfaces: A correlation mining approach

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader