skip to main content
10.1145/375663.375731acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Reconciling schemas of disparate data sources: a machine-learning approach

Authors Info & Claims
Published:01 May 2001Publication History

ABSTRACT

A data-integration system provides access to a multitude of data sources through a single mediated schema. A key bottleneck in building such systems has been the laborious manual construction of semantic mappings between the source schemas and the mediated schema. We describe LSD, a system that employs and extends current machine-learning techniques to semi-automatically find such mappings. LSD first asks the user to provide the semantic mappings for a small set of data sources, then uses these mappings together with the sources to train a set of learners. Each learner exploits a different type of information either in the source schemas or in their data. Once the learners have been trained, LSD finds semantic mappings for a new data source by applying the learners, then combining their predictions using a meta-learner. To further improve matching accuracy, we extend machine learning techniques so that LSD can incorporate domain constraints as an additional source of knowledge, and develop a novel learner that utilizes the structural information in XML documents. Our approach thus is distinguished in that it incorporates multiple types of knowledge. Importantly, its architecture is extensible to additional learners that may exploit new kinds of information. We describe a set of experiments on several real-world domains, and show that LSD proposes semantic mappings with a high degree of accuracy.

References

  1. 1.LSD's website: cs.washington.edu/homes/anhai/lsd.html.Google ScholarGoogle Scholar
  2. 2.S. Castano and V. D. Antonellis. A schema analysis and reconciliation tool environment for heterogeneous databases. In Proc. of the Int. Database Engineering and Applications Symposium (IDEAS-99), pages 53-62. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. 3.C. Clifton, E. Housman, and A. Rosenthal. Experience with a combined approach to attribute-matching across heterogeneous databases. In Proc. of the IFIP Working Conference on Data Semantics (DS-7), 1997.Google ScholarGoogle Scholar
  4. 4.W. Cohen and H. Hirsh. Joins that generalize: Text classification using whirl. In Proc. of the Fourth Int. Conf. on Knowledge Discovery and Data Mining (KDD), 1998.Google ScholarGoogle Scholar
  5. 5.P. Domingos and M. Pazzani. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29:103-130, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. 6.S. Donoho and L. Rendell. Constructive induction using fragmentary knowledge. In Proc. of the 13th Int. Conf. on Machine Learning, pages 113-121, 1996.Google ScholarGoogle Scholar
  7. 7.D. Freitag. Machine learning for information extraction in informal domains. Ph.D. Thesis, 1998. Dept. of Computer Science, Carnegie Mellon University. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. 8.H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman, and J. Widom. The TSIMMIS project: Integration of heterogeneous information sources. Journal of Intelligent Information Systems, 8(2):117-132, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. 9.L. Haas, D. Kossmann, E. Wimmers, and J. Yang. Optimizing queries across diverse data sources. In Proc. of VLDB, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. 10.P. Hart, N. Nilsson, and B. Raphael. Correction to "a formal basis for the heuristic determination of minimum cost paths". SIGART Newsletter, 37:28-29, 1972. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. 11.Z. Ives, D. Florescu, M. Friedman, A. Levy, and D. Weld. An adaptive query execution system for data integration. In Proc. of SIGMOD, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. 12.G. Keim, N. Shazeer, M. Littman, S. Agarwal, C. Cheves, J. Fitzgerald, J. Grosland, F. Jiang, S. Pollard, and K. Weinmeister. PROVERB: The probabilistic cruciverbalist. In Proc. of the 6th National Conf. on Artificial Intelligence (AAAI-99), pages 710-717, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. 13.C. Knoblock, S. Minton, J. Ambite, N. Ashish, P. Modi, I. Muslea, A. Philpot, and S. Tejada. Modeling web sources for information integration. In Proc. of the National Conference on Artificial Intelligence (AAAI), 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. 14.N. Kushmerick. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118(1-2):15-68, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. 15.A. Y. Levy, A. Rajaraman, and J. Ordille. Querying heterogeneous information sources using source descriptions. In Proc. of VLDB, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. 16.W. Li and C. Clifton. SEMINT: A tool for identifying attribute correspondence in heterogeneous databases using neural networks. Data and Knowledge Engineering, 33:49-84, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. 17.R. Michalski and G. Tecuci, editors. Machine Learning: A Multistrategy Approach. Morgan Kaufmann, 1994.Google ScholarGoogle Scholar
  18. 18.R. Miller, L. Haas, and M. Hernandez. Schema mapping as query discovery. InProc. of VLDB, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. 19.T. Milo and S. Zohar. Using schema matching to simplify heterogeneous data translation. In Proc. of VLDB, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. 20.L. Palopoli, D. Sacca, and D. Ursino. Semi-automatic, semantic discovery of properties from database schemes. In Proc. of the Int. Database Engineering and Applications Symposium (IDEAS-98), pages 244-253. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. 21.M. Perkowitz and O. Etzioni. Category translation: Learning to understand information on the Internet. In Proc. of Int. Joint Conf. on AI (IJCAI), 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. 22.E. Rahm and P. Bernstein. On matching schemas automatically. Tech. report MSR-TR-2001-17, 2001. Microsoft Research, Redmon, WA.Google ScholarGoogle Scholar
  23. 23.K. M. Ting and I. H. Witten. Issues in stacked generalization. Journal of Artificial Intelligence Research, 10:271-289, 1999. Google ScholarGoogle ScholarCross RefCross Ref
  24. 24.A. Tomasic, L. Raschid, and P. Valduriez. Scaling access to distributed heterogeneous data sources with Disco. IEEE Transactions On Knowledge and Data Engineering, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. 25.D. Wolpert. Stacked generalization. Neural Networks, 5:241-259, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. 26.Extensible markup language (XML) 1.0. www.w3.org/TR/1998/REC-xml-19980210. W3C Recommendation.Google ScholarGoogle Scholar
  27. 27.J. Yi and N. Sundaresan. A classifier for semi-structured documents. In Proc. of the 6th Int. Conf. on Knowledge Discovery and Data Mining (KDD-2000), 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Reconciling schemas of disparate data sources: a machine-learning approach

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGMOD '01: Proceedings of the 2001 ACM SIGMOD international conference on Management of data
          May 2001
          630 pages
          ISBN:1581133324
          DOI:10.1145/375663

          Copyright © 2001 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 1 May 2001

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          SIGMOD '01 Paper Acceptance Rate44of293submissions,15%Overall Acceptance Rate785of4,003submissions,20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader