ABSTRACT
A data-integration system provides access to a multitude of data sources through a single mediated schema. A key bottleneck in building such systems has been the laborious manual construction of semantic mappings between the source schemas and the mediated schema. We describe LSD, a system that employs and extends current machine-learning techniques to semi-automatically find such mappings. LSD first asks the user to provide the semantic mappings for a small set of data sources, then uses these mappings together with the sources to train a set of learners. Each learner exploits a different type of information either in the source schemas or in their data. Once the learners have been trained, LSD finds semantic mappings for a new data source by applying the learners, then combining their predictions using a meta-learner. To further improve matching accuracy, we extend machine learning techniques so that LSD can incorporate domain constraints as an additional source of knowledge, and develop a novel learner that utilizes the structural information in XML documents. Our approach thus is distinguished in that it incorporates multiple types of knowledge. Importantly, its architecture is extensible to additional learners that may exploit new kinds of information. We describe a set of experiments on several real-world domains, and show that LSD proposes semantic mappings with a high degree of accuracy.
- 1.LSD's website: cs.washington.edu/homes/anhai/lsd.html.Google Scholar
- 2.S. Castano and V. D. Antonellis. A schema analysis and reconciliation tool environment for heterogeneous databases. In Proc. of the Int. Database Engineering and Applications Symposium (IDEAS-99), pages 53-62. Google ScholarDigital Library
- 3.C. Clifton, E. Housman, and A. Rosenthal. Experience with a combined approach to attribute-matching across heterogeneous databases. In Proc. of the IFIP Working Conference on Data Semantics (DS-7), 1997.Google Scholar
- 4.W. Cohen and H. Hirsh. Joins that generalize: Text classification using whirl. In Proc. of the Fourth Int. Conf. on Knowledge Discovery and Data Mining (KDD), 1998.Google Scholar
- 5.P. Domingos and M. Pazzani. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29:103-130, 1997. Google ScholarDigital Library
- 6.S. Donoho and L. Rendell. Constructive induction using fragmentary knowledge. In Proc. of the 13th Int. Conf. on Machine Learning, pages 113-121, 1996.Google Scholar
- 7.D. Freitag. Machine learning for information extraction in informal domains. Ph.D. Thesis, 1998. Dept. of Computer Science, Carnegie Mellon University. Google ScholarDigital Library
- 8.H. Garcia-Molina, Y. Papakonstantinou, D. Quass, A. Rajaraman, Y. Sagiv, J. Ullman, and J. Widom. The TSIMMIS project: Integration of heterogeneous information sources. Journal of Intelligent Information Systems, 8(2):117-132, 1997. Google ScholarDigital Library
- 9.L. Haas, D. Kossmann, E. Wimmers, and J. Yang. Optimizing queries across diverse data sources. In Proc. of VLDB, 1997. Google ScholarDigital Library
- 10.P. Hart, N. Nilsson, and B. Raphael. Correction to "a formal basis for the heuristic determination of minimum cost paths". SIGART Newsletter, 37:28-29, 1972. Google ScholarDigital Library
- 11.Z. Ives, D. Florescu, M. Friedman, A. Levy, and D. Weld. An adaptive query execution system for data integration. In Proc. of SIGMOD, 1999. Google ScholarDigital Library
- 12.G. Keim, N. Shazeer, M. Littman, S. Agarwal, C. Cheves, J. Fitzgerald, J. Grosland, F. Jiang, S. Pollard, and K. Weinmeister. PROVERB: The probabilistic cruciverbalist. In Proc. of the 6th National Conf. on Artificial Intelligence (AAAI-99), pages 710-717, 1999. Google ScholarDigital Library
- 13.C. Knoblock, S. Minton, J. Ambite, N. Ashish, P. Modi, I. Muslea, A. Philpot, and S. Tejada. Modeling web sources for information integration. In Proc. of the National Conference on Artificial Intelligence (AAAI), 1998. Google ScholarDigital Library
- 14.N. Kushmerick. Wrapper induction: Efficiency and expressiveness. Artificial Intelligence, 118(1-2):15-68, 2000. Google ScholarDigital Library
- 15.A. Y. Levy, A. Rajaraman, and J. Ordille. Querying heterogeneous information sources using source descriptions. In Proc. of VLDB, 1996. Google ScholarDigital Library
- 16.W. Li and C. Clifton. SEMINT: A tool for identifying attribute correspondence in heterogeneous databases using neural networks. Data and Knowledge Engineering, 33:49-84, 2000. Google ScholarDigital Library
- 17.R. Michalski and G. Tecuci, editors. Machine Learning: A Multistrategy Approach. Morgan Kaufmann, 1994.Google Scholar
- 18.R. Miller, L. Haas, and M. Hernandez. Schema mapping as query discovery. InProc. of VLDB, 2000. Google ScholarDigital Library
- 19.T. Milo and S. Zohar. Using schema matching to simplify heterogeneous data translation. In Proc. of VLDB, 1998. Google ScholarDigital Library
- 20.L. Palopoli, D. Sacca, and D. Ursino. Semi-automatic, semantic discovery of properties from database schemes. In Proc. of the Int. Database Engineering and Applications Symposium (IDEAS-98), pages 244-253. Google ScholarDigital Library
- 21.M. Perkowitz and O. Etzioni. Category translation: Learning to understand information on the Internet. In Proc. of Int. Joint Conf. on AI (IJCAI), 1995. Google ScholarDigital Library
- 22.E. Rahm and P. Bernstein. On matching schemas automatically. Tech. report MSR-TR-2001-17, 2001. Microsoft Research, Redmon, WA.Google Scholar
- 23.K. M. Ting and I. H. Witten. Issues in stacked generalization. Journal of Artificial Intelligence Research, 10:271-289, 1999. Google ScholarCross Ref
- 24.A. Tomasic, L. Raschid, and P. Valduriez. Scaling access to distributed heterogeneous data sources with Disco. IEEE Transactions On Knowledge and Data Engineering, 1998. Google ScholarDigital Library
- 25.D. Wolpert. Stacked generalization. Neural Networks, 5:241-259, 1992. Google ScholarDigital Library
- 26.Extensible markup language (XML) 1.0. www.w3.org/TR/1998/REC-xml-19980210. W3C Recommendation.Google Scholar
- 27.J. Yi and N. Sundaresan. A classifier for semi-structured documents. In Proc. of the 6th Int. Conf. on Knowledge Discovery and Data Mining (KDD-2000), 2000. Google ScholarDigital Library
Index Terms
- Reconciling schemas of disparate data sources: a machine-learning approach
Recommendations
Reconciling schemas of disparate data sources: a machine-learning approach
A data-integration system provides access to a multitude of data sources through a single mediated schema. A key bottleneck in building such systems has been the laborious manual construction of semantic mappings between the source schemas and the ...
Conceptual modeling of XML schemas
WIDM '03: Proceedings of the 5th ACM international workshop on Web information and data managementXML has become the standard format for representing structured and semi-structured data on the Web. To describe the structure and content of XML data, several XML schema languages have been proposed. Although being very useful for validating XML ...
Comments