ABSTRACT
It is increasingly important to develop scalable integration techniques for the growing number of XML data sources. A practical starting point for the integration of large numbers of Document Type Definitions (DTDs) of XML sources would be to first find clusters of DTDs that are similar in structure and semantics. Reconciling similar DTDs within such a cluster will be an easier task than reconciling DTDs that are different in structure and semantics as the latter would involve more restructuring. We introduce XClust, a novel integration strategy that involves the clustering of DTDs. A matching algorithm based on the semantics, immediate descendents and leaf-context similarity of DTD elements is developed. Our experiments to integrate real world DTDs demonstrate the effectiveness of the XClust approach.
- S.Abiteboul. Querying semistructured data. ICDT, 1997.]] Google ScholarDigital Library
- V.Apparao, S.Byrne, MChampion. Document Object Model, 1998. http://www.w3.org/TR/REC-DOM-Level-1/.]]Google Scholar
- S. Castano, V. De Antonellis, S. Vimercati. Global Viewing of Heterogeneous Data Sources. IEEE TKDE 13(2), 2001.]] Google ScholarDigital Library
- D. Chamberlin et al. XQuery: A Query Language for XML, 2000. http://www.w3.org/TR/xmlquery/.]]Google Scholar
- D. Chamberlin, J. Robie, D. Florescu. Quilt: An XML Query Language for Heterogeneous Data Sources. ACM SIGMOD Workshop on Web and Databases, 2000.]] Google ScholarDigital Library
- The DBLP DTD file is available at ftp://ftp.informatik.uni-trier.de/pub/users/Ley/bib]]Google Scholar
- A. Doan, P. Domingos, and A. Halevy. Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach, ACM SIGMOD, 2001.]] Google ScholarDigital Library
- A.Deutsch, M.Fernandez, D.Florescu. XML-QL: A query language for XML,1998. http://www.w3.org/TR/NOTE-xml-ql]]Google Scholar
- Brian Everitt. Cluster analysis. New York Press, 1993.]]Google Scholar
- M.R. Genesereth, A.M. Keller, and O. Duschka. Infomaster: An Information Integration System. ACM SIGMOD, 1997.]] Google ScholarDigital Library
- H. Garcia-Molina et al. The TSIMMIS approach to mediation: Data models and languages. Journal of Intelligent Information Systems, 8(2):117--132, 1997.]] Google ScholarDigital Library
- M. Garcia-Solaco, F. Saltor and M. Castellanos, A structure based schema integration methodology, 11th International Conference on Data Engineering, pp 505--512, 1995.]] Google ScholarDigital Library
- R. Goldman and J. Widom, DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases. VLDB, 1997.]] Google ScholarDigital Library
- The hotel message service DTD files is available at: http://www.hitis.org/standards/centralreservation/]]Google Scholar
- M.A. Hernández, R.J. Miller, L.M. Haas. Clio: A Semi-Automatic Tool For Schema Mapping. SIGMOD Record 30(2), 2001.]] Google ScholarDigital Library
- Z.G. Ives, D. Florescu, M. Friedman. An Adaptive Query Execution System for Data Integration. ACM SIGMOD, 1999.]] Google ScholarDigital Library
- V. Kashyap, A Sheth. Semantic and Schematic Similarities between Database Objects: A Context-Based Approach, VLDB Journal 5(4), 1996.]] Google ScholarDigital Library
- J. Larson, S.B. Navathe, and R. Elmasri. Theory of Attribute Equivalence and its Applications to Schema Integration, IEEE Trans. on Software Engineering, 15(4), 1989.]] Google ScholarDigital Library
- B. Ludascher, Y. Papakonstantinou, P. Velikhov. A Framework for Navigation-Driven Lazy Mediators, ACM SIGMOD Workshop on Web and Databases, 1999.]]Google Scholar
- A. Y. Levy, A. Rajaraman, and J. J. Ordille. Querying heterogeneous information sources using source descriptions. VLDB, pp:251--262, 1996.]] Google ScholarDigital Library
- T. Milo, S. Zohar. Using schema matching to simplify heterogeneous data translation, VLDB, 1998.]] Google ScholarDigital Library
- J. Madhavan, P. A. Bernstein, and E. Rahm, Generic schema matching with Cupid, VLDB, 2001.]] Google ScholarDigital Library
- S. Nestorov, S. Abiteboul and R. Motwani, Extracting schema from semistructured data, ACM SIGMOD, 1998.]] Google ScholarDigital Library
- E. Rahm, P.A. Bernstein. On Matching Schemas Automatically, Microsoft Research Technical Report MSR-TR-2001-17, 2001.]]Google Scholar
- J. Robie, J. Lapp, D. Schach. XML Query Language (XQL), Workshop on XML Query languages, 1998.]]Google Scholar
- A. Sahuguet. Everything you ever wanted to know about DTDs, but were afraid to ask. ACM SIGMOD Workshop on Web and Databases, 2000.]] Google ScholarDigital Library
- H. Su, S. Padmanabhan, M. Lo, Identification of Syntactically Similar DTD Elements in Schema Matching across DTDs, WAIM, 2001.]] Google ScholarDigital Library
- Tomasic, A. and Raschid, L. and Valduriez, P. Scaling access to heterogeneous data sources with DISCO. IEEE TKDE 10(5):808--823, 1998.]] Google ScholarDigital Library
- http://www.cogsci.princeton.edu/~wn/]]Google Scholar
- http://sourceforge.net/projects/javawn/]]Google Scholar
- Lucie Xyleme. A dynamic warehouse for XML Data of the Web. IEEE Data Engineering Bulletin 24(2): 40--47, 2001.]]Google ScholarDigital Library
Index Terms
- XClust: clustering XML schemas for effective integration
Recommendations
Conceptual modeling of XML schemas
WIDM '03: Proceedings of the 5th ACM international workshop on Web information and data managementXML has become the standard format for representing structured and semi-structured data on the Web. To describe the structure and content of XML data, several XML schema languages have been proposed. Although being very useful for validating XML ...
Double-layered schema integration of heterogeneous XML sources
Schema integration aims to create a mediated schema as a unified representation of existing heterogeneous sources sharing a common application domain. These sources have been increasingly written in XML due to its versatility and expressive power. ...
XML Matchers: approaches and challenges
Schema Matching, i.e. the process of discovering semantic correspondences between concepts adopted in different data source schemas, has been a key topic in Database and Artificial Intelligence research areas for many years. In the past, it was largely ...
Comments