skip to main content
10.1145/584792.584841acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

XClust: clustering XML schemas for effective integration

Published:04 November 2002Publication History

ABSTRACT

It is increasingly important to develop scalable integration techniques for the growing number of XML data sources. A practical starting point for the integration of large numbers of Document Type Definitions (DTDs) of XML sources would be to first find clusters of DTDs that are similar in structure and semantics. Reconciling similar DTDs within such a cluster will be an easier task than reconciling DTDs that are different in structure and semantics as the latter would involve more restructuring. We introduce XClust, a novel integration strategy that involves the clustering of DTDs. A matching algorithm based on the semantics, immediate descendents and leaf-context similarity of DTD elements is developed. Our experiments to integrate real world DTDs demonstrate the effectiveness of the XClust approach.

References

  1. S.Abiteboul. Querying semistructured data. ICDT, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. V.Apparao, S.Byrne, MChampion. Document Object Model, 1998. http://www.w3.org/TR/REC-DOM-Level-1/.]]Google ScholarGoogle Scholar
  3. S. Castano, V. De Antonellis, S. Vimercati. Global Viewing of Heterogeneous Data Sources. IEEE TKDE 13(2), 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Chamberlin et al. XQuery: A Query Language for XML, 2000. http://www.w3.org/TR/xmlquery/.]]Google ScholarGoogle Scholar
  5. D. Chamberlin, J. Robie, D. Florescu. Quilt: An XML Query Language for Heterogeneous Data Sources. ACM SIGMOD Workshop on Web and Databases, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. The DBLP DTD file is available at ftp://ftp.informatik.uni-trier.de/pub/users/Ley/bib]]Google ScholarGoogle Scholar
  7. A. Doan, P. Domingos, and A. Halevy. Reconciling Schemas of Disparate Data Sources: A Machine-Learning Approach, ACM SIGMOD, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A.Deutsch, M.Fernandez, D.Florescu. XML-QL: A query language for XML,1998. http://www.w3.org/TR/NOTE-xml-ql]]Google ScholarGoogle Scholar
  9. Brian Everitt. Cluster analysis. New York Press, 1993.]]Google ScholarGoogle Scholar
  10. M.R. Genesereth, A.M. Keller, and O. Duschka. Infomaster: An Information Integration System. ACM SIGMOD, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. H. Garcia-Molina et al. The TSIMMIS approach to mediation: Data models and languages. Journal of Intelligent Information Systems, 8(2):117--132, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Garcia-Solaco, F. Saltor and M. Castellanos, A structure based schema integration methodology, 11th International Conference on Data Engineering, pp 505--512, 1995.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. Goldman and J. Widom, DataGuides: Enabling Query Formulation and Optimization in Semistructured Databases. VLDB, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. The hotel message service DTD files is available at: http://www.hitis.org/standards/centralreservation/]]Google ScholarGoogle Scholar
  15. M.A. Hernández, R.J. Miller, L.M. Haas. Clio: A Semi-Automatic Tool For Schema Mapping. SIGMOD Record 30(2), 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Z.G. Ives, D. Florescu, M. Friedman. An Adaptive Query Execution System for Data Integration. ACM SIGMOD, 1999.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. V. Kashyap, A Sheth. Semantic and Schematic Similarities between Database Objects: A Context-Based Approach, VLDB Journal 5(4), 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Larson, S.B. Navathe, and R. Elmasri. Theory of Attribute Equivalence and its Applications to Schema Integration, IEEE Trans. on Software Engineering, 15(4), 1989.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. B. Ludascher, Y. Papakonstantinou, P. Velikhov. A Framework for Navigation-Driven Lazy Mediators, ACM SIGMOD Workshop on Web and Databases, 1999.]]Google ScholarGoogle Scholar
  20. A. Y. Levy, A. Rajaraman, and J. J. Ordille. Querying heterogeneous information sources using source descriptions. VLDB, pp:251--262, 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. T. Milo, S. Zohar. Using schema matching to simplify heterogeneous data translation, VLDB, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Madhavan, P. A. Bernstein, and E. Rahm, Generic schema matching with Cupid, VLDB, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Nestorov, S. Abiteboul and R. Motwani, Extracting schema from semistructured data, ACM SIGMOD, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. E. Rahm, P.A. Bernstein. On Matching Schemas Automatically, Microsoft Research Technical Report MSR-TR-2001-17, 2001.]]Google ScholarGoogle Scholar
  25. J. Robie, J. Lapp, D. Schach. XML Query Language (XQL), Workshop on XML Query languages, 1998.]]Google ScholarGoogle Scholar
  26. A. Sahuguet. Everything you ever wanted to know about DTDs, but were afraid to ask. ACM SIGMOD Workshop on Web and Databases, 2000.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. H. Su, S. Padmanabhan, M. Lo, Identification of Syntactically Similar DTD Elements in Schema Matching across DTDs, WAIM, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Tomasic, A. and Raschid, L. and Valduriez, P. Scaling access to heterogeneous data sources with DISCO. IEEE TKDE 10(5):808--823, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. http://www.cogsci.princeton.edu/~wn/]]Google ScholarGoogle Scholar
  30. http://sourceforge.net/projects/javawn/]]Google ScholarGoogle Scholar
  31. Lucie Xyleme. A dynamic warehouse for XML Data of the Web. IEEE Data Engineering Bulletin 24(2): 40--47, 2001.]]Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. XClust: clustering XML schemas for effective integration

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            CIKM '02: Proceedings of the eleventh international conference on Information and knowledge management
            November 2002
            704 pages
            ISBN:1581134924
            DOI:10.1145/584792

            Copyright © 2002 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 4 November 2002

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • Article

            Acceptance Rates

            Overall Acceptance Rate1,861of8,427submissions,22%

            Upcoming Conference

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader