XML schema clustering with semantic and hierarchical similarity measures

https://doi.org/10.1016/j.knosys.2006.08.006Get rights and content

Abstract

With the growing popularity of XML as the data representation language, collections of the XML data are exploded in numbers. The methods are required to manage and discover the useful information from them for the improved document handling. We present a schema clustering process by organising the heterogeneous XML schemas into various groups. The methodology considers not only the linguistic and the context of the elements but also the hierarchical structural similarity. We support our findings with experiments and analysis.

Introduction

XML has become a standard for information exchange and retrieval [34]. With the continuous growth in the XML data, the ability to manage massive collections of XML data and to discover knowledge from them becomes essential for the Web-based information systems [15], [25]. A possible solution is to group the similar XML data based on their context and structure. The clustering of XML data facilitates a number of advanced applications such as improved information retrieval, data and schema integration, document classification analysis, structure summary and indexing, and query processing and optimization [6], [23].

The clustering data mining process categorizes the XML data based on their similarity without having a prior knowledge on the taxonomy. There exist a number of clustering methods dealing with the (unstructured) database objects and text data [3], [36]. The XML data is different – semistructured and hierarchical [34]. There are two types of XML data: XML documents and XML schemas. A XML schema describes the structure of the XML document. Usually, XML’s schema can be obtained separately without scanning the whole document. Therefore, a method to cluster the XML documents should take advantage of their schema.

The similarity of correspondence elements between the XML documents can be conducted efficiently using the relevant XML schemas. The document schema provides a definitive description of the XML document, while document instances only give a snapshot what the XML document may contain. The document definition outlined in a schema holds true for all document instances of that schema. So the result produced from the clustering of schemas will hold true for all document instances of those schemas, and can be reused for any other instances. On the contrary, the result of the clustering of document instances will hold true for included document instances only. The clustering process is to be repeated for any other document instances.

This paper presents the XMine methodology that quantitatively determines the similarity between the heterogeneous XML schemas by considering the semantic, as well as the hierarchical structural similarity of elements. The similar schemas are clustered into the separate meaningful classes. Whilst there are several XML documents and schema clustering techniques available [4], [6], [9], [11], [24], [26], this paper enhances this task by adding the hierarchical similarity in clustering by addressing the element level hierarchical positions. The XMine methodology can deal with the varying structures of schemas and with the varying aspects of semantic differences in the schema elements.

The contributions of this paper are (1) combining the semantic and syntactic relationships to calculate the linguistic similarity between two element names; (2) calculating the structural similarity between two elements by considering the ancestor–child relationship along with the parent–child relationship in maximal similar paths; and then (3) generalizing a suitable schema class hierarchy to determine the relationships between the discovered schemas in the XMine methodology.

The performance of XMine is demonstrated using a number of heterogeneous schemas derived from the several application domains. The empirical results demonstrate that the semantic, syntactic and hierarchal relationships of schema elements play important roles for producing the good quality of clustering results. Most importantly, it discovers that the syntactic similarity measure is more useful than the semantic similarity measure.

The result of the schema class composition hierarchy can serve as a basis for a number of XML application processes. The clusters of schemas provide a hint for building an index structure. The indexing based on the structural similarity supports many applications. For example in the information retrieval field, the XML-based search engines can improve the speed and accuracy in retrieving the relevant portions of the XML data by using the efficient indexes. Moreover, several database tools that are developed to deliver, store, integrate and query the XML data [5], [12], [21], [33], require indexing based on the structural similarity to support an effective document storage and retrieval.

Moreover, the schema class composition hierarchy can be viewed as a generalization of the training sets of schemas to a super-class that is useful for further XML document classification analysis. A number of heterogeneous sources of schemas can be classified into this set of predefined classifications of schemas. This process will improve the XML document handling and achieve more effective and efficient searches for the relevant XML documents.

The method of the association rule mining can also be applied to find the interesting correlation relationships of all the metadata available in schemas belonging to the same schema class. The element tags that frequently occur together within a schema class can be used to maximally distinguish one class of schema from others. This would derive a set of association rules associated with each schema class. This schema element tag-based association analysis is also useful for discovering common XML structures for a specific domain.

In addition, the schema class hierarchies can also facilitate a difficult task of schema integration process on the heterogeneous schemas. The integration on the similar schemas within each schema class would provide an easier task than reconciling the schemas that are different in structure and semantics, which would involve a complex restructuring process.

The similarity between two structures is also a notion tied to a challenging task of reusing the XML or semi-structured documents. In the XML document content reuse, a document (or a part of the document) structured under one schema must be restructured into an instance of a different schema. The identification of the common paths between two instances of schema helps to avail this restructuring.

Section snippets

Background knowledge on the XML data

XML is a flexible representation language. There are two varieties of XML data: XML documents and XML schemas. A XML schema provides the data definitions and structure of the XML document [1]. While XML documents are the instances of a schema giving a snapshot of what the document may contain. A schema includes what elements are (not) allowed; what attributes for any elements may be and the number of occurrences of elements; etc. A schema for a document may be included as both internally and

The XMine methodology

Fig. 3 illustrates the overall architecture of the XMine methodology. This is deployed in three phases, namely preprocessing, data mining, and postprocessing.

The focus of the preprocessing phase is to determine the common and similar features between various schemas in an automated manner to effectively facilitate the clustering process. It includes four stages to address various issues involved in measuring the similarity of schemas. Firstly, the structure analyser analyses the structure of a

Empirical evaluation and discussion

Dataset: Table 6 summarizes the major characteristics of the schema collection used in experiments. Each domain consists of a number of different domain categories that have the structural and semantic differences. The schemas from the same domain also vary in structures and semantics and might not be considered similar enough to be grouped into the same clusters. Fig. 6 illustrates the average similarity degree (using the schemaSim measurement) between schemas in the seven subject domain

Related work

Research on measuring the structural similarity and clustering of XML data is gaining momentum. We show a taxonomy of these approaches in Fig. 15 as broadly classified into structure level and element level based similarity approaches.

The structure-level similarity approaches can be divided into three different research directions; (1) to detecting and measuring the structure and content similarities between data; (2) to detecting and measuring the structural similarity between data and schema;

Conclusions and future work

The potential benefits of the rich semantics of XML have been recognized widely for enhancing document handling. A schema clustering process improves the document handling process in the digital libraries and XML repositories by organising the heterogeneous schemas into groups. This paper presented the XMine methodology that accurately clusters the schemas by considering both structural and semantic information of elements. The element structural similarity is the hierarchical position of the

References (36)

  • E. Bertino et al.

    A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications

    Information Systems

    (2004)
  • S. Abiteboul et al.

    Data on the Web: From Relations to Semistructured Data and XML

    (2000)
  • R. Agrawal, R. Srikant, 1996, Mining Sequential Patterns: Generalizations and Performance Improvements. Paper presented...
  • P. Berkhin, 2002. Survey of Clustering Data Mining Techniques: Technical Report, Accrue Software, San Jose,...
  • S. Boag, D. Chamberlin, M. Fernández, D. Florescu, J. Robie, J. Siméon, XQuery 1.0: An XML query language. Retrieved...
  • A. Boukottaya, C. Vanoirbeek, 2005, November 02–04, Schema matching for transforming structured documents. Paper...
  • Y. Chi et al.

    Frequent subtree mining – an overview

    Fundamenta Informatiace Special Issue on Graph and Tree Mining

    (2005)
  • H.H. Do, E. Rahm, 2002 August, COMA – a system for flexible combination of schema matching approaches. Paper presented...
  • A. Doan, R. Domingos, A.Y. Halevy, 2001, Reconciling schemas of disparate sources: a machine-learning approach. Paper...
  • C. Fellbaum

    WordNet: An Electronic Lexical Database

    (1998)
  • S. Flesca et al.

    Fast detection of XML structural similarities

    IEEE Transaction on Knowledge and Data Engineering

    (2005)
  • G. Guardalben, Integrating XML and relational database technologies: a position paper. Retrieved May 1st, 2005, <...
  • Introduction to XML Schema by Rrefsnes data, <http://www.w3schools.com/schema/schema_intro.asp>, 2005, April...
  • E. Jeong, C.-N. Hsu, 2001, Induction of integrated view for XML data with heterogeneous DTDs. Paper presented at the...
  • G. Koloniari et al.

    Peer-to-peer management of XML data: issues and research challenges

    SIGMOD Record

    (2005)
  • L. Kurgan, W. Swiercz, K. Cios, 2002, Semantic mapping of XML tags using inductive machine learning. Paper presented at...
  • J.W.Lee, S.S. Park, 2004, October 20–24. Finding maximal similar paths between XML documents using sequential patterns....
  • L.M. Lee, L.H. Yang, W. Hsu, X. Yang, 2002, November, XClust: clustering XML schemas for effective integration. Paper...
  • Cited by (54)

    • Coreference detection in an XML schema

      2015, Information Sciences
    • XML matchers: Approaches and challenges

      2014, Knowledge-Based Systems
      Citation Excerpt :

      As observed in Section 5.4, XClust and the approach of [100] use the tree-based representation of DTDs to compute their similarity degree. A further example of these approaches is XMine [73]. XMine exploits WordNet, in conjunction with a user-defined dictionary, to find semantic matchings.

    • Exploring dictionary-based semantic relatedness in labeled tree data

      2013, Information Sciences
      Citation Excerpt :

      While all the approaches above have their strengths in detecting structural similarity in XML data, they all lack one key aspect: to represent the structural characteristics of XML data by coupling the syntactic hierarchical information with the semantic meanings underlying the descriptive markup tags. To the best of our knowledge, there is a relatively smaller corpus of studies that take somehow into account semantic aspects in XML retrieval, data management or mining tasks, including schema similarity and matching [17,18,44,25,4], keyword search [24,36], classification [58], and clustering [55,44,20]. Particularly, schema matching has been an important research theme in (semistructured) data and knowledge management, due to its centrality in many application domains, including data integration, ontology mapping, and Semantic Web services [51,32].

    • Minimizing user effort in XML grammar matching

      2012, Information Sciences
      Citation Excerpt :

      Here, our main goal is to develop an effective XML grammar matching method minimizing the amount of manual work needed to perform the match task. This requires: (i) considering the various characteristics and constraints of the XML grammars being matched, in comparison with existing ‘grammar simplifying’ approaches, e.g., [15,29], (ii) allowing a flexible and extensible combination of different matching criteria, adaptable to various application scenarios, in comparison with existing static methods, e.g., [40,59], and (iii) effectively considering the semi-structured nature of XML, as the most prominent and distinctive feature of an XML grammar [2,45], in comparison with existing heuristic or generic approaches, e.g., [18,36], in order to produce more accurate results. Hence, the contributions of our study can be summarized as follows.

    • S-Trans: Semantic transformation of XML healthcare data into OWL ontology

      2012, Knowledge-Based Systems
      Citation Excerpt :

      Yang and Powers [20] used linguistic taxonomy based on concept definitions in WordNet [21] to gain the most accurate semantics for element names. Recently, some researchers [22–24] have employed additional functions to calculate the similarity of a particular feature of a given schema, such as the similarities of leaf nodes, root nodes, data types, and constraints. All of the partial results are then combined into a final similarity value using a weighted sum function.

    View all citing articles on Scopus
    View full text