XML schema clustering with semantic and hierarchical similarity measures

doi:10.1016/j.knosys.2006.08.006

Knowledge-Based Systems

Volume 20, Issue 4, May 2007, Pages 336-349

https://doi.org/10.1016/j.knosys.2006.08.006 Get rights and content

Abstract

With the growing popularity of XML as the data representation language, collections of the XML data are exploded in numbers. The methods are required to manage and discover the useful information from them for the improved document handling. We present a schema clustering process by organising the heterogeneous XML schemas into various groups. The methodology considers not only the linguistic and the context of the elements but also the hierarchical structural similarity. We support our findings with experiments and analysis.

Introduction

XML has become a standard for information exchange and retrieval [34]. With the continuous growth in the XML data, the ability to manage massive collections of XML data and to discover knowledge from them becomes essential for the Web-based information systems [15], [25]. A possible solution is to group the similar XML data based on their context and structure. The clustering of XML data facilitates a number of advanced applications such as improved information retrieval, data and schema integration, document classification analysis, structure summary and indexing, and query processing and optimization [6], [23].

The clustering data mining process categorizes the XML data based on their similarity without having a prior knowledge on the taxonomy. There exist a number of clustering methods dealing with the (unstructured) database objects and text data [3], [36]. The XML data is different – semistructured and hierarchical [34]. There are two types of XML data: XML documents and XML schemas. A XML schema describes the structure of the XML document. Usually, XML’s schema can be obtained separately without scanning the whole document. Therefore, a method to cluster the XML documents should take advantage of their schema.

The similarity of correspondence elements between the XML documents can be conducted efficiently using the relevant XML schemas. The document schema provides a definitive description of the XML document, while document instances only give a snapshot what the XML document may contain. The document definition outlined in a schema holds true for all document instances of that schema. So the result produced from the clustering of schemas will hold true for all document instances of those schemas, and can be reused for any other instances. On the contrary, the result of the clustering of document instances will hold true for included document instances only. The clustering process is to be repeated for any other document instances.

This paper presents the XMine methodology that quantitatively determines the similarity between the heterogeneous XML schemas by considering the semantic, as well as the hierarchical structural similarity of elements. The similar schemas are clustered into the separate meaningful classes. Whilst there are several XML documents and schema clustering techniques available [4], [6], [9], [11], [24], [26], this paper enhances this task by adding the hierarchical similarity in clustering by addressing the element level hierarchical positions. The XMine methodology can deal with the varying structures of schemas and with the varying aspects of semantic differences in the schema elements.

The contributions of this paper are (1) combining the semantic and syntactic relationships to calculate the linguistic similarity between two element names; (2) calculating the structural similarity between two elements by considering the ancestor–child relationship along with the parent–child relationship in maximal similar paths; and then (3) generalizing a suitable schema class hierarchy to determine the relationships between the discovered schemas in the XMine methodology.

The performance of XMine is demonstrated using a number of heterogeneous schemas derived from the several application domains. The empirical results demonstrate that the semantic, syntactic and hierarchal relationships of schema elements play important roles for producing the good quality of clustering results. Most importantly, it discovers that the syntactic similarity measure is more useful than the semantic similarity measure.

The result of the schema class composition hierarchy can serve as a basis for a number of XML application processes. The clusters of schemas provide a hint for building an index structure. The indexing based on the structural similarity supports many applications. For example in the information retrieval field, the XML-based search engines can improve the speed and accuracy in retrieving the relevant portions of the XML data by using the efficient indexes. Moreover, several database tools that are developed to deliver, store, integrate and query the XML data [5], [12], [21], [33], require indexing based on the structural similarity to support an effective document storage and retrieval.

Moreover, the schema class composition hierarchy can be viewed as a generalization of the training sets of schemas to a super-class that is useful for further XML document classification analysis. A number of heterogeneous sources of schemas can be classified into this set of predefined classifications of schemas. This process will improve the XML document handling and achieve more effective and efficient searches for the relevant XML documents.

The method of the association rule mining can also be applied to find the interesting correlation relationships of all the metadata available in schemas belonging to the same schema class. The element tags that frequently occur together within a schema class can be used to maximally distinguish one class of schema from others. This would derive a set of association rules associated with each schema class. This schema element tag-based association analysis is also useful for discovering common XML structures for a specific domain.

In addition, the schema class hierarchies can also facilitate a difficult task of schema integration process on the heterogeneous schemas. The integration on the similar schemas within each schema class would provide an easier task than reconciling the schemas that are different in structure and semantics, which would involve a complex restructuring process.

The similarity between two structures is also a notion tied to a challenging task of reusing the XML or semi-structured documents. In the XML document content reuse, a document (or a part of the document) structured under one schema must be restructured into an instance of a different schema. The identification of the common paths between two instances of schema helps to avail this restructuring.

Section snippets

Background knowledge on the XML data

XML is a flexible representation language. There are two varieties of XML data: XML documents and XML schemas. A XML schema provides the data definitions and structure of the XML document [1]. While XML documents are the instances of a schema giving a snapshot of what the document may contain. A schema includes what elements are (not) allowed; what attributes for any elements may be and the number of occurrences of elements; etc. A schema for a document may be included as both internally and

The XMine methodology

Fig. 3 illustrates the overall architecture of the XMine methodology. This is deployed in three phases, namely preprocessing, data mining, and postprocessing.

The focus of the preprocessing phase is to determine the common and similar features between various schemas in an automated manner to effectively facilitate the clustering process. It includes four stages to address various issues involved in measuring the similarity of schemas. Firstly, the structure analyser analyses the structure of a

Empirical evaluation and discussion

Dataset: Table 6 summarizes the major characteristics of the schema collection used in experiments. Each domain consists of a number of different domain categories that have the structural and semantic differences. The schemas from the same domain also vary in structures and semantics and might not be considered similar enough to be grouped into the same clusters. Fig. 6 illustrates the average similarity degree (using the schemaSim measurement) between schemas in the seven subject domain

Related work

Research on measuring the structural similarity and clustering of XML data is gaining momentum. We show a taxonomy of these approaches in Fig. 15 as broadly classified into structure level and element level based similarity approaches.

The structure-level similarity approaches can be divided into three different research directions; (1) to detecting and measuring the structure and content similarities between data; (2) to detecting and measuring the structural similarity between data and schema;

Conclusions and future work

The potential benefits of the rich semantics of XML have been recognized widely for enhancing document handling. A schema clustering process improves the document handling process in the digital libraries and XML repositories by organising the heterogeneous schemas into groups. This paper presented the XMine methodology that accurately clusters the schemas by considering both structural and semantic information of elements. The element structural similarity is the hierarchical position of the

References (36)

E. Bertino et al.
A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications
Information Systems
(2004)
S. Abiteboul et al.
Data on the Web: From Relations to Semistructured Data and XML
(2000)
R. Agrawal, R. Srikant, 1996, Mining Sequential Patterns: Generalizations and Performance Improvements. Paper presented...
P. Berkhin, 2002. Survey of Clustering Data Mining Techniques: Technical Report, Accrue Software, San Jose,...
S. Boag, D. Chamberlin, M. Fernández, D. Florescu, J. Robie, J. Siméon, XQuery 1.0: An XML query language. Retrieved...
A. Boukottaya, C. Vanoirbeek, 2005, November 02–04, Schema matching for transforming structured documents. Paper...
Y. Chi et al.
Frequent subtree mining – an overview
Fundamenta Informatiace Special Issue on Graph and Tree Mining
(2005)
H.H. Do, E. Rahm, 2002 August, COMA – a system for flexible combination of schema matching approaches. Paper presented...
A. Doan, R. Domingos, A.Y. Halevy, 2001, Reconciling schemas of disparate sources: a machine-learning approach. Paper...
C. Fellbaum
WordNet: An Electronic Lexical Database
(1998)

S. Flesca et al.

Fast detection of XML structural similarities

IEEE Transaction on Knowledge and Data Engineering

(2005)

G. Guardalben, Integrating XML and relational database technologies: a position paper. Retrieved May 1st, 2005, <...

Introduction to XML Schema by Rrefsnes data, <http://www.w3schools.com/schema/schema_intro.asp>, 2005, April...

E. Jeong, C.-N. Hsu, 2001, Induction of integrated view for XML data with heterogeneous DTDs. Paper presented at the...

G. Koloniari et al.

Peer-to-peer management of XML data: issues and research challenges

SIGMOD Record

(2005)

L. Kurgan, W. Swiercz, K. Cios, 2002, Semantic mapping of XML tags using inductive machine learning. Paper presented at...

J.W.Lee, S.S. Park, 2004, October 20–24. Finding maximal similar paths between XML documents using sequential patterns....

L.M. Lee, L.H. Yang, W. Hsu, X. Yang, 2002, November, XClust: clustering XML schemas for effective integration. Paper...

Cited by (54)

Schema profiling of document-oriented databases
2018, Information Systems
In document-oriented databases, schema is a soft concept and the documents in a collection can be stored using different local schemata. This gives designers and implementers augmented flexibility; however, it requires an extra effort to understand the rules that drove the use of alternative schemata when sets of documents with different —and possibly conflicting— schemata are to be analyzed or integrated. In this paper we propose a technique, called schema profiling, to explain the schema variants within a collection in document-oriented databases by capturing the hidden rules explaining the use of these variants. We express these rules in the form of a decision tree (schema profile). Consistently with the requirements we elicited from real users, we aim at creating explicative, precise, and concise schema profiles. The algorithm we adopt to this end is inspired by the well-known C4.5 classification algorithm and builds on two original features: the coupling of value-based and schema-based conditions within schema profiles, and the introduction of a novel measure of entropy to assess the quality of a schema profile. A set of experimental tests made on both synthetic and real datasets demonstrates the effectiveness and efficiency of our approach.
Coreference detection in an XML schema
2015, Information Sciences
Preserving data quality is an important issue in data collection management. One of the crucial issues hereby is the detection of duplicate objects (called coreferent objects) which describe the same entity, but in different ways. In this paper we present a method for detecting coreferent objects in metadata, in particular in XML schemas. Our approach consists in comparing the paths from a root element to a given element in the schema. Each path precisely defines the context and location of a specific element in the schema. Path matching is based on the comparison of the different steps of which paths are composed. The uncertainty about the matching of steps is expressed with possibilistic truth values and aggregated using the Sugeno integral. The discovered coreference of paths can help for establishing a mapping between two different XML schemas. In other words, a novel approach for schema matching problem based on paths comparison only is proposed.
XML matchers: Approaches and challenges
2014, Knowledge-Based Systems
Citation Excerpt :
As observed in Section 5.4, XClust and the approach of [100] use the tree-based representation of DTDs to compute their similarity degree. A further example of these approaches is XMine [73]. XMine exploits WordNet, in conjunction with a user-defined dictionary, to find semantic matchings.
Schema Matching, i.e. the process of discovering semantic correspondences between concepts adopted in different data source schemas, has been a key topic in Database and Artificial Intelligence research areas for many years. In the past, it was largely investigated especially for classical database models (e.g., E/R schemas, relational databases, etc.). However, in the latest years, the widespread adoption of XML in the most disparate application fields pushed a growing number of researchers to design XML-specific Schema Matching approaches, called XML Matchers, aiming at finding semantic matchings between concepts defined in DTDs and XSDs. XML Matchers do not just take well-known techniques originally designed for other data models and apply them on DTDs/XSDs, but they exploit specific XML features (e.g., the hierarchical structure of a DTD/XSD) to improve the performance of the Schema Matching process. The design of XML Matchers is currently a well-established research area. The main goal of this paper is to provide a detailed description and classification of XML Matchers. We first describe to what extent the specificities of DTDs/XSDs impact on the Schema Matching task. Then we introduce a template, called XML Matcher Template, that describes the main components of an XML Matcher, their role and behavior. We illustrate how each of these components has been implemented in some popular XML Matchers. We consider our XML Matcher Template as the baseline for objectively comparing approaches that, at first glance, might appear as unrelated. The introduction of this template can be useful in the design of future XML Matchers. Finally, we analyze commercial tools implementing XML Matchers and introduce two challenging issues strictly related to this topic, namely XML source clustering and uncertainty management in XML Matchers.
Exploring dictionary-based semantic relatedness in labeled tree data
2013, Information Sciences
Citation Excerpt :
While all the approaches above have their strengths in detecting structural similarity in XML data, they all lack one key aspect: to represent the structural characteristics of XML data by coupling the syntactic hierarchical information with the semantic meanings underlying the descriptive markup tags. To the best of our knowledge, there is a relatively smaller corpus of studies that take somehow into account semantic aspects in XML retrieval, data management or mining tasks, including schema similarity and matching [17,18,44,25,4], keyword search [24,36], classification [58], and clustering [55,44,20]. Particularly, schema matching has been an important research theme in (semistructured) data and knowledge management, due to its centrality in many application domains, including data integration, ontology mapping, and Semantic Web services [51,32].
The increase in the volume and heterogeneity of semistructured data based application scenarios has demanded for next-generation methods that are able to effectively couple syntactic with semantic information in data management and mining tasks.
The focus of this paper is on the development of methods for determining semantic relatedness in tree-shaped semistructured data and on the assessment of the impact of these methods on structural sense ranking in such data. By exploiting key features of a lexical knowledge base like WordNet, namely ontological relations and concept definitions, we propose a twofold approach that takes into account the particular form of labeled tree data as a conceptual hierarchical representation of real-world objects. We infer indirect relationships between tag concepts and exploit an interleaved search through different concept hierarchies in order to extend semantic relatedness measures originally conceived for plain-text data to deal with labeled tree data instances. We also develop a structural sense ranking framework which employs a context graph built on the tag concepts and the structural relations among tags in the tree data. Experimental evidence on a large real-world collection of Wikipedia articles has shown that the proposed methods can effectively detect and maximize semantic relatedness in tree-structured data, and can be profitably used to perform structural sense ranking.
Minimizing user effort in XML grammar matching
2012, Information Sciences
Citation Excerpt :
Here, our main goal is to develop an effective XML grammar matching method minimizing the amount of manual work needed to perform the match task. This requires: (i) considering the various characteristics and constraints of the XML grammars being matched, in comparison with existing ‘grammar simplifying’ approaches, e.g., [15,29], (ii) allowing a flexible and extensible combination of different matching criteria, adaptable to various application scenarios, in comparison with existing static methods, e.g., [40,59], and (iii) effectively considering the semi-structured nature of XML, as the most prominent and distinctive feature of an XML grammar [2,45], in comparison with existing heuristic or generic approaches, e.g., [18,36], in order to produce more accurate results. Hence, the contributions of our study can be summarized as follows.
XML grammar matching has found considerable interest recently, due to the growing number of heterogeneous XML documents on the Web, and the need to integrate, search and retrieve XML documents originated from different data sources. In this study, we provide an approach for automatic XML grammar matching and comparison aiming to minimize the amount of user effort required to perform the match task. This requires (i) considering the various characteristics and constraints of XML grammars (in comparison with ‘grammar simplifying’ approaches), (ii) allowing a flexible combination of different matching criteria (in comparison with static approaches), and (iii) effectively considering the semi-structured nature of XML (in contrast with heuristic methods). To achieve this, we propose an extensible framework based on the concept of tree edit distance as an optimal technique to consider XML structure, integrating different matching criteria to capture all basic XML grammar characteristics, ranging over element semantic and syntactic similarities, cardinality and alternativeness constraints, as well as data-type correspondences and relative ordering. In addition, our framework is flexible, enabling the user to choose mapping cardinality (i.e., 1:1, 1:n, n:1, n:n), in comparison with exiting static methods (usually constrained to 1:1). User constraints and feedback are equally considered in order to adjust matching results to the user’s perception of correct matches. Experiments on real and synthetic XML grammars demonstrate the effectiveness and efficiency of our matching strategy in identifying mappings, in comparison with alternative methods.
S-Trans: Semantic transformation of XML healthcare data into OWL ontology
2012, Knowledge-Based Systems
Citation Excerpt :
Yang and Powers [20] used linguistic taxonomy based on concept definitions in WordNet [21] to gain the most accurate semantics for element names. Recently, some researchers [22–24] have employed additional functions to calculate the similarity of a particular feature of a given schema, such as the similarities of leaf nodes, root nodes, data types, and constraints. All of the partial results are then combined into a final similarity value using a weighted sum function.
Most healthcare data are available in XML format, which mainly focuses on the structure level and lacks support for data representation. Therefore, a variety of medical applications and medical semantic search engines have difficulty understanding and integrating healthcare data in a highly heterogeneous environment. OWL (Web Ontology Language) and Semantic Web technologies provide an infrastructure that can solve these problems. The aim of our study is to present a mechanism to ease the interpretation and automate the semantic transformation of XML healthcare data into the OWL ontology (S-Trans), which allows an easier and better semantic communication among hospital information systems. On the basis of the XML schemas (XSD or DTD), we extract the document structure and add more descriptions for XML elements. Moreover, to classify the semantic level of duplicate elements in an XML schema, we propose novel metrics to measure the similarity between them. Experimental results show that the proposed method reliably predicts semantic similarity of duplicates and produces a better-quality OWL ontology.

View all citing articles on Scopus

View full text