Elsevier

Data & Knowledge Engineering

Volume 69, Issue 12, December 2010, Pages 1254-1273
Data & Knowledge Engineering

Schema label normalization for improving schema matching

https://doi.org/10.1016/j.datak.2010.10.004Get rights and content

Abstract

Schema matching is the problem of finding relationships among concepts across heterogeneous data sources that are heterogeneous in format and in structure. Starting from the “hidden meaning” associated with schema labels (i.e. class/attribute names) it is possible to discover relationships among the elements of different schemata. Lexical annotation (i.e. annotation w.r.t. a thesaurus/lexical resource) helps in associating a “meaning” to schema labels. However, the performance of semi-automatic lexical annotation methods on real-world schemata suffers from the abundance of non-dictionary words such as compound nouns, abbreviations, and acronyms. We address this problem by proposing a method to perform schema label normalization which increases the number of comparable labels. The method semi-automatically expands abbreviations/acronyms and annotates compound nouns, with minimal manual effort. We empirically prove that our normalization method helps in the identification of similarities among schema elements of different data sources, thus improving schema matching results.

Introduction

Schema matching is a critical step in many applications, including: data integration, data warehousing, e-business, semantic query processing, peer data management, and semantic web applications. In this work, we focus on schema matching in the context of data integration [1], where the goal is the creation of mappings between heterogeneous data sources (heterogeneous in format and structure). Mappings are obtained by a schema matching system by using a set of semantic matches (e.g. location = area) between different schemata. A powerful means to discover matches is the understanding of the “meaning” behind the names denoting schema elements, “labels” in the following [2]. In this context, lexical annotation, i.e. the explicit association of a meaning to a label w.r.t. a thesaurus (WordNet [3] in our case) is a key tool.

The strength of a thesaurus, like WordNet (WN), is the presence of a wide network of semantic relationships among word meanings, thus providing a corresponding inferred semantic network of lexical relationships among the labels of different schemata. Its weakness, is that it does not cover different domains of knowledge with the same detail and that many domain-dependent words, or non-dictionary words, may not be present in it. Non-dictionary words include compound nouns (CNs) (e.g. “company address”), abbreviations (e.g. “QTY”) and acronyms (e.g. WSD — Word Sense Disambiguation).

The result of automatic lexical annotation techniques is strongly affected by the presence of such non-dictionary words in schemata. For this reason, a method to expand abbreviations and to semantically “interpret” CNs is required. In the following, we will refer to this method as schema label normalization. Schema label normalization helps in the identification of similarities between labels coming from different data sources, thus improving schema mapping accuracy.

A manual process of schema label normalization is laborious, time consuming and itself prone to errors. Starting from our previous work on semi-automatic lexical annotation of structured and semi-structured data sources [4], we propose a semi-automatic method for the normalization of schema labels that is able to expand abbreviations and acronyms, and to enrich WN with new CNs. Our approach uses only schema-level information and can thus be used in scenarios where data instances are not available [5].

Our method is implemented in the MOMIS (Mediator envirOnment for Multiple Information Sources) system [1], [6]. However, it may be applied in general in the context of schema mapping discovery, ontology merging, data integration systems, and web interface integration. Moreover, it might be effective for reverse engineering tasks, e.g. when an ER schema needs to be extracted from a legacy database.

The rest of the article is organized as follows. In Section 2, we define the problem in the context of schema matching; in Section 3, a brief overview of the method is given; in 4 Schema label preprocessing, 5 Abbreviation expansion, 6 Compound noun interpretation, we describe the subsequent phases of our method: label preprocessing, abbreviation expansion and CN interpretation. In Section 7, we demonstrate the effectiveness of the method with extensive experiments on real-world data sets. A comparison of our method with related work is presented in Section 8. Finally, in Section 9, we make some concluding remarks and illustrate our future work.

Section snippets

Problem definition

Element labels represent an important source for assessing similarity between schema elements. This can be done semantically by comparing their meanings.

Definition 1

Lexical annotation of a schema label is the explicit assignment of a meaning to the label w.r.t. a thesaurus.

Starting from the lexical annotation of schema labels, we can derive lexical relationships between them on the basis of the semantic relationships defined in WN between their meanings.

Definition 2

Let S and T be two heterogeneous schemata, and ES = {s1,

Overview of the schema label normalization method

As shown in Fig. 2, the schema label normalization method consists of three phases: (1) schema label preprocessing, (2) abbreviation expansion and (3) CN interpretation.

In this section, we briefly analyze the different phases and describe a simple example of the application of the normalization method on the schema element “DeliveryCO” belonging to the “PurchaseOrder” schema in Fig. 1.

Schema label preprocessing

To perform schema label normalization, schema labels need to be preprocessed. Schema label preprocessing is divided into three main sub-steps (as shown in Fig. 2): (1) identification, (2) tokenization, and (3) classification.

Abbreviation expansion

A schema can contain both standard and ad hoc abbreviations. Standard abbreviations either denote important and frequent domain concepts (domain standard abbreviations), e.g. “Co” (Company), or are commonly used by schema designers but do not belong to any specific domain (schema standard abbreviations), e.g. “Nbr” (Number). For instance, the OTA standard3 contains a list of recommended schema

Compound noun interpretation

In the NLP (Natural Language Processing) literature different CN classifications have been proposed [14], [15]. In this work, we use the classification introduced in [14], where CNs are classified in four distinct categories: endocentric, exocentric, copulative, and appositional.

Endocentric CNs consist of a head (i.e. the categorical part that contains the basic meaning of the whole CN) and modifiers, which restrict the meaning of the head. An endocentric CN exhibits a modifier-head structure,

Experimental evaluation

Our evaluation goals were as follows: (1) measuring and explaining the performance of our method, (2) checking whether our method improves the lexical annotation process and finally (3) estimating the effect of schema label normalization on the lexical relationship discovery process. To achieve these goals we conducted detailed experiments. The method was integrated within the MOMIS system. Schema label normalization is performed during the lexical annotation phase of MOMIS: in particular,

Related work

Work related to the issues discussed in this article is in the area of linguistic normalization, normalization techniques in schema matching and finally the use of WN in schema matching.

Conclusion and future work

In this article, we presented a method for the semi-automatic normalization of schema elements labeled with abbreviations and CNs in a data integration environment. Our method can be applied to several other contexts, including ontology merging, data-warehouses and web interface integration. The experimental results have shown the effectiveness of our method, which significantly improves the results of the automatic lexical annotation method, and, as a consequence, enhances the quality of the

Acknowledgments

This work was partially funded by the “Searching for a needle in mountains of data!” project funded by the Fondazione Cassa di Risparmio di Modena within the Bando di Ricerca Internazionale 2008 (http://www.dbgroup.unimo.it/keymantic) and by the MIUR FIRB Network Peer for Business project (http://www.dbgroup.unimo.it/nep4b).

Serena Sorrentino is a Ph.D. student (3rd year) of the Doctorate School in “Information and Communication Technologies (ICT) — Computer Engineering and Science” at the Department of Information Engineering, University of Modena and Reggio Emilia, Italy. Her areas of research are Intelligent Data Integration, Semantic Schema Matching and Natural Language Processing. She is a member of the “DBGROUP”, led by Professor Sonia Bergamaschi.

References (49)

  • S. Bergamaschi et al.

    Semantic integration of semistructured and structured data sources

    Special Interest Group on Management of Data, SIGMOD Record

    (1999)
  • F. Giunchiglia et al.

    S-Match: an algorithm and an implementation of semantic matching

  • G.A. Miller et al.

    WordNet: an on-line lexical database

    International Journal of Lexicography

    (1990)
  • S. Bergamaschi et al.

    Automatic annotation for mapping discovery in data integration systems

  • E. Rahm et al.

    A survey of approaches to automatic schema matching

    The Very Large Data Bases (VLDB) Journal

    (2001)
  • D. Beneventano et al.

    Synthesizing an integrated ontology

    IEEE Internet Computing

    (2003)
  • J. Euzenat et al.

    Ontology Matching

    (2007)
  • X. Su et al.

    Semantic enrichment for ontology mapping

  • J. Li

    LOM: a lexicon-based ontology mapping tool

  • H. Feild et al.

    An empirical comparison of techniques for extracting concept abbreviations from identifiers

  • L. Ratinov et al.

    Abbreviation expansion in schema matching and web integration

  • R. Uthurusamy et al.

    Extracting knowledge from diagnostic databases

    The IEEE Expert Journal: Intelligent Systems and Their Applications

    (1993)
  • E. Hill et al.

    AMAP: automatically mining abbreviation expansions in programs to enhance software maintenance tools

  • I. Plag

    Word-Formation in English, Cambridge Textbooks in Linguistics

    (2003)
  • J.N. Levi

    The Syntax and Semantics of Complex Nominals

    (1978)
  • K. Barker et al.

    Semi-automatic recognition of noun modifier relationships

  • J. Madhavan et al.

    Generic schema matching with Cupid

  • K. Toutanova et al.

    Enriching the knowledge sources used in a maximum entropy part-of-speech tagger

  • S.N. Kim et al.

    Automatic interpretation of noun compounds using WordNet similarity

  • T.W. Finin

    The semantic interpretation of nominal compounds

  • D. Moldovan et al.

    Models for the semantic classification of noun phrases

  • B. Rosario et al.

    Classifying the semantic relations in noun compounds

  • V. Nastase et al.

    Learning noun-modifier semantic relations with corpus-based and WordNet-based features

  • D. Ó Séaghdha, Learning compound noun semantics, Ph.D. thesis, Computer Laboratory, University of Cambridge, published...
  • Cited by (28)

    • Language model-based automatic prefix abbreviation expansion method for biomedical big data analysis

      2019, Future Generation Computer Systems
      Citation Excerpt :

      In this method, the relation between the abbreviation and full form is judged by the neural network, but the candidate full form must be prepared in advance, which is not the case in all fields. Sorrentino et al. [34] used four resources to evaluate each expansion and selected the expansion with the maximum evaluation value to normalize the abbreviation. In addition to the abovementioned areas, some other fields are influenced by abbreviations, such as data integration [35], speech recognition [36] and string join [37,38] etc.

    • An approach to website schema.org design

      2015, Data and Knowledge Engineering
      Citation Excerpt :

      The specific normalized name and description may be used in the cases where the original name and description defined in schema.org can be improved in a given context. Such improvements (which are the equivalent of the “semantic label normalizations” performed in a different context [24]) allow the dialogue generator to generate “better” questions. For example, CreativeWork includes the property inLanguage.

    • Semantic annotation of the CEREALAB database by the AGROVOC linked dataset

      2015, Ecological Informatics
      Citation Excerpt :

      The main problem we encountered in the application of WSD algorithms to our context was the presence in CEREALAB of several compound nouns (i.e., nouns composed by more than one term, e.g., “Septoria Tritici”). To annotate these terms we employed the normalization tool described in Sorrentino et al. (2010) which allows to annotate compound names by considering the meaning of its constituent terms. Moreover, it provides additional functionalities to semi-automatically expand abbreviations and acronyms.

    • Effect of thesaurus size on schema matching quality

      2014, Knowledge-Based Systems
      Citation Excerpt :

      The quality measures precision, recall, and F-measure as defined in [30] are used to evaluate the quality of schema matching with different thesauri. Precision, recall, and F-measure are used in IR domain, however it is commonly used to schema matching evaluation [6]. In addition, in the case of common matches between manual and automatic, the quality of overall similarity is compared based on two approaches; first, the comparison based on Maximum value, and second is the comparison based on the Average value to show the enhancement in the overall similarity of common matches among thesauri used.

    View all citing articles on Scopus

    Serena Sorrentino is a Ph.D. student (3rd year) of the Doctorate School in “Information and Communication Technologies (ICT) — Computer Engineering and Science” at the Department of Information Engineering, University of Modena and Reggio Emilia, Italy. Her areas of research are Intelligent Data Integration, Semantic Schema Matching and Natural Language Processing. She is a member of the “DBGROUP”, led by Professor Sonia Bergamaschi.

    Sonia Bergamaschi is a Professor and Head of the Department of Information Engineering, University of Modena and Reggio Emilia, Italy. Her research activity covers knowledge representation and management in the context of very large databases facing both theoretical and implementation aspects. She has published several papers and conference contributions. Her research has been founded by the Italian institutions and by European Community projects. She is leader of the DBGROUP (www.dbgroup.unimo.it), founder of the DATARIVER start-up (www.datariver.it), and she is a member of the IEEE Computer Society and of ACM.

    Maciej Gawinecki is a Ph.D. student in Computer Engineering and Science at the University of Modena and Reggio Emilia, Italy. He graduated from the University of Adam Mickiewicz, Poznań, Poland. He worked as a software engineer in several agent-oriented projects in Systems Research Institute of Polish Academy of Sciences. Currently, he is a member of AgentGroup of Prof. Giacomo Cabri. His research interests include: re-use of software components, Web service discovery and semantic annotation of structured and semi-structured data.

    Laura Po is a research fellow in Computer Engineering at the University of Modena and Reggio Emilia, Italy. She is a member of the DBGROUP, led by Professor Sonia Bergamaschi. Her major research interests focus on analysis and comparison of Word Sense Disambiguation methods, the development of automatic techniques for extracting metadata and annotating data sources and the use of Description Logic in knowledge representation. She received a Ph.D. in Computer Engineering and Science from the University of Modena and Reggio Emilia in 2009. She has been involved in several research projects (National and European), and published several papers and conference contributions.

    View full text