Schema label normalization for improving schema matching

doi:10.1016/j.datak.2010.10.004

Data & Knowledge Engineering

Volume 69, Issue 12, December 2010, Pages 1254-1273

https://doi.org/10.1016/j.datak.2010.10.004 Get rights and content

Abstract

Schema matching is the problem of finding relationships among concepts across heterogeneous data sources that are heterogeneous in format and in structure. Starting from the “hidden meaning” associated with schema labels (i.e. class/attribute names) it is possible to discover relationships among the elements of different schemata. Lexical annotation (i.e. annotation w.r.t. a thesaurus/lexical resource) helps in associating a “meaning” to schema labels. However, the performance of semi-automatic lexical annotation methods on real-world schemata suffers from the abundance of non-dictionary words such as compound nouns, abbreviations, and acronyms. We address this problem by proposing a method to perform schema label normalization which increases the number of comparable labels. The method semi-automatically expands abbreviations/acronyms and annotates compound nouns, with minimal manual effort. We empirically prove that our normalization method helps in the identification of similarities among schema elements of different data sources, thus improving schema matching results.

Introduction

Schema matching is a critical step in many applications, including: data integration, data warehousing, e-business, semantic query processing, peer data management, and semantic web applications. In this work, we focus on schema matching in the context of data integration [1], where the goal is the creation of mappings between heterogeneous data sources (heterogeneous in format and structure). Mappings are obtained by a schema matching system by using a set of semantic matches (e.g. location = area) between different schemata. A powerful means to discover matches is the understanding of the “meaning” behind the names denoting schema elements, “labels” in the following [2]. In this context, lexical annotation, i.e. the explicit association of a meaning to a label w.r.t. a thesaurus (WordNet [3] in our case) is a key tool.

The strength of a thesaurus, like WordNet (WN), is the presence of a wide network of semantic relationships among word meanings, thus providing a corresponding inferred semantic network of lexical relationships among the labels of different schemata. Its weakness, is that it does not cover different domains of knowledge with the same detail and that many domain-dependent words, or non-dictionary words, may not be present in it. Non-dictionary words include compound nouns (CNs) (e.g. “company address”), abbreviations (e.g. “QTY”) and acronyms (e.g. WSD — Word Sense Disambiguation).

The result of automatic lexical annotation techniques is strongly affected by the presence of such non-dictionary words in schemata. For this reason, a method to expand abbreviations and to semantically “interpret” CNs is required. In the following, we will refer to this method as schema label normalization. Schema label normalization helps in the identification of similarities between labels coming from different data sources, thus improving schema mapping accuracy.

A manual process of schema label normalization is laborious, time consuming and itself prone to errors. Starting from our previous work on semi-automatic lexical annotation of structured and semi-structured data sources [4], we propose a semi-automatic method for the normalization of schema labels that is able to expand abbreviations and acronyms, and to enrich WN with new CNs. Our approach uses only schema-level information and can thus be used in scenarios where data instances are not available [5].

Our method is implemented in the MOMIS (Mediator envirOnment for Multiple Information Sources) system [1], [6]. However, it may be applied in general in the context of schema mapping discovery, ontology merging, data integration systems, and web interface integration. Moreover, it might be effective for reverse engineering tasks, e.g. when an ER schema needs to be extracted from a legacy database.

The rest of the article is organized as follows. In Section 2, we define the problem in the context of schema matching; in Section 3, a brief overview of the method is given; in 4 Schema label preprocessing, 5 Abbreviation expansion, 6 Compound noun interpretation, we describe the subsequent phases of our method: label preprocessing, abbreviation expansion and CN interpretation. In Section 7, we demonstrate the effectiveness of the method with extensive experiments on real-world data sets. A comparison of our method with related work is presented in Section 8. Finally, in Section 9, we make some concluding remarks and illustrate our future work.

Section snippets

Problem definition

Element labels represent an important source for assessing similarity between schema elements. This can be done semantically by comparing their meanings.

Definition 1

Lexical annotation of a schema label is the explicit assignment of a meaning to the label w.r.t. a thesaurus.

Starting from the lexical annotation of schema labels, we can derive lexical relationships between them on the basis of the semantic relationships defined in WN between their meanings.

Definition 2

Let S and T be two heterogeneous schemata, and E_S = {s₁,

Overview of the schema label normalization method

As shown in Fig. 2, the schema label normalization method consists of three phases: (1) schema label preprocessing, (2) abbreviation expansion and (3) CN interpretation.

In this section, we briefly analyze the different phases and describe a simple example of the application of the normalization method on the schema element “DeliveryCO” belonging to the “PurchaseOrder” schema in Fig. 1.

Schema label preprocessing

To perform schema label normalization, schema labels need to be preprocessed. Schema label preprocessing is divided into three main sub-steps (as shown in Fig. 2): (1) identification, (2) tokenization, and (3) classification.

Abbreviation expansion

A schema can contain both standard and ad hoc abbreviations. Standard abbreviations either denote important and frequent domain concepts (domain standard abbreviations), e.g. “Co” (Company), or are commonly used by schema designers but do not belong to any specific domain (schema standard abbreviations), e.g. “Nbr” (Number). For instance, the OTA standard³ contains a list of recommended schema

Compound noun interpretation

In the NLP (Natural Language Processing) literature different CN classifications have been proposed [14], [15]. In this work, we use the classification introduced in [14], where CNs are classified in four distinct categories: endocentric, exocentric, copulative, and appositional.

Endocentric CNs consist of a head (i.e. the categorical part that contains the basic meaning of the whole CN) and modifiers, which restrict the meaning of the head. An endocentric CN exhibits a modifier-head structure,

Experimental evaluation

Our evaluation goals were as follows: (1) measuring and explaining the performance of our method, (2) checking whether our method improves the lexical annotation process and finally (3) estimating the effect of schema label normalization on the lexical relationship discovery process. To achieve these goals we conducted detailed experiments. The method was integrated within the MOMIS system. Schema label normalization is performed during the lexical annotation phase of MOMIS: in particular,

Related work

Work related to the issues discussed in this article is in the area of linguistic normalization, normalization techniques in schema matching and finally the use of WN in schema matching.

Conclusion and future work

In this article, we presented a method for the semi-automatic normalization of schema elements labeled with abbreviations and CNs in a data integration environment. Our method can be applied to several other contexts, including ontology merging, data-warehouses and web interface integration. The experimental results have shown the effectiveness of our method, which significantly improves the results of the automatic lexical annotation method, and, as a consequence, enhances the quality of the

Acknowledgments

This work was partially funded by the “Searching for a needle in mountains of data!” project funded by the Fondazione Cassa di Risparmio di Modena within the Bando di Ricerca Internazionale 2008 (http://www.dbgroup.unimo.it/keymantic) and by the MIUR FIRB Network Peer for Business project (http://www.dbgroup.unimo.it/nep4b).

Serena Sorrentino is a Ph.D. student (3rd year) of the Doctorate School in “Information and Communication Technologies (ICT) — Computer Engineering and Science” at the Department of Information Engineering, University of Modena and Reggio Emilia, Italy. Her areas of research are Intelligent Data Integration, Semantic Schema Matching and Natural Language Processing. She is a member of the “DBGROUP”, led by Professor Sonia Bergamaschi.

References (49)

S. Bergamaschi et al.
Semantic integration of semistructured and structured data sources
Special Interest Group on Management of Data, SIGMOD Record
(1999)
F. Giunchiglia et al.
S-Match: an algorithm and an implementation of semantic matching
G.A. Miller et al.
WordNet: an on-line lexical database
International Journal of Lexicography
(1990)
S. Bergamaschi et al.
Automatic annotation for mapping discovery in data integration systems
E. Rahm et al.
A survey of approaches to automatic schema matching
The Very Large Data Bases (VLDB) Journal
(2001)
D. Beneventano et al.
Synthesizing an integrated ontology
IEEE Internet Computing
(2003)
J. Euzenat et al.
Ontology Matching
(2007)
X. Su et al.
Semantic enrichment for ontology mapping
J. Li
LOM: a lexicon-based ontology mapping tool
H. Feild et al.
An empirical comparison of techniques for extracting concept abbreviations from identifiers

L. Ratinov et al.

Abbreviation expansion in schema matching and web integration

R. Uthurusamy et al.

Extracting knowledge from diagnostic databases

The IEEE Expert Journal: Intelligent Systems and Their Applications

(1993)

E. Hill et al.

AMAP: automatically mining abbreviation expansions in programs to enhance software maintenance tools

I. Plag

Word-Formation in English, Cambridge Textbooks in Linguistics

(2003)

J.N. Levi

The Syntax and Semantics of Complex Nominals

(1978)

K. Barker et al.

Semi-automatic recognition of noun modifier relationships

J. Madhavan et al.

Generic schema matching with Cupid

K. Toutanova et al.

Enriching the knowledge sources used in a maximum entropy part-of-speech tagger

S.N. Kim et al.

Automatic interpretation of noun compounds using WordNet similarity

T.W. Finin

The semantic interpretation of nominal compounds

D. Moldovan et al.

Models for the semantic classification of noun phrases

B. Rosario et al.

Classifying the semantic relations in noun compounds

V. Nastase et al.

Learning noun-modifier semantic relations with corpus-based and WordNet-based features

D. Ó Séaghdha, Learning compound noun semantics, Ph.D. thesis, Computer Laboratory, University of Cambridge, published...

Cited by (28)

Language model-based automatic prefix abbreviation expansion method for biomedical big data analysis
2019, Future Generation Computer Systems
Citation Excerpt :
In this method, the relation between the abbreviation and full form is judged by the neural network, but the candidate full form must be prepared in advance, which is not the case in all fields. Sorrentino et al. [34] used four resources to evaluate each expansion and selected the expansion with the maximum evaluation value to normalize the abbreviation. In addition to the abovementioned areas, some other fields are influenced by abbreviations, such as data integration [35], speech recognition [36] and string join [37,38] etc.
In biomedical domain, abbreviations are appearing more and more frequently in various data sets, which has caused significant obstacles to biomedical big data analysis. The dictionary-based approach has been adopted to process abbreviations, but it cannot handle ad hoc abbreviations, and it is impossible to cover all abbreviations. To overcome these drawbacks, this paper proposes an automatic abbreviation expansion method called LMAAE (Language Model-based Automatic Abbreviation Expansion). In this method, the abbreviation is firstly divided into blocks; then, expansion candidates are generated by restoring each block; and finally, the expansion candidates are filtered and clustered to acquire the final expansion result according to the language model and clustering method. Through restrict the abbreviation to prefix abbreviation, the search space of expansion is reduced sharply. And then, the search space is continuous reduced by restrained the effective and the length of the partition. In order to validate the effective of the method, two types of experiments are designed. For standard abbreviations, the expansion results include most of the expansion in dictionary. Therefore, it has a high precision. For ad hoc abbreviations, the precisions of schema matching, knowledge fusion are increased by using this method to handle the abbreviations. Although the recall for standard abbreviation needs to be improved, but this does not affect the good complement effect for the dictionary method.
An approach to website schema.org design
2015, Data and Knowledge Engineering
Citation Excerpt :
The specific normalized name and description may be used in the cases where the original name and description defined in schema.org can be improved in a given context. Such improvements (which are the equivalent of the “semantic label normalizations” performed in a different context [24]) allow the dialogue generator to generate “better” questions. For example, CreativeWork includes the property inLanguage.
Schema.org offers to web developers the opportunity to enrich a website's content with microdata and schema.org. For large websites, implementing microdata can take a lot of time. In general, it is necessary to perform two main activities, for which we lack methods and tools. The first consists in designing what we call the website schema.org, which is the fragment of schema.org that is relevant to the website. The second consists in adding the corresponding microdata tags to the web pages. In this paper, we describe an approach to the design of a website schema.org. The approach consists in using a human–computer task-oriented dialogue, whose purpose is to arrive at that design. We describe a dialogue generator that is domain independent but that can be adapted to specific domains. We propose a set of six evaluation criteria that we use to evaluate our approach and that could be used in future approaches.
Semantic annotation of the CEREALAB database by the AGROVOC linked dataset
2015, Ecological Informatics
Citation Excerpt :
The main problem we encountered in the application of WSD algorithms to our context was the presence in CEREALAB of several compound nouns (i.e., nouns composed by more than one term, e.g., “Septoria Tritici”). To annotate these terms we employed the normalization tool described in Sorrentino et al. (2010) which allows to annotate compound names by considering the meaning of its constituent terms. Moreover, it provides additional functionalities to semi-automatically expand abbreviations and acronyms.
Nowadays, there has been an increment of open data government initiatives promoting the idea that particular data should be freely published. However, the great majority of these resources is published in an unstructured format and is typically accessed only by closed communities. Starting from these considerations, in a previous work related to a dataset on young workers on non permanent contracts, we proposed an experimental and preliminary methodology for facilitating resource providers in publishing public data into the Linked Open Data (LOD) cloud, and for helping consumers (companies and citizens) in efficiently accessing and querying them.
Linked Open Data play a central role for accessing and analyzing the rapidly growing pool of life science data and, as discussed in recent meetings, it is important for data source providers themselves making their resources available as Linked Open Data.
In this paper we extend and apply our methodology to the agricultural domain, i.e. to the CEREALAB database, created to store both genotypic and phenotypic data and specifically designed for plant breeding, in order to provide its publication into the LOD cloud.
Effect of thesaurus size on schema matching quality
2014, Knowledge-Based Systems
Citation Excerpt :
The quality measures precision, recall, and F-measure as defined in [30] are used to evaluate the quality of schema matching with different thesauri. Precision, recall, and F-measure are used in IR domain, however it is commonly used to schema matching evaluation [6]. In addition, in the case of common matches between manual and automatic, the quality of overall similarity is compared based on two approaches; first, the comparison based on Maximum value, and second is the comparison based on the Average value to show the enhancement in the overall similarity of common matches among thesauri used.
Thesaurus is used in many Information Retrieval (IR) applications such as data integration, data warehousing, semantic query processing and schema matching. Schema matching or mapping is one of the most important basic steps in data integration. It is the process of identifying the semantic correspondence or equivalent between two or more schemas. Considering the fact of the existence of many thesauri for identical knowledge domain, the quality and the change in the results of schema matching when using different thesauri in specific knowledge field are not predictable. In this research, we studied the effect of thesaurus size on schema matching quality by conducting many experiments using different thesauri. In addition, a new method in calculating the similarity between vectors extracted from thesaurus database is proposed. The method is based on the ratio of individual shared elements to the elements in the compound set of the vectors. Moreover, we explained in details the efficient algorithm used in searching thesaurus database. After describing the experiments, results that show enhancement in the average of the similarity is presented. The completeness, effectiveness, and their harmonic mean measures were calculated to quantify the quality of matching. Experiments on two different thesauri show positive results with average Precision of 35% and a less value in the average of Recall. The effect of thesaurus size on the quality of matching was statically insignificant; however, other factors affecting the output and the exact value of change are still in the focus of our future study.
CITOM: An incremental construction of multilingual topic maps
2012, Data and Knowledge Engineering
This paper proposes the CITOM approach for an incremental construction of multilingual Topic Maps. Our main goal is to facilitate user's navigation across documents available in different languages. Our approach takes into account three types of information sources: (a) a set of multilingual documents, (b) a domain thesaurus and (c) all the possible questioning sources such as FAQ and user's or expert's requests about documents. In this paper we present the different steps of the proposed approach to construct the Topic Map and the pruning process of the generated Topic Map. We validate our approach with a real corpus from the sustainable construction domain.
Artificial intelligence for ocean science data integration: Current state, gaps, and way forward
2020, Elementa

View all citing articles on Scopus

Sonia Bergamaschi is a Professor and Head of the Department of Information Engineering, University of Modena and Reggio Emilia, Italy. Her research activity covers knowledge representation and management in the context of very large databases facing both theoretical and implementation aspects. She has published several papers and conference contributions. Her research has been founded by the Italian institutions and by European Community projects. She is leader of the DBGROUP (www.dbgroup.unimo.it), founder of the DATARIVER start-up (www.datariver.it), and she is a member of the IEEE Computer Society and of ACM.

Maciej Gawinecki is a Ph.D. student in Computer Engineering and Science at the University of Modena and Reggio Emilia, Italy. He graduated from the University of Adam Mickiewicz, Poznań, Poland. He worked as a software engineer in several agent-oriented projects in Systems Research Institute of Polish Academy of Sciences. Currently, he is a member of AgentGroup of Prof. Giacomo Cabri. His research interests include: re-use of software components, Web service discovery and semantic annotation of structured and semi-structured data.

Laura Po is a research fellow in Computer Engineering at the University of Modena and Reggio Emilia, Italy. She is a member of the DBGROUP, led by Professor Sonia Bergamaschi. Her major research interests focus on analysis and comparison of Word Sense Disambiguation methods, the development of automatic techniques for extracting metadata and annotating data sources and the use of Description Logic in knowledge representation. She received a Ph.D. in Computer Engineering and Science from the University of Modena and Reggio Emilia in 2009. She has been involved in several research projects (National and European), and published several papers and conference contributions.

View full text

Schema label normalization for improving schema matching

Abstract

Introduction

Section snippets

Problem definition

Overview of the schema label normalization method

Schema label preprocessing

Abbreviation expansion

Compound noun interpretation

Experimental evaluation

Related work

Conclusion and future work

Acknowledgments

Semantic integration of semistructured and structured data sources

Special Interest Group on Management of Data, SIGMOD Record

S-Match: an algorithm and an implementation of semantic matching

WordNet: an on-line lexical database

International Journal of Lexicography

Automatic annotation for mapping discovery in data integration systems

A survey of approaches to automatic schema matching

The Very Large Data Bases (VLDB) Journal

Synthesizing an integrated ontology

IEEE Internet Computing

Ontology Matching

Semantic enrichment for ontology mapping

LOM: a lexicon-based ontology mapping tool

An empirical comparison of techniques for extracting concept abbreviations from identifiers