A semantic similarity method based on information content exploiting multiple ontologies

https://doi.org/10.1016/j.eswa.2012.08.049Get rights and content

Abstract

The quantification of the semantic similarity between terms is an important research area that configures a valuable tool for text understanding. Among the different paradigms used by related works to compute semantic similarity, in recent years, information theoretic approaches have shown promising results by computing the information content (IC) of concepts from the knowledge provided by ontologies. These approaches, however, are hampered by the coverage offered by the single input ontology. In this paper, we propose extending IC-based similarity measures by considering multiple ontologies in an integrated way. Several strategies are proposed according to which ontology the evaluated terms belong. Our proposal has been evaluated by means of a widely used benchmark of medical terms and MeSH and SNOMED CT as ontologies. Results show an improvement in the similarity assessment accuracy when multiple ontologies are considered.

Highlights

► A method to compute IC-based semantic similarity from multiple ontologies is presented. ► Strategies to integrate both overlapping and disjoint ontologies are proposed. ► Evaluation has been performed with a standard benchmark and widely used medical ontologies. ► Results show an improvement in accuracy when multiple ontologies are considered.

Introduction

The estimation of the semantic similarity between terms contributes to the better understanding of textual resources. As a result, it has been applied in many different tasks such as word-sense disambiguation (Resnik, 1999), document categorization or clustering (Batet, 2011, Cilibrasi and Vitányi, 2006, Luo et al., 2011), word spelling correction (Budanitsky & Hirst, 2006), automatic language translation (Cilibrasi & Vitányi, 2006), ontology learning (Sánchez, 2010, Sánchez and Moreno, 2008a, Sánchez and Moreno, 2008b, Sánchez, Moreno, et al., 2012), semantic annotation (Sánchez, Isern, & Millán, 2011), information extraction (Atkinson et al., 2009, Sánchez and Isern, 2011), information retrieval (Al-Mubaid and Nguyen, 2006, Budanitsky and Hirst, 2006) or anonymisation of textual documents (Martínez, Sánchez, Valls, 2012, Martínez, Sánchez, Valls, et al., 2012).

Semantic similarity is understood as the degree of taxonomic proximity between terms. Similarity measures assess a numerical score that quantifies this proximity as a function of the semantic evidence observed in one or several knowledge sources. Usually, those resources consist on taxonomies and more general ontologies, which provide a formal and machine-readable way to express a shared conceptualisation by means of a unified terminology and semantic inter-relations from which semantic similarity can be assessed. In the last years, general purpose ontologies have been developed (such as WordNet) but also domain-dependant one (such as MeSH or SNOMED CT for the biomedical domain).

According to the theoretical principles and the way in which ontologies are analysed to estimate similarity, different families of methods can be identified. In a nutshell, edge-counting measures base the similarity assessment on the number of taxonomical links of the minimum path separating two concepts contained in a given ontology (Leacock and Chodorow, 1998, Li et al., 2003, Rada et al., 1989, Wu and Palmer, 1994). Due to their simplicity, these approaches offer a limited accuracy due to ontologies model a large amount of taxonomical knowledge that is not considered during the evaluation of the minimum path (Batet, Sánchez, & Valls, 2011). Feature-based approaches estimate similarity according to the weighted sum of the amount of common and non-common features (Sánchez, Batet, Isern, & Valls, 2012). By features, authors usually consider taxonomic and non-taxonomic information modelled in an ontology, in addition to concept descriptions (e.g., glosses) retrieved from dictionaries (Petrakis et al., 2006, Rodríguez and Egenhofer, 2003, Tversky, 1977). Due to the additional semantic evidences considered during the assessment, they potentially improve edge-counting approaches. However, they usually rely on non-taxonomic features that are rarely found in ontologies (Ding et al., 2004) and require fine tuning of weighting parameters in order to integrate heterogeneous semantic evidences (Petrakis et al., 2006).

Finally, information content-based approaches, which are the focus of this work, assess the similarity between concepts as a function of the information content (IC) that both concepts have in common in a given ontology. In the past, IC was typically computed from concept distribution in tagged textual corpora (Jiang and Conrath, 1997, Lin, 1998, Resnik, 1995). However, this introduces a dependency on corpora availability and manual tagging that hampered their accuracy and applicability due to data sparseness (Sánchez, Batet, Valls, & Gibert, 2010). To overcome this problem, in recent years, several authors have proposed ways to infer IC of concepts in an intrinsic manner from the knowledge structure modelled in an ontology (Seco et al., 2004, Sánchez and Batet, 2011, Sánchez et al., 2011, Zhou et al., 2008). However, the fact that intrinsic IC-based measures only rely on ontological knowledge is also a drawback because they completely depend on the degree of coverage and detail of the unique input ontology. This limitation could be overcome computing concept’s IC and estimating semantic similarity from multiple ontologies. As stated in Al-Mubaid and Nguyen (2009) the exploitation of multiple ontologies provides additional knowledge that can improve the similarity estimation and solve cases in which terms are not represented in an individual ontology. This is especially interesting in domains such as the biomedical one, in which several big and detailed ontologies are available, offering overlapping and complementary knowledge about the same topics.

As it will be discussed in Section 2, few works propose similarity methods supporting more than one ontology, being all of them framed in the context of edge-counting and feature-based paradigms. In this paper we present a method to extend IC-based semantic similarity measures when multiple ontologies are available. As far as we know, no similarity methods based on IC have been proposed in the past considering more than one input ontology. The method relies on a state of the art approach to compute concept’s IC from an ontology in an intrinsic manner (Sánchez et al., 2011). On one hand, our method permits estimating the similarity when a term or a term pair is missing in a certain ontology but it is found in another one. On the other hand, in case of overlapping knowledge (i.e., ontologies covering the same terms), our approach increases the accuracy by selecting the most reliable IC and similarity estimation from those computed from each individual ontology. The method has been evaluated by means of a widely used benchmark of biomedical terms and the above-mentioned biomedical ontologies. Results show that intrinsic IC measures are able to improve other similarity computation paradigms. Moreover, the exploitation of several complementary and/or overlapping ontologies during the similarity assessment was able to improve the accuracy with respect to the mono-ontology scenario.

The rest of the paper is organised as follows. Section 2 introduces related works proposing methods for semantic similarity assessment from multiple ontologies. Section 3 analyses different approaches for computing the IC of a concept, focusing on ontology-based methods. Afterwards, classic IC-based similarity measures are presented. Section 4 describes our method to exploit multiple ontologies for similarity assessment, detailing the strategies proposed to tackle the problem according to which ontology the evaluated terms belong. Section 5 evaluates our approach, comparing it to a mono-ontology scenario. The final section contains the conclusions and some lines of future research.

Section snippets

Related work

Semantic similarity estimation methods supporting multiple ontologies are based on the edge-counting and feature-based paradigms.

In Rodríguez and Egenhofer (2003), the similarity is computed as the weighted sum of similarities between synonym sets, features (e.g., meronyms, attributes, etc.) and neighbour concepts (those linked via semantic pointers) of evaluated terms. Petrakis et al. (2006) extended the previous approach relying on the matching between synonym sets and concept glosses (i.e.,

Information content and semantic similarity

The information content (IC) of a concept states the amount of information provided by the concept when appearing in a context. In this manner, general and abstract entities present less IC when found in a discourse than more concrete and specialised ones. A proper quantification of the IC of concepts improves text understanding by enabling assessing the degree of semantic generality or concreteness of words referring to these concepts. In fact, as stated in the introduction, IC has been

Extending IC-based measures to multiple ontologies

The availability of several knowledge sources can potentially aid the similarity assessment in cases in which a concept or a concept pair is missing in an ontology but found in another, or in situations in which ontologies overlap. In this section we present a method to extend IC-based similarity measures to take profit from multiple input ontologies.

Regarding IC-based similarity estimation, as shown in the previous section, the key point to compare a pair of concepts is to retrieve their LCS.

Evaluation

In order to evaluate the benefits that multiple ontologies bring to similarity assessments, we have applied IC-based measures introduced in Section 3.3 to several mono and multi-ontology scenarios using, in this last case, the strategies proposed in Section 4.

To enable the multi-ontology setting, we have selected a domain in which several detailed and partially overlapping ontologies are available: biomedicine. In this context, SNOMED CT and MeSH knowledge sources have been used as background

Conclusions

The fact that multiple input ontologies are available permits: (i) to compute the similarity of concepts missing in one ontology but present in another, and (ii) to select the most accurate estimation from those computed from different ontologies in case of overlapping knowledge (i.e., concepts belonging to several ontologies at the same time). The former case improves the recall of the similarity estimation and avoids depending on the coverage of an individual source, a serious limitation of

Acknowledgements

This work was partly funded by the Spanish Government through the projects CONSOLIDER INGENIO 2010 CSD2007-0004 “ARES” and TIN2012-32757 “ICWT”, and by the Government of Catalonia under Grant 2009 SGR 1135.

References (39)

  • D. Sánchez et al.

    Ontology-based semantic similarity: A new feature-based approach

    Expert Systems with Applications

    (2012)
  • D. Sánchez et al.

    Learning non-taxonomic relationships from web documents for domain ontology construction

    Data & Knowledge Engineering

    (2008)
  • D. Sánchez et al.

    Learning relation axioms from text: An automatic Web-based approach

    Expert Systems with Applications

    (2012)
  • D. Sánchez et al.

    Enabling semantic similarity estimation across multiple ontologies: An evaluation in the biomedical domain

    Journal of Biomedical Informatics

    (2012)
  • H. Al-Mubaid et al.

    Measuring semantic similarity between biomedical concepts within multiple ontologies

    IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews

    (2009)
  • H. Al-Mubaid et al.

    A cluster-based approach for semantic similarity in the biomedical domain

  • M. Batet

    Ontology-based semantic clustering

    AI Communications

    (2011)
  • A. Budanitsky et al.

    Evaluating wordnet-based measures of semantic distance

    Computational Linguistics

    (2006)
  • R.L. Cilibrasi et al.

    The Google similarity distance

    IEEE Transactions on Knowledge and Data Engineering

    (2006)
  • Cited by (0)

    View full text