21.08.2021  Regular Paper Open Access
Knowledgebased sentence semantic similarity: algebraical properties
 Zeitschrift:
 Progress in Artificial Intelligence
Wichtige Hinweise
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
1 Introduction
Measures of semantic similarities have been primarily developed for quantifying the extent of resemblance between two words or two concepts using preexisting resources that encode wordtoword or concepttoconcept relationships as in WordNet lexical database [
1,
2].
Accurate comparison between text snippets for the similarity determination is a fundamental prerequisite in the areas of natural language processing, information retrieval, text summarization, document clustering, question answering, automatic essay scoring and others [
3,
4]. For instance, the quantification of the similarity between candidate sentences can allow us to promote a good summary coverage and prevent redundancy in automatic text summarization [
5]. In this respect, the similarity values of sentence pairs are sometimes used as part of the statistical features of the text summarization system. Likewise, question answering applications require similarity identification between a question–answer or question–question pair [
6]. Similarly, querybased summarization crucially relies on similarity scores for summary extraction [
7]. On the other hand, semantic similarity plays a crucial role in information retrieval where the matching documents are ranked according to their similarity with the supplied query [
3,
8,
9]. Plagiarism detection is a recent area of research which is solely based on text similarity detection [
10‐
12]. Among other text similarity applications, one shall mention machine translation [
13], text classification [
3,
4,
14], database where similarity is used for schema matching [
15], and bioinformatics [
16].
Anzeige
Strictly speaking, computing sentence similarity is not trivial due to the variety of linguistic constructs and inherent ambiguity in textual expressions, which renders the task of determining semantically equivalent sentences very challenging even for human beings, especially when the contextual information is not wellknown [
4].
Conventionally, sentencetosentence similarity computation can be categorized into two streams of approaches. The first one advocates a lexical or a stringbased approach where sentence similarity quantifies the amount of overlapping of the characters constituting the two input sentences. A such approach is primarily intended to capture similarity among the input sentences in terms of the extent of the string sequence matching between the two inputs. This can provide insights to identify gaps that might be attributed to misspelling as in edit distance [
17]. The second stream corresponds to the semantic similarity which attempts to capture the overlap between the meanings conveyed by the individual sentences [
18]. This category includes several other approaches as well. Notably, one distinguishes (i) the corpusbased approach which uses statistical analysis about word cooccurrence over a large corpus as in the newly emerging distributional and embedding methods [
19]; (ii) knowledgebased approach, which relies on a handcrafted semantic net of words as in WordNet lexical database [
20]; (iii) featurebased methods where sentences are represented through a vector of predefined features [
19]; hybridbased approach that seeks to leverage features methods with corpusbased or knowledgebased approaches [
18]. In parallel, one shall also mention the growing interest to deep learning architecturebased semantic similarity where attention weighs mechanism was proposed to accommodate finegrain variations in various linguistic constructs as in Quan et al. [
21]
In the last decade, knowledgebased linguistic measures with their various variants become the stateoftheart methodology for computing the similarity scores among pairs of text snippets, which forms the basis of many commercial plagiarism detection tools, automatic summarization and information retrieval systems [
4]. In essence, these methods utilize wordlevel semantic networks and lexical relations to arrive at sentencelevel relevance. Electronic resources such as WordNet [
1,
2] and ConceptNet [
22] are acknowledged as the main lexical resources and knowledge bases for this purpose where the quantification of the semantic similarity between two sentences is evaluated as a global measure on pairwise comparison of word similarity of these sentences [
23‐
26]. Nevertheless, a such construction of sentencetosentence semantic similarity from individual word semantic similarity is also prone to at least three inherent limitations and challenges.

(A1) The order of words in individual sentences cannot be accounted for. Therefore, sentences like “students like fast exercises in class” and “fast students like class in” would yield the same semantic similarity. We refer to this challenge A1. Attempts to account for word order in the semantic similarity have been investigated by several researchers but with very limited success. For instance, Islam and Inkpen [ 27] suggested to include a wordorder similarity as a weighted additive component to the semantic similarity, where the wordorder similarity is computed by the normalized difference of common words among the tokens of two input sentences. This presupposes the existence of a set of tokens, which are common to both inputs. A very similar approach has also been adopted by Li et al. [ 25, 26]. Ozates et al. [ 28] proposed to use the dependency grammar concept. In essence their approach uses dependency tree bigram units and evaluates the similarity of the two input sentences as the amount of the bigram unit match. Nevertheless, it should be noted that handling bigram or ngram unit instead of standard bagofword representation entails substantial increase of computational time, without necessarily achieving higher accuracy score as pointed out in some other studies [ 18, 20].

(A2) The WordNet wordlevel semantic is restricted to noun and verb partofspeech (PoS) categories only. This makes nouns and verbs as the two only classes usable to calculate the similarity scores between individual words because of their hierarchical taxonomic organization in WordNet. This trivially leaves the correlation among entities of distinct PoS as well as other types of nonverbal and nonnaming entities uncounted for. For instance, WorldNet measures fail to establish the semantic relation of the word “investigate” to any of the words “investigation”, “investigator”, or “investigative” as they belong to different classes (part of speech) which are not taxonomically linked in WordNet hierarchy. Although the derivational morphology of words is contained in the WordNet lexical database, but in a distinct fashion and without explicit coherence, such limitations have already been pointed out in other studies, see, for instance [ 1, 2]. Therefore, accounting for various PoS entities in the quantification of the sentencetosentence semantic similarity remains an open challenge to the research community. We shall refer to this challenge A2.

(A3) Several WordNet semantic similarity measures have been put forward by the researchers, some of which are solely based on the hierarchy of the WordNet taxonomy [ 29, 30], while others make use corpusbased information as well [ 31‐ 33]. Therefore, the contribution of individual semantic similarity measure to sentencetosentence similarity is still to be investigated. Although several studies have reported some ad hoc and experimentalbased comparison as pointed out in [ 34‐ 38], the almost absence of theoretical studies in this respect is quite striking. We shall refer to this challenge A3.
This paper aims to address challenges
A2 and
A3. The cornerstone ingredients for addressing these challenges are twofold. To address challenge
A2, we advocate a new proposal for PoS word conversation using WordNet taxonomy and derivational relations to promote the “Alltonoun” conversation, where all lexemes are turned into their noun counterparts. The alltonoun conversion is motivated by several grounded arguments. First, noun entities are much more abundant in WordNet taxonomy than verbal or other PoS categories [
2,
39]. Second, the graph hierarchy of noun entities is more elaborated than its verbal counterpart as testified by its larger depth. Third, it provides an appropriate framework to handle named entities. A testing experiment is put forward to quantify the usefulness of a such transformation. Next, to address challenge
A2, we hypothesize that the knowledge of algebraical properties of semantic similarity measures would provide useful insights to choose appropriate semantic measure in a given context. Especially, we mainly focus on hierarchicalbased semantic similarity measures, namely pathlength measure [
37], Wu and Palmer measure [
29] and Leacock and Chodorow [
30] measure because of their dominance in the text mining literature and applications [
3,
40], where some theoretical results related to the expected performance of such measures are pointed out. Particularly, we examine the properties of these measures in terms of range, monotonicity, boundary and computational complexity. The extension from wordtoword to sentencetosentence semantic similarity is also investigated. Particularly, algebraical and intuitive properties of the sentencetosentence similarity measure when using the main WordNet hierarchicalbased semantic similarity measures are laid bare, providing a tremendous help to a user when deciding to choose a particular measure according to his/her context. In terms of research questions, this paper tackles the following:
Anzeige
Q1
What is the benefit of the “Alltonoun” derivation transformation in the calculus of the sentencetosentence semantic similarity?
Q2
What are the algebraical properties of the main hierarchical WordNet semantic similarity measures?
Q3
How are these properties extended from wordlevel semantic similarity to sentencelevel similarity?
The rest of the paper is organized as follows. Section
2 reviews WordNet semantic similarity measures where algebraical properties of the main taxonomybased measures are examined, which tackle Research Question
Q2. Section 3 investigates the sentencetosentence semantic similarities and their associated properties, contributing to Research Question
Q3. In Sect.
4, we will explore the WordNet PoS conversions detailing the implication of such conversion on sentencetosentence semantic similarity measures through exemplification and testing. Especially, the performance of the suggested alltonoun conversation in terms of sentencetosentence semantic similarity using a publicly available dataset is performed. The results were also compared to other conversion approaches, contributing to Research Question
Q1.
2 WordNet and semantic similarity
2.1 WordNet taxonomy
In WordNet [
1,
2], words are clustered into synsets, which are considered to be either synonym or at least carry the same semantic meaning. The words in each synset are grouped so that they are interchangeable in certain contexts. Typically, synsets are used to represent lexical concepts by bounding words and word senses together in a lexical matrix. A single word may appear in more than one synset, which agrees with the fact that a word may have multiple meanings or can appear in multiple parts of speech (verb, noun and adjective/adverb).
Synsets are linked with each other through various semantic relations. The most common relationships are the
Hyponym/Hypernym and
Meronym/Holonym relationships, which are defined for noun category. Hyponymy (resp. hypernym) describes the relation of being subordinate (resp. superior) or belonging to a lower (resp. higher) rank or a class. Meronymy–holonym corresponds to the relation that holds between a part and the whole. For verb category,
Troponym and
Entailment relations are examples of implemented hierarchical relationships in WordNet.
This design creates a sequence of levels going from a specific word (noun or verb) from a lower level to a broader category at the top level. In WordNet, the relation between lexicalized concepts is implemented by a pointer between the corresponding synsets. For example, if we use brackets to indicate synsets and “@ → ” to represent a relation to the meaning of ‘ISA’ or ‘ISAKINDOF’ a possible hierarchy would be:
{robin, redbreast} @ → {bird} @ → {animal, animate_being} @ → {organism, life_form, living_thing}
This manner of representing hyponymy and hypernymy yields a lexical hierarchy in the form of a tree diagram. Such hierarchies are called inheritance systems because items are inheriting information from their superordinates. Thus, all properties of the superordinate are assumed to be properties of the subordinate object as well, while the nouns in WordNet are an example of a lexical inheritance system.
In theory, it is possible to combine all hypernyms into one hierarchy subordinate to an empty synset with no superordinates called a unique beginner. In WordNet, there are several noun hierarchies each starting with a different unique beginner. These multiple hierarchies belong to distinct semantic fields, each with a different vocabulary. Furthermore, since all hyponyms inherit the information of their hypernym, each unique beginner represents a primitive semantic component of the hyponym in question.
Unlike nouns, some verbs are not organized in hierarchies. These are connected by antonym relationships and similar relationships, like adjectives. Adjectives are organized into both head and satellite synsets, each organized around antonymous pairs of adjectives. These two adjectives are considered head synsets. Satellite synsets include words whose meaning is similar to that of a single head adjective. Nouns and adjectives that are derived from verbs and adverbs that are derived from adjectives have pointers indicating these relations.
An example of WordNet hierarchies is shown in Fig.
1 (taken from WordNet 1.5). In this example, the synset {car; auto; automobile; machine; motorcar} is related to:

A broader concept (hypernym synset): {motor vehicle; automotive vehicle},

More specific concepts or hyponym synsets: e.g. {cruiser; squad car; patrol car; police car; prowl car} and {cab; taxi; hack; taxicab},

Parts of it is composed of: {bumper}; {car door}, {car mirror} and {car window} (meronymy relationship).
×
Now given a specific hierarchy of noun/verb concepts, one can estimate the semantic similarity between any two concepts by exploring the structure of the hierarchy between the two underlying concepts. This is detailed in the forthcoming sections, where a theoretical investigation of some of the key measures making use of only graph structure is carried out.
3 Taxonomybased semantic similarity
We shall use the following notations:

len(c _{i} ,c _{j} ): the length of the shortest path from synset c _{i} to synset c _{j} in WordNet lexical database.

lcs(c _{i},c _{j}): the lowest common subsumer of c _{i} and c _{j}

depth(c _{i} ): shortest distance in terms of number of edges or nodes from global root to synset c _{i,} where the global root is such that depth(root) = 1.

max_depth: the maximum depth of the taxonomy
Wu and Palmer semantic measure [
29] between two concepts or synsets, say,
c
_{1} and
c
_{2} compares the depth of the lowest common subsumer of two concepts to depth of individual concepts as in the following equation.
$$ {\text{Si}}m_{wup} \left( {c_{1} ,c_{2} } \right) = \frac{{2{*}depth\left( {lcs\left( {c_{1} ,{ }c_{2} } \right)} \right)}}{{depth\left( {c_{1} } \right) + depth\left( {c_{2} } \right)}} $$
(1)
Similarly, Leacock and Chodorow [
30] developed a similarity metric that uses the shortest path between the two concepts, normalized using the maximum depth of the taxonomy:
$$ {\text{Si}}m_{lch} \left( {c_{1} ,c_{2} } \right) =  log \frac{{len\left( {c_{1} , c_{2} } \right)}}{{2 {\text{max}}\_depth}} $$
(2)
A third simple measure employing WordNet hierarchy makes use only the path length
^{1} between the two associated synsets [
37]:
$$ Sim_{path} \left( {c_{1} ,{ }c_{2} } \right) = 1/len\left( {c_{1} ,{ }c_{2} } \right) $$
(3)
It should be noted that
Sim
_{wup},
Sim
_{path} measures are normalized within the unit interval, while
Sim
_{lch} ranges from
\(\log 2\) to
\(\log \left( {2{\text{max}}\_depth} \right)\). Normalization in unit interval can be achieved using:
$$ Sim_{lch}^{*} \left( {c_{1} ,c_{2} } \right) = \frac{{Sim_{lch} \left( {c_{1} ,c_{2} } \right)  \log 2}}{{\log \left( {\left( {2{\text{max}}\_depth} \right)} \right){ }  {\text{ log}}2}} $$
(4)
The similarity between two words, say, w
_{1} and w
_{2}, is generated from the similarity of the synsets they induce as follows:
where
\(s(w_{1} )\) (resp.
\(s(w_{2} )\)) is the set of synsets (concepts) in WordNet taxonomy that are senses of word w
_{1} (resp. w
_{2}).
$$ Sim\left( {w_{1} ,w_{2} } \right) = \mathop {\max }\limits_{{c_{1} \in s\left( {w_{1} } \right),c_{2} \in s\left( {w_{2} } \right)}} Sim\left( {c_{1} ,c_{2} } \right), $$
(5)
3.1 Properties of taxonomybased semantic similarity measure
First, looking at the hyponymy/hypernymy relation
R: “…IS…A…” or “…IS A KIND OF…”, denoted
“@ → ”, which has already been pointed out in Section A, reveals that
R acts as a partial order relation over the set of all synsets (or concepts). Indeed,

R is trivially reflexive;

R is transitive: for any synsets c _{1}, c _{2}, c _{3} such that c _{1} @ → c _{2} @ → c _{3} , entails c _{1} @ → c _{3} . For example, since “dog” is a hyponym of “mammal” and “mammal” is a hyponym of “animal”, “dog is a hyponym of animal”.

R is antisymmetric: for any synsets c _{1}, c _{2}, if c _{1} @ → c _{2} and c _{2} @ → c _{1} entails c _{1} = c _{2} .
The partial ordering follows from the fact that there are synsets, which are not related by hyponymy relationship.
However, the translation of the hyponym relationship into semantic relations in the sense of the above is not straightforward. A possible question of interest is whether there is any relationship between the value of the semantic similarity and the occurrence or absence of any hyponymy relation. In this respect, intuitively, the following holds.
Proposition 1
Synsets
\(c_{i}\) and
\(c_{j}\) are linked by hyponymy relation if either
\(c_{i} = lcs\left( {c_{i} ,c_{j} } \right)\) or
\(c_{j} = lcs\left( {c_{i} ,c_{j} } \right)\).
The preceding shows that the information about the lowest common subsumer provides a relevant information regarding the existence of hyponymy
^{2} relation. Nevertheless, such information is not straightforwardly inferred from semantic similarity measures. Let us first investigate the properties of such semantic relations in terms of range of values assigned to each of them, monotonicity and boundary cases.
Unless stated otherwise, one shall use notation
Sim
_{x}
(.,.) to stand for any of path, Wu and Palmer, normalized Leacock and Chodorow similarity measures. Consider the relation
Sim
_{x}
(c
_{i}
, c
_{j}
) “
c
_{i} is semantically related to
c
_{j} in the sense of
Sim
_{x}”, then it holds:

Reflexivity: Sim _{x} (c _{i} ,c _{i} ) = 1.

Symmetry: Sim _{x} (c _{i} ,c _{j} ) = Sim _{x} (c _{j} ,c _{i} )

\(0 \le Sim_{x} \left( {c_{i} ,c_{j} } \right) \le 1\)
The above properties were trivial and follow straightforwardly from the definitions of the similarity measures in (1) and (3–4). Other properties of
Sim
_{x} are summarized in the following Propositions whose proofs are reported to Appendix of this paper.
Proposition 2
For all synsets
\({c}_{i}, {c}_{j}\), it holds:
(i)
$$ \frac{1}{{{\text{max}}\_depth}} \le Sim_{path} \left( {c_{i} ,c_{j} } \right) \le 1 $$
(6)
(ii)
$$ \frac{2}{{{\text{max}}\_depth + 1}} \le Sim_{wup} \left( {c_{i} ,c_{j} } \right) \le 1 $$
(7)
(iii)
$$ 0 \le Sim_{lch}^{*} \left( {c_{i} ,c_{j} } \right) \le { }1 $$
(8)
Proposition 3
$$ Sim_{x} \left( {c_{i} ,c_{j} } \right) = 1 \Leftrightarrow c_{i} = c_{j} $$
(9)
Proposition 3 demonstrates that the only case where the semantic similarities take their maximum value is when the underlying pair of words belongs to the same synset.
Proposition 4
For all synsets
\({c}_{i},{c}_{j},{c}_{k},{c}_{l}\),
(i)
$$ Sim_{path} \left( {c_{i} ,{ }c_{j} } \right) = Sim_{path} \left( {c_{k} ,{ }c_{l} } \right) \Leftrightarrow Sim_{lch} \left( {c_{i} ,c_{j} } \right) = Sim_{lch} \left( {c_{k} ,c_{l} } \right) $$
(10)
(ii)
$$ Sim_{path} \left( {c_{i} ,c_{j} } \right) < Sim_{path} \left( {c_{k} ,c_{l} } \right) \Leftrightarrow Sim_{lch} \left( {c_{i} , c_{j} } \right) < Sim_{lch} \left( {c_{k} , c_{l} } \right) $$
(11)
To prove the above statements, it is enough to see that
\(Si{m}_{path}\) and
\(Si{m}_{lch}\) are related to each other through log and linear transformations, and since both logarithmic and linear transformations are strictly monotonic functions, the result follows straightforwardly. Besides, the statements in the core of Proposition 4 are also valid for the normalized Leacock and Chodorow similarity
\(S{im}_{lch}^{*}\). We shall refer to behaviour induced by expressions (
10) and (
11) to the
monotonic equivalence property fulfilled by the pathlength and Leacock and Chodorow semantic similarities. However, such a monotonic equivalence property does not hold between
\(Si{m}_{wup}\) and any of the two other semantic similarities. To see it, one shall consider the following counterexample.
Example 1
Sim
_{path} (process, attribute) = 0.2;
Sim
_{wup} (process, attribute) = 0.5
Sim
_{path} (whole, foode) = 0.1667;
Sim
_{wup} (whole, foode) = 0.5455.
So it is easy to notice that:
Sim
_{path} (process, attribute) >
Sim
_{path} (whole, foode), while
Sim
_{wup} (process, attribute) <
Sim
_{wup} (whole, foode).
Proposition 5
For all synsets
\({c}_{i}, {c}_{j}\) it holds:
(i)
$$ Sim_{path} \left( {c_{i} ,c_{j} } \right) \le Sim_{wup} \left( {c_{i} , c_{j} } \right) $$
(12)
(ii)
$$ \begin{gathered} Sim_{wup} \left( {c_{i} ,c_{j} } \right) \le Sim_{lch}^{*} \left( {c_{i} ,c_{j} } \right),\;{\text{if}}\;len(c_{i} ,c_{j} ) \le 2 \hfill \\ {\text{Otherwise}},\;Sim_{wup} \left( {c_{i} ,c_{j} } \right) > Sim_{lch}^{*} \left( {c_{i} ,c_{j} } \right) \hfill \\ \end{gathered} $$
(13)
Proposition
5 shows that for any pair of synsets that are semantically close in WordNet hierarchy (either being synonyms or one is a direct hyponym of another to ensure the condition
\(len(c_{i} ,c_{j} ) \le 2\)), the path similarity is the most conservative among the three similarity measures. Otherwise, the Wu and Palmer measure is the less conservative one. This is especially relevant when the order of magnitude of the semantic similarity is deemed important.
Statements in Proposition
5 can also be seen as providing some ordering relationship among the three semantic similarity measures,
Proposition 6
$$ c_{i} \ne c_{j} \Rightarrow Sim_{path} \left( {c_{i} ,c_{j} } \right) \le 0.5\;{\text{and}}\;Sim_{lch}^{*} \left( {c_{i} ,c_{j} } \right) < 0.7 $$
(14)
The proof of the statement (14) follows from the fact that in the case of different synsets, then trivially
\(len({c}_{i}, {c}_{j})\ge 2\)
, which, after putting the lower bound “2” in (3) and (4) and noticing that the maximum depth in WordNet 3.0 is 20, is translated into inequalities pointed out in the core of this Proposition.
Proposition 6 indicates that Wu and Palmer similarity is the only one that allows the user to expect to obtain high similarity values, close to one when using different synsets. From this perspective,
\( S{im}_{wup}\) has some advantages with respect to the other two semantic similarity measures, especially when thresholdinglike approach was employed. In other words, the range of values that can be attributed to
\(Si{m}_{path}\) and
\(Si{m}_{lch}^{*}\) contain empty slots of (0, 5 1] and (0.7 1], respectively.
Another interesting property concerns the behaviour of these semantic similarity measures when one of the synsets is a direct hyponym of the other one. Strictly speaking, intuitively, a (full) equivalence relation between the hyponymy and semantic similarity relations cannot be held as the former is antisymmetric while the latter is symmetric. Nevertheless, this does not exclude the existence of hints and/or links between the two concepts. In this course, the following holds.
Proposition 7
Assume synsets
\({c}_{1}@\to {c}_{1}^{\prime},\)
\({c}_{2}@\to {c}_{2}^{\prime},\) then it holds.
(i)
If
\({c}_{1, }{c}_{2},{c}_{1}^{\prime},{c}_{2}^{\prime}\) have the same lower common subsumer, then
\(Si{m}_{*}({c}_{1},{c}_{2})\le Si{m}_{*}({c}_{1}^{\prime},{c}_{2}^{\prime})\)
(ii)
If
\({c}_{1}^{\prime}\) and
\({c}_{2}^{\prime}\) are direct hyponyms of
\({c}_{1}\) and
\({c}_{2}\), respectively, and do not share lower common subsumer, then
\(Si{m}_{*}({c}_{1},{c}_{2})\ge Si{m}_{*}({c}_{1}^{\prime},{c}_{2}^{\prime})\). Especially,
\(Si{m}_{*}({c}_{1},{c}_{2})\ge Si{m}_{*}({c}_{1}^{\prime},{c}_{2}^{\prime})\) for path and Leacock and Chodorow semantic similarities.
(iii)
If
\({c}_{1}^{\prime} (resp. {c}_{2}^{\prime})\) is a direct distinct hyponym of
\({c}_{2}^{\prime}\) (resp.
\({c}_{1}^{\prime}\)), then no stable relationship to
\(Si{m}_{*}\) exits.
Proposition
7 indicates that the hyponymy relationship among synsets does not extend straightforwardly to semantic similarity of pairs of synsets. Especially, it is found that the preservation of the monotonicity relationship is guaranteed only when the pairs share the same lowest common subsumer. Otherwise, it has also been pointed out that path and Leacock/Chodorow semantic similarities also conserve the monotonicity in the case of direct hyponymy relationship or when one of the elements of the pair is the lowest common subsumer of the other element.
Especially, in the case of three synsets only, say,
\({c}_{1},{c}_{2},{c}_{3}\) where
\({c}_{1}@\to {c}_{1}^{\prime}\), then it holds that.
(i)
If
\({c}_{1},{c}_{2},{c}_{1}^{\prime}\) share the same lowest common subsumer, then
\(Si{m}_{*}({c}_{1},{c}_{2})\le Si{m}_{*}({c}_{1}^{\prime},{c}_{2})\).
(ii)
Similarly, if
\({c}_{2}@\to {c}_{2}^{\prime}\), where
c’
_{2} share the same lowest common subsumer with (
c
_{1}
,c
_{2}), then
\(Sim_{*} (c_{1} ,c_{2} ) \le Sim_{*} (c_{1} ,c^{\prime}_{2} )\).
On the other hand, regarding the boundary values of the various semantic similarity measures when one of the synset is a direct hyponym of the other one, the following holds.
Proposition 8
Assume that
\({c}_{i}\) is a direct hyponym (or hypernym) of
\({c}_{j}\), then it holds that.
(i)
$$ Sim_{wup} \left( {c_{i} ,c_{j} } \right) \ge 0.8 $$
(15)
(ii)
$$ Sim_{path} \left( {c_{i} ,c_{j} } \right) = 0.5 $$
(16)
(iii)
$$ Sim_{lch}^{*} \left( {c_{i} ,c_{j} } \right) = 1  \frac{\log \left( 2 \right)}{{\log \left( {2{\text{max}}\_depth} \right)  \log \left( 2 \right)}} $$
(17)
It should be noted that the results pointed out in Proposition
8 are not necessarily held in reverse order; namely, for instance, if
\(Si{m}_{wp}({c}_{i},{c}_{j})=0.8\), this does not necessarily entail that
\({c}_{i}\) (resp.
\({c}_{j}\)) is a hyponym of
\({c}_{j}\) (resp.
\({c}_{i}\)).
However, the reverse implication holds in case of path or normalized Leacock and Chodron semantic similarity measures because whenever the length between the two synsets is two, it implicitly entails that one is a direct hyponym of the other one.
Proposition 9
Given a sequence of hyponymy relations as
then it holds that
$$ c_{1} @ \to c_{2} @ \to c_{3} @ \to \ldots c_{n  1} @ \to c_{n} , $$
(18)
$$Si{m}_{*}({c}_{1},{c}_{2})\ge Si{m}_{*}({c}_{1},{c}_{3})\ge Si{m}_{*}({c}_{1},{c}_{4})\ge ...\ge Si{m}_{*}({c}_{1},{c}_{n1})\ge Si{m}_{*}({c}_{1},{c}_{n})$$
(19)
Proposition
9 indicates that when a direct hyponym is available, this yields the highest semantic similarity measure with its associated hypernym among all possible other distinct hyponyms. Proposition
9 also highlights typical scenarios in which the monotonicity of the hyponym relation is preserved when translated into a semantic similarity measure.
The result pointed out in Proposition
9 is in a full agreement with the intuition behind the concept of synset in the sense that the more the hyponym synset is close to the current synset, the more one expects the underlying semantic similarity becomes higher. The preceding reveals the importance of the concept of direct hyponymy to ensure the satisfaction of the monotonicity relation. To see it, it suffices to see the example of Fig.
2a, b in Appendix where c’
_{2} is also a hyponym (but not a direct hyponym) of c
_{1} and nothing prevents
\(Sim_{*} (c_{1} ,c_{2} )\) to be greater than
\(Sim_{*} (c_{1} ,c^{\prime}_{2} )\).
×
Finally, exploring the computational complexity of the three semantic similarity measures, and denoting by
T(Sim
_{x}
), the execution time of the similarity measure
Sim
_{x}, the following holds (in asymptotic behaviour).
Proposition 10
T(Sim
_{path}
) ≤
T(Sim
_{lch}
) ≤
T(Sim
_{wup}
)
To prove the above inequality, consider two concepts c
_{1}, c
_{2} (same reasoning applies for two words w
_{1}, w
_{2} as well by maximization over all induced concepts), one shall notice the following.
The calculus of
len(c
_{1}
,c
_{2}
) required in
Sim
_{path} and
Sim
_{lch} can be performed in
O(V + E) time where V and E stand for the number of vertices and edges in the WordNet hierarchy (using Breadthfirst search algorithm).
Noticing that the least common subsumer
lcs(c
_{1}
,c
_{2}
) can be inferred straightforwardly when a shortest path from c
_{1} and c
_{2} (
len(c
_{1}
,c
_{2}
)) is found without any additional exploration of the network, while additional network scrutinizing is required to calculate
depth(c
_{1}
) and
depth(c
_{2}
) in
Sim
_{wup}. This yields
Sim
_{path}(c
_{1}, c
_{2}) ≤
Sim
_{wup}(c
_{1}, c
_{2}) and
Sim
_{lch}(c
_{1}, c
_{2}) ≤
Sim
_{wup}(c
_{1}, c
_{2}).
To compare
T(Sim
_{path}
) and
T(Sim
_{lch}
), the issue boils down to the comparison between the computational complexity of the arithmetic division and logarithmic operations. In this respect, if
len(c
_{1}
,c
_{2}
) and the outcome of the arithmetic/logarithmic operations are ndigit number, then the arithmetic division operation in
Sim
_{path} can be executed in
O(M(n)) where M(n) stands for the complexity of the chosen multiplication, using, for instance, Newton–Raphson division algorithm, while the complexity of the logarithmic operation
O(M(n)k) for some constant k. Therefore,
Sim
_{path} ≤
Sim
_{lch} on asymptotic level. See Borwein [
42] for further details on complexity on arithmetic and functional operations.
3.2 Summary
Table
1 summarizes the main properties of the three semantic similarities: pathlength, Wu and Palmer, Leacock and Chodorow. We shall distinguish between properties fulfilled by a single semantic similarity irrespective of others and properties that highlight the interactions among the various semantic similarities. Individual properties of similarity measures include reflexivity, symmetry, range, (one argument, twoarguments) monotonicity with respect to hyponymy/hypernymy relationship and incremental monotonicity evolution.
Table 1
Summary of individual properties of the similarity measures
Similarity
measure

Range

Monotonicity with respect to hyponymy/hypernym

Incremental monotonicity evolution

Reflexivity

Symmetry


Path length

\(\frac{1}{\mathrm{max}\_depth}\le Si{m}_{path}({c}_{i},{c}_{j})\le 0.5\)
\(Si{m}_{path}\left({c}_{i},{c}_{j}\right)=1\, if\, same\, synsets\)

\(Si{m}_{path}\left(.,x\right)\) or
\(Si{m}_{path}(x,.)\) is monotonic
\(Si{m}_{path}\left(x,y\right)\) is monotonic if all inputs have same lcs

\(Si{m}_{path}(x,y)=0.5\)
when x is a direct
hyponym/hypernym of y

Yes

Yes

Leacock and Chodorow

\(0\le Si{m}_{lch}^{*}({c}_{i},{c}_{j})<0.7\)
\(Si{m}_{lch}^{*}\left({c}_{i},{c}_{j}\right) =1\, for\, same\, synsets\)

\(Si{m}_{lch}\left(.,x\right)\) or
\(Si{m}_{lch}(x,.)\) is monotonic
\(Si{m}_{lch}\left(x,y\right)\) is monotonic if all inputs have same lcs

\(Si{m}_{lch}(x,y)=1\frac{\mathrm{log}(2)}{\mathrm{log}(2\mathrm{max}\_depth)\mathrm{log}(2)}\)
when x is a direct
hyponym/hypernym of y

Yes

Yes

Wu and Palmer

\(\frac{2}{\mathrm{max}\_depth+1}\le Si{m}_{wup}({c}_{i},{c}_{j})\le 1\)

\(Si{m}_{wup}\left(.,x\right)\) or
\(Si{m}_{wup}(x,.)\) is monotonic
\(Si{m}_{wup}\left(x,y\right)\) is monotonic if all inputs have same lcs

\(Si{m}_{wup}(x,y)\ge 0.8\)
when x is a direct
hyponym/hypernym of y

Yes

Yes

In Table
2, we summarize the interactive properties of the three semantic similarities highlighting the properties of score ordering, time complexity and monotonic equivalence.
Table 2
Summary of interactive properties of the similarity measures
Property

Fulfilment


Score ordering

\(Si{m}_{path}(x,y) \le Si{m}_{wup}(x, y)\) for all x, y
\(Si{m}_{wup}(x,y)\le Si{m}_{lch}^{*}(x,y)\), if
\(len(x,y)\le 2\), Otherwise
\(Si{m}_{wup}(x,y)>Si{m}_{lch}^{*}(x,y)\)

Time complexity

\(T(Si{m}_{path})\le T(Si{m}_{lch})\le T(Si{m}_{wup})\)

Monotonic equivalence

\(Si{m}_{path}\) and
\(Si{m}_{lch}\) are monotonically equivalent

These properties can provide valuable insights to any researcher or computational linguistic specialist who is interested in applying WordNetbased semantic similarity in text mining and retrieval as will be detailed in the discussion section of this paper. This also yields another look and revisit of the results that might be expected from WordNetbased semantic similarity analysis.
4 Sentence semantic similarity
4.1 From word semantic similarity to sentence similarity
Since the sentence is constituted of a set of words, the sentencetosentence semantic similarity is intuitively linked to wordtoword semantic similarities, although a sentence is more than a simple bag of words because of the importance of word disposition, part of speech, punctuation, among others, in conveying a specific meaning to the sentence. For this purpose, various linguistic manipulations were suggested to combine word semantic values to achieve sentencetosentence semantic scores. Borrowed from information retrieval techniques, traditionally, words of each sentence are represented using weights built from some corpuslike information, e.g., information content of each word, frequency times inverse document frequency (TFIDF), among others [
3,
13,
14], then one uses a cosinelike similarity measure to infer the overall score of the similarity between the two input sentences. Nevertheless, such reasoning fails to account for the WordNet semantic similarity measure. Li et al. [
25,
26] proposed a hybrid approach that combines both vectorbased and WordNet models, where the overall sentence similarity score is given as a convex combination of semantic similarity and wordorder similarity. The latter provides a rough indication of how similar the word order between the two sentences is. The suggested sentence semantic similarity accounts for both information content and pathlength semantic similarity. A simple extension of the word semantic similarity to sentence similarity that accounts for word part of speech but discarding the explicit corpusbased information is suggested by Mihalcea et al. [
23] where the similarity between the pair of words that ensures the maximum score among all other pairs is accounted for. This approach has also been used in the query system [
40]. It is worth noticing that the computation of wordpairwise similarity scores, using either pathbased or information content, can only be enabled if the terms belong to the same part of speech. More formally, expression (
20) forms the basis of the sentence semantic resemblance between sentences S
_{A} and S
_{B} [
23]:
where the semantic similarity of the word
w of sentence
S
_{A} and sentence
S
_{B,}
\({Sim}_{*}\left(w,{S}_{B}\right)\), is given as the semantic similarity of the word in
S
_{B} of the same part of speech as
w yielding the highest semantic similarity with
w. Similar reasoning applies to
\({Sim}_{*}\left(w,{S}_{A}\right)\).
$$ Sim_{g}^{*} (S_{A} ,S_{B} ) = \frac{{\sum_{{w \in S_{A} }} {\mathop {\max }\limits_{\begin{subarray}{l} \, x \in S_{B} , \, \\ {\text{PoS(x)}} = {\text{PoS(w)}} \end{subarray} } Sim_{*} (w,x)} + \sum_{{w \in S_{B} }} {\mathop {\max }\limits_{\begin{subarray}{l} \, x \in S_{A} , \, \\ {\text{PoS(x)}} = {\text{PoS(w)}} \end{subarray} } Sim_{*} (w,x)} }}{{S_{A}  + S_{B} }} \, $$
(20)
From (
20), it is worth noticing that only verbs and nouns are accounted for in the sentencetosentence similarity measure, which suggests that all other partofspeech wording of the sentence (s) is ignored. The normalization factor in the denominator of expression (
20) ensures the overall sentence similarity to be within the unit interval. Trivially, any of word semantic similarity highlighted in expressions (
1–
3) can be employed to quantify
Sim
_{*}
(w, x). A more concise formulating of (20) that accounts for normalization issues and distinguishes the contribution of both sentences to overall semantic similarity, in the same spirit as Mihalcea et al.’s proposal [
23],
^{3} can be provided as.
$$Sim_{g}^{*} (S_{A} ,S_{B} ) = \frac{1}{2}\left( {\frac{{\sum_{{w \in S_{A} }} {\mathop {\max }\limits_{\begin{subarray}{l} \, x \in S_{B} , \, \\ {\text{PoS(x)}} = {\text{PoS(w)}} \end{subarray} } Sim_{*} (w,x)} }}{{S_{A} }} + \frac{{\sum_{{w \in S_{B} }} {\mathop {\max }\limits_{\begin{subarray}{l} \, x \in S_{A} , \, \\ {\text{PoS(x)}} = {\text{PoS(w)}} \end{subarray} } Sim_{*} (w,x)} }}{{S_{B} }}} \right) $$
(21)
Because of use of
Sim
_{*}(,), expression (
21) assumes that the semantic scores are only linked to
WordNet taxonomy without any reference to the corpus. In order to exemplify expression (
21), one can write sentences
S
_{A} and
S
_{B} as bag of nouns and verbs (other part of speech, e.g. adverbs, adjectives as well as any stopwords do not contribute to (
21)):
S
_{A}: N
_{11}, N
_{21},..N
_{p1}, V
_{11},V
_{21},…,V
_{q1}
S
_{B}: N
_{12}, N
_{22},..N
_{m2}, V
_{12},V
_{22},…,V
_{n2}where
p,
m are the number of nouns in sentences
A and
B, respectively, while
q and
n stand for the number of verbs in sentences
A and
B, respectively. Let us consider permutations σ and ϕ defined on sets {1,2,…,
max(p,m)} and {1,2,…,
max(q,n))}, respectively, such that:
With
$$ S_{A} :N_{\sigma (1)1} ,N_{\sigma (2)1} ,...,N_{\sigma (max(p,m))1} ,{\text{ V}}_{\phi (1)1} ,{\text{V}}_{\phi (2)1} ,...,{\text{V}}_{\phi (max(q,n))1} \;S_{B} :N_{\sigma (1)2} ,N_{\sigma (2)2} ,...,N_{\sigma (max(p,m))2} ,{\text{ V}}_{\phi (1)2} ,{\text{V}}_{\phi (2)2} ,...,{\text{V}}_{\phi (max(q,n))2} $$
$$ \mathop {\max }\limits_{i = 1,m} Sim_{*} (N_{j1} ,N_{i2} ) = Sim_{*} (N_{j1} ,N_{\sigma (i)2} ),{\text{ j}} = 1{\text{ to }}p $$
(22)
$$ \mathop {\max }\limits_{i = 1,n} Sim_{*} (V_{j1} ,V_{i2} ) = Sim_{*} (V_{j1} ,V_{\phi (i)2} ),\;{\text{j}} = 1{\text{ to }}q, $$
(23)
$$ \mathop {\max }\limits_{i = 1,p} Sim_{*} (N_{j2} ,N_{i1} ) = Sim_{*} (N_{j2} ,N_{\sigma (i)1} ),{\text{ j}} = 1{\text{ to }}m $$
(24)
$$ \mathop {\max }\limits_{i = 1,q} Sim_{*} (V_{j2} ,V_{i1} ) = Sim_{*} (V_{j2} ,V_{\phi (i)1} ),{\text{ j}} = 1{\text{ to }}q, $$
(25)
It should be noted that extra nouns and/or verbs in sentences
S
_{A} or
S
_{B} correspond to a duplication of already existing nouns/verbs in order to ensure that that the pair (
\(N_{\sigma (i)1}\)
\(N_{\sigma (i)2}\)) yields the highest semantic similarity measure. Similar reasoning applies to pairs (
\(V_{\phi (i)1}\)
\(V_{\phi (i)2}\)) (
i = 1,2…). As a consequence of the above construction, expression (
21) boils down to
$$ \begin{aligned} Sim_{g}^{*} (S_{A} ,S_{B} ) = & \frac{{\sum\nolimits_{i = 1,p} {Sim_{ * } \left( {{\text{N}}_{i1} ,{\text{N}}_{\sigma \left( i \right)2} } \right)} + \sum_{i = 1,q} {Sim_{ * } \left( {{\text{V}}_{i1} ,{\text{V}}_{\phi \left( i \right)2} } \right)} }}{{2\left( {p + q} \right)}} \\ & + \frac{{\sum_{i = 1,m} {Sim_{ * } \left( {{\text{N}}_{i2} ,{\text{N}}_{\sigma \left( i \right)1} } \right)} + \sum_{i = 1,n} {Sim_{ * } \left( {{\text{V}}_{i2} ,{\text{V}}_{\phi \left( i \right)1} } \right)} }}{{2\left( {m + n} \right)}} \\ \end{aligned} $$
(26)
Example 2
Consider the following set of sentences:
S
_{A}: Students are heavily involved in exams these days.
S
_{B}: Only few students are expected to achieve first class marks.
In the above sentences, tokens “are”, “these”, “only”, “few”, “first” are commonly treated as part of stopword list, which are eliminated as part of preprocessing stage associated with the above sentences. Similarly, some tokens are converted into their root forms, e.g. “students” to “student”, “days” to “day”, “involved” to “involve”, etc., as part of preprocessing stage.
Therefore, the use of expression (
21), or equivalently (
26), yields
$$ Sim_{g}^{*} (S_{A} ,S_{B} ) = \frac{1}{2}\left( \begin{gathered} \frac{{\max \left( {Sim(student,student),Sim(student,class),Sim(student,mark)} \right)}}{{S_{A} }} \hfill \\ + \frac{{\max \left( {Sim(exam,student),Sim(exam,class),Sim(exam,mark)} \right)}}{{S_{A} }} \hfill \\ + \frac{{\max \left( {Sim(day,student),Sim(day,class),Sim(day,mark)} \right)}}{{S_{A} }} \hfill \\ + \frac{{\max \left( {Sim(involve,s\exp ect),Sim(involve,achieve)} \right)}}{{S_{A} }} \hfill \\ + \frac{{\max \left( {Sim(student,student),Sim(student,exam),Sim(student,day)} \right)}}{{S_{B} }} \hfill \\ + \frac{{\max \left( {Sim(class,student),Sim(class,exam),Sim(class,day)} \right)}}{{S_{B} }} \hfill \\ + \frac{{\max \left( {Sim(mark,student),Sim(mark,exam),Sim(mark,day)} \right)}}{{S_{B} }} \hfill \\ + \frac{Sim(\exp ect,involve) + Sim(achieve,involve)}{{S_{B} }} \hfill \\ \end{gathered} \right) $$

Using Wu and Palmer similarity measure, we have:$$ Sim_{g}^{WP} (S_{A} ,S_{B} ) = \frac{1}{2}\left( \begin{gathered} \frac{1 + 0.71 + 0.67 + 0.5}{4} + \hfill \\ \frac{1 + 0.71 + 0.71 + 0.5 + 0.4}{5} \hfill \\ \end{gathered} \right) \approx 0.69 $$

Using pathlength measure:$$ Sim_{g}^{PL} (S_{A} ,S_{B} ) = \frac{1}{2}\left( \begin{gathered} \frac{1 + 0.17 + 0.17 + 0.33}{4} + \hfill \\ \frac{1 + 0.17 + 0.2 + 0.33 + 0.25}{5} \hfill \\ \end{gathered} \right) \approx 0.40 $$
Using (normalized) Leacock and Chodorow similarity measure yields:
$$ Sim_{g}^{NLC} (S_{A} ,S_{B} ) = \frac{1}{2}\left( \begin{gathered} \frac{1 + 0.40 + 0.40 + 0.58}{4} + \hfill \\ \frac{1 + 0.40 + 0.46 + 0.58 + 0.47}{5} \hfill \\ \end{gathered} \right) \approx 0.59 $$
From the above example, it is worth noticing the following:

The stopwords as well as adjectives/adverbs, although they are important in conveying meaning to the underlined sentence, are not taken into account in the above sentencetosentence similarity. An intuitive way to account for the occurrence of such tokens consists of expanding the range of S _{A}, for instance, to include all tokens in sentence A including stopwords and adverbs/adjectives. However, this would substantially reduce the score of the sentencetosentence similarity. Besides, such integration of extra tokens would not take into account any meaning of such wording so that if another sentence, say S’ _{A} contains adverb/adjective (s), which are antonyms in sentence S _{A}, still will induce the same scoring value! Consequently, discarding such token seems a rational attitude in this respect.

Some tokens, like “involved”, can appear in both verb and adjective categories. However, since only verbal entities are present in the taxonomy, the handling of the token as such seems convenient and intuitive.
On the other hand, from the generic expression (
21), one can straightforwardly notice some interesting special cases:

Assume that sentence S _{A} reduces, after some text preprocessing task, to one single noun N _{1} and one single verb V _{1}, while sentence S _{B} reduces to noun N _{2} and verb V _{2}, then$${Sim}_{g}^{*}\left({S}_{A},{S}_{B}\right)=\frac{1}{2}\left(Si{m}_{*}\left({N}_{1},{N}_{2}\right)+Si{m}_{*}\left({V}_{1},{V}_{2}\right)\right)$$(27)

In case of identical sentences, or at least, identical nouns and verbs in both sentences S _{A} and S _{B}, then it is easy to see that \({Sim}_{g}^{*}({S}_{A},{S}_{B})=1\)

In case where sentences, after preprocessing, reduce to a single noun or verb, then sentence similarity boils down to the corresponding word semantic similarity; that is, assuming N _{1} and N _{2} (resp. V _{1} and V _{2}) be the nouns (resp. verbs) associated with sentence S _{A} and S _{B}, respectively, then$$ Sim_{g}^{*} \left( {S_{A} ,B} \right) = Sim_{*} \left( {N_{1} ,N_{2} } \right) $$$$ \left( {{\text{resp}}{.}\;Sim_{g} \left( {S_{A} ,S_{B} } \right) = Sim_{*} \left( {V_{1} ,V_{2} } \right)} \right) $$

There are situations in which the semantic similarity between the two sentences is not defined. Indeed, this occurs if the two sentences contain neither naming nor verbal expressions, or in case where one sentence contains only naming expression and the other one only verbal expression. In such cases, there is no analogy between the parts of speech in the two sentences, which renders the application of expression ( 21) void. Similarly, this also occurs if at least one of the two sentences contains neither verbal nor naming expression (s). An example of such sentences is: S _{A}: “How are you?”; S _{B}: “Hi there”.

An interesting case is related to the situation where a sentence contains repeated words or semantically equivalent name or verb expressions. It will thereby be of interest to see how such repetition influences the sentence similarity score. In this course, the following holds.
Proposition 11
Consider two sentences S
_{1} and S
_{2} such that.
where
p,
q (resp.
s and
l) are the number of names and verbs in sentence S
_{1} (resp. S
_{2}).
$$ S_{1} = \left( {N_{11} ,{ }N_{12} ,{ }..N_{1p} ,{ }V_{11} ,V_{12} ,..V_{1q} } \right) $$
$$ S_{2} = \left( {N_{21} ,{ }N_{22} ,{ }..N_{2s} ,{ }V_{21} ,V_{22} ,..V_{1l} } \right) $$
Assume nouns N
_{1i}, i = 1 to p (resp. N
_{2j}, j = 1 to s) are semantically equivalent. Similarly, verbs V
_{1i}, i = 1 to
q (resp. V
_{2j}, j = 1 to
l) are also assumed semantically equivalent. Then
where
\(S_{1}^{\prime} = \left( {N_{11} ,V_{11} } \right)\) and
\(S_{2}^{\prime} = \left( {N_{21} ,V21{ }} \right).\)
$$ Sim_{g}^{*} \left( {S_{1} ,S_{2} } \right) = Sim_{g}^{*} \left( {S_{1}^{\prime} ,S_{2}^{\prime} } \right) \Leftrightarrow \frac{p}{q} = \frac{l}{s} $$
(28)
Proposition
11 states the conditions under which the use of semantically equivalent nouns and verbs in the sentences preserves the overall semantic similarity score of the original sentences (without introducing extra equivalent nouns and verbs). Especially it has been pointed out that such equivalence holds only if the ratio between the number of equivalent nouns and the number of equivalent verbs in the first sentence is inversely proportional to the ratio yield by the second sentence. This includes, for instance, the case where the two sentences have the same number of equivalent nouns as well as the same number of equivalent verbs. However, if the condition stated in the core of Proposition
11 is not held then the use of semantically equivalent nouns and verbs may either increase or decrease the overall semantic similarity of the two sentences when compared to semantic similarity of original sentences (which do not contain equivalent nouns/verbs). To see it, consider for instance the following sentences:
$$ S_{1} = \left( {N_{11} ,N_{12} ,V_{11} } \right) $$
$$ S_{2} = \left( {N_{21} ,V_{21} } \right) $$
After some manipulations, one obtains
$$\begin{aligned} Sim_g^*\left( {{S_1},{S_2}} \right) = & \frac{1}{2}\left[ \begin{gathered} \frac{{2Si{m_*}\left( {{N_{11}},{N_{21}}} \right) + Si{m_*}\left( {{V_{11}},{V_{21}}} \right)}}{3} \hfill \\ + \frac{{Si{m_*}\left( {{N_{11}},{N_{21}}} \right) + Si{m_*}\left( {{V_{11}},{V_{21}}} \right)}}{2} \hfill \\ \end{gathered} \right] \\ = & \frac{1}{2}\left[ {\frac{{7Si{m_*}\left( {{N_{11}},{N_{21}}} \right) + 5Si{m_*}\left( {{V_{11}},{V_{21}}} \right)}}{6}} \right] \\ \end{aligned} $$
(29)
Consider now the corresponding sentences without additional equivalent verb/noun; namely,
\( S_{1}^{\prime} = \left( {N_{11} ,V_{11} } \right)\) and
\(S_{2}^{\prime} = S_{2} .\) Therefore,
\(Sim_{g}^{*} \left( {S_{1} ,S_{2} } \right) \ge Sim_{g}^{*} \left( {S_{1}^{\prime} ,S_{2}^{\prime} } \right)\) is equivalent to
$$ \left( {\frac{{7Sim_{*} (N_{11} ,N_{21} ) + 5Sim_{*} (V_{11} ,V_{21} )}}{6}} \right) \ge \left( {\frac{{Sim_{*} (N_{11} ,N_{21} ) + Sim_{*} (V_{11} ,V_{21} )}}{2}} \right) $$
(30)
This entails
$$ Sim_{*} \left( {N_{11} ,N_{21} } \right)  Sim_{*} \left( {V_{11} ,V_{21} } \right) $$
(31)
The latter inequality is not systematically held and is rather context dependent. To see it, let us consider the following example.
Example 4
S
_{1}: These persons look great among available individuals → (person, individual, look).
S
_{2}: Students seem active (student, seem)
$$ Sim_{g}^{*} \left( {S_{1} ,S_{2} } \right)  Sim_{g} \left( {S_{1}^{\prime} ,S_{2}^{\prime} } \right) = Sim_{*} \left( {individual,student} \right)  Sim_{*} \left( {look,seem} \right) = 0.33  1 < 0 $$
(using path semantic similarity).
Proposition 12
Consider two sentences S
_{1} and S
_{2} such that
$$ S_{1} = \left( {N_{11} ,{ }N_{12} ,..N_{1p} ,V_{11} ,{ }V_{12} ,..V_{1q} } \right) $$
$$ S_{2} = \left( {N_{21} ,{ }N_{22} ,..N_{2p} ,V_{21} ,{ }V_{22} ,..V_{2q} } \right) $$
Assume there is onetoone direct hyponymy relation:
$$ N_{1i} @ \to N_{2i} ;\; V_{1j} @ \to V_{2j} ,\; i = 1,p;\;j = 1,q. $$
Then, it holds
$$ Sim_{g}^{*} \left( {S_{A} ,S_{B} } \right) = \frac{{\mathop \sum \nolimits_{i = 1}^{p} Sim_{*} \left( {N_{1i} ,N_{2i} } \right) + \mathop \sum \nolimits_{j}^{q} Sim_{*} \left( {V_{1i} ,V_{2j} } \right)}}{p + q} $$
(32)
Proof
Strictly speaking, Proposition
12 provides an example of how the hyponymy relation can be used to simplify expression (
21). Interestingly, the outcome pointed out in Proposition
12, as highlighted in expression (
32), is similar to that of (
26) when sentences
S
_{A} and
S
_{B} have the same number of nouns and verbs, where permutations boils down to identity, e.g.
\(\sigma \left(i\right)=\varnothing \left(i\right)=i\) for
i = 1,2,..
Another question of interest relates to the behaviour of the sentence similarity rule with respect to a possible contraction or expansion, namely given sentences S
_{A} and S
_{B}, where
S
_{A}: N
_{11}, N
_{21},..N
_{p1}, V
_{11},V
_{21},…,V
_{q1}.
S
_{B}: N
_{12}, N
_{22},..N
_{m2}, V
_{12},V
_{22},…,V
_{n2,}
Then given a new sentence
S’
_{A} (resp.
S’
_{B}) issued from S
_{A} (resp.
S
_{B}) by, say, omitting some of its terms, e.g.
S’
_{A}: N
_{11}, N
_{21},..N
_{p’1}, V
_{11},V
_{21},…, V
_{q’1},
p’ <
p, q’ <
q.
S’
_{B}: N
_{12}, N
_{22},..N
_{m’2}, V
_{12},V
_{22},…,V
_{n’2,}
m’ <
m, n’ <
n,the problem here is how the semantic similarity of sentence
S
_{A} and
S
_{B} compares to that of sentences
S’
_{A} and
S’
_{B}? Generic answer to such question by either increase or decrease in sentence semantic similarity without any further constraint does not exist. Nevertheless, the following provides a skeleton of an answer to such problem.
Proposition 13
Let
S
_{A}: N
_{11}, N
_{21},..N
_{p1}, V
_{11},V
_{21},…,V
_{q1}.
S
_{B}: N’
_{12}, N’
_{22},..N’
_{p2}, V’
_{12},V’
_{22},…,V’
_{q2,} such that
\(\underset{i=1,p}{\mathrm{max}} Si{m}_{*}({N}_{j1},{N}_{i2}^{\prime})=Si{m}_{*}({N}_{j1},{N}_{j2}^{\prime})\) j = 1 to p, and
$$\underset{i=1,p}{\mathrm{max}}Si{m}_{*}({N}_{j1},{N}_{i2}^{\prime})=Si{m}_{*}({N}_{j1},{N}_{j2}^{\prime})$$
If a new word, say noun
\({N}_{p+\mathrm{1,2}}\) is added to sentence
S
_{B} (similar result applies to
\({V}_{q+\mathrm{1,2}}\)), yielding a new sentence.
S’
_{B}: N’
_{12}, N’
_{22}, N’
_{p2}, N
_{p+1,2.} V’
_{12}, V’
_{22},…,V’
_{q2,}
Then
\({Sim}_{g}^{*}({S}_{A},{S}_{B})\le {Sim}_{g}^{*}({S}_{A},{S}_{B}^{\prime})\) if and only if there exists a noun N
_{k1} in sentence
S
_{A} such that
$$ \max \left( \begin{gathered} \mathop {\max }\limits_{i = 1,q} {\text{ Sim}}_{*} (V_{i1} ,V^{\prime}_{i2} ), \, \mathop {\max }\limits_{i = 1,p} {\text{ Sim}}_{*} (N_{i1} ,N^{\prime}_{i2} ), \hfill \\ \mathop {\max }\limits_{i = 1,p} {\text{ Sim}}_{*} (N_{i1} ,N_{p + 1,2} ) \hfill \\ \end{gathered} \right) = {\text{Sim}}_{*} (N_{k1} ,N_{p + 1,2} ) \, $$
(33)
Proposition
13 indicates that adding an extra element to the sentence would bring an increase in semantic sentence similarity if the added element is sufficiently semantically similar to one of the elements of the sentence in the sense of yielding the highest semantic similarity score among all possible pairs between the elements of the two sentences. As a special case, if one adds an element which is a synonym to one of the components of the sentence, then this always yields an increase of the semantic similarity. This is because the use of synonyms would make the quantity
\({\text{Sim}}_{*} (N_{k1} ,N_{p + 1,2} )\) equal to one, which renders constraint (33) in Proposition
13 always valid.
Proposition
13 can easily be extended in case of adding more than one element as follows.
Proposition 14
Using same arguments of Proposition
13, given.
S”
_{B}: N’
_{12}, N’
_{22},..N’
_{p2}, N
_{p+1,2.,} N
_{p+s,2..,} V’
_{12},V’
_{22},…,V’
_{q2,}
Then
\(Sim_{g}^{*} (S_{A} ,S_{B} ) \le Sim_{g}^{*} (S_{A} ,S^{\prime\prime}_{B} )\) if and only if there exist nouns
\(N_{{k_{i} 1}}\)(i = 1 to s) in sentence S
_{A} such that
$$ \left\{ {\begin{array}{*{20}c} {Sim_{*} \left( {N_{{k_{1} 1,{ }}} N_{p + 1,2} } \right) = } \\ {max(\mathop {\max }\limits_{i = 1,q} { }Sim_{*} \left( {V_{i1} ,V_{i2}^{\prime} } \right), \mathop {\max }\limits_{i = 1,p} S\left( {N_{i1} ,N_{i2}^{\prime} } \right),\mathop {\max }\limits_{i = 1,p} Sim_{*} \left( {N_{i1} ,N_{p + 1,2} } \right))} \\ \end{array} } \right\} $$
$$ \left\{ {\begin{array}{*{20}c} {Sim_{*} \left( {N_{{k_{1} 1,{ }}} N_{p + s,2} } \right) = } \\ {max(\mathop {\max }\limits_{i = 1,q} { }Sim_{*} \left( {V_{i1} ,V_{i2}^{\prime} } \right), \mathop {\max }\limits_{i = 1,p} Sim\left( {N_{i1} ,N_{i2}^{\prime} } \right),\mathop {\max }\limits_{i = 1,p} Sim_{*} \left( {N_{i1} ,N_{p + s,2} } \right))} \\ \end{array} } \right\} $$
The proof of Proposition
14 follows the same spirit as that of Proposition
13, so the detail is omitted.
Proposition
14 shows that adding extra elements to the original sentence increase the overall sentence similarity score as long as each of the added elements is sufficiently similar to one of the elements of the original sentence in the sense described above. More generally, this can be written
under constraint (33), for any nonempty (in terms of both noun and verb expressions) sentence S
_{A}.
$$ S_{B} \subset S_{B}^{\prime} \Rightarrow Sim_{g}^{*} \left( {S_{A} ,S_{B} } \right) \le Sim_{g}^{*} \left( {S_{A} ,S{^{\prime}}_{B} } \right) $$
(34)
The inclusion operator employed in (
34) is in the sense of setinclusion of bag of words constituted by the set of nouns and verbs of each sentence.
On the other hand, it should be noted that the disposition of the sentences S
_{A} and S
_{B} described in the core of Proposition
13 is not a prerequisite for the result as this can be deduced from any pair of sentences using appropriate permutation of set of indices as already pointed out in (
22–
26).
In summary, one shall mention in response to research question Q3 that except trivial symmetry and reflexivity relations, other wordtowordlevel similarity properties are not straightforwardly extended to sentencetosentence similarity. For instance, score ordering and time complexity are fulfilled in the sense that
\(Si{m}_{path}({S}_{A},{S}_{B}) \le Si{m}_{wup}({S}_{A}, {S}_{B})\) and
\(T(Si{m}_{path}\left({S}_{A},{S}_{B}\right)) \le T(Si{m}_{lch}\left({S}_{A},{S}_{B}\right)))\le T(Si{m}_{wup}\left({S}_{A},{S}_{B}\right))\).
However, further constraints are still required to meet the monotonicity property for instance.
5 Effect of partofspeech conversion
So far, the handling of nouns and verbs part of speech were performed independently in (
21). However, such independence can also be debatable from different perspectives. First, the dissymmetry of verbs and nouns in WordNet taxonomy cannot be ignored as already pointed out in the introduction of this paper at least with respect to the parameters of the hierarchy since the number of noun synsets, maximum depth and average number of hyponyms are substantially higher in case of nouns [
2]. Second, nouns and verbs that have the same lexical root are very common in the English language. For instance, the words “connect” and “connection” have the same semantic meaning. Third, it is quite common that a single word does support several partofspeech categories in the sense that it can be considered as a noun as well as a verb according to context. Fourth, there are often namedbased or verbalbased sentences where either verb or noun category is missing, which renders the application of quantification (
21) almost void. To see it, let us consider the following example.
Example 5
S
_{A}: Housing is included in the tuition fee.
S
_{B}: The hostel is part of the tuition fee.
After, an initial text preprocessing, the first sentence boils down to (Housing, tuition, fee, include) while sentence B entails tokens (hostel, part, tuition, fee). It is therefore easy to notice that unlike sentence A, sentence B contains no verb category, which would make the word “include” not contributing to the overall similarity sentence score.
To handle this difficulty, the key is to use the existing linguistic transformations which can turn verb, adjective, adverb into the corresponding noun and vice versa, see [
41,
43‐
45] for details and exemplifications. This motivates this subsequent section where verb–noun conversations are investigated to enhance the sentence semantic similarity calculus. Especially, it is expected that such conversion would allow us to overcome the possible ambiguity or inconsistency where (
21) accounts equally for both verb and noun pairwise semantic similarity measures. Besides, it is also expected to account for both adjective and adverb categories in the semantic similarity calculus. For instance, in Example
5, if a such conversion took place, “include” would be converted into, say, noun “inclusion”, which provides a good counterpart to word “part” in sentence B when computing word semantic similarity. As such, the transformation of this verb to the same part of speech as the other constituent words in the other sentence would raise the similarity score using (
21) from 0.89 to 0.92. The effect of this single verb–noun conversion is therefore revealed in the change of the similarity score, where, intuitively, as the two sentences have close semantic meaning, one expects high semantic similarity scores.
On the other hand, one can also think of using a nountoverb conversation. In this case, the new calculus of the sentence similarity score decreases to 0.402. The reason for this is that a such strategy cannot handle some of the nouns e.g. “university”, “tuition”, “year” (because of the absence of verbal counterpart in WordNet taxonomy), which, in turn, will negatively influence the overall score of the sentence similarity. Therefore, given the more elaborated structure of noun category and the existence of semantically related nouns for most verbs, adjectives and adverbs, the transformation alltonouns is more appealing. For this purpose, an approach employing WordNet taxonomy has been put forward and investigated in our previous work reported in [
38,
39,
41], although further work is still required in order to explore the intuitive and physical constraints induced by such transformations.
The highlevel description of the implementation of this transformed sentencetosentence similarity is illustrated in Fig.
2.
In order to evaluate the performance of the alltonoun transformation on semantic similarity score, we have conducted an experiment using Benchmark dataset and with comparison with some stateoftheart methods.
6 Dataset
We used O’Shea et al. [
46] dataset, which consists of 65 sentence pairs where human similarity judgements are assigned to each pair. During this dataset creation, 32 graduate native speakers were assigned to score the similarity degree between each pair using scores from 0.0 to 4.0 so that a score 0 corresponds to unrelated sentence pair, 1 for vaguely similar in meaning, 2 for sentences very much a like in meaning, 3 for sentences strongly related in meaning and 4 for sentences that are identical in meaning.
7 Evaluation of semantic similarity measures
Given the provided human judgement in terms of the similarity between the sentences of each pair, we can evaluate the usefulness of the elaborated semantic similarity Sim by calculating the correlation coefficient between the similarity measure and human judgment using the following expression, where
\(h_{i} , Sim\left( {p_{i} } \right)\) stand for the human and machine (via semantic similarity Sim) assigned score to the ith pair of sentences and n is the total number of pairs:
$$ r_{s} = \frac{{n\mathop \sum \nolimits_{i = 1}^{n} h_{i} { }Sim\left( {p_{i} } \right)  \mathop \sum \nolimits_{i = 1}^{n} h_{i} { }\mathop \sum \nolimits_{i = 1}^{n} Sim\left( {p_{i} } \right)}}{{\sqrt {\left( {n\mathop \sum \nolimits_{i = 1}^{n} h_{i}^{2}  \left( {\mathop \sum \nolimits_{i = 1}^{n} h_{i} } \right)^{2} } \right)} \sqrt {\left( {n\mathop \sum \nolimits_{i = 1}^{n} \left[ {Sim\left( {p_{i} } \right)} \right]^{2}  \left( {\mathop \sum \nolimits_{i = 1}^{n} \left[ {Sim\left( {p_{i} } \right)} \right]} \right)^{2} } \right)} }} $$
When another semantic similarity measure (
\(Sim\left( {p_{i} } \right))\) is employed another correlation index
\(r_{s}\) is generated. We shall consider two stateoftheart and baseline models in this experiment in addition to the canonical extension (
26).

LSA corresponds to the case where the sentence similarity obtained using latent semantic analysis approach [ 47].

WN corresponds to the case where the sentence similarity is calculated using the canonical extension ( 26) with Wu and Palmer wordtoword semantic similarity measure. The choice of Wu and Palmer similarity is justified by the behaviour of the similarity measures, especially in terms of the range of the values that can be assigned to. In the implementation, we used Pederson’s [ 48, 49] implementation module of the Wu and Palmer semantic similarity measure. For preprocessing, we used Illinois Partofspeech tagger [ 50] to identify various test segments.

WNwC corresponds to the case where the semantic similarity is calculated using the canonical extension ( 26) with Wu and Palmer wordtoword semantic similarity measure after performing the “alltonoun” conversation. This conversation is performed using the CatVar model [ 43] because it is found to yield better results than morphosemantic links [ 44, 45].
We computed the sentencetosentence similarity of each pair of the dataset using STASIS, LSA, WN and WNwC methods. We next calculated the correlation index with human judgments for each of the above methods. Finally, to gather the statistics about each method, we also calculated the mean, median, standard deviation, maximum and minimum values of the similarity score over all pairs of the dataset. The result of this analysis is summarized in Table
3.
Table 3
Comparison of the sentence similarity scores using the different methods and correlation with human judgment
Methods

Correlation
\({r}_{s}\)

Mean value of similarity score of sentence pairs

Min–Max value of similarity

Standard deviation

Median

Processing time (sec)


STASIS

0.816

0.589433

[0.209 1]

0.193619

0.6145

0.561

LSA

0.838

0.687667

[0.505 1]

0.143315

0.685

0.134

WN

0.821

0.656186

[0.362 1]

0.168924

0.6272

0.343

WNwC

0.846

0.695833

[0.397 1]

0.155162

0.683

0.423

The results summary in the table shows that the WNwC (WordNetbased alltonoun conversion) method achieves better performance than other benchmark methods in terms of correlation with human judgment, although we noticed that the performance of the LSA algorithm is quite close with less variability in the scores than our alltonouns approach).
Table
3 also reports indication about the execution time of each method when performed on a standard PC machine equipped with Intel Core i510400F processor and 16 GB RAM. The results show that the developed approach, although induces extra processing time to handle the PoS conversion task as compared to conventional WN model, the execution time can fit soft realtime requirement, and it outperform the stateoftheart STASIS model, which involves an extra component associated with word ordering similarity that may be computationally burdened. It should be noted that the above results were obtained using our local reimplementation of the STASIS that uses locally stored WordNet semantic similarity values. Otherwise, if one uses the WordNet server to compute the semantic similarity, the execution time will be much longer and the result will be less reliable as it would depend on the server availability and traffic as well. However, LSA is computationally much more effective because it relies on solely on matrix manipulation and no external lexical database.
Next, in order to find out whether the results in terms of correlation with human judgment are statistically significant, we compared the population of the semantic similarity scores of various pairs using each method and used a statistical ttest.
More formally, we have run a ttest comparing the sentence similarity scores of the 65 sentence pairs and analysed the difference of each method with that of WNwC to see whether there is a statistical difference between the two approaches. Through the statistical tests, as highlighted in Table
4, we found that the WordNetbased alltonoun conversion significantly improves over the conventional sentence semantic similarity as it yields (t(64) = 3.5986,
p < 0.0005). This means that we have evidence to reject the Null hypothesis and conclude that the improvement with the alltonoun conversion is statistically significant.
Table 4
Statistical Significant Test (Ttest) results at a 95% Confidence Interval
Paired variables

MoDs

Lower interval

Upper interval

T value

P value


STASIS–WNwC

− 0.1064

Inf

− 0.0597025

− 3.8715

0.0002833

WN–WNwC

− 0.0396474

Inf

− 0.01611506

− 2.8627

0.003861

LSA–WNwC

− 0.0081667

Inf

0.02323585

− 0.44188

0.3309

The paired t test results in Table
4 also confirmed that WordNetbased PoS conversion method achieves significant improvement over the STASIS (
p < 0.0005) similarity measure. However, the improvement over LSA method does not seem significant, although the associated t value (0.44188) lies in the required confidence interval (Inf—0.02323585).
8 Discussions and implications
The results highlighted in this paper provide significant insights and implication from theoretical and practical perspective.

The algebraical and interactive properties of the investigated WordNet semantic similarities, namely pathlength, Wu and Palmer and Leacock and Chodorow measures, provide valuable insights to data mining or natural language processing researchers and practitioners.To exemplify this reasoning, consider a commonly employed example of designing a data mining task that uses a thresholding on a semantic similarity measure, say, \(Si{m}_{x}\left({c}_{1},{c}_{2}\right)\ge \zeta \), where, often the threshold \(\zeta \) is chosen either empirically or by imposing some default values. Nevertheless, if a pathlength measure was employed, and we set any value \(\zeta >0.5\), would yield a useless outcome. This is because we can predict through the results pointed out in Sect. 2 that such threshold will only enable us to capture one single instance corresponding to synonyms. The same reasoning applies for Leacock and Chodorow measure when we set a threshold \(\zeta >0.7.\) However, such restriction does not apply in case of Wu and Palmer measure.Similarly, the incremental evolution property provides us with useful insights in terms of what we may expect as a result when one slightly changes the input along the lexical hierarchy. Especially, it shows that an incremental moves up or down within the hierarchy as in direct hyponymy/hypernymy relation would yield a constant similarity score in case of pathlength or Leacock and Chodorow measures, regardless of the words employed, while Wu and Palmer measure ensures a high score (beyond 0.8) whose exact value depends on the individual words employed. Such knowledge can be very useful, for instance, to predict the robustness of a given plagiarism detection system that is built on WordNet semantic similarity layer. Likewise, the interactive properties show, for instance, that if hard time requirements were imposed, then pathlength similarity should be prioritized.

The results gained from Sect. 3 have direct implications on the type of handling that may be needed for the underlined text mining task. This involves the type of preprocessing that should be performed prior to calling upon the sentencetosentence similarity module. This depends on whether the NLP task favours maximizing similarity score or minimizing it, according, for instance, to the criticality of the false positive in the underscored task. This can provide insights whether short sentences are deemed more important or not. And, if rephrasing is permitted, how such operation could be performed in such a way to maximize or minimize similarity score. Especially, the finding highlights the importance of PoS tagging as an initial scrutinizing task. If such analysis reveals that the two input sentences contain only distinct PoS word categories, then one concludes immediately that the sentencetosentence similarity yields zero similarity score without any further processing. This also highlights the importance of accurate partofspeech tagger in the initial scrutinizing stage as any error has a substantial consequence on the outcome. Interestingly, this provides some guidelines on the choice of stopwords list as well, such that the PoS aspect could be taken into account in order to ensure there is sufficient coverage in terms of number of noun and verb entities that trigger nonzero sentencetosentence similarity score.

The findings in Sect. 4 highlight the benefits of the PoS transformation, especially the “alltonoun” transformation as a tool that can overcome some inherent limitation of the canonical sentencetosentence semantic similarity where only noun and verbal entities are handled. Nevertheless, this should not hide the inherent limitations of such an approach which primarily relies on either manually created database or lexical transformation rules, which are not error free, and can lead to an amplification of the meaning shift from the correct sense. The analysis of the computational complexity revealed the importance to account for WordNet server latency and reliability, which is often outside the scope of the user. Therefore, any attempt to use backup and locally stored data instead of server data is of paramount importance in this regard.

In overall, the development pursued in this paper, despite their novelty and relevance to AI and NLP community, is also subject to several inherent limitations, which can be rooted back to either the selected semantic similarity measures or the generic pipeline employed. First, we restricted our analysis to the three commonly employed WordNet similarity measures that explore the hierarchical structure only, which leaves other similarity measures unexplored, including those proposed as extensions/refinements of pathlength and Wu and Palmer measures, see [ 51] for an overview of structured similarity measures. Second, the exploration of properties at either word level or sentence level is not meant to be exhaustive as we restricted to only few properties that might be of interest to data mining and natural language processing community. Third, the PoS conversion process utilized WordNet’s conceptual relations and some other linguistics/grammatical rules. Therefore, such conversation, sometimes, may not be accurate, especially when the underlined word has multiple senses, which opens wide the door of word sense disambiguation problem. Fourth, the extension from word level to sentence level considered only the canonical extension highlighted in expression ( 21). Nevertheless, it should be noted that this is far to be the unique representation of such extension and several interesting proposals have been reported in the computational linguistic community. One may mention, for instance, approaches that exploit the syntactic information in the sentence linking verb entity to their subject and complement object in a way to maintain up to some extent the structure of the sentence, ngram models or enforcing some specific sentence ontology representation, see [ 18] for an overview of such alternative representations.
9 Conclusion and future work
This paper investigates the sentence semantic similarity using the WordNet lexical database where three pathbased word similarity measures have been contrasted and investigated. Some theoretical and algebraical properties of the underlined semantic similarities have been examined and scrutinized, providing the user with tremendous help when deciding on the choice of a specific word semantic similarity measure. Similarly, the canonical extension from wordlevel semantic similarity to sentencelevel semantic similarity has been examined with respect to the properties of the underscored extension. It has particularly been highlighted the impact of the PoS tagging on the outcome of the sentence similarity. To handle inherent limitations of word semantic similarity, especially when deficiency in word category is observed in a sentence, a new proposal using PoS WordNetbased conversion has been put forward. The performance of the proposal in terms of correlation with human judgment has been compared to conventional canonical extension and some stateoftheart techniques using O’Shea et al. [
46] publicly available dataset. The comparative analysis demonstrated the feasibility and superiority of the proposal where using the alltonoun conversion in the canonical extension expression is found to outperform the conventional canonical extension as well as stateoftheart STASIS and LSA method. Besides, using the ttest testing, this outperformance is found to be statistically significant at
p < 0.005 for conventional extension and STASIS. As an immediate future work, it is expected to extend the “Alltonoun” conversion scheme by incorporating ways of handling named entities as this plays special linguistic role, and let the overall sentence similarity be constituted of two parts: one part will be related to standard sentencelevel input and another one dedicated to namedentity similarity only. Furthermore, it is also among our plans to emphasize the similarity of sentence’s corresponding frame roles using semantic role labelling, which enables us to account to a large extent for the sentence grammar. In the case of lexical resource, we intend to employ ConceptNet database instead of WordNet as part of the future works in our ongoing research.
Acknowledgements
This work is partly supported by EU YoungRes project (#823701), which is gratefully acknowledged.
Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit
http://creativecommons.org/licenses/by/4.0/.
Appendix
Proof of Proposition 3
The implication
\(Si{m}_{x}\left({c}_{i},{c}_{j}\right)=1\Leftarrow {(c}_{i}={c}_{j})\) is trivial from the reflexivity property of the three semantic similarity measures. To prove the reverse implication
\(Si{m}_{x}\left({c}_{i},{c}_{j}\right)=1\Rightarrow {c}_{i}={c}_{j}\), one shall proceed for each similarity measure, and noticing that
\(le\) only if
\(c_{i} = c_{j}\).
Using pathlength measure, we have:
$$ \frac{1}{{len\left( {c_{i} ,c_{j} } \right)}} = 1 \Rightarrow len\left( {c_{i} ,c_{j} } \right) = 1 \Rightarrow c_{i} = c_{j} $$
Using normalized Leacock and Chodorow measure, we have from (4),
\({\text{Sim}}_{{{\text{lch}}}} \left( {c_{i} ,{ }c_{j} } \right) = \log \left( {2{\text{*max}}\_depth} \right)\), So
$$ \frac{{2{\text{*max}}\_depth}}{{len\left( {c_{i} ,c_{j} } \right)}} = 2{\text{*max}}\_depth \Rightarrow len\left( {c_{{i{ }}} ,{ }c_{j} } \right) = 1 \Rightarrow {\text{c}}_{{\text{i}}} = c_{j} $$
Using Wu and Palmer’s measure. Let us assume that
\(c_{i} {\text{, c}}_{j}\) have distinct nodes in the network. Then, let depth(lcs((c
_{i}, c
_{j})) = l, length(c
_{i}, lcs) = l
_{1}, length(c
_{j}, lcs) = l
_{2}. Therefore, depth(c
_{i}) = l + l
_{1}, depth(c
_{j}) = l + l
_{2}. So,
$$ Sim_{wp} \left( {{\varvec{c}}_{i} ,{\varvec{c}}_{j} } \right) = 1 \Rightarrow \frac{2l}{{2l + l_{1} + l_{2} }} = 1 \, \Rightarrow \, l_{1} + l_{2} { = 0} \Rightarrow (l_{1} = 0{\text{ and }}l_{2} { = 0)} \Rightarrow {\text{c}}_{i} {\text{ = c}}_{j} \, $$
Proof of Proposition 5
To prove the statement, let us consider without loss of generality the generic taxonomy of Fig.
3 showing the path between the two synsets c
_{1} and c
_{2} as well as their lower common subsumer.
$$ {\varvec{Sim}}_{{{\varvec{path}}}} \left( {{\varvec{c}}_{{\varvec{i}}} ,{\varvec{c}}_{{\varvec{j}}} } \right) = \frac{1}{{{\varvec{p}} + {\varvec{q}}}},{ }{\varvec{Sim}}_{{{\varvec{wup}}}} \left( {{\varvec{c}}_{{\varvec{i}}} ,{\varvec{c}}_{{\varvec{j}}} } \right) = \frac{{2{\varvec{l}}}}{{2{\varvec{l}} + {\varvec{p}} + {\varvec{q}}}} $$
×
Since parameters p, q and l are positively valued, it holds
that
\(+ q + 2l \ge p + q\), this again entails (since 2 l > l):
$$ \frac{1}{p + q + 2l} \le \frac{1}{{p + q{ }}} \le \frac{2l}{{p + q}} $$
Therefore, the inequality
\(Sim_{path} \left( {c_{i} ,c_{j} } \right) \le Sim_{wup} \left( {c_{i} ,c_{j} } \right)\) trivially holds.
Denoting for simplicity
\(x = len\left( {c_{i} ,c_{j} } \right)\), d = max_depth, then
\(Sim_{wup} \left( {c_{i} ,c_{j} } \right) \le Sim_{lch}^{*} \left( {c_{i} ,c_{j} } \right)\) is equivalent to
\(\frac{{\log \left( \frac{x}{2d} \right)  \log 2}}{{\log \left( {2d} \right)  \log 2}} \ge \frac{1}{x}\) or, equivalently
$$ \frac{{  \log x + \log \left( {2d} \right)  \log 2}}{{\log \left( {2d} \right) + \log 2}}  \frac{1}{2} \ge 0 $$
(A.1)
By derivating (A.1) with respect to x, we have
\( \frac{1}{x}\left( {\frac{1}{{\log \left( {2d} \right)  \log 2}} + \frac{1}{x}} \right) \ge 0 ,\) which entails
$$ x \le \log \left( {2d} \right)  \log 2 \approx 2.98 $$
Proof of Proposition 7
To illustrate the skeleton of the proof, let us consider the generic two examples shown in Fig.
4.
×
To prove statement in (i), notice that Fig.
4a highlights a typical scenario which
\(c_{1} , c_{2} , c_{1}^{\prime}\) and
\(c_{2}^{\prime}\) have the same lower common subsumer. In such case, it holds that
$$ len\left( {c_{1}^{\prime} , c_{2}^{\prime} } \right) = p^{\prime} + q^{\prime} \le p + q = len\left( {c_{1} ,c_{2} } \right) \Rightarrow Sim_{path} \left( {c_{1}^{\prime} , c_{2}^{\prime} } \right) \ge Sim_{path} \left( {c_{1} ,c_{2} } \right) $$
This also entails
\(Sim_{lch}^{*} \left( {c_{1}^{\prime} , c_{2}^{\prime} } \right) \ge Sim_{lch}^{*} \left( {c_{1} ,c_{2} } \right)\) similarly, we have
$$ Sim_{wup}^{*} \left( {c_{1}^{\prime} ,c_{2,}^{\prime} } \right) = \frac{2l}{{p^{\prime} + q^{\prime} + 2l}} \ge \frac{2l}{{p + q + 2l}} = Sim_{wup} \left( {c_{1} ,c_{2} } \right) $$
To prove statement (ii) where synsets are such that
\(c_{1}^{\prime}\) and
\(c_{2}^{\prime}\) are direct hyponyms of
\(c_{1}\) and
\(c_{2}\) without a common lowest common subsumer, one notices that such scenario implicitly entails that either c
_{1} is lowest common subsume of c
_{2} or vice versa. For instance, if c
_{1} is lowest common subsume, the following diagram holds
$$ c_{2} \to c_{2}^{\prime} \ldots \to c_{1} \to c_{2}^{\prime} \to ...Root $$
In such case, it holds that
$$ len(c^{\prime}_{1} ,c^{\prime}_{2} ) = len(c_{1} ,c_{2} ) \Rightarrow Sim_{path} (c^{\prime}_{1} ,c^{\prime}_{2} ) = Sim_{path} (c_{1} ,c_{2} ) $$
For similar arguments,
\(Sim_{lch}^{*} \left( {c_{1}^{\prime} ,c_{2}^{\prime} } \right) = Sim_{lch}^{*} \left( {c_{1} ,c_{2} } \right)\), while
$$ Sim_{wp} (c^{\prime}_{1} ,c^{\prime}_{2} ) = \frac{2(l  1)}{{len(c_{1} ,c_{2} ) + l  1 + l  1}} = \frac{2(l  1)}{{len(c_{1} ,c_{2} ) + 2(l  1)}} \ge \frac{2l}{{len(c_{1} ,c_{2} ) + 2l}} = Sim_{wp} (c_{1} ,c_{2} ) $$
So, in both cases it holds that
\(Sim_{*} \, (c_{1} ,c_{2} ) \ge Sim_{*} \, (c^{\prime}_{1} ,c^{\prime}_{2} )\).
To prove (iii), it is enough to see Fig.
4b, where
\(len\left( {c_{1}^{\prime} ,c_{2}^{\prime} } \right) = p^{\prime} + l  q^{\prime}\) while
\(len\left( {c_{1} ,c_{2} } \right) = p + q\). Since l is fully independent of q,
\(len\left( {c_{1}^{\prime} ,c_{2}^{\prime} } \right)\) can be greater, equal or smaller than
\(len\left( {c_{1} ,c_{2} } \right)\) so that no specific ordering can be established. Same reasoning applies when calculating the depth of the synsets, which renders
\(Sim_{*} \left( {c_{1} ,c_{2} } \right)\) and
\(Sim_{*} (c_{1}^{\prime} ,c_{2}^{\prime}\)) not comparable.
Proof of Proposition 8
(i)
From the assumption that
\(c_{i}\) is a direct hyponym of
\(c_{j}\), it follows
\(c_{i}\) is also the least common subsumer of the two synsets. So, if depth(
\(c_{i}\)) = l, then depth (
\(c_{j}\)) = l + 1. Therefore,
\(Sim_{wp} \left( {c_{i} ,c_{j} } \right) = \frac{2l}{{2l + 1}}\). Noticing that the above expression is nondecreasing in l, and for distinct synsets, the minimum value of l is 2, which, after substituting in the above expression, yields
\(Sim_{wp} \left( {c_{i} ,c_{j} } \right) = 0.8.\) The result follows straightforwardly that if
\(c_{i}\) is a direct hyponym of
\(c_{j}\), then len(
\(c_{i} ,c_{j}\)) = 2, so after substituting in (3) and (4), the result (ii) and (iii) of Proposition
8 are trivial.
Proof of Proposition 9
The hyponymy relation can be represented by the diagram below c
_{1} → c
_{2} → c
_{3} → …c
_{n1} → c
_{n} → …
RootNode.
Given that
\(len\left( {c_{1} ,c_{2} } \right) = 2 \le len\left( {c_{1} ,c_{3} } \right) = 3 \le \ldots \le .len\left( {c_{1} ,c_{k} } \right) = k\, for\, k4,\,n.\) This indicates that (19) trivially holds for path and Leacock and Chodorow similarity. For Wu and Palmer similarity, assume a length
l from c
_{n} till
RootNode, then it holds.
While
$$ Sim\left( {c_{1} ,c_{2} } \right) = \frac{{2\left( {l + n  1} \right)}}{l + n + l + n  1} = \frac{{2\left( {l + n  1} \right)}}{{2\left( {l + n  1} \right)  1}} = \frac{1}{{1 + \frac{1}{{2\left( {l + n  1} \right)}}}} $$
$$ \begin{gathered} Sim(c_{1} ,c_{k} ) = \frac{2(l + n  k  1)}{{l + n + l + n  k  1}} = \frac{2(l + n  k  1)}{{2(l + n  k  1) + k + 1}} \hfill \\ \, = \frac{1}{{1 + \frac{k + 1}{{2(l + n  k  1)}}}},{\text{ k = 3,n}} \hfill \\ \end{gathered} $$
Noticing that
\(\frac{1}{2(l+n1)}<\frac{k+1}{2(l+nk1)}\) since this is equivalent to
\([2l  2(k + 1)l] + [2n  2(k + 1)n] + [  2  2(k + 1)] < 0\), which trivially holds since each expression under square bracket on the left hand side of the last inequality is always negatively valued for k greater or equal than 3. This yields
\(Si{m}_{*}\left({c}_{1},{c}_{2}\right)\ge Si{m}_{*}\left({c}_{1},{c}_{k}\right)\) for k = to n, which, by a simple induction reasoning, yields inequality (
20).
Proof of Proposition 11
Applying expression (
21) with respect to the abovedefined S
_{1} and S
_{2} leads to, after noticing, that due to semantic equivalence assumption
\(Si{m}_{*}({N}_{1i},{N}_{2j})=Si{m}_{*}({N}_{11},{N}_{21})\),
j = 1 to s,
i= 1 to
p, and
\(Si{m}_{*}({V}_{1i},{V}_{2j})=Si{m}_{*}({V}_{11},{V}_{21})\), j = 1 to l, i = 1 to
q, we have
$${Sim}_{g}^{*}({S}_{1},{S}_{2})=\frac{1}{2}\left[\frac{pSi{m}_{*}\left({N}_{11},{N}_{21}\right)+qSi{m}_{*}({V}_{11},{V}_{21})}{p+q}+\frac{sSi{m}_{*}\left({N}_{21},{N}_{11}\right)+lSi{m}_{*}({V}_{21},{V}_{11})}{s+l}\right]$$
$$ = \frac{1}{2}\left[ {Sim_{*} (N_{11} ,N_{21} )\left( {\frac{p}{p + q} + \frac{s}{s + l}} \right) + Sim_{*} (V_{11} ,V_{21} )\left( {\frac{q}{p + q} + \frac{l}{s + l}} \right)} \right] $$
Noticing that
\(Si{m}_{g}({S}_{1}^{\prime},{S}_{2}^{\prime})\) would have the same form as (
27), reconciling the last expression with that of (
27) yields
$$ \frac{p}{p + q} + \frac{s}{s + l} = 1{\text{ and }}\frac{q}{p + q} + \frac{l}{s + l} = 1 $$
After some manipulations, the last expressions are equivalent to
\(ps=ql\), which is again equivalent to
\(\frac{p}{q}=\frac{l}{s}\).
Proof of Proposition 13
From the definition of S
_{A} and S
_{B}, it holds
$$ Sim_{g}^{*} (S_{A} ,S_{B} ) = \frac{{\sum_{i = 1,i \ne k}^{p} {Sim_{*} (N_{i1} ,N^{\prime}_{i2} )} + Sim_{*} (N_{k1} ,N_{p + 1,2} ) + \sum_{i = 1}^{q} {Sim_{*} (V_{i1} ,V^{\prime}_{i2} )} }}{p + q} $$
Similarly,
$$\begin{aligned} Sim_g^*({S_A},S_B^\prime ) = & \frac{{\sum\limits_{i = 1,i \ne k}^p {Si{m_*}({N_{i1}},N_{i2}^\prime )} + Si{m_*}({N_{k1}},{N_{p + 1,2}}) + \sum\limits_{i = 1}^q {Si{m_*}({V_{i1}},V_{i2}^\prime )} }}{{2(p + q)}} \\ & + \frac{{\sum\limits_{i = 1}^p {Si{m_*}({N_{i1}},N_{i2}^\prime )} + Si{m_*}({N_{k1}},{N_{p + 1,2}}) + \sum\limits_{i = 1}^q {Si{m_*}({V_{i1}},V_{i2}^\prime )} }}{{2(p + q + 1)}}{\kern 1pt} \\ \end{aligned} $$
In the first fraction of
\(Sim_{g} (S_{A} ,S^{\prime}_{B} )\), it assumes that N
_{p+1,2} does not influence the semantic similarity of other pairs. In the opposite case, the overall still is valid as will be demonstrated later on. By rewriting.
$$\sum\nolimits_{i = 1}^{p} {Sim_{*} (N_{i1} ,N^{\prime}_{i2} )} = \sum\nolimits_{i = 1,i \ne k}^{p} {Sim_{*} (N_{i1} ,N^{\prime}_{i2} )} + Sim_{*} (N_{k1} ,N^{\prime}_{k2} )$$
And after some manipulations, subtracting
\(Sim_{g} (S_{A} ,S_{B} )\) from
\(Sim_{g} (S_{A} ,S^{\prime}_{B} )\), we have
$${Sim}_{g}^{*}\left({S}_{A},{S}_{B}^{\prime}\right)Si{m}_{g}\left({S}_{A},{S}_{B}\right)\propto $$
$$\left(2p+2q+l\right)\sum_{i=1,i\ne k}^{p}Si{m}_{*}\left({N}_{i1},{N}_{i2}^{\prime} \right)+\left(2p+2q+1\right)\sum_{i=1}^{q}Si{m}_{*}\left({V}_{i1},{V}_{i2}\right)$$
$$+\left(2p+2q+1\right)Si{m}_{*}\left({N}_{k1},{N}_{p+\mathrm{1,2}}\right)+\left(p+q\right)Si{m}_{*}\left({N}_{k1},{N}_{k2}^{\prime}\right)$$
$$2\left(p+q+1\right)\sum_{i=1,i\ne k}^{p}Si{m}_{*}\left({N}_{i1},{N}_{i2}^{\prime} \right)2\left(p+q+1\right)\sum_{i=1,i\ne k}^{q}Si{m}_{*}\left({V}_{i1},{V}_{i2}\right)$$
$$\left(2p+2q+1\right)Si{m}_{*}\left({N}_{k1},{N}_{k2}^{\prime}\right)$$
This can be simplified as
\(Sim_{g} (S_{A} ,S^{\prime}_{B} )  Sim_{g} (S_{A} ,S_{B} )\)
$$ \begin{gathered} \propto \, (2p + 2q + 1)Sim_{*} (N_{k1} ,N_{p + 1,2} )  \sum\limits_{i = 1,i \ne k}^{p} {Sim_{*} (N_{i1} ,N^{\prime}_{i2} )}  \sum\limits_{i = 1}^{q} {Sim_{*} (V_{i1} ,V^{\prime}_{i2} )} \hfill \\ {  }(p + q)Sim_{*} (N_{k1} ,N^{\prime}_{k2} ) \hfill \\ \end{gathered} $$
The last expression can be rewritten as
$$\left[\left(p1\right)Si{m}_{*}\left({N}_{k1},{N}_{p+\mathrm{1,2}}\right)\sum_{i=1,i\ne k}^{p}Si{m}_{*}\left({N}_{i1},{N}_{i2}^{\prime}\right)\right]$$
$$+\left[qSi{m}_{*}\left({N}_{k1},{N}_{p+\mathrm{1,2}}\right)\sum_{i=1,}^{q}Si{m}_{*}\left({V}_{i1},{V}_{i2}^{\prime}\right)\right]$$
$$+\left(p+q+2\right)\left[Si{m}_{*}\left({N}_{k1},{N}_{k2}^{\prime}\right)\right]Si{m}_{*}({N}_{k1},{N}_{k2}^{\prime})$$
It is then easy to see that each expression under square bracket in above is positively valued as
\(Sim_{ * } \left( {N_{k1} ,N_{p + \_1,2} } \right)\) is supremum over semantic similarity of any chosen pairs belonging to either noun or verb part of speech. This makes
$${Sim}_{g}^{*}\left({S}_{A},{{S}^{\prime}}_{B}\right){Sim}_{g}^{*}\left({S}_{A},{S}_{B}\right)>0$$
In case where N
_{p+1,2} does influence the semantic similarity of other pairs, say, there is a noun N
_{m1} in sentence S
_{A} such that
\(Sim_{ * } \left( {N_{m1} ,N_{m2}^{\prime} } \right) \le Sim_{ * } \left( {N_{m1} ,N_{p + 1,2}^{\prime} } \right)\),
\(Sim_{g} \left( {S_{A} ,S_{B}^{\prime} } \right)\) is changed as (
\({Sim}_{g}^{*}\left({S}_{A},{S}_{B}\right)\) remains unchanged):
$$ \begin{gathered} Sim_{g}^{*} (S_{A} ,S^{\prime}_{B} ) = \frac{{\sum\nolimits_{i = 1,i \ne k}^{p} {Sim_{*} (N_{i1} ,N^{\prime}_{i2} )} + Sim_{*} (N_{k1} ,N_{p + 1,2} ) + \sum\nolimits_{i = 1}^{q} {Sim_{*} (V_{i1} ,V^{\prime}_{i2} )} }}{2(p + q)} \hfill \\ \, + \frac{{\sum\nolimits_{i = 1}^{p} {Sim_{*} (N_{i1} ,N^{\prime}_{i2} )} + Sim_{*} (N_{k1} ,N_{p + 1,2} ) + \sum\nolimits_{i = 1}^{q} {Sim_{*} (V_{i1} ,V^{\prime}_{i2} )} }}{2(p + q + 1)} \, \hfill \\ \end{gathered} $$
Using similar reasoning as above, the quantity
\({Sim}_{g}^{*}\left({S}_{A},{{S}^{\prime}}_{B}\right){Sim}_{g}^{*}\left({S}_{A},{S}_{B}\right)\) is proportional to
where each entity under bracket is also positively valued.
$$ \begin{gathered} \left[ {(p  2)Sim_{*} (N_{k1} ,N_{p + 1,2} )  \sum\limits_{i = 1,i \ne k,i \ne m}^{p} {Sim_{*} (N_{i1} ,N^{\prime}_{i2} )} } \right] \hfill \\ + \left[ {(qSim_{*} (N_{k1} ,N_{p + 1,2} )  \sum\limits_{i = 1}^{q} {Sim_{*} (V_{i1} ,V^{\prime}_{i2} )} } \right] \hfill \\ + \, (p + q + 1)\left[ {Sim_{*} (N_{m1} ,N_{p + 1,2} )  Sim_{*} (N_{m1} ,N^{\prime}_{m2} )} \right] \hfill \\ + (p + q + 2)\left[ {Sim_{*} (N_{k1} ,N_{p + 1,2} )  Sim_{*} (N_{k1} ,N^{\prime}_{k2} )} \right] \hfill \\ + \, \left[ {Sim_{*} (N_{k1} ,N_{p + 1,2} )  Sim_{*} (N_{m1} ,N^{\prime}_{m2} )} \right], \hfill \\ \end{gathered} $$
The same reasoning applies if there is more than only element N
_{i1} in sentence S
_{A}, for which the inequality
\(Sim_{*} (N_{i1} ,N^{\prime}_{i2} ) \le Sim_{*} (N_{i1} ,N_{p + 1,2} )\) is held. The details is omitted for its similarity with previous one. This completes the proof.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Footnotes
1
One shall also mention the all the three semantic similarity use the concept of path length as well, and are sometimes cited under the category of path length measures.
2
We mentioned here and in the whole section hyponymy / hypernymy relation for illustration only as it is the dominant relation in WordNet lexical database, but the reasoning applies equally to meronymy, troponymy relations or any other relation that use hierarchical relation only of WordNet.