1 Introduction
-
(A1) The order of words in individual sentences cannot be accounted for. Therefore, sentences like “students like fast exercises in class” and “fast students like class in” would yield the same semantic similarity. We refer to this challenge A1. Attempts to account for word order in the semantic similarity have been investigated by several researchers but with very limited success. For instance, Islam and Inkpen [27] suggested to include a word-order similarity as a weighted additive component to the semantic similarity, where the word-order similarity is computed by the normalized difference of common words among the tokens of two input sentences. This presupposes the existence of a set of tokens, which are common to both inputs. A very similar approach has also been adopted by Li et al. [25, 26]. Ozates et al. [28] proposed to use the dependency grammar concept. In essence their approach uses dependency tree bigram units and evaluates the similarity of the two input sentences as the amount of the bigram unit match. Nevertheless, it should be noted that handling bigram or n-gram unit instead of standard bag-of-word representation entails substantial increase of computational time, without necessarily achieving higher accuracy score as pointed out in some other studies [18, 20].
-
(A2) The WordNet word-level semantic is restricted to noun and verb part-of-speech (PoS) categories only. This makes nouns and verbs as the two only classes usable to calculate the similarity scores between individual words because of their hierarchical taxonomic organization in WordNet. This trivially leaves the correlation among entities of distinct PoS as well as other types of non-verbal and non-naming entities uncounted for. For instance, WorldNet measures fail to establish the semantic relation of the word “investigate” to any of the words “investigation”, “investigator”, or “investigative” as they belong to different classes (part of speech) which are not taxonomically linked in WordNet hierarchy. Although the derivational morphology of words is contained in the WordNet lexical database, but in a distinct fashion and without explicit coherence, such limitations have already been pointed out in other studies, see, for instance [1, 2]. Therefore, accounting for various PoS entities in the quantification of the sentence-to-sentence semantic similarity remains an open challenge to the research community. We shall refer to this challenge A2.
-
(A3) Several WordNet semantic similarity measures have been put forward by the researchers, some of which are solely based on the hierarchy of the WordNet taxonomy [29, 30], while others make use corpus-based information as well [31‐33]. Therefore, the contribution of individual semantic similarity measure to sentence-to-sentence similarity is still to be investigated. Although several studies have reported some ad hoc and experimental-based comparison as pointed out in [34‐38], the almost absence of theoretical studies in this respect is quite striking. We shall refer to this challenge A3.
2 WordNet and semantic similarity
2.1 WordNet taxonomy
-
A broader concept (hypernym synset): {motor vehicle; automotive vehicle},
-
More specific concepts or hyponym synsets: e.g. {cruiser; squad car; patrol car; police car; prowl car} and {cab; taxi; hack; taxicab},
-
Parts of it is composed of: {bumper}; {car door}, {car mirror} and {car window} (meronymy relationship).
3 Taxonomy-based semantic similarity
-
len(ci,cj): the length of the shortest path from synset ci to synset cj in WordNet lexical database.
-
lcs(ci,cj): the lowest common subsumer of ci and cj
-
depth(ci): shortest distance in terms of number of edges or nodes from global root to synset ci, where the global root is such that depth(root) = 1.
-
max_depth: the maximum depth of the taxonomy
3.1 Properties of taxonomy-based semantic similarity measure
-
R is trivially reflexive;
-
R is transitive: for any synsets c1, c2, c3 such that c1 @ → c2 @ → c3, entails c1 @ → c3. For example, since “dog” is a hyponym of “mammal” and “mammal” is a hyponym of “animal”, “dog is a hyponym of animal”.
-
R is anti-symmetric: for any synsets c1, c2, if c1@ → c2 and c2 @ → c1 entails c1 = c2.
-
Reflexivity: Simx(ci,ci) = 1.
-
Symmetry: Simx(ci,cj) = Simx(cj,ci)
-
\(0 \le Sim_{x} \left( {c_{i} ,c_{j} } \right) \le 1\)
3.2 Summary
Similarity measure | Range | Monotonicity with respect to hyponymy/hypernym | Incremental monotonicity evolution | Reflexivity | Symmetry |
---|---|---|---|---|---|
Path length | \(\frac{1}{\mathrm{max}\_depth}\le Si{m}_{path}({c}_{i},{c}_{j})\le 0.5\) \(Si{m}_{path}\left({c}_{i},{c}_{j}\right)=1\, if\, same\, synsets\) | \(Si{m}_{path}\left(.,x\right)\) or \(Si{m}_{path}(x,.)\) is monotonic \(Si{m}_{path}\left(x,y\right)\) is monotonic if all inputs have same lcs | \(Si{m}_{path}(x,y)=0.5\) when x is a direct hyponym/hypernym of y | Yes | Yes |
Leacock and Chodorow | \(0\le Si{m}_{lch}^{*}({c}_{i},{c}_{j})<0.7\) \(Si{m}_{lch}^{*}\left({c}_{i},{c}_{j}\right) =1\, for\, same\, synsets\) | \(Si{m}_{lch}\left(.,x\right)\) or \(Si{m}_{lch}(x,.)\) is monotonic \(Si{m}_{lch}\left(x,y\right)\) is monotonic if all inputs have same lcs | \(Si{m}_{lch}(x,y)=1-\frac{\mathrm{log}(2)}{\mathrm{log}(2\mathrm{max}\_depth)-\mathrm{log}(2)}\) when x is a direct hyponym/hypernym of y | Yes | Yes |
Wu and Palmer | \(\frac{2}{\mathrm{max}\_depth+1}\le Si{m}_{wup}({c}_{i},{c}_{j})\le 1\) | \(Si{m}_{wup}\left(.,x\right)\) or \(Si{m}_{wup}(x,.)\) is monotonic \(Si{m}_{wup}\left(x,y\right)\) is monotonic if all inputs have same lcs | \(Si{m}_{wup}(x,y)\ge 0.8\) when x is a direct hyponym/hypernym of y | Yes | Yes |
Property | Fulfilment |
---|---|
Score ordering | \(Si{m}_{path}(x,y) \le Si{m}_{wup}(x, y)\) for all x, y \(Si{m}_{wup}(x,y)\le Si{m}_{lch}^{*}(x,y)\), if \(len(x,y)\le 2\), Otherwise \(Si{m}_{wup}(x,y)>Si{m}_{lch}^{*}(x,y)\) |
Time complexity | \(T(Si{m}_{path})\le T(Si{m}_{lch})\le T(Si{m}_{wup})\) |
Monotonic equivalence | \(Si{m}_{path}\) and \(Si{m}_{lch}\) are monotonically equivalent |
4 Sentence semantic similarity
4.1 From word semantic similarity to sentence similarity
-
Using Wu and Palmer similarity measure, we have:$$ Sim_{g}^{WP} (S_{A} ,S_{B} ) = \frac{1}{2}\left( \begin{gathered} \frac{1 + 0.71 + 0.67 + 0.5}{4} + \hfill \\ \frac{1 + 0.71 + 0.71 + 0.5 + 0.4}{5} \hfill \\ \end{gathered} \right) \approx 0.69 $$
-
Using path-length measure:$$ Sim_{g}^{PL} (S_{A} ,S_{B} ) = \frac{1}{2}\left( \begin{gathered} \frac{1 + 0.17 + 0.17 + 0.33}{4} + \hfill \\ \frac{1 + 0.17 + 0.2 + 0.33 + 0.25}{5} \hfill \\ \end{gathered} \right) \approx 0.40 $$
-
The stopwords as well as adjectives/adverbs, although they are important in conveying meaning to the underlined sentence, are not taken into account in the above sentence-to-sentence similarity. An intuitive way to account for the occurrence of such tokens consists of expanding the range of |SA|, for instance, to include all tokens in sentence A including stopwords and adverbs/adjectives. However, this would substantially reduce the score of the sentence-to-sentence similarity. Besides, such integration of extra tokens would not take into account any meaning of such wording so that if another sentence, say S’A contains adverb/adjective (s), which are antonyms in sentence SA, still will induce the same scoring value! Consequently, discarding such token seems a rational attitude in this respect.
-
Some tokens, like “involved”, can appear in both verb and adjective categories. However, since only verbal entities are present in the taxonomy, the handling of the token as such seems convenient and intuitive.
-
Assume that sentence SA reduces, after some text preprocessing task, to one single noun N1 and one single verb V1, while sentence SB reduces to noun N2 and verb V2, then$${Sim}_{g}^{*}\left({S}_{A},{S}_{B}\right)=\frac{1}{2}\left(Si{m}_{*}\left({N}_{1},{N}_{2}\right)+Si{m}_{*}\left({V}_{1},{V}_{2}\right)\right)$$(27)
-
In case of identical sentences, or at least, identical nouns and verbs in both sentences SA and SB, then it is easy to see that \({Sim}_{g}^{*}({S}_{A},{S}_{B})=1\)
-
In case where sentences, after preprocessing, reduce to a single noun or verb, then sentence similarity boils down to the corresponding word semantic similarity; that is, assuming N1 and N2 (resp. V1 and V2) be the nouns (resp. verbs) associated with sentence SA and SB, respectively, then$$ Sim_{g}^{*} \left( {S_{A} ,B} \right) = Sim_{*} \left( {N_{1} ,N_{2} } \right) $$$$ \left( {{\text{resp}}{.}\;Sim_{g} \left( {S_{A} ,S_{B} } \right) = Sim_{*} \left( {V_{1} ,V_{2} } \right)} \right) $$
-
There are situations in which the semantic similarity between the two sentences is not defined. Indeed, this occurs if the two sentences contain neither naming nor verbal expressions, or in case where one sentence contains only naming expression and the other one only verbal expression. In such cases, there is no analogy between the parts of speech in the two sentences, which renders the application of expression (21) void. Similarly, this also occurs if at least one of the two sentences contains neither verbal nor naming expression (s). An example of such sentences is: SA: “How are you?”; SB: “Hi there”.
-
An interesting case is related to the situation where a sentence contains repeated words or semantically equivalent name or verb expressions. It will thereby be of interest to see how such repetition influences the sentence similarity score. In this course, the following holds.
5 Effect of part-of-speech conversion
6 Dataset
7 Evaluation of semantic similarity measures
-
LSA corresponds to the case where the sentence similarity obtained using latent semantic analysis approach [47].
-
WN corresponds to the case where the sentence similarity is calculated using the canonical extension (26) with Wu and Palmer word-to-word semantic similarity measure. The choice of Wu and Palmer similarity is justified by the behaviour of the similarity measures, especially in terms of the range of the values that can be assigned to. In the implementation, we used Pederson’s [48, 49] implementation module of the Wu and Palmer semantic similarity measure. For preprocessing, we used Illinois Part-of-speech tagger [50] to identify various test segments.
-
WNwC corresponds to the case where the semantic similarity is calculated using the canonical extension (26) with Wu and Palmer word-to-word semantic similarity measure after performing the “all-to-noun” conversation. This conversation is performed using the CatVar model [43] because it is found to yield better results than morphosemantic links [44, 45].
Methods | Correlation \({r}_{s}\) | Mean value of similarity score of sentence pairs | Min–Max value of similarity | Standard deviation | Median | Processing time (sec) |
---|---|---|---|---|---|---|
STASIS | 0.816 | 0.589433 | [0.209 1] | 0.193619 | 0.6145 | 0.561 |
LSA | 0.838 | 0.687667 | [0.505 1] | 0.143315 | 0.685 | 0.134 |
WN | 0.821 | 0.656186 | [0.362 1] | 0.168924 | 0.6272 | 0.343 |
WNwC | 0.846 | 0.695833 | [0.397 1] | 0.155162 | 0.683 | 0.423 |
Paired variables | MoDs | Lower interval | Upper interval | T value | P value |
---|---|---|---|---|---|
STASIS–WNwC | − 0.1064 | -Inf | − 0.0597025 | − 3.8715 | 0.0002833 |
WN–WNwC | − 0.0396474 | -Inf | − 0.01611506 | − 2.8627 | 0.003861 |
LSA–WNwC | − 0.0081667 | -Inf | 0.02323585 | − 0.44188 | 0.3309 |
8 Discussions and implications
-
The algebraical and interactive properties of the investigated WordNet semantic similarities, namely path-length, Wu and Palmer and Leacock and Chodorow measures, provide valuable insights to data mining or natural language processing researchers and practitioners.To exemplify this reasoning, consider a commonly employed example of designing a data mining task that uses a thresholding on a semantic similarity measure, say, \(Si{m}_{x}\left({c}_{1},{c}_{2}\right)\ge \zeta \), where, often the threshold \(\zeta \) is chosen either empirically or by imposing some default values. Nevertheless, if a path-length measure was employed, and we set any value\(\zeta >0.5\), would yield a useless outcome. This is because we can predict through the results pointed out in Sect. 2 that such threshold will only enable us to capture one single instance corresponding to synonyms. The same reasoning applies for Leacock and Chodorow measure when we set a threshold \(\zeta >0.7.\) However, such restriction does not apply in case of Wu and Palmer measure.Similarly, the incremental evolution property provides us with useful insights in terms of what we may expect as a result when one slightly changes the input along the lexical hierarchy. Especially, it shows that an incremental moves up or down within the hierarchy as in direct hyponymy/hypernymy relation would yield a constant similarity score in case of path-length or Leacock and Chodorow measures, regardless of the words employed, while Wu and Palmer measure ensures a high score (beyond 0.8) whose exact value depends on the individual words employed. Such knowledge can be very useful, for instance, to predict the robustness of a given plagiarism detection system that is built on WordNet semantic similarity layer. Likewise, the interactive properties show, for instance, that if hard time requirements were imposed, then path-length similarity should be prioritized.
-
The results gained from Sect. 3 have direct implications on the type of handling that may be needed for the underlined text mining task. This involves the type of preprocessing that should be performed prior to calling upon the sentence-to-sentence similarity module. This depends on whether the NLP task favours maximizing similarity score or minimizing it, according, for instance, to the criticality of the false positive in the underscored task. This can provide insights whether short sentences are deemed more important or not. And, if rephrasing is permitted, how such operation could be performed in such a way to maximize or minimize similarity score. Especially, the finding highlights the importance of PoS tagging as an initial scrutinizing task. If such analysis reveals that the two input sentences contain only distinct PoS word categories, then one concludes immediately that the sentence-to-sentence similarity yields zero similarity score without any further processing. This also highlights the importance of accurate part-of-speech tagger in the initial scrutinizing stage as any error has a substantial consequence on the outcome. Interestingly, this provides some guidelines on the choice of stopwords list as well, such that the PoS aspect could be taken into account in order to ensure there is sufficient coverage in terms of number of noun and verb entities that trigger nonzero sentence-to-sentence similarity score.
-
The findings in Sect. 4 highlight the benefits of the PoS transformation, especially the “all-to-noun” transformation as a tool that can overcome some inherent limitation of the canonical sentence-to-sentence semantic similarity where only noun and verbal entities are handled. Nevertheless, this should not hide the inherent limitations of such an approach which primarily relies on either manually created database or lexical transformation rules, which are not error free, and can lead to an amplification of the meaning shift from the correct sense. The analysis of the computational complexity revealed the importance to account for WordNet server latency and reliability, which is often outside the scope of the user. Therefore, any attempt to use backup and locally stored data instead of server data is of paramount importance in this regard.
-
In overall, the development pursued in this paper, despite their novelty and relevance to AI and NLP community, is also subject to several inherent limitations, which can be rooted back to either the selected semantic similarity measures or the generic pipeline employed. First, we restricted our analysis to the three commonly employed WordNet similarity measures that explore the hierarchical structure only, which leaves other similarity measures unexplored, including those proposed as extensions/refinements of path-length and Wu and Palmer measures, see [51] for an overview of structured similarity measures. Second, the exploration of properties at either word level or sentence level is not meant to be exhaustive as we restricted to only few properties that might be of interest to data mining and natural language processing community. Third, the PoS conversion process utilized WordNet’s conceptual relations and some other linguistics/grammatical rules. Therefore, such conversation, sometimes, may not be accurate, especially when the underlined word has multiple senses, which opens wide the door of word sense disambiguation problem. Fourth, the extension from word level to sentence level considered only the canonical extension highlighted in expression (21). Nevertheless, it should be noted that this is far to be the unique representation of such extension and several interesting proposals have been reported in the computational linguistic community. One may mention, for instance, approaches that exploit the syntactic information in the sentence linking verb entity to their subject and complement object in a way to maintain up to some extent the structure of the sentence, n-gram models or enforcing some specific sentence ontology representation, see [18] for an overview of such alternative representations.