1 Introduction
2 State of the art
2.1 Overview on the sentences similarity methods
2.2 LMF-ISO 24613 standard
3 The proposed method
3.1 Preprocessing
-
Tokenization: input sentences are broken up into tokens (words).
-
Punctuation sign removal: punctuation signs are used in any text. They are considered as unimportant information between sentences. They are removed to get more significant results.
-
Lemmatization: morphological variants are reduced to their stem form.
3.2 Similarity scores attribution
-
MC is the number of common words between the sentences S1 and S2,
-
MS1 is the number of words contained in sentence S1 and
-
MS2 is the number of words contained in sentence S2.
-
Case 1: if Wi appears in the sentence, then Ti is set to 1.
-
Case 2: if Wi is not contained in the sentence, then a semantic similarity score is computed between Wi and each word in the sentence using the synonymy relations of LMF standardized dictionary (extracted from Sense Relation class). Thus, the most similar word to Wi in sentence is the one with the highest similarity score \(\theta \), then Ti is set to \(\theta \).
-
MC is the number of common words between the two synonym sets,
-
MW1 is the number of words contained in the w1 synonym set and
-
MW2 is the number of words contained in the w2 synonym set. From the generated semantic vectors, as described above, we compute the semantic similarity score, which we call SM(S1, S2), between them, using the Cosine similarity [18].where:$$\begin{aligned} \mathrm{SM(S1,S2)}= \frac{V1 . V2}{|| V1||. ||V2||}, \end{aligned}$$(3)
-
V1 is the semantic vector of sentence S1 and
-
V2 is the semantic vector of sentence S2. Semantic knowledge and especially semantic arguments, which aim at characterizing the meanings of lexical units in sentences, have attracted considerable interest in both linguistic and computational linguistic domains. Such semantic arguments can be defined as a semantic linguistic property that can be used as a valuable means of comprehending the specific meaning of a sentence. Moreover, the semantic argument is characterized by the semantic class and the thematic role that provides information about the relationships between words and provides a mechanism of interaction among the syntactic processors. The thematic role refers to a semantic relationship between a predicate and its arguments. For example, the thematic role, “the broom-handle” is different in a sentence S1: “He banged the broom-handle on the ceiling”, and S2: “He banged the ceiling with the broom-handle”, because it presents an object in S1 and an instrument in S2. Likewise, the semantic argument “the ceiling” plays the role of a location in S1 and an object in S2.
-
-
ASC is the number of common semantic arguments between the two sentences,
-
ASS1 is the number of semantic arguments contained in sentence S1 and
-
ASS2 is the number of semantic arguments contained in sentence S2.
3.3 Supervised learning
-
Vi is the vector extracting the pair of sentences,
-
SL is the lexical similarity score between the elements of a pair of sentences,
-
SM is the semantic similarity score between the elements of a pair of sentences,
-
SSM is the syntactico-semantic similarity between the elements of a pair of sentences and
-
Di is the Boolean criterion representing the class of the similar or dissimilar vector Vi .
-
\(\alpha \) is the weight attributed to lexical similarity,
-
\(\beta \) is the weight attributed to semantic similarity,
-
\(\gamma \) is the weight attributed to syntactico-semantic similarity and
-
C is constant.
-
If Sim (S1, S2)\(\ge \) threshold, then the sentences are similar.
-
If Sim (S1, S2) < threshold, then the sentences are not similar.
4 Experiments and results
4.1 The databases
Dataset | #Pairs |
---|---|
Lissan Al-Arab | 480 |
Al-Wassit | 266 |
Al-Muhit | 178 |
Tj Al-Arous | 456 |
Arabic sentence | English translation | Human similarity (mean) | Our proposed method |
---|---|---|---|
×
| God decreed the patient’s survival | 0.7 | 0.75 |
×
| God decreed healing the patient’s healing | ||
×
| The beggar took the food | 0.6 | 0.75 |
×
| The beggar took in the food | ||
×
| I feel pain | 0.5 | 0.5 |
×
| I have a stomachache in my belly | ||
×
| He gave a person the money | 0.7 | 0.75 |
×
| He gave the money to him | ||
×
| He wrote him the ground | 0.3 | 0.25 |
×
| He wrote him a letter | ||
×
| The rain continues | 0.4 | 0.25 |
×
| The sky continues with the rain | ||
×
| Shelving tree | 1 | 1 |
×
| Shelving trees | ||
×
| The nurse gave an injection | 0.45 | 0.5 |
×
| The nurse gave the patient an injection | ||
×
| God kept him away | 0 | 0 |
×
| God did not keep him away | ||
×
| He weakened his enemies | 0.3 | 0.5 |
×
| He weakened his enemies with wounds |
4.2 An experiment with human similarities of Arabic sentence pairs
4.3 Results and discussion
Correlation r
| |
---|---|
Our proposed measure | 0.92 |
Mean of all participants | 0.938 |
Worst participant | 0.73 |
Best participant | 0.947 |
Precision (%) | Recall (%) |
F-score (%) |
---|---|---|
88.12 | 83.24 | 85.61 |