Skip to main content
Top
Published in: Arabian Journal for Science and Engineering 11/2019

26-07-2019 | Research Article - Computer Engineering and Computer Science

Sentence Embedding and Convolutional Neural Network for Semantic Textual Similarity Detection in Arabic Language

Authors: Adnen Mahmoud, Mounir Zrigui

Published in: Arabian Journal for Science and Engineering | Issue 11/2019

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The continuous increase in extraordinary textual sources on the web has facilitated the act of paraphrase. Its detection has become a challenge in different natural language processing applications (e.g., plagiarism detection, information retrieval and extraction, question answering, etc.). Different from western languages like English, few works have been addressed the problem of extrinsic paraphrase detection in Arabic language. In this context, we proposed a deep learning-based approach to indicate how original and suspect documents expressed the same meaning. Indeed, word2vec algorithm extracted the relevant features by predicting each word to its neighbors. Subsequently, averaging the obtained vectors was efficient for generating sentence vectors representations. Then, convolutional neural network was useful to capture more contextual information and compute the degree of semantic relatedness. Faced to the lack of resources publicly available, paraphrased corpus was developed using skip gram model. It had better performance in replacing an original word by its most similar one that had the same grammatical class from a vocabulary. Finally, the proposed system achieved good results enhancing an efficient contextual relationship detection between Arabic documents in terms of precision (85%) and recall (86.8%) than previous studies.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
4.
go back to reference Xu, W.; Callison-Burch, C.; Dolan, W.B.: SemEval-2015 Task 1: paraphrase and semantic similarity in twitter (PIT). In: 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, Colorado, pp. 1–11 (2015) Xu, W.; Callison-Burch, C.; Dolan, W.B.: SemEval-2015 Task 1: paraphrase and semantic similarity in twitter (PIT). In: 9th International Workshop on Semantic Evaluation (SemEval 2015), Denver, Colorado, pp. 1–11 (2015)
7.
go back to reference Kumar, V.; Verma, A.; Mittal, N.; Gromov, S.V.: Anatomy of pre-processing of big data for monolingual corpora paraphrase extraction: source language sentence selection. In: Emerging Technologies in Data Mining and Information Security, pp. 495–505. Springer, Singapore (2018) Kumar, V.; Verma, A.; Mittal, N.; Gromov, S.V.: Anatomy of pre-processing of big data for monolingual corpora paraphrase extraction: source language sentence selection. In: Emerging Technologies in Data Mining and Information Security, pp. 495–505. Springer, Singapore (2018)
8.
go back to reference Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J.: Distributed representations of words and phrases and their compositionality. In: 26th International Conference on Neural Information Processing Systems, vol. 2, pp. 3111–3119 (2013) Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J.: Distributed representations of words and phrases and their compositionality. In: 26th International Conference on Neural Information Processing Systems, vol. 2, pp. 3111–3119 (2013)
11.
go back to reference Al-Sabahi, K.; Zhang, Z.; Long, J.; Alwesabi, K.: An enhanced latent semantic analysis approach for Arabic document summarization. Arab. J. Sci. Eng. 43, 8079–8094 (2018)CrossRef Al-Sabahi, K.; Zhang, Z.; Long, J.; Alwesabi, K.: An enhanced latent semantic analysis approach for Arabic document summarization. Arab. J. Sci. Eng. 43, 8079–8094 (2018)CrossRef
12.
go back to reference AlZu’bi, S.; Hawashin, B.; ElBes, M.; Al-Ayyoub, M.: A novel recommender system based on apriori algorithm for requirements engineering. In: Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS), pp. 323–327. IEEE (2018) AlZu’bi, S.; Hawashin, B.; ElBes, M.; Al-Ayyoub, M.: A novel recommender system based on apriori algorithm for requirements engineering. In: Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS), pp. 323–327. IEEE (2018)
13.
go back to reference Mahmoud, A.; Zrigui, M.: Artificial method for building monolingual plagiarized Arabic corpus. Computacion y Sistemas 22, 3767–3776 (2018) Mahmoud, A.; Zrigui, M.: Artificial method for building monolingual plagiarized Arabic corpus. Computacion y Sistemas 22, 3767–3776 (2018)
14.
go back to reference Moawad, I.; Alromima, W.; Rania, E.: Bi-gram term collocations-based query expansion approach for improving Arabic information retrieval. Arab. J. Sci. Eng. 43, 7705–7718 (2018)CrossRef Moawad, I.; Alromima, W.; Rania, E.: Bi-gram term collocations-based query expansion approach for improving Arabic information retrieval. Arab. J. Sci. Eng. 43, 7705–7718 (2018)CrossRef
15.
go back to reference Zrigui, S.; Zouaghi, A.; Ayadi, R.; Zrigui, M.; Zrigui, S.: ISAO: an intelligent system of opinion analysis. Res. Comput. 110, 21–31 (2016) Zrigui, S.; Zouaghi, A.; Ayadi, R.; Zrigui, M.; Zrigui, S.: ISAO: an intelligent system of opinion analysis. Res. Comput. 110, 21–31 (2016)
16.
go back to reference Mahmoud, A.; Zrigui, M.: Semantic similarity analysis for paraphrase identification in Arabic texts. In: The 31st Pacific Asia Conference on Language, Information and Computation, Philippine, (PACLIC 31), pp. 274–281 (2017) Mahmoud, A.; Zrigui, M.: Semantic similarity analysis for paraphrase identification in Arabic texts. In: The 31st Pacific Asia Conference on Language, Information and Computation, Philippine, (PACLIC 31), pp. 274–281 (2017)
17.
go back to reference Hkiri, E.; Mallat, S.; Zrigui, M.: Arabic–English text translation leveraging hybrid NER. The 31st Pacific Asia Conference on Language, Information and Computation (PACLIC 31), pp. 124–131 (2017) Hkiri, E.; Mallat, S.; Zrigui, M.: Arabic–English text translation leveraging hybrid NER. The 31st Pacific Asia Conference on Language, Information and Computation (PACLIC 31), pp. 124–131 (2017)
20.
go back to reference AlZu’bi, S.; Al-Qatawneh, S.; Alsmirat, M.: Transferable HMM trained matrices for accelerating statistical segmentation time. In: Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS), pp. 172–176. IEEE (2018) AlZu’bi, S.; Al-Qatawneh, S.; Alsmirat, M.: Transferable HMM trained matrices for accelerating statistical segmentation time. In: Fifth International Conference on Social Networks Analysis, Management and Security (SNAMS), pp. 172–176. IEEE (2018)
21.
go back to reference Mohamed, M.A.B.; Mallat, S.; Nahdi, M.A.; Zrigui, M.: Exploring the potential of schemes in building NLP tools for Arabic language. Int. Arab J. Inf. Technol. (IAJIT) 6(12), 13–19 (2015) Mohamed, M.A.B.; Mallat, S.; Nahdi, M.A.; Zrigui, M.: Exploring the potential of schemes in building NLP tools for Arabic language. Int. Arab J. Inf. Technol. (IAJIT) 6(12), 13–19 (2015)
23.
go back to reference Batita, M.A.; Zrigui, M.: Derivational relations in Arabic Wordnet. In: 9th Global WordNet Conference GWC, pp. 137–144 (2018) Batita, M.A.; Zrigui, M.: Derivational relations in Arabic Wordnet. In: 9th Global WordNet Conference GWC, pp. 137–144 (2018)
24.
go back to reference Salah, M.H.; Schwab, D.; Blanchon, H.; Zrigui, M.: Système de traduction automatique statistique Anglais-Arabe, pp. 1–8. arXiv:1802.02053v1 [CS.CL] (2018) Salah, M.H.; Schwab, D.; Blanchon, H.; Zrigui, M.: Système de traduction automatique statistique Anglais-Arabe, pp. 1–8. arXiv:​1802.​02053v1 [CS.CL] (2018)
27.
go back to reference Al-Shenak, M.; Nahar, K.; Halwani, H.: AQAS: Arabic question answering system based on SVM, SVD, and LSI. J. Theor. Appl. Inf. Technol. 97(2), 681–691 (2019) Al-Shenak, M.; Nahar, K.; Halwani, H.: AQAS: Arabic question answering system based on SVM, SVD, and LSI. J. Theor. Appl. Inf. Technol. 97(2), 681–691 (2019)
28.
go back to reference Shehab, A.; Faroun, M.; Rashad, M.: An automatic Arabic essay grading system based on text similarity Algorithms. Int. J. Adv. Comput. Sci. Appl. (IJACSA) 9(3), 263–268 (2018) Shehab, A.; Faroun, M.; Rashad, M.: An automatic Arabic essay grading system based on text similarity Algorithms. Int. J. Adv. Comput. Sci. Appl. (IJACSA) 9(3), 263–268 (2018)
29.
go back to reference Imran, S.; Khan, M.U.G.; Idrees, M.; Muneer, I.; Iqbal, M.M.: An enhanced framework for extrinsic plagiarism avoidance for research articles. Tech. J. 23(1), 84–92 (2018) Imran, S.; Khan, M.U.G.; Idrees, M.; Muneer, I.; Iqbal, M.M.: An enhanced framework for extrinsic plagiarism avoidance for research articles. Tech. J. 23(1), 84–92 (2018)
30.
go back to reference Rafiq, M.H.; Razzaq, S.; Kehkashan, T.: UPD: a plagiarism detection tool for Urdu language documents. Int. J. Multidiscip. Sci. Eng. 9(1), 19–22 (2018) Rafiq, M.H.; Razzaq, S.; Kehkashan, T.: UPD: a plagiarism detection tool for Urdu language documents. Int. J. Multidiscip. Sci. Eng. 9(1), 19–22 (2018)
31.
go back to reference Abooraig, R.; Al-Zu’bi, S.; Kanan, T.; Hawashin, B.; Al Ayoub, M.; Hmeidi, I.: Automatic categorization of Arabic articles based on their political orientation. Dig. Investig. 25, 24–41 (2018)CrossRef Abooraig, R.; Al-Zu’bi, S.; Kanan, T.; Hawashin, B.; Al Ayoub, M.; Hmeidi, I.: Automatic categorization of Arabic articles based on their political orientation. Dig. Investig. 25, 24–41 (2018)CrossRef
32.
go back to reference Issa, F.; Damonte, M.; Cohen, S.B.; Yan, X.; Chang, Y.: Abstract meaning representation for paraphrase detection. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 442–452 (2018). https://doi.org/10.18653/v1/n18-1041 Issa, F.; Damonte, M.; Cohen, S.B.; Yan, X.; Chang, Y.: Abstract meaning representation for paraphrase detection. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 442–452 (2018). https://​doi.​org/​10.​18653/​v1/​n18-1041
33.
go back to reference Ezzikouri, H.; Oukessou, M.; Erritali, M.; Madani, Y.: Fuzzy cross language plagiarism detection approach based on semantic similarity and Hadoop MapReduce. In: Recent Advances in Intuitionistic Fuzzy Logic Systems, pp. 181–190. Springer, Cham (2019) Ezzikouri, H.; Oukessou, M.; Erritali, M.; Madani, Y.: Fuzzy cross language plagiarism detection approach based on semantic similarity and Hadoop MapReduce. In: Recent Advances in Intuitionistic Fuzzy Logic Systems, pp. 181–190. Springer, Cham (2019)
34.
go back to reference Fernando, S.; Stevenson, M.: A semantic similarity approach to paraphrase detection. In: Proceedings of the 11th Annual Research Colloquium of the UK Special Interest Group for Computational Linguistics, pp. 45–52 (2008). Fernando, S.; Stevenson, M.: A semantic similarity approach to paraphrase detection. In: Proceedings of the 11th Annual Research Colloquium of the UK Special Interest Group for Computational Linguistics, pp. 45–52 (2008).
35.
go back to reference Mihalcea, R.; Corley, C.; Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: AAAI’06 Proceedings of the 21st National Conference on Artificial Intelligence, vol. 1, pp. 775–780 (2006) Mihalcea, R.; Corley, C.; Strapparava, C.: Corpus-based and knowledge-based measures of text semantic similarity. In: AAAI’06 Proceedings of the 21st National Conference on Artificial Intelligence, vol. 1, pp. 775–780 (2006)
36.
go back to reference Azunre, P.; Corcoran, C.; Dhamani, N.; Gleason, J.; Honke, G.; Sullivan, D.; Ruppel, R.; Verma, S.; Morgan, J.: Semantic classification of tabular datasets via character-level convolutional neural networks, pp. 1–15. arXiv:1901.08456 (2019) Azunre, P.; Corcoran, C.; Dhamani, N.; Gleason, J.; Honke, G.; Sullivan, D.; Ruppel, R.; Verma, S.; Morgan, J.: Semantic classification of tabular datasets via character-level convolutional neural networks, pp. 1–15. arXiv:​1901.​08456 (2019)
37.
go back to reference Lai, S.; Leung, K.S.; Leung, Y.: SUNNYNLP at SemEval-2018 Task 10: a support-vector-machine-based method for detecting semantic difference using taxonomy and word embedding features. In: Proceedings of the 12th International Workshop on Semantic Evaluation (SemEval-2018), New Orleans, Louisiana, pp. 741–746 (2018). https://doi.org/10.18653/v1/s18-1118 Lai, S.; Leung, K.S.; Leung, Y.: SUNNYNLP at SemEval-2018 Task 10: a support-vector-machine-based method for detecting semantic difference using taxonomy and word embedding features. In: Proceedings of the 12th International Workshop on Semantic Evaluation (SemEval-2018), New Orleans, Louisiana, pp. 741–746 (2018). https://​doi.​org/​10.​18653/​v1/​s18-1118
38.
go back to reference He, H.; Wieting, J.; Gimpel, K.; Rao, J.; Lin, J.: UMD-TTIC-UW at SemEval-2016 Task 1: attention-based multi-perspective convolutional neural networks for textual similarity measurement. In: 10th International Workshop on Semantic Evaluation (SemEval-2016), pp. 1103–1108 (2016). https://doi.org/10.18653/v1/s16-1170 He, H.; Wieting, J.; Gimpel, K.; Rao, J.; Lin, J.: UMD-TTIC-UW at SemEval-2016 Task 1: attention-based multi-perspective convolutional neural networks for textual similarity measurement. In: 10th International Workshop on Semantic Evaluation (SemEval-2016), pp. 1103–1108 (2016). https://​doi.​org/​10.​18653/​v1/​s16-1170
39.
go back to reference Salem, A.; Almarimi, A.; Andrejkova, G.: Text dissimilarities predictions using convolutional neural networks and clustering. In: World Symposium on Digital Intelligence for Systems and Machines (DISA), pp. 343–347 (2018) Salem, A.; Almarimi, A.; Andrejkova, G.: Text dissimilarities predictions using convolutional neural networks and clustering. In: World Symposium on Digital Intelligence for Systems and Machines (DISA), pp. 343–347 (2018)
41.
go back to reference Alrabiah, M.; Al-Salman, A.; Atwell, E.; Alhelewh, N.: KSUCCA: a key to exploring Arabic historical linguistics. Int. J. Comput. Linguist. (IJCL) 5(2), 27–36 (2014) Alrabiah, M.; Al-Salman, A.; Atwell, E.; Alhelewh, N.: KSUCCA: a key to exploring Arabic historical linguistics. Int. J. Comput. Linguist. (IJCL) 5(2), 27–36 (2014)
43.
go back to reference Sameen, S.; Sharjeel, M.; Nawab, R.M.A.; Rayson, P.; Muneer, I.: Measuring short text reuse for the Urdu language. IEEE Access 6, 7412–7421 (2018)CrossRef Sameen, S.; Sharjeel, M.; Nawab, R.M.A.; Rayson, P.; Muneer, I.: Measuring short text reuse for the Urdu language. IEEE Access 6, 7412–7421 (2018)CrossRef
45.
go back to reference Almarwani, N.; Diab, M.: Arabic textual entailment with word embeddings. In: Proceedings of The Third Arabic Natural Language Processing Workshop (WANLP), Valencia, Spain, pp. 185–190 (2017) Almarwani, N.; Diab, M.: Arabic textual entailment with word embeddings. In: Proceedings of The Third Arabic Natural Language Processing Workshop (WANLP), Valencia, Spain, pp. 185–190 (2017)
Metadata
Title
Sentence Embedding and Convolutional Neural Network for Semantic Textual Similarity Detection in Arabic Language
Authors
Adnen Mahmoud
Mounir Zrigui
Publication date
26-07-2019
Publisher
Springer Berlin Heidelberg
Published in
Arabian Journal for Science and Engineering / Issue 11/2019
Print ISSN: 2193-567X
Electronic ISSN: 2191-4281
DOI
https://doi.org/10.1007/s13369-019-04039-7

Other articles of this Issue 11/2019

Arabian Journal for Science and Engineering 11/2019 Go to the issue

Research Article - Computer Engineering and Computer Science

Cooling Computer Chips with Cascaded and Non-cascaded Thermoelectric Devices

Premium Partners