Skip to main content
Top
Published in: Arabian Journal for Science and Engineering 4/2021

03-01-2021 | Research Article-Computer Engineering and Computer Science

Index Term Selection Heuristics for Arabic Text Retrieval

Author: Yaser A. Al-Lahham

Published in: Arabian Journal for Science and Engineering | Issue 4/2021

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The Arabic index term selection is a challenging process due to the complex morphological nature of the Arabic language. Index term selection is a significant factor that affects the efficiency of any information retrieval system. Many methods of index term selection were proposed in the literature. The majority of them were based on root extraction and stemming. Other proposals apply complex linguistic rules and machine learning tools. This paper proposes a simple index term selection method using some heuristics such that a representative subset of terms is selected to form the index. The proposed heuristics essentially select index terms from Arabic words having the prefix ‘AL’ (definite words) as a basis. Besides, the proposed method selects new words according to any of the following heuristics: the words preceding or words succeeding definite terms, choosing words that follow some linking words and words following propositions in semi-sentences, and selecting words that represent named entities. The proposed heuristics were tested using the TREC-2001/2002 Arabic test collection. The results show the effectiveness of the proposed method since it outperforms selecting all terms stemmed by two well-known stemmers. For example, choosing definite words and words that represent named entities outperforms selecting all terms stemmed by the LIGHT10 stemmer according to the mean average precision by 8.4% and at the same time decreases the index size by 27.8%.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literature
7.
go back to reference Taji, D.; et al.: An Arabic morphological analyzer and generator with copious features. In: Proceedings of the 15th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pp. 140–150 (2019). https://doi.org/10.18653/v1/w18-5816 Taji, D.; et al.: An Arabic morphological analyzer and generator with copious features. In: Proceedings of the 15th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pp. 140–150 (2019). https://​doi.​org/​10.​18653/​v1/​w18-5816
11.
go back to reference Abdelali, A., Darwish, K., Durrani, N., Mubarak, H.: Farasa: a fast and furious segmenter for Arabic. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pp. 11–16 (2016). https://doi.org/10.18653/v1/n16-3003 Abdelali, A., Darwish, K., Durrani, N., Mubarak, H.: Farasa: a fast and furious segmenter for Arabic. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, pp. 11–16 (2016). https://​doi.​org/​10.​18653/​v1/​n16-3003
17.
go back to reference Chouigui, A.; Khiroun, O.B.; Elayeb, B.: A TF-IDF and co-occurrence based approach for events extraction from Arabic news corpus. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 10859, 272–280 (2018). https://doi.org/10.1007/978-3-319-91947-8_27CrossRef Chouigui, A.; Khiroun, O.B.; Elayeb, B.: A TF-IDF and co-occurrence based approach for events extraction from Arabic news corpus. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 10859, 272–280 (2018). https://​doi.​org/​10.​1007/​978-3-319-91947-8_​27CrossRef
21.
go back to reference Armouty, B.; Tedmori, S.: Automated keyword extraction using support vector machine from Arabic news documents. In: 2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology, JEEIT 2019—Proceedings. IEEE, pp. 342–346 (2019). https://doi.org/10.1109/jeeit.2019.8717420. Armouty, B.; Tedmori, S.: Automated keyword extraction using support vector machine from Arabic news documents. In: 2019 IEEE Jordan International Joint Conference on Electrical Engineering and Information Technology, JEEIT 2019—Proceedings. IEEE, pp. 342–346 (2019). https://​doi.​org/​10.​1109/​jeeit.​2019.​8717420.
22.
go back to reference Liu, Z.; Li, P.; Zheng Y.; Sun, M.: Clustering to find exemplar terms for keyphrase extraction. In: EMNLP ‘09: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 1, pp. 257–266 (2009). Liu, Z.; Li, P.; Zheng Y.; Sun, M.: Clustering to find exemplar terms for keyphrase extraction. In: EMNLP ‘09: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 1, pp. 257–266 (2009).
23.
go back to reference Al-Kabi, M.; Al-Belaili, H.; Abul-Huda, B.; Wahbeh, A.: Keyword extraction based on word co-occurrence statistical information for Arabic text, ABHATH AL-YARMOUK. Basic Sci Eng 22(1), 75–95 (2013) Al-Kabi, M.; Al-Belaili, H.; Abul-Huda, B.; Wahbeh, A.: Keyword extraction based on word co-occurrence statistical information for Arabic text, ABHATH AL-YARMOUK. Basic Sci Eng 22(1), 75–95 (2013)
24.
go back to reference Liu, F.; Pennell, D.; Liu, F.; Liu, Y.: Unsupervised approaches for automatic keyword extraction using meeting transcripts. In: NAACL HLT 2009—Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Conference, pp. 620–628 (2009). https://doi.org/10.3115/1620754.1620845 Liu, F.; Pennell, D.; Liu, F.; Liu, Y.: Unsupervised approaches for automatic keyword extraction using meeting transcripts. In: NAACL HLT 2009—Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Proceedings of the Conference, pp. 620–628 (2009). https://​doi.​org/​10.​3115/​1620754.​1620845
25.
go back to reference Beliga, S.; Meštrović, A.; Martinčić-Ipšić, S.: An overview of graph-based keyword extraction methods and approaches. J. Inf. Organ. Sci. 39(1), 1–20 (2015) Beliga, S.; Meštrović, A.; Martinčić-Ipšić, S.: An overview of graph-based keyword extraction methods and approaches. J. Inf. Organ. Sci. 39(1), 1–20 (2015)
26.
go back to reference Awajan, A.: Unsupervised approach for automatic keyword extraction from Arabic documents. In: The 2014 Conference on Computational Linguistics and Speech Processing, pp. 175–184 (2014). Awajan, A.: Unsupervised approach for automatic keyword extraction from Arabic documents. In: The 2014 Conference on Computational Linguistics and Speech Processing, pp. 175–184 (2014).
30.
go back to reference Ababneh, A.H., Lu, J., Xu, Q.: Arabic information retrieval: a relevancy assessment survey. In: 25th International Conference on Information Systems Development, ISD, pp. 345–357 (2016) Ababneh, A.H., Lu, J., Xu, Q.: Arabic information retrieval: a relevancy assessment survey. In: 25th International Conference on Information Systems Development, ISD, pp. 345–357 (2016)
32.
36.
Metadata
Title
Index Term Selection Heuristics for Arabic Text Retrieval
Author
Yaser A. Al-Lahham
Publication date
03-01-2021
Publisher
Springer Berlin Heidelberg
Published in
Arabian Journal for Science and Engineering / Issue 4/2021
Print ISSN: 2193-567X
Electronic ISSN: 2191-4281
DOI
https://doi.org/10.1007/s13369-020-05022-3

Other articles of this Issue 4/2021

Arabian Journal for Science and Engineering 4/2021 Go to the issue

Research Article-Computer Engineering and Computer Science

Detection and Defense of PUEA in Cognitive Radio Network

Research Article-Computer Engineering and Computer Science

Performance Evaluation of Lightweight Encryption Algorithms for IoT-Based Applications

Premium Partners