skip to main content
10.1145/3404835.3463237acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper
Open Access

Morphologically Annotated Amharic Text Corpora

Published:11 July 2021Publication History

ABSTRACT

In information retrieval (IR), documents that match the query are retrieved. Search engines usually conflate word variants into a common stem when indexing documents because queries and documents do not need to use exactly the same word variant for the documents to be relevant. Stemmers are known to be effective in many languages for IR. However, there are still languages where stemmers or morphological analyzers are missing; this is the case for Amharic which is the working language of Ethiopia. Morphological analysis is the key to derive stems, roots (primary lexical units) and grammatical markers of words such as person, tense and negation markers. This paper presents morphologically annotated Amharic lexicons as well as stem-based and root-based morphologically annotated corpora which could be used by the research community as benchmark collections either to evaluate morphological analyzers or information retrieval for Amharic. Such resources are believed to foster research in Amharic IR.

Skip Supplemental Material Section

Supplemental Material

rsp1138 Morphologically Annotated Amharic Text Corpora.mp4

mp4

15.4 MB

References

  1. Tanja Gaustad and Gosse Bouma, 2002. Accurate stemming of Dutch for text classfication. Language and Computers, vol. 45, no. 1, 104--117.Google ScholarGoogle Scholar
  2. Martin Porter, 1980. An algorithm for suffix stripping. Program? Electronic Library and Information Systems, vol. 14, no. 3, 130--137.Google ScholarGoogle ScholarCross RefCross Ref
  3. Viviane Orengo and Christian Huyck, 2001. A stemming algorithm for the Portuguese language. In Proceedings the 8th Symposium on String Processing and Information Retrieval, 186--193, Laguna de San Rafael, Chile.Google ScholarGoogle ScholarCross RefCross Ref
  4. Mohammed Aljlayl and Ophir Frieder, 2002. On Arabic search: improving the retrieval effectiveness via a light stemming approach. In Proceedings of the 11th International Conference on Information and knowledge Management, 340--347, McLean Virginia, USA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Eduard Hovy and Jduard Lavid, 2010. Towards a science of corpus annotation: A new methodological challenge for corpus linguistics, International journal of Translation, vol. 22, no. 1, 13--36.Google ScholarGoogle Scholar
  6. Prasenjit Majumder, Mandar Mitra, Swapan Parui, Gobinda Kole, Pabitra Mitra and Kalyankumar Datta, 2007. YASS: Yet another suffix stripper. ACM Transactions on Information Systems (TOIS), vol. 25, no. 4.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Jasmeet Singh and Vishal Gupta, 2019. A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics. Knowledge-Based Systems, vol. 180 , no. 2019, 147--162.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Jiaul H. Paik and Swapan K. Parui, 2011. A fast corpus-based stemmer. ACM Transactions on Asian Language Information Processing (TALIP), vol. 10, no. 2, 1--16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Massimo Melucci and Nicola Orio, 2003. A novel method for stemmer generation based on hidden Markov models. In Proceedings of the 12th CIMK, 131--138, New Orleans, USA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Alireza Mokhtaripour and Saber Jahanpour, 2006. Introduction to a new Farsi stemmer. In Proceedings of the 15th ACM International Conference on Information and Knowledge management, Arlington Virginia,USA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Ali Daud, Wahab Khan and Dunrene Che, 2017. Urdu language processing: a survey. Artificial Intelligence Review, vol. 47, no. 3, 279--311.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Donna Harman, 1995. Overview of the second text retrieval conference (TREC-2). Information Processing and Management, vol. 31, no. 3, 271--289.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Nicola Ferro, 2014. CLEF 15th birthday: past, present, and future. ACM SIGIR Forum, vol. 48, no. 2, 31--55.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Noriko Kando, Kazuko Kuriyama, Toshihiko Nozue, Koji Eguchi, Hiroyuki Kato and Jun Adachi, 1999. The NTCIR workshop?: the 1st evaluation workshop on Japanese text retrieval and cross-lingual information retrieval. In Proceedings of the 4th International Workshop on Information Retrieval with Asian Languages, INV-1-INV-7., 1--7, Tokyo,Japan.Google ScholarGoogle Scholar
  15. Tasnim Chaudhury, Abdul Matin, M. S. Hossain, Asie Uzzaman and Md Masum, 2017. Annotated Bangla news corpus and lexicon development with POS tagging and stemming. Global Journal of Research in Engineering, vol. 17, no. 1.Google ScholarGoogle Scholar
  16. Tilahun Yeshambel, Josiane Mothe and Yaregal Assabie, 2020. 2AIRTC: The Amharic adhoc information retrieval test collection. In Proceedings of CLEF 2020, 55--66,Thessaloniki, Greece.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Yaregal Assabie, 2017. Development of Amharic morphological analyzer, Technical Report, Ethiopian Ministry of Communication and Information Technology, Addis Ababa.Google ScholarGoogle Scholar
  18. Wolf Leslau, 1995. Reference Grammar of Amharic (1st ed.). Otto Harrassowitz, Wiesbaden, Germany.Google ScholarGoogle Scholar
  19. Girma Demeke and Mesfin Getachew, 2006. Manual annotation of Amharic news items with part-of-speech tags and its challenges. Ethiopian Languages Research Center Working Papers, vol. 2, no. 1, 1--16.Google ScholarGoogle Scholar
  20. Biniyam Epherem, Yusuke Miyao and Baye Yimam, 2016. Morpho-syntactically annotated Amharic treebank. In Proceedings of CLiF Corpus Linguistics Fest, 48--57, Blooming, IN, USA.Google ScholarGoogle Scholar
  21. Andargachew Mekonnen, Biniyam Epherem, Michael Gasser and Andreas Nürnberger, 2018. Contemporary Amharic corpus: Automatically morpho-syntactically tagged Amharic corpus. In Proceedings of the 1st Workshop on Linguistic Resources for Natural Language Processing, 65--70, Santa Fe, USA.Google ScholarGoogle Scholar
  22. Ralf Grubenmann, Don Tuggener, Pius Däniken, Mark Deriu and Cieliebak, 2019. SB-Ch: A Swiss German corpus with sentiment annotations. In Proceedings of the 11th International Conference on Language Resources and Evaluation, LRE, MC, 2349--2353, Miyazaki, Japan.Google ScholarGoogle Scholar
  23. Maher Itani, Chris Roast and Samir Al-Khayatt, 2017. Corpora for sentiment analysis of Arabic text in social media. In Proceding of the 8th International Conference on Information and Communication Systems (ICICS), 64--69, Irbid, Jordan.Google ScholarGoogle ScholarCross RefCross Ref
  24. Kais Dukes and Nizar Habash, 2010. Morphological annotation of quranic Arabic. In Proceedings of the 7th International Conference on Language Resources and Evaluation, 2530--2536, Valletta, Malta.Google ScholarGoogle Scholar
  25. Mitchell Marcus, Beatrice Santorini and Mary Marcinkiewicz, 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, vol. 19, no. 2, 313--330.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Erika Rimkut,Vidas Daudaravicius and Anrius Utka, 2007. Morphological annotation of the Lithuanian corpus. In Proceedings of the Workshop on Balto-Slavonic natural language processing, 94--99, Czech Republic.Google ScholarGoogle Scholar
  27. Abeba Ibrahim and Yaregal Assabie, 2014. Amharic sentence parsing using phrase chunking. In Gelbukh A.(eds) Computiational Linguistics and Intelligent Text Processing (CICLing), 297--306, Berlin, Heidelberg.Google ScholarGoogle Scholar
  28. Martha Yifiru, Solomon Teferra and Laurent Besacier, 2011. Part-of-speech tagging for under-resourced and morphologically rich languages: the case of Amharic. In Proceedings of Conference on Human Language Technology for Development, 50--55, Alexandria, Egypt.Google ScholarGoogle Scholar
  29. Michael Gasser, 2011. HornMorpho: a system for morphological processing of Amharic, Oromo, and Tigrinya. In Preceedings of Conference on Human Language Technology for Development, 94--99, Alexandria, Egypt.Google ScholarGoogle Scholar
  30. Wondwossen Mulugeta and Michael Gasser, 2012. Learning morphological rules for Amharic verbs using inductive logic programming. In Workshop on Language Technology for Normalisation of Less-Resourced Languages (SALTMIL8/AfLaT2012), 7--12, Istanbul, Turkey.Google ScholarGoogle Scholar
  31. Baye Yimam, 2000. 'Amharic Grammar'/ (2nd ed.), CASE, Addis Ababa, EthiopiaGoogle ScholarGoogle Scholar
  32. Nega Alemayehu and Peter Willett, 2002. Stemming of Amharic words for information retrieval. Literary and Lingustic Computing, vol. 17, no. 1, 1--17.Google ScholarGoogle ScholarCross RefCross Ref
  33. Tilahun Yeshambel, Josiane Mothe and Yaregal Assabie, 2020. Construction of morpheme-based Amharic stopword list for information retrieval system. In Proceedings of the 8th EAI International Conference on Advancements of Science and Technology, Bahir Dar, Ethiopia.Google ScholarGoogle Scholar
  34. Tilahun Yeshambel, Josiane Mothe and Yaregal Assabie, 2020. Amharic document representation for adhoc retrieval. In Proceedings of the 12th International Conference on knowledge discovery and information retrieval, online conference, Hungary,124--134.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Morphologically Annotated Amharic Text Corpora

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval
      July 2021
      2998 pages
      ISBN:9781450380379
      DOI:10.1145/3404835

      Copyright © 2021 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 11 July 2021

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • short-paper

      Acceptance Rates

      Overall Acceptance Rate792of3,983submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader