ABSTRACT
In information retrieval (IR), documents that match the query are retrieved. Search engines usually conflate word variants into a common stem when indexing documents because queries and documents do not need to use exactly the same word variant for the documents to be relevant. Stemmers are known to be effective in many languages for IR. However, there are still languages where stemmers or morphological analyzers are missing; this is the case for Amharic which is the working language of Ethiopia. Morphological analysis is the key to derive stems, roots (primary lexical units) and grammatical markers of words such as person, tense and negation markers. This paper presents morphologically annotated Amharic lexicons as well as stem-based and root-based morphologically annotated corpora which could be used by the research community as benchmark collections either to evaluate morphological analyzers or information retrieval for Amharic. Such resources are believed to foster research in Amharic IR.
Supplemental Material
- Tanja Gaustad and Gosse Bouma, 2002. Accurate stemming of Dutch for text classfication. Language and Computers, vol. 45, no. 1, 104--117.Google Scholar
- Martin Porter, 1980. An algorithm for suffix stripping. Program? Electronic Library and Information Systems, vol. 14, no. 3, 130--137.Google ScholarCross Ref
- Viviane Orengo and Christian Huyck, 2001. A stemming algorithm for the Portuguese language. In Proceedings the 8th Symposium on String Processing and Information Retrieval, 186--193, Laguna de San Rafael, Chile.Google ScholarCross Ref
- Mohammed Aljlayl and Ophir Frieder, 2002. On Arabic search: improving the retrieval effectiveness via a light stemming approach. In Proceedings of the 11th International Conference on Information and knowledge Management, 340--347, McLean Virginia, USA.Google ScholarDigital Library
- Eduard Hovy and Jduard Lavid, 2010. Towards a science of corpus annotation: A new methodological challenge for corpus linguistics, International journal of Translation, vol. 22, no. 1, 13--36.Google Scholar
- Prasenjit Majumder, Mandar Mitra, Swapan Parui, Gobinda Kole, Pabitra Mitra and Kalyankumar Datta, 2007. YASS: Yet another suffix stripper. ACM Transactions on Information Systems (TOIS), vol. 25, no. 4.Google ScholarDigital Library
- Jasmeet Singh and Vishal Gupta, 2019. A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics. Knowledge-Based Systems, vol. 180 , no. 2019, 147--162.Google ScholarDigital Library
- Jiaul H. Paik and Swapan K. Parui, 2011. A fast corpus-based stemmer. ACM Transactions on Asian Language Information Processing (TALIP), vol. 10, no. 2, 1--16.Google ScholarDigital Library
- Massimo Melucci and Nicola Orio, 2003. A novel method for stemmer generation based on hidden Markov models. In Proceedings of the 12th CIMK, 131--138, New Orleans, USA.Google ScholarDigital Library
- Alireza Mokhtaripour and Saber Jahanpour, 2006. Introduction to a new Farsi stemmer. In Proceedings of the 15th ACM International Conference on Information and Knowledge management, Arlington Virginia,USA.Google ScholarDigital Library
- Ali Daud, Wahab Khan and Dunrene Che, 2017. Urdu language processing: a survey. Artificial Intelligence Review, vol. 47, no. 3, 279--311.Google ScholarDigital Library
- Donna Harman, 1995. Overview of the second text retrieval conference (TREC-2). Information Processing and Management, vol. 31, no. 3, 271--289.Google ScholarDigital Library
- Nicola Ferro, 2014. CLEF 15th birthday: past, present, and future. ACM SIGIR Forum, vol. 48, no. 2, 31--55.Google ScholarDigital Library
- Noriko Kando, Kazuko Kuriyama, Toshihiko Nozue, Koji Eguchi, Hiroyuki Kato and Jun Adachi, 1999. The NTCIR workshop?: the 1st evaluation workshop on Japanese text retrieval and cross-lingual information retrieval. In Proceedings of the 4th International Workshop on Information Retrieval with Asian Languages, INV-1-INV-7., 1--7, Tokyo,Japan.Google Scholar
- Tasnim Chaudhury, Abdul Matin, M. S. Hossain, Asie Uzzaman and Md Masum, 2017. Annotated Bangla news corpus and lexicon development with POS tagging and stemming. Global Journal of Research in Engineering, vol. 17, no. 1.Google Scholar
- Tilahun Yeshambel, Josiane Mothe and Yaregal Assabie, 2020. 2AIRTC: The Amharic adhoc information retrieval test collection. In Proceedings of CLEF 2020, 55--66,Thessaloniki, Greece.Google ScholarDigital Library
- Yaregal Assabie, 2017. Development of Amharic morphological analyzer, Technical Report, Ethiopian Ministry of Communication and Information Technology, Addis Ababa.Google Scholar
- Wolf Leslau, 1995. Reference Grammar of Amharic (1st ed.). Otto Harrassowitz, Wiesbaden, Germany.Google Scholar
- Girma Demeke and Mesfin Getachew, 2006. Manual annotation of Amharic news items with part-of-speech tags and its challenges. Ethiopian Languages Research Center Working Papers, vol. 2, no. 1, 1--16.Google Scholar
- Biniyam Epherem, Yusuke Miyao and Baye Yimam, 2016. Morpho-syntactically annotated Amharic treebank. In Proceedings of CLiF Corpus Linguistics Fest, 48--57, Blooming, IN, USA.Google Scholar
- Andargachew Mekonnen, Biniyam Epherem, Michael Gasser and Andreas Nürnberger, 2018. Contemporary Amharic corpus: Automatically morpho-syntactically tagged Amharic corpus. In Proceedings of the 1st Workshop on Linguistic Resources for Natural Language Processing, 65--70, Santa Fe, USA.Google Scholar
- Ralf Grubenmann, Don Tuggener, Pius Däniken, Mark Deriu and Cieliebak, 2019. SB-Ch: A Swiss German corpus with sentiment annotations. In Proceedings of the 11th International Conference on Language Resources and Evaluation, LRE, MC, 2349--2353, Miyazaki, Japan.Google Scholar
- Maher Itani, Chris Roast and Samir Al-Khayatt, 2017. Corpora for sentiment analysis of Arabic text in social media. In Proceding of the 8th International Conference on Information and Communication Systems (ICICS), 64--69, Irbid, Jordan.Google ScholarCross Ref
- Kais Dukes and Nizar Habash, 2010. Morphological annotation of quranic Arabic. In Proceedings of the 7th International Conference on Language Resources and Evaluation, 2530--2536, Valletta, Malta.Google Scholar
- Mitchell Marcus, Beatrice Santorini and Mary Marcinkiewicz, 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, vol. 19, no. 2, 313--330.Google ScholarDigital Library
- Erika Rimkut,Vidas Daudaravicius and Anrius Utka, 2007. Morphological annotation of the Lithuanian corpus. In Proceedings of the Workshop on Balto-Slavonic natural language processing, 94--99, Czech Republic.Google Scholar
- Abeba Ibrahim and Yaregal Assabie, 2014. Amharic sentence parsing using phrase chunking. In Gelbukh A.(eds) Computiational Linguistics and Intelligent Text Processing (CICLing), 297--306, Berlin, Heidelberg.Google Scholar
- Martha Yifiru, Solomon Teferra and Laurent Besacier, 2011. Part-of-speech tagging for under-resourced and morphologically rich languages: the case of Amharic. In Proceedings of Conference on Human Language Technology for Development, 50--55, Alexandria, Egypt.Google Scholar
- Michael Gasser, 2011. HornMorpho: a system for morphological processing of Amharic, Oromo, and Tigrinya. In Preceedings of Conference on Human Language Technology for Development, 94--99, Alexandria, Egypt.Google Scholar
- Wondwossen Mulugeta and Michael Gasser, 2012. Learning morphological rules for Amharic verbs using inductive logic programming. In Workshop on Language Technology for Normalisation of Less-Resourced Languages (SALTMIL8/AfLaT2012), 7--12, Istanbul, Turkey.Google Scholar
- Baye Yimam, 2000. 'Amharic Grammar'/ (2nd ed.), CASE, Addis Ababa, EthiopiaGoogle Scholar
- Nega Alemayehu and Peter Willett, 2002. Stemming of Amharic words for information retrieval. Literary and Lingustic Computing, vol. 17, no. 1, 1--17.Google ScholarCross Ref
- Tilahun Yeshambel, Josiane Mothe and Yaregal Assabie, 2020. Construction of morpheme-based Amharic stopword list for information retrieval system. In Proceedings of the 8th EAI International Conference on Advancements of Science and Technology, Bahir Dar, Ethiopia.Google Scholar
- Tilahun Yeshambel, Josiane Mothe and Yaregal Assabie, 2020. Amharic document representation for adhoc retrieval. In Proceedings of the 12th International Conference on knowledge discovery and information retrieval, online conference, Hungary,124--134.Google ScholarCross Ref
Index Terms
- Morphologically Annotated Amharic Text Corpora
Recommendations
Lexical modeling for the development of Amharic automatic speech recognition systems
AbstractAmharic is the second most spoken Semitic language after Arabic. It has its own syllabary writing system, each character representing a consonant and a vowel. Automatic Speech Recognition (ASR) researches for Amharic have been conducted on the ...
2AIRTC: The Amharic Adhoc Information Retrieval Test Collection
Experimental IR Meets Multilinguality, Multimodality, and InteractionAbstractEvaluation is highly important for designing, developing, and maintaining information retrieval (IR) systems. The IR community has developed shared tasks where evaluation framework, evaluation measures and test collections have been developed for ...
A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics
AbstractWord Stemming is a widely used mechanism in the fields of Natural Language Processing, Information Retrieval, and Language Modeling. Language-independent stemmers discover classes of morphologically related words from the ambient ...
Comments