short-paper

Open Access

Morphologically Annotated Amharic Text Corpora

Authors:
Tilahun Yeshambel

Addis Ababa University, Addis Ababa , Ethiopia

Addis Ababa University, Addis Ababa , Ethiopia
View Profile

,
Josiane Mothe

Univ. de Toulouse, Toulouse, France

Univ. de Toulouse, Toulouse, France
View Profile

,
Yaregal Assabie

Addis Ababa University, Addis Ababa, Ethiopia

Addis Ababa University, Addis Ababa, Ethiopia
View Profile

SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information RetrievalJuly 2021Pages 2349–2355https://doi.org/10.1145/3404835.3463237

Published:11 July 2021Publication History

SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 2349–2355

ABSTRACT

In information retrieval (IR), documents that match the query are retrieved. Search engines usually conflate word variants into a common stem when indexing documents because queries and documents do not need to use exactly the same word variant for the documents to be relevant. Stemmers are known to be effective in many languages for IR. However, there are still languages where stemmers or morphological analyzers are missing; this is the case for Amharic which is the working language of Ethiopia. Morphological analysis is the key to derive stems, roots (primary lexical units) and grammatical markers of words such as person, tense and negation markers. This paper presents morphologically annotated Amharic lexicons as well as stem-based and root-based morphologically annotated corpora which could be used by the research community as benchmark collections either to evaluate morphological analyzers or information retrieval for Amharic. Such resources are believed to foster research in Amharic IR.

Supplemental Material

rsp1138 Morphologically Annotated Amharic Text Corpora.mp4

mp4

15.4 MB

Download

References

Tanja Gaustad and Gosse Bouma, 2002. Accurate stemming of Dutch for text classfication. Language and Computers, vol. 45, no. 1, 104--117.Google Scholar
Martin Porter, 1980. An algorithm for suffix stripping. Program? Electronic Library and Information Systems, vol. 14, no. 3, 130--137.Google ScholarCross Ref
Viviane Orengo and Christian Huyck, 2001. A stemming algorithm for the Portuguese language. In Proceedings the 8th Symposium on String Processing and Information Retrieval, 186--193, Laguna de San Rafael, Chile.Google ScholarCross Ref
Mohammed Aljlayl and Ophir Frieder, 2002. On Arabic search: improving the retrieval effectiveness via a light stemming approach. In Proceedings of the 11th International Conference on Information and knowledge Management, 340--347, McLean Virginia, USA.Google ScholarDigital Library
Eduard Hovy and Jduard Lavid, 2010. Towards a science of corpus annotation: A new methodological challenge for corpus linguistics, International journal of Translation, vol. 22, no. 1, 13--36.Google Scholar
Prasenjit Majumder, Mandar Mitra, Swapan Parui, Gobinda Kole, Pabitra Mitra and Kalyankumar Datta, 2007. YASS: Yet another suffix stripper. ACM Transactions on Information Systems (TOIS), vol. 25, no. 4.Google ScholarDigital Library
Jasmeet Singh and Vishal Gupta, 2019. A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics. Knowledge-Based Systems, vol. 180 , no. 2019, 147--162.Google ScholarDigital Library
Jiaul H. Paik and Swapan K. Parui, 2011. A fast corpus-based stemmer. ACM Transactions on Asian Language Information Processing (TALIP), vol. 10, no. 2, 1--16.Google ScholarDigital Library
Massimo Melucci and Nicola Orio, 2003. A novel method for stemmer generation based on hidden Markov models. In Proceedings of the 12th CIMK, 131--138, New Orleans, USA.Google ScholarDigital Library
Alireza Mokhtaripour and Saber Jahanpour, 2006. Introduction to a new Farsi stemmer. In Proceedings of the 15th ACM International Conference on Information and Knowledge management, Arlington Virginia,USA.Google ScholarDigital Library
Ali Daud, Wahab Khan and Dunrene Che, 2017. Urdu language processing: a survey. Artificial Intelligence Review, vol. 47, no. 3, 279--311.Google ScholarDigital Library
Donna Harman, 1995. Overview of the second text retrieval conference (TREC-2). Information Processing and Management, vol. 31, no. 3, 271--289.Google ScholarDigital Library
Nicola Ferro, 2014. CLEF 15th birthday: past, present, and future. ACM SIGIR Forum, vol. 48, no. 2, 31--55.Google ScholarDigital Library
Noriko Kando, Kazuko Kuriyama, Toshihiko Nozue, Koji Eguchi, Hiroyuki Kato and Jun Adachi, 1999. The NTCIR workshop?: the 1st evaluation workshop on Japanese text retrieval and cross-lingual information retrieval. In Proceedings of the 4th International Workshop on Information Retrieval with Asian Languages, INV-1-INV-7., 1--7, Tokyo,Japan.Google Scholar
Tasnim Chaudhury, Abdul Matin, M. S. Hossain, Asie Uzzaman and Md Masum, 2017. Annotated Bangla news corpus and lexicon development with POS tagging and stemming. Global Journal of Research in Engineering, vol. 17, no. 1.Google Scholar
Tilahun Yeshambel, Josiane Mothe and Yaregal Assabie, 2020. 2AIRTC: The Amharic adhoc information retrieval test collection. In Proceedings of CLEF 2020, 55--66,Thessaloniki, Greece.Google ScholarDigital Library
Yaregal Assabie, 2017. Development of Amharic morphological analyzer, Technical Report, Ethiopian Ministry of Communication and Information Technology, Addis Ababa.Google Scholar
Wolf Leslau, 1995. Reference Grammar of Amharic (1st ed.). Otto Harrassowitz, Wiesbaden, Germany.Google Scholar
Girma Demeke and Mesfin Getachew, 2006. Manual annotation of Amharic news items with part-of-speech tags and its challenges. Ethiopian Languages Research Center Working Papers, vol. 2, no. 1, 1--16.Google Scholar
Biniyam Epherem, Yusuke Miyao and Baye Yimam, 2016. Morpho-syntactically annotated Amharic treebank. In Proceedings of CLiF Corpus Linguistics Fest, 48--57, Blooming, IN, USA.Google Scholar
Andargachew Mekonnen, Biniyam Epherem, Michael Gasser and Andreas Nürnberger, 2018. Contemporary Amharic corpus: Automatically morpho-syntactically tagged Amharic corpus. In Proceedings of the 1st Workshop on Linguistic Resources for Natural Language Processing, 65--70, Santa Fe, USA.Google Scholar
Ralf Grubenmann, Don Tuggener, Pius Däniken, Mark Deriu and Cieliebak, 2019. SB-Ch: A Swiss German corpus with sentiment annotations. In Proceedings of the 11th International Conference on Language Resources and Evaluation, LRE, MC, 2349--2353, Miyazaki, Japan.Google Scholar
Maher Itani, Chris Roast and Samir Al-Khayatt, 2017. Corpora for sentiment analysis of Arabic text in social media. In Proceding of the 8th International Conference on Information and Communication Systems (ICICS), 64--69, Irbid, Jordan.Google ScholarCross Ref
Kais Dukes and Nizar Habash, 2010. Morphological annotation of quranic Arabic. In Proceedings of the 7th International Conference on Language Resources and Evaluation, 2530--2536, Valletta, Malta.Google Scholar
Mitchell Marcus, Beatrice Santorini and Mary Marcinkiewicz, 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, vol. 19, no. 2, 313--330.Google ScholarDigital Library
Erika Rimkut,Vidas Daudaravicius and Anrius Utka, 2007. Morphological annotation of the Lithuanian corpus. In Proceedings of the Workshop on Balto-Slavonic natural language processing, 94--99, Czech Republic.Google Scholar
Abeba Ibrahim and Yaregal Assabie, 2014. Amharic sentence parsing using phrase chunking. In Gelbukh A.(eds) Computiational Linguistics and Intelligent Text Processing (CICLing), 297--306, Berlin, Heidelberg.Google Scholar
Martha Yifiru, Solomon Teferra and Laurent Besacier, 2011. Part-of-speech tagging for under-resourced and morphologically rich languages: the case of Amharic. In Proceedings of Conference on Human Language Technology for Development, 50--55, Alexandria, Egypt.Google Scholar
Michael Gasser, 2011. HornMorpho: a system for morphological processing of Amharic, Oromo, and Tigrinya. In Preceedings of Conference on Human Language Technology for Development, 94--99, Alexandria, Egypt.Google Scholar
Wondwossen Mulugeta and Michael Gasser, 2012. Learning morphological rules for Amharic verbs using inductive logic programming. In Workshop on Language Technology for Normalisation of Less-Resourced Languages (SALTMIL8/AfLaT2012), 7--12, Istanbul, Turkey.Google Scholar
Baye Yimam, 2000. 'Amharic Grammar'/ (2nd ed.), CASE, Addis Ababa, EthiopiaGoogle Scholar
Nega Alemayehu and Peter Willett, 2002. Stemming of Amharic words for information retrieval. Literary and Lingustic Computing, vol. 17, no. 1, 1--17.Google ScholarCross Ref
Tilahun Yeshambel, Josiane Mothe and Yaregal Assabie, 2020. Construction of morpheme-based Amharic stopword list for information retrieval system. In Proceedings of the 8th EAI International Conference on Advancements of Science and Technology, Bahir Dar, Ethiopia.Google Scholar
Tilahun Yeshambel, Josiane Mothe and Yaregal Assabie, 2020. Amharic document representation for adhoc retrieval. In Proceedings of the 12th International Conference on knowledge discovery and information retrieval, online conference, Hungary,124--134.Google ScholarCross Ref

Index Terms

Morphologically Annotated Amharic Text Corpora
1. Information systems
  1. Information retrieval
    1. Document representation
      1. Dictionaries

Recommendations

Lexical modeling for the development of Amharic automatic speech recognition systems
Abstract
Amharic is the second most spoken Semitic language after Arabic. It has its own syllabary writing system, each character representing a consonant and a vowel. Automatic Speech Recognition (ASR) researches for Amharic have been conducted on the ...
Read More
2AIRTC: The Amharic Adhoc Information Retrieval Test Collection
Experimental IR Meets Multilinguality, Multimodality, and Interaction
Abstract
Evaluation is highly important for designing, developing, and maintaining information retrieval (IR) systems. The IR community has developed shared tasks where evaluation framework, evaluation measures and test collections have been developed for ...
Read More
A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics
Abstract
Word Stemming is a widely used mechanism in the fields of Natural Language Processing, Information Retrieval, and Language Modeling. Language-independent stemmers discover classes of morphologically related words from the ambient ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2021
2998 pages
ISBN:9781450380379
DOI:10.1145/3404835
General Chairs:
Fernando Diaz
(Google)
,
Chirag Shah
University of Washington
,
Torsten Suel
New York University
,
Program Chairs:
Pablo Castells
Universidad Autónoma de Madrid, Amazon
,
Rosie Jones
Spotify
,
Tetsuya Sakai
Waseda University
Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 July 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Amharic
corpus
information retrieval
morphological annotation
under-resourced language
Qualifiers
- short-paper
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 196
  Total Downloads
- Downloads (Last 12 months)65
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Morphologically Annotated Amharic Text Corpora

SIGIR '21: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Lexical modeling for the development of Amharic automatic speech recognition systems

2AIRTC: The Amharic Adhoc Information Retrieval Test Collection

A novel unsupervised corpus-based stemming technique using lexicon and corpus statistics