nach oben

Discover Computing

Erschienen in:

01.06.2009

Classifying Amharic webnews

verfasst von: Lars Asker, Atelach Alemu Argaw, Björn Gambäck, Samuel Eyassu Asfeha, Lemma Nigussie Habte

Erschienen in: Discover Computing | Ausgabe 3/2009

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

We present work aimed at compiling an Amharic corpus from the Web and automatically categorizing the texts. Amharic is the second most spoken Semitic language in the World (after Arabic) and used for countrywide communication in Ethiopia. It is highly inflectional and quite dialectally diversified. We discuss the issues of compiling and annotating a corpus of Amharic news articles from the Web. This corpus was then used in three sets of text classification experiments. Working with a less-researched language highlights a number of practical issues that might otherwise receive less attention or go unnoticed. The purpose of the experiments has not primarily been to develop a cutting-edge text classification system for Amharic, but rather to put the spotlight on some of these issues. The first two sets of experiments investigated the use of Self-Organizing Maps (SOMs) for document classification. Testing on small datasets, we first looked at classifying unseen data into 10 predefined categories of news items, and then at clustering it around query content, when taking 16 queries as class labels. The second set of experiments investigated the effect of operations such as stemming and part-of-speech tagging on text classification performance. We compared three representations while constructing classification models based on bagging of decision trees for the 10 predefined news categories. The best accuracy was achieved using the full text as representation. A representation using only the nouns performed almost equally well, confirming the assumption that most of the information required for distinguishing between various categories actually is contained in the nouns, while stemming did not have much effect on the performance of the classifier.

Vorheriger Artikel Using the Web as corpus for self-training text categorization

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

An international standard for Amharic was agreed on as late as in 1998, following Amendment 10 to ISO-10646-1. It was incorporated into Unicode in 2000: www.unicode.org/charts/PDF/U1200.pdf.

The number of languages in a country is as much a political as a linguistic issue. The number of languages of Ethiopia thus differs from 70 up to 420, depending on the source; however, the 1994 Ethiopian census listed 77 distinct, living languages plus a category for “other languages” (Hudson 1999), while the Ethnologue (Gordon Jr 2005) claims 82 (plus 4 extinct) and Hudson (2006) only gives 75 (including 4 extinct).

Together with the Semitic languages, the Cushitic languages make up two of the branches of the Afro-Asiatic language family; the other branches are Berber, Chadic, Egyptian, and Omotic (Gordon Jr 2005).

There should be a census every 10 years, according to the Ethiopian constitution. However, the census of 2004 was delayed due to political unrest, and initiated only in 2007. No results have been published so far.

The number of speakers of a language is also an issue influenced by political and economical interests. Thus the makers of the ‘Wazéma2001’ software for Ethiopic character encoding (www.gzamargna.net) state that there are some 90 million speakers of Amharic (Negga 2008). However, it is a generally accepted fact that Amharic is the second largest Semitic language, since the size-wise differences are in the order of a magnitude: the first-language speakers of Arabic count to well over 200 million, while the ones for Hebrew and Tigrinya are in the order of 5 million, and Gurage (a group of Ethiopian languages) about 2 million—with other Semitic languages counting their speakers in thousands (see, e.g., Gordon Jr 2005).

‘fidel’ (lit. ‘alphabet’ in Amharic) refers both to the characters as such and the entire script. The script is also known as ‘Ethiopic’. This is a bit misleading since it (or variants of it) is (or has been) used by several languages in the Horn of Africa region, including Amharic, Tigrinya, Gurage (Semitic); Sidamo and Blin (Cushitic); and Wolaytta (Omotic)—even though Eritrea, following its independence in 1993, has adopted a policy that all non-Semitic languages should use Roman-based alphabets.

SOV (Subject–Object–Verb) refers to the basic word-order of the language. In contrast, most Western-European languages have an SVO word-order.

Stemming is then normally used as “a poor man’s version” of full-scale morphological analysis, mainly aiming at stripping off prefixes and suffixes, while leaving the root forms unchanged. For morphology-poor languages such as English, this basically amounts to the same thing as morphological analysis, while for most other natural languages procedures altering the roots (and infixes), splitting compounds, handling derivate processes, etc., are needed in order to perform a complete morphological analysis.

Offline Explorer Pro 3.5 from MetaProducts Corporation: www.metaproducts.com.

The Ethiopian calendar runs approximately 7 years and 8 months behind the Gregorian calendar, so the data came from the Ethiopian years 1993–1997.

Emsa HTML Tag Remover v1.0 Build 20.

The transliteration was done using a file conversion utility called \(\tt{g2}\) available in the \(\tt{LibEth}\) package (\(\tt{LibEth}\) is a library for Ethiopic text processing written in ANSI C; www.libeth.sourceforge.net).

A tagged version of the \({\tt walta\_1065}\) corpus is available online at http://nlp.amharic.org.

www.mathworks.com.

www.compumine.com.

An inherent problem with ANN-based methods is that the results produced are not human-transparent, i.e., that it is not necessarily easy for a human to understand why the network classified its input as part of a specific output class. In contrast, decision tree-based methods (Sect. 5.3) are inherently human-transparent.

“Compared” should not be taken in a strict sense when it comes to the performance figures, since the results on English discussed in this section are not straight-forwardly compatible with each other: only the figures inside one particular paper are comparable, while figures from one author to another seldom are. The differences pertain to which data (and which subset of that data) has been used, how many categories were classified, which metrics for the evaluation and which metrics for counting averages were used, as well as if the results apply to binary classification only or directly to multi-class classification. We have aimed to even out some of those differences in the present discussion, though.

Available at www.daviddlewis.com/resources.

Macro-averages gives equal weight to all classes while micro-averages count the averages over all documents and thus gives higher weight to the more common classes. Loosely speaking, the micro-average figures on the Reuters corpus tend to be some 20–25% higher than the macro-averages.

Alemayehu, N., & Willett, P. (2002). Stemming of Amharic words for information retrieval. Literary and Linguistic Computing, 17(1), 1–17.CrossRef

Alemayehu, N., & Willett, P. (2003). The effectiveness of stemming for information retrieval in Amharic. Emerald Research Register, 37(4), 254–259.

Argaw, A. A. (2008). Amharic-English information retrieval with pseudo relevance feedback. In C. Peters et al., (Eds.), Advances in Multilingual and Multimodal Information Retrieval: 8th Workshop of the Cross Language Evaluation Forum, CLEF 2007, Budapest, Hungary, September 19–21, Revised Selected Papers (pp. 119–126). Berlin/Heidelberg: Springer.

Argaw, A. A., & Asker, L. (2007a). Amharic-English information retrieval. In C. Peters et al., (Eds.), Evaluation of Multilingual and Multi-modal Information Retrieval: 7th Workshop of the Cross Language Evaluation Forum, CLEF 2006, Alicante, Spain, September 20–22, 2006, Revised Selected Papers (pp. 43–50). Berlin/Heidelberg: Springer.

Argaw, A. A., & Asker, L. (2007b). An Amharic stemmer: Reducing words to their citation forms. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. Workshop on computational approaches to semitic languages (pp. 104–110). Prague, Czech Republic: ACL.

Argaw, A. A., Asker, L., Cöster, R., & Karlgren, J. (2005). Dictionary-based Amharic–English information retrieval. In C. Peters et al., (Eds.), Multilingual Information Access for Text, Speech and Images: 5th Workshop of the Cross Language Evaluation Forum, CLEF 2004. Bath, UK, September 15–24, 2004, Revised Selected Papers (pp. 143–149). Berlin/Heidelberg: Springer.

Argaw, A. A., Asker, L., Cöster, R., Karlgren, J., & Sahlgren, M. (2006). Dictionary-based Amharic–French information retrieval. In C. Peters et al., (Eds.), Accessing Multilingual Information Repositories: 6th Workshop of the Cross Language Evaluation Forum, CLEF 2005. Vienna, Austria, September 21–23, 2005. Revised Selected Papers (pp. 83–92). Berlin/Heidelberg: Springer.

Argaw, A. A., Asker, L., & Eriksson, G. (2003). An empirical approach to building an Amharic treebank. In Proceedings of the 2nd Workshop on Treebanks and Linguistic Theories (pp. 205–208). Sweden: Växjö University.

Amine, A., Elberrichi, Z., Simonet, M., & Malki, M. (2008). Evaluation and comparison of concept based and n-grams based text clustering using SOM. INFOCOMP Journal of Computer Science, 7(1), 27–35.

Amsalu, S. (2001). The application of information retrieval techniques to Amharic. Master of Science Thesis, School of Information Studies for Africa, Addis Ababa University, Addis Ababa, Ethiopia.

Amsalu, S., & Gibbon, D. (2005). Finite state morphology of Amharic. In R. Mitkov (Ed.), Proceedings of the 5th International Conference on Recent Advances in Natural Language Processing, Borovets, Bulgaria (pp. 47–51).

Arampatzis, A. (2001). Adaptive and temporally-dependent document filtering. Doctor of Philosophy thesis, Department of Information Systems Sciences and Information Retrieval, Katholieke Universiteit Nijmegen, Nijmegen, The Netherlands.

Bayou, A. (2000). Design and development of word parser for Amharic language. Master of Science thesis, School of Information Studies for Africa, Addis Ababa University, Addis Ababa, Ethiopia.

Bayu, T. (2002). Automatic morphological analyser: An experiment using unsupervised and autosegmental approach. Master of Science thesis, School of Information Studies for Africa, Addis Ababa University, Addis Ababa, Ethiopia.

Bender, M. L., Head, S. W., & Cowley, R. (1976). The Ethiopian writing system. In M. Bender, J. Bowen, R. Cooper, & C. Ferguson (Eds.), Language in Ethiopia (pp. 120–129). London, England: Oxford University Press.

Berry, M. W., Dumais, S. T., & O’Brien, G. W. (1995). Using linear algebra for intelligent information retrieval. SIAM Review, 37(4), 573–595.MATHCrossRefMathSciNet

Bloor, T. (1995). The Ethiopic writing system: A profile. Journal of the Simplified Spelling Society, 19(2), 30–36.MathSciNet

Breiman, L. (1996). Bagging predictors. Machine Learning, 24(2), 123–140.MATHMathSciNet

Cai, L., & Hofmann, T. (2003). Text categorization by boosting automatically extracted concepts. In Proceedings of the 26th International Conference on Research and Development in Information Retrieval, (pp. 182–189). Toronto, Canada: ACM SIGIR.

CIA. (2008). The world factbook—Ethiopia. Washington, DC: The Central Intelligence Agency [Last updated 12 Feb, 2008].

Cowell, J., & Hussain, F. (2003). Amharic character recognition using a fast signature based algorithm. In Proceedings of the 7th International Conference on Image Visualization (pp. 384–389). England: IEEE, London.

Csiszár, I., & Tusnády, G. (1984). Information geometry and alternating minimization procedures. Statistics and Decisions, 1, 205–237.

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.CrossRef

Demeke, G. A., & Getachew, M. (2006). Manual annotation of Amharic news items with part-of-speech tags and its challenges. ELRC Working Papers, 2(1), 1–17.

Dumais, S. T. (1991). Improving the retrieval of information from external sources. Behavior Research Methods Instruments and Computers, 23(2), 229–236.

Dumais, S. T. (1995). Using LSI for information filtering: TREC-3 experiments. In D. K. Harman (Ed.), Proceedings of the 3rd Text Retrieval Conference (pp. 219–230). Gaithersburg, MD: National Institute of Standards and Technology.

Firdyiwek, Y., & Yacob, D. (1993). The Ethiopian script in ASCII. Journal of EthioSciences, 3(1). http://www.abyssiniacybergateway.net/fidel/sera.ps [Last updated 1 Jan 1997].

Fissaha, S., & Haller, J. (2003a). Amharic verb lexicon in the context of machine translation. In Proceedings of the 10th Conference on Traitement Automatique des Langues Naturelles, Batz-sur-Mer, France (Vol. 2, pp. 183–192).

Fissaha, S., & Haller, J. (2003b). Application of corpus-based techniques to Amharic texts. In Proceedings of the 9th Machine Translation Summit, New Orleans, Louisiana. Workshop on Machine Translation for Semitic Languages: Issues and Approaches. http://www.amtaweb.org/summit/WS2/Fissaya+Haller_paper.pdf.

Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and application to boosting. Journal of Computer and System Sciences, 55(1), 119–139.MATHCrossRefMathSciNet

Furzey, J. (1996). Enpowering socio-economic development in Africa utilizing information technology. A country study for the United Nations Economic Commission for Africa, African Studies Center, University of Pennsylvania.

Gaustad, T., & Bouma, G. (2002). Accurate stemming of Dutch for text classification. In M. Theune, A. Nijholt, & H. Hondorp (Eds.), Computational Linguistics in the Netherlands 2001: Selected Papers from the Twelfth CLIN Meeting, Rodopi, Amsterdam, The Netherlands (pp. 104–117).

GebreMeskel, T. (2003). Amharic text retrieval: An experiment using latent semantic indexing (LSI) with singular value decomposition (SVD). Master of Science thesis, School of Information Studies for Africa, Addis Ababa University, Addis Ababa, Ethiopia.

Gordon, R. G. Jr. (Ed.). (2005). Ethnologue: languages of the world (15th ed.). Dallas, TX: SIL International.

Hoi, S. C. H., Jin, R., & Lyu, M. R. (2006). Large-scale text categorization by batch mode active learning. In Proceedings of the 15th International World Wide Web Conference, Edinburgh, Scotland (pp. 633–642).

Honkela, T., Kaski, S., Lagus, K., & Kohonen, T. (1997). WEBSOM—Self-Organizing Maps of document collections. In Proceedings of WSOM’97, Workshop on Self-Organizing Maps, Espoo, Finland (pp. 310–315).

Hudson, G. (1999). Linguistic analysis of the 1994 Ethiopian census. Northeast African Studies, 6(3), 89–107.CrossRef

Hudson, G. (2006). 75 Ethiopian languages: 19 Cushitic, 20 Nilosaharan, 23 Omotic, 12 Semitic, and 1 unclassified. http://www.msu.edu/hudson/Ethlgslist.htm [Last updated 29 Dec, 2006].

Hulth, A. (2004). Combining machine learning and natural language processing for automatic keyword extraction. Doctor of Philosophy thesis, Stockholm University and the Royal Institute of Technology, Deparment of Computer and Systems Sciences, Stockholm, Sweden.

Karlgren, J., & Sahlgren, M. (2001). From words to understanding. In Y. Uesaka, P. Kanerva, & H. Asoh (Eds.), Foundations of Real World Intelligence (pp. 294–308). Stanford California: CSLI publications.

Kaski, S., Honkela, T., Lagus, K., & Kohonen, T. (1996). Creating an order in digital libraries with Self-Organizing Maps. In Proceedings of the World Congress on Neural Networks, San Diego, California (pp. 814–817).

Kohonen, T. (1999). Self-organization and associative memory (3rd ed.). Heidelberg, Germany: Springer.

Kohonen, T. (2001). Self-Organizing Maps (3rd ed.). Berlin, Germany: Springer.MATH

Kohonen, T., Kaski, S., Lagus, K., Salojärvi, J., Honkela, J., Paatero, V., & Saarela, A. (2000). Self organization of a massive document collection. IEEE Transactions on Neural Networks, 11(3), 574–585.CrossRef

Larkey, L. S. (2002). Improving stemming for Arabic information retrieval: Light stemming and co-occurrence analysis. In Proceedings of the 25th International Conference on Research and Development in Information Retrieval (pp. 275–282). Tampere, Finland: ACM SIGIR.

Li, F., & Yang, Y. (2003). A loss function analysis for classification methods in text categorization. In Proceedings of the 20th International Conference on Machine Learning, Washington, D.C. (pp. 472–479).

Lin, X., Soergel, D., & Marchionini, G. (1991). A self-organizing semantic map for information retrieval. In Proceedings of the 14th International Conference on Research and Development in Information Retrieval (pp. 262–269). Chicago, IL: ACM SIGIR.

Miniwatts Marketing Group. (2008). Internet world users by language. http://www.internetworldstats.com/languages.htm [Last updated 30 Jun, 2008].

Negga, W. (2008). Wazéma System: an Ethiopian computer writing system for Windows NT/2000/XP/Vista Version 2.1. Croydon, England. www.gzamargna.net.

Ng, H. T., Goh, W. B., & Low, K. L. (1997). Feature selection, perceptron learning, and a usability case study for text categorization. In Proceedings of the 20th International Conference on Research and Development in Information Retrieval (pp. 67–73). Philadelphia, PA: ACM SIGIR.

Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.

Ruiz, M. E., & Srinivasan, P. (1999). Hierarchical neural networks for text categorization. In Proceedings of the 22nd International Conference on Research and Development in Information Retrieval (pp. 281–282). Berkeley, CA: ACM SIGIR.

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. Rumelhart, & J. McClelland (Eds.), Parallel distributed processing (Vol. 1, pp. 318–362). Cambridge, MA: MIT Press.

Salton, G., & McGill, M. (1983). Introduction to modern information retrieval. New York, NY: McGraw-Hill.MATH

Schütze, H., Hull, D. A., & Pedersen, J. O. (1995). A comparison of classifiers and document representations for the routing problem. In E. A. Fox, P. Ingwersen, & R. Fidel (Eds.), Proceedings of the 18th International Conference on Research and Development in Information Retrieval (pp. 229–237). Seattle, WA: ACM SIGIR.

Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.CrossRef

Sintayehu, Z. (2001). Automatic classification of Amharic news items: The case of the Ethiopian News Agency. Master of Science thesis, School of Information Studies for Africa, Addis Ababa University, Addis Ababa, Ethiopia.

Subramanya, A., & Bilmes, J. (2008). Soft-supervised learning for text classification. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (pp. 1090–1099). Honolulu, Hawaii: ACL.

Syiam, M. M., Fayed, Z. T., & Habib, M. B. (2006). An intelligent system for Arabic text classification. International Journal of Intelligent Computing and Information Sciences, 6(1), 1–19.

Tambouratzis, G., Hairetakis, N., Markantonatou, S., & Carayannis, G. (2003). Applying the SOM model to text classification according to register and stylistic content. International Journal of Neural Systems, 13(1), 1–11.CrossRef

Xu, J., Fraser, A., & Weischedel, R. (2002). Empirical studies in strategies for Arabic retrieval. In Proceedings of the 25th International Conference on Research and Development in Information Retrieval (pp. 269–274). Tampere, Finland: ACM SIGIR.

Yacob, D. (1997). The system for Ethiopic representation in ASCII—1997 standard. http://www.abyssiniacybergateway.net/fidel/sera-97.html.

Yacob, D. (2005). Developments towards an electronic Amharic corpus. In Proceedings of the 12th Conference on Traitement Automatique des Langues Naturelles, Dourdan, France. Workshop on Under-Resourced Languages. http://yacob.org/papers/DanielYacob-TALN2005.pdf.

Titel: Classifying Amharic webnews
verfasst von: Lars Asker
Atelach Alemu Argaw
Björn Gambäck
Samuel Eyassu Asfeha
Lemma Nigussie Habte
Publikationsdatum: 01.06.2009
Verlag: Springer Netherlands
Erschienen in: Discover Computing / Ausgabe 3/2009
Print ISSN: 2948-2984
Elektronische ISSN: 2948-2992
DOI: https://doi.org/10.1007/s10791-008-9080-x

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 3/2009

Introduction to the special issue on non-english web retrieval

A user-centric approach to identifying best deployment strategies for language tools: the impact of content and access language on Web user behaviour and attitudes

Towards enhancing retrieval effectiveness of search engines for diacritisized Arabic documents

Current research issues and trends in non-English Web searching

Mixed monolingual homepage finding in 34 languages: the role of language script and search domain

Using the Web as corpus for self-training text categorization

Premium Partner