Skip to main content
Erschienen in: Discover Computing 3/2009

01.06.2009

Classifying Amharic webnews

verfasst von: Lars Asker, Atelach Alemu Argaw, Björn Gambäck, Samuel Eyassu Asfeha, Lemma Nigussie Habte

Erschienen in: Discover Computing | Ausgabe 3/2009

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

We present work aimed at compiling an Amharic corpus from the Web and automatically categorizing the texts. Amharic is the second most spoken Semitic language in the World (after Arabic) and used for countrywide communication in Ethiopia. It is highly inflectional and quite dialectally diversified. We discuss the issues of compiling and annotating a corpus of Amharic news articles from the Web. This corpus was then used in three sets of text classification experiments. Working with a less-researched language highlights a number of practical issues that might otherwise receive less attention or go unnoticed. The purpose of the experiments has not primarily been to develop a cutting-edge text classification system for Amharic, but rather to put the spotlight on some of these issues. The first two sets of experiments investigated the use of Self-Organizing Maps (SOMs) for document classification. Testing on small datasets, we first looked at classifying unseen data into 10 predefined categories of news items, and then at clustering it around query content, when taking 16 queries as class labels. The second set of experiments investigated the effect of operations such as stemming and part-of-speech tagging on text classification performance. We compared three representations while constructing classification models based on bagging of decision trees for the 10 predefined news categories. The best accuracy was achieved using the full text as representation. A representation using only the nouns performed almost equally well, confirming the assumption that most of the information required for distinguishing between various categories actually is contained in the nouns, while stemming did not have much effect on the performance of the classifier.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
An international standard for Amharic was agreed on as late as in 1998, following Amendment 10 to ISO-10646-1. It was incorporated into Unicode in 2000: www.​unicode.​org/​charts/​PDF/​U1200.​pdf.
 
2
The number of languages in a country is as much a political as a linguistic issue. The number of languages of Ethiopia thus differs from 70 up to 420, depending on the source; however, the 1994 Ethiopian census listed 77 distinct, living languages plus a category for “other languages” (Hudson 1999), while the Ethnologue (Gordon Jr 2005) claims 82 (plus 4 extinct) and Hudson (2006) only gives 75 (including 4 extinct).
 
3
Together with the Semitic languages, the Cushitic languages make up two of the branches of the Afro-Asiatic language family; the other branches are Berber, Chadic, Egyptian, and Omotic (Gordon Jr 2005).
 
4
There should be a census every 10 years, according to the Ethiopian constitution. However, the census of 2004 was delayed due to political unrest, and initiated only in 2007. No results have been published so far.
 
5
The number of speakers of a language is also an issue influenced by political and economical interests. Thus the makers of the ‘Wazéma2001’ software for Ethiopic character encoding (www.​gzamargna.​net) state that there are some 90 million speakers of Amharic (Negga 2008). However, it is a generally accepted fact that Amharic is the second largest Semitic language, since the size-wise differences are in the order of a magnitude: the first-language speakers of Arabic count to well over 200 million, while the ones for Hebrew and Tigrinya are in the order of 5 million, and Gurage (a group of Ethiopian languages) about 2 million—with other Semitic languages counting their speakers in thousands (see, e.g., Gordon Jr 2005).
 
6
fidel’ (lit. ‘alphabet’ in Amharic) refers both to the characters as such and the entire script. The script is also known as ‘Ethiopic’. This is a bit misleading since it (or variants of it) is (or has been) used by several languages in the Horn of Africa region, including Amharic, Tigrinya, Gurage (Semitic); Sidamo and Blin (Cushitic); and Wolaytta (Omotic)—even though Eritrea, following its independence in 1993, has adopted a policy that all non-Semitic languages should use Roman-based alphabets.
 
7
SOV (Subject–Object–Verb) refers to the basic word-order of the language. In contrast, most Western-European languages have an SVO word-order.
 
8
Stemming is then normally used as “a poor man’s version” of full-scale morphological analysis, mainly aiming at stripping off prefixes and suffixes, while leaving the root forms unchanged. For morphology-poor languages such as English, this basically amounts to the same thing as morphological analysis, while for most other natural languages procedures altering the roots (and infixes), splitting compounds, handling derivate processes, etc., are needed in order to perform a complete morphological analysis.
 
9
Offline Explorer Pro 3.5 from MetaProducts Corporation: www.​metaproducts.​com.
 
10
The Ethiopian calendar runs approximately 7 years and 8 months behind the Gregorian calendar, so the data came from the Ethiopian years 1993–1997.
 
11
Emsa HTML Tag Remover v1.0 Build 20.
 
12
The transliteration was done using a file conversion utility called \(\tt{g2}\) available in the \(\tt{LibEth}\) package (\(\tt{LibEth}\) is a library for Ethiopic text processing written in ANSI C; www.​libeth.​sourceforge.​net).
 
13
A tagged version of the \({\tt walta\_1065}\) corpus is available online at http://​nlp.​amharic.​org.
 
16
An inherent problem with ANN-based methods is that the results produced are not human-transparent, i.e., that it is not necessarily easy for a human to understand why the network classified its input as part of a specific output class. In contrast, decision tree-based methods (Sect. 5.3) are inherently human-transparent.
 
17
“Compared” should not be taken in a strict sense when it comes to the performance figures, since the results on English discussed in this section are not straight-forwardly compatible with each other: only the figures inside one particular paper are comparable, while figures from one author to another seldom are. The differences pertain to which data (and which subset of that data) has been used, how many categories were classified, which metrics for the evaluation and which metrics for counting averages were used, as well as if the results apply to binary classification only or directly to multi-class classification. We have aimed to even out some of those differences in the present discussion, though.
 
19
Macro-averages gives equal weight to all classes while micro-averages count the averages over all documents and thus gives higher weight to the more common classes. Loosely speaking, the micro-average figures on the Reuters corpus tend to be some 20–25% higher than the macro-averages.
 
Literatur
Zurück zum Zitat Alemayehu, N., & Willett, P. (2002). Stemming of Amharic words for information retrieval. Literary and Linguistic Computing, 17(1), 1–17.CrossRef Alemayehu, N., & Willett, P. (2002). Stemming of Amharic words for information retrieval. Literary and Linguistic Computing, 17(1), 1–17.CrossRef
Zurück zum Zitat Alemayehu, N., & Willett, P. (2003). The effectiveness of stemming for information retrieval in Amharic. Emerald Research Register, 37(4), 254–259. Alemayehu, N., & Willett, P. (2003). The effectiveness of stemming for information retrieval in Amharic. Emerald Research Register, 37(4), 254–259.
Zurück zum Zitat Argaw, A. A. (2008). Amharic-English information retrieval with pseudo relevance feedback. In C. Peters et al., (Eds.), Advances in Multilingual and Multimodal Information Retrieval: 8th Workshop of the Cross Language Evaluation Forum, CLEF 2007, Budapest, Hungary, September 19–21, Revised Selected Papers (pp. 119–126). Berlin/Heidelberg: Springer. Argaw, A. A. (2008). Amharic-English information retrieval with pseudo relevance feedback. In C. Peters et al., (Eds.), Advances in Multilingual and Multimodal Information Retrieval: 8th Workshop of the Cross Language Evaluation Forum, CLEF 2007, Budapest, Hungary, September 19–21, Revised Selected Papers (pp. 119–126). Berlin/Heidelberg: Springer.
Zurück zum Zitat Argaw, A. A., & Asker, L. (2007a). Amharic-English information retrieval. In C. Peters et al., (Eds.), Evaluation of Multilingual and Multi-modal Information Retrieval: 7th Workshop of the Cross Language Evaluation Forum, CLEF 2006, Alicante, Spain, September 20–22, 2006, Revised Selected Papers (pp. 43–50). Berlin/Heidelberg: Springer. Argaw, A. A., & Asker, L. (2007a). Amharic-English information retrieval. In C. Peters et al., (Eds.), Evaluation of Multilingual and Multi-modal Information Retrieval: 7th Workshop of the Cross Language Evaluation Forum, CLEF 2006, Alicante, Spain, September 20–22, 2006, Revised Selected Papers (pp. 43–50). Berlin/Heidelberg: Springer.
Zurück zum Zitat Argaw, A. A., & Asker, L. (2007b). An Amharic stemmer: Reducing words to their citation forms. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. Workshop on computational approaches to semitic languages (pp. 104–110). Prague, Czech Republic: ACL. Argaw, A. A., & Asker, L. (2007b). An Amharic stemmer: Reducing words to their citation forms. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics. Workshop on computational approaches to semitic languages (pp. 104–110). Prague, Czech Republic: ACL.
Zurück zum Zitat Argaw, A. A., Asker, L., Cöster, R., & Karlgren, J. (2005). Dictionary-based Amharic–English information retrieval. In C. Peters et al., (Eds.), Multilingual Information Access for Text, Speech and Images: 5th Workshop of the Cross Language Evaluation Forum, CLEF 2004. Bath, UK, September 15–24, 2004, Revised Selected Papers (pp. 143–149). Berlin/Heidelberg: Springer. Argaw, A. A., Asker, L., Cöster, R., & Karlgren, J. (2005). Dictionary-based Amharic–English information retrieval. In C. Peters et al., (Eds.), Multilingual Information Access for Text, Speech and Images: 5th Workshop of the Cross Language Evaluation Forum, CLEF 2004. Bath, UK, September 15–24, 2004, Revised Selected Papers (pp. 143–149). Berlin/Heidelberg: Springer.
Zurück zum Zitat Argaw, A. A., Asker, L., Cöster, R., Karlgren, J., & Sahlgren, M. (2006). Dictionary-based Amharic–French information retrieval. In C. Peters et al., (Eds.), Accessing Multilingual Information Repositories: 6th Workshop of the Cross Language Evaluation Forum, CLEF 2005. Vienna, Austria, September 21–23, 2005. Revised Selected Papers (pp. 83–92). Berlin/Heidelberg: Springer. Argaw, A. A., Asker, L., Cöster, R., Karlgren, J., & Sahlgren, M. (2006). Dictionary-based Amharic–French information retrieval. In C. Peters et al., (Eds.), Accessing Multilingual Information Repositories: 6th Workshop of the Cross Language Evaluation Forum, CLEF 2005. Vienna, Austria, September 21–23, 2005. Revised Selected Papers (pp. 83–92). Berlin/Heidelberg: Springer.
Zurück zum Zitat Argaw, A. A., Asker, L., & Eriksson, G. (2003). An empirical approach to building an Amharic treebank. In Proceedings of the 2nd Workshop on Treebanks and Linguistic Theories (pp. 205–208). Sweden: Växjö University. Argaw, A. A., Asker, L., & Eriksson, G. (2003). An empirical approach to building an Amharic treebank. In Proceedings of the 2nd Workshop on Treebanks and Linguistic Theories (pp. 205–208). Sweden: Växjö University.
Zurück zum Zitat Amine, A., Elberrichi, Z., Simonet, M., & Malki, M. (2008). Evaluation and comparison of concept based and n-grams based text clustering using SOM. INFOCOMP Journal of Computer Science, 7(1), 27–35. Amine, A., Elberrichi, Z., Simonet, M., & Malki, M. (2008). Evaluation and comparison of concept based and n-grams based text clustering using SOM. INFOCOMP Journal of Computer Science, 7(1), 27–35.
Zurück zum Zitat Amsalu, S. (2001). The application of information retrieval techniques to Amharic. Master of Science Thesis, School of Information Studies for Africa, Addis Ababa University, Addis Ababa, Ethiopia. Amsalu, S. (2001). The application of information retrieval techniques to Amharic. Master of Science Thesis, School of Information Studies for Africa, Addis Ababa University, Addis Ababa, Ethiopia.
Zurück zum Zitat Amsalu, S., & Gibbon, D. (2005). Finite state morphology of Amharic. In R. Mitkov (Ed.), Proceedings of the 5th International Conference on Recent Advances in Natural Language Processing, Borovets, Bulgaria (pp. 47–51). Amsalu, S., & Gibbon, D. (2005). Finite state morphology of Amharic. In R. Mitkov (Ed.), Proceedings of the 5th International Conference on Recent Advances in Natural Language Processing, Borovets, Bulgaria (pp. 47–51).
Zurück zum Zitat Arampatzis, A. (2001). Adaptive and temporally-dependent document filtering. Doctor of Philosophy thesis, Department of Information Systems Sciences and Information Retrieval, Katholieke Universiteit Nijmegen, Nijmegen, The Netherlands. Arampatzis, A. (2001). Adaptive and temporally-dependent document filtering. Doctor of Philosophy thesis, Department of Information Systems Sciences and Information Retrieval, Katholieke Universiteit Nijmegen, Nijmegen, The Netherlands.
Zurück zum Zitat Bayou, A. (2000). Design and development of word parser for Amharic language. Master of Science thesis, School of Information Studies for Africa, Addis Ababa University, Addis Ababa, Ethiopia. Bayou, A. (2000). Design and development of word parser for Amharic language. Master of Science thesis, School of Information Studies for Africa, Addis Ababa University, Addis Ababa, Ethiopia.
Zurück zum Zitat Bayu, T. (2002). Automatic morphological analyser: An experiment using unsupervised and autosegmental approach. Master of Science thesis, School of Information Studies for Africa, Addis Ababa University, Addis Ababa, Ethiopia. Bayu, T. (2002). Automatic morphological analyser: An experiment using unsupervised and autosegmental approach. Master of Science thesis, School of Information Studies for Africa, Addis Ababa University, Addis Ababa, Ethiopia.
Zurück zum Zitat Bender, M. L., Head, S. W., & Cowley, R. (1976). The Ethiopian writing system. In M. Bender, J. Bowen, R. Cooper, & C. Ferguson (Eds.), Language in Ethiopia (pp. 120–129). London, England: Oxford University Press. Bender, M. L., Head, S. W., & Cowley, R. (1976). The Ethiopian writing system. In M. Bender, J. Bowen, R. Cooper, & C. Ferguson (Eds.), Language in Ethiopia (pp. 120–129). London, England: Oxford University Press.
Zurück zum Zitat Berry, M. W., Dumais, S. T., & O’Brien, G. W. (1995). Using linear algebra for intelligent information retrieval. SIAM Review, 37(4), 573–595.MATHCrossRefMathSciNet Berry, M. W., Dumais, S. T., & O’Brien, G. W. (1995). Using linear algebra for intelligent information retrieval. SIAM Review, 37(4), 573–595.MATHCrossRefMathSciNet
Zurück zum Zitat Bloor, T. (1995). The Ethiopic writing system: A profile. Journal of the Simplified Spelling Society, 19(2), 30–36.MathSciNet Bloor, T. (1995). The Ethiopic writing system: A profile. Journal of the Simplified Spelling Society, 19(2), 30–36.MathSciNet
Zurück zum Zitat Cai, L., & Hofmann, T. (2003). Text categorization by boosting automatically extracted concepts. In Proceedings of the 26th International Conference on Research and Development in Information Retrieval, (pp. 182–189). Toronto, Canada: ACM SIGIR. Cai, L., & Hofmann, T. (2003). Text categorization by boosting automatically extracted concepts. In Proceedings of the 26th International Conference on Research and Development in Information Retrieval, (pp. 182–189). Toronto, Canada: ACM SIGIR.
Zurück zum Zitat CIA. (2008). The world factbook—Ethiopia. Washington, DC: The Central Intelligence Agency [Last updated 12 Feb, 2008]. CIA. (2008). The world factbook—Ethiopia. Washington, DC: The Central Intelligence Agency [Last updated 12 Feb, 2008].
Zurück zum Zitat Cowell, J., & Hussain, F. (2003). Amharic character recognition using a fast signature based algorithm. In Proceedings of the 7th International Conference on Image Visualization (pp. 384–389). England: IEEE, London. Cowell, J., & Hussain, F. (2003). Amharic character recognition using a fast signature based algorithm. In Proceedings of the 7th International Conference on Image Visualization (pp. 384–389). England: IEEE, London.
Zurück zum Zitat Csiszár, I., & Tusnády, G. (1984). Information geometry and alternating minimization procedures. Statistics and Decisions, 1, 205–237. Csiszár, I., & Tusnády, G. (1984). Information geometry and alternating minimization procedures. Statistics and Decisions, 1, 205–237.
Zurück zum Zitat Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.CrossRef Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6), 391–407.CrossRef
Zurück zum Zitat Demeke, G. A., & Getachew, M. (2006). Manual annotation of Amharic news items with part-of-speech tags and its challenges. ELRC Working Papers, 2(1), 1–17. Demeke, G. A., & Getachew, M. (2006). Manual annotation of Amharic news items with part-of-speech tags and its challenges. ELRC Working Papers, 2(1), 1–17.
Zurück zum Zitat Dumais, S. T. (1991). Improving the retrieval of information from external sources. Behavior Research Methods Instruments and Computers, 23(2), 229–236. Dumais, S. T. (1991). Improving the retrieval of information from external sources. Behavior Research Methods Instruments and Computers, 23(2), 229–236.
Zurück zum Zitat Dumais, S. T. (1995). Using LSI for information filtering: TREC-3 experiments. In D. K. Harman (Ed.), Proceedings of the 3rd Text Retrieval Conference (pp. 219–230). Gaithersburg, MD: National Institute of Standards and Technology. Dumais, S. T. (1995). Using LSI for information filtering: TREC-3 experiments. In D. K. Harman (Ed.), Proceedings of the 3rd Text Retrieval Conference (pp. 219–230). Gaithersburg, MD: National Institute of Standards and Technology.
Zurück zum Zitat Fissaha, S., & Haller, J. (2003a). Amharic verb lexicon in the context of machine translation. In Proceedings of the 10th Conference on Traitement Automatique des Langues Naturelles, Batz-sur-Mer, France (Vol. 2, pp. 183–192). Fissaha, S., & Haller, J. (2003a). Amharic verb lexicon in the context of machine translation. In Proceedings of the 10th Conference on Traitement Automatique des Langues Naturelles, Batz-sur-Mer, France (Vol. 2, pp. 183–192).
Zurück zum Zitat Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and application to boosting. Journal of Computer and System Sciences, 55(1), 119–139.MATHCrossRefMathSciNet Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and application to boosting. Journal of Computer and System Sciences, 55(1), 119–139.MATHCrossRefMathSciNet
Zurück zum Zitat Furzey, J. (1996). Enpowering socio-economic development in Africa utilizing information technology. A country study for the United Nations Economic Commission for Africa, African Studies Center, University of Pennsylvania. Furzey, J. (1996). Enpowering socio-economic development in Africa utilizing information technology. A country study for the United Nations Economic Commission for Africa, African Studies Center, University of Pennsylvania.
Zurück zum Zitat Gaustad, T., & Bouma, G. (2002). Accurate stemming of Dutch for text classification. In M. Theune, A. Nijholt, & H. Hondorp (Eds.), Computational Linguistics in the Netherlands 2001: Selected Papers from the Twelfth CLIN Meeting, Rodopi, Amsterdam, The Netherlands (pp. 104–117). Gaustad, T., & Bouma, G. (2002). Accurate stemming of Dutch for text classification. In M. Theune, A. Nijholt, & H. Hondorp (Eds.), Computational Linguistics in the Netherlands 2001: Selected Papers from the Twelfth CLIN Meeting, Rodopi, Amsterdam, The Netherlands (pp. 104–117).
Zurück zum Zitat GebreMeskel, T. (2003). Amharic text retrieval: An experiment using latent semantic indexing (LSI) with singular value decomposition (SVD). Master of Science thesis, School of Information Studies for Africa, Addis Ababa University, Addis Ababa, Ethiopia. GebreMeskel, T. (2003). Amharic text retrieval: An experiment using latent semantic indexing (LSI) with singular value decomposition (SVD). Master of Science thesis, School of Information Studies for Africa, Addis Ababa University, Addis Ababa, Ethiopia.
Zurück zum Zitat Gordon, R. G. Jr. (Ed.). (2005). Ethnologue: languages of the world (15th ed.). Dallas, TX: SIL International. Gordon, R. G. Jr. (Ed.). (2005). Ethnologue: languages of the world (15th ed.). Dallas, TX: SIL International.
Zurück zum Zitat Hoi, S. C. H., Jin, R., & Lyu, M. R. (2006). Large-scale text categorization by batch mode active learning. In Proceedings of the 15th International World Wide Web Conference, Edinburgh, Scotland (pp. 633–642). Hoi, S. C. H., Jin, R., & Lyu, M. R. (2006). Large-scale text categorization by batch mode active learning. In Proceedings of the 15th International World Wide Web Conference, Edinburgh, Scotland (pp. 633–642).
Zurück zum Zitat Honkela, T., Kaski, S., Lagus, K., & Kohonen, T. (1997). WEBSOM—Self-Organizing Maps of document collections. In Proceedings of WSOM’97, Workshop on Self-Organizing Maps, Espoo, Finland (pp. 310–315). Honkela, T., Kaski, S., Lagus, K., & Kohonen, T. (1997). WEBSOM—Self-Organizing Maps of document collections. In Proceedings of WSOM’97, Workshop on Self-Organizing Maps, Espoo, Finland (pp. 310–315).
Zurück zum Zitat Hudson, G. (1999). Linguistic analysis of the 1994 Ethiopian census. Northeast African Studies, 6(3), 89–107.CrossRef Hudson, G. (1999). Linguistic analysis of the 1994 Ethiopian census. Northeast African Studies, 6(3), 89–107.CrossRef
Zurück zum Zitat Hulth, A. (2004). Combining machine learning and natural language processing for automatic keyword extraction. Doctor of Philosophy thesis, Stockholm University and the Royal Institute of Technology, Deparment of Computer and Systems Sciences, Stockholm, Sweden. Hulth, A. (2004). Combining machine learning and natural language processing for automatic keyword extraction. Doctor of Philosophy thesis, Stockholm University and the Royal Institute of Technology, Deparment of Computer and Systems Sciences, Stockholm, Sweden.
Zurück zum Zitat Karlgren, J., & Sahlgren, M. (2001). From words to understanding. In Y. Uesaka, P. Kanerva, & H. Asoh (Eds.), Foundations of Real World Intelligence (pp. 294–308). Stanford California: CSLI publications. Karlgren, J., & Sahlgren, M. (2001). From words to understanding. In Y. Uesaka, P. Kanerva, & H. Asoh (Eds.), Foundations of Real World Intelligence (pp. 294–308). Stanford California: CSLI publications.
Zurück zum Zitat Kaski, S., Honkela, T., Lagus, K., & Kohonen, T. (1996). Creating an order in digital libraries with Self-Organizing Maps. In Proceedings of the World Congress on Neural Networks, San Diego, California (pp. 814–817). Kaski, S., Honkela, T., Lagus, K., & Kohonen, T. (1996). Creating an order in digital libraries with Self-Organizing Maps. In Proceedings of the World Congress on Neural Networks, San Diego, California (pp. 814–817).
Zurück zum Zitat Kohonen, T. (1999). Self-organization and associative memory (3rd ed.). Heidelberg, Germany: Springer. Kohonen, T. (1999). Self-organization and associative memory (3rd ed.). Heidelberg, Germany: Springer.
Zurück zum Zitat Kohonen, T. (2001). Self-Organizing Maps (3rd ed.). Berlin, Germany: Springer.MATH Kohonen, T. (2001). Self-Organizing Maps (3rd ed.). Berlin, Germany: Springer.MATH
Zurück zum Zitat Kohonen, T., Kaski, S., Lagus, K., Salojärvi, J., Honkela, J., Paatero, V., & Saarela, A. (2000). Self organization of a massive document collection. IEEE Transactions on Neural Networks, 11(3), 574–585.CrossRef Kohonen, T., Kaski, S., Lagus, K., Salojärvi, J., Honkela, J., Paatero, V., & Saarela, A. (2000). Self organization of a massive document collection. IEEE Transactions on Neural Networks, 11(3), 574–585.CrossRef
Zurück zum Zitat Larkey, L. S. (2002). Improving stemming for Arabic information retrieval: Light stemming and co-occurrence analysis. In Proceedings of the 25th International Conference on Research and Development in Information Retrieval (pp. 275–282). Tampere, Finland: ACM SIGIR. Larkey, L. S. (2002). Improving stemming for Arabic information retrieval: Light stemming and co-occurrence analysis. In Proceedings of the 25th International Conference on Research and Development in Information Retrieval (pp. 275–282). Tampere, Finland: ACM SIGIR.
Zurück zum Zitat Li, F., & Yang, Y. (2003). A loss function analysis for classification methods in text categorization. In Proceedings of the 20th International Conference on Machine Learning, Washington, D.C. (pp. 472–479). Li, F., & Yang, Y. (2003). A loss function analysis for classification methods in text categorization. In Proceedings of the 20th International Conference on Machine Learning, Washington, D.C. (pp. 472–479).
Zurück zum Zitat Lin, X., Soergel, D., & Marchionini, G. (1991). A self-organizing semantic map for information retrieval. In Proceedings of the 14th International Conference on Research and Development in Information Retrieval (pp. 262–269). Chicago, IL: ACM SIGIR. Lin, X., Soergel, D., & Marchionini, G. (1991). A self-organizing semantic map for information retrieval. In Proceedings of the 14th International Conference on Research and Development in Information Retrieval (pp. 262–269). Chicago, IL: ACM SIGIR.
Zurück zum Zitat Negga, W. (2008). Wazéma System: an Ethiopian computer writing system for Windows NT/2000/XP/Vista Version 2.1. Croydon, England. www.gzamargna.net. Negga, W. (2008). Wazéma System: an Ethiopian computer writing system for Windows NT/2000/XP/Vista Version 2.1. Croydon, England. www.​gzamargna.​net.
Zurück zum Zitat Ng, H. T., Goh, W. B., & Low, K. L. (1997). Feature selection, perceptron learning, and a usability case study for text categorization. In Proceedings of the 20th International Conference on Research and Development in Information Retrieval (pp. 67–73). Philadelphia, PA: ACM SIGIR. Ng, H. T., Goh, W. B., & Low, K. L. (1997). Feature selection, perceptron learning, and a usability case study for text categorization. In Proceedings of the 20th International Conference on Research and Development in Information Retrieval (pp. 67–73). Philadelphia, PA: ACM SIGIR.
Zurück zum Zitat Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106. Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1(1), 81–106.
Zurück zum Zitat Ruiz, M. E., & Srinivasan, P. (1999). Hierarchical neural networks for text categorization. In Proceedings of the 22nd International Conference on Research and Development in Information Retrieval (pp. 281–282). Berkeley, CA: ACM SIGIR. Ruiz, M. E., & Srinivasan, P. (1999). Hierarchical neural networks for text categorization. In Proceedings of the 22nd International Conference on Research and Development in Information Retrieval (pp. 281–282). Berkeley, CA: ACM SIGIR.
Zurück zum Zitat Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. Rumelhart, & J. McClelland (Eds.), Parallel distributed processing (Vol. 1, pp. 318–362). Cambridge, MA: MIT Press. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. In D. Rumelhart, & J. McClelland (Eds.), Parallel distributed processing (Vol. 1, pp. 318–362). Cambridge, MA: MIT Press.
Zurück zum Zitat Salton, G., & McGill, M. (1983). Introduction to modern information retrieval. New York, NY: McGraw-Hill.MATH Salton, G., & McGill, M. (1983). Introduction to modern information retrieval. New York, NY: McGraw-Hill.MATH
Zurück zum Zitat Schütze, H., Hull, D. A., & Pedersen, J. O. (1995). A comparison of classifiers and document representations for the routing problem. In E. A. Fox, P. Ingwersen, & R. Fidel (Eds.), Proceedings of the 18th International Conference on Research and Development in Information Retrieval (pp. 229–237). Seattle, WA: ACM SIGIR. Schütze, H., Hull, D. A., & Pedersen, J. O. (1995). A comparison of classifiers and document representations for the routing problem. In E. A. Fox, P. Ingwersen, & R. Fidel (Eds.), Proceedings of the 18th International Conference on Research and Development in Information Retrieval (pp. 229–237). Seattle, WA: ACM SIGIR.
Zurück zum Zitat Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.CrossRef Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.CrossRef
Zurück zum Zitat Sintayehu, Z. (2001). Automatic classification of Amharic news items: The case of the Ethiopian News Agency. Master of Science thesis, School of Information Studies for Africa, Addis Ababa University, Addis Ababa, Ethiopia. Sintayehu, Z. (2001). Automatic classification of Amharic news items: The case of the Ethiopian News Agency. Master of Science thesis, School of Information Studies for Africa, Addis Ababa University, Addis Ababa, Ethiopia.
Zurück zum Zitat Subramanya, A., & Bilmes, J. (2008). Soft-supervised learning for text classification. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (pp. 1090–1099). Honolulu, Hawaii: ACL. Subramanya, A., & Bilmes, J. (2008). Soft-supervised learning for text classification. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing (pp. 1090–1099). Honolulu, Hawaii: ACL.
Zurück zum Zitat Syiam, M. M., Fayed, Z. T., & Habib, M. B. (2006). An intelligent system for Arabic text classification. International Journal of Intelligent Computing and Information Sciences, 6(1), 1–19. Syiam, M. M., Fayed, Z. T., & Habib, M. B. (2006). An intelligent system for Arabic text classification. International Journal of Intelligent Computing and Information Sciences, 6(1), 1–19.
Zurück zum Zitat Tambouratzis, G., Hairetakis, N., Markantonatou, S., & Carayannis, G. (2003). Applying the SOM model to text classification according to register and stylistic content. International Journal of Neural Systems, 13(1), 1–11.CrossRef Tambouratzis, G., Hairetakis, N., Markantonatou, S., & Carayannis, G. (2003). Applying the SOM model to text classification according to register and stylistic content. International Journal of Neural Systems, 13(1), 1–11.CrossRef
Zurück zum Zitat Xu, J., Fraser, A., & Weischedel, R. (2002). Empirical studies in strategies for Arabic retrieval. In Proceedings of the 25th International Conference on Research and Development in Information Retrieval (pp. 269–274). Tampere, Finland: ACM SIGIR. Xu, J., Fraser, A., & Weischedel, R. (2002). Empirical studies in strategies for Arabic retrieval. In Proceedings of the 25th International Conference on Research and Development in Information Retrieval (pp. 269–274). Tampere, Finland: ACM SIGIR.
Metadaten
Titel
Classifying Amharic webnews
verfasst von
Lars Asker
Atelach Alemu Argaw
Björn Gambäck
Samuel Eyassu Asfeha
Lemma Nigussie Habte
Publikationsdatum
01.06.2009
Verlag
Springer Netherlands
Erschienen in
Discover Computing / Ausgabe 3/2009
Print ISSN: 2948-2984
Elektronische ISSN: 2948-2992
DOI
https://doi.org/10.1007/s10791-008-9080-x

Weitere Artikel der Ausgabe 3/2009

Discover Computing 3/2009 Zur Ausgabe

Premium Partner