skip to main content
10.1145/1390749.1390767acmotherconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

A novel Arabic lemmatization algorithm

Published:24 July 2008Publication History

ABSTRACT

Tokenization is a fundamental step in processing textual data preceding the tasks of information retrieval, text mining, and natural language processing. Tokenization is a language-dependent approach, including normalization, stop words removal, lemmatization and stemming.

Both stemming and lemmatization share a common goal of reducing a word to its base. However, lemmatization is more robust than stemming as it often involves usage of vocabulary and morphological analysis, as opposed to simply removing the suffix of the word. In this work, we introduce a novel lemmatization algorithm for the Arabic Language.

The new lemmatizer proposed here is a part of a comprehensive Arabic tokenization system, with a stop words list exceeding 2200 Arabic words. Currently, there are two Arabic leading stemmers: the root-based stemmer and the light stemmer. We hypothesize that lemmatization would be more effective than stemming in mining Arabic text. We investigate the impact of our new lemmatizer on unsupervised data mining techniques in comparison to the leading Arabic stemmers. We conclude that lemmatization is a better word normalization method than stemming for Arabic text.

References

  1. W. B. Frakes, "Stemming algorithms," 1992.Google ScholarGoogle Scholar
  2. I. A. Al-Kharashi, "Micro-AIRS: A microcomputer-based Arabic information retrieval system comparing words, stems, and roots as index terms," 1991.Google ScholarGoogle Scholar
  3. I. A. Al-Kharashi and M. W. Evens, "Comparing Words, Stems, and Roots as Index Terms in an Arabic Information Retrieval System.," Journal of the American Society for Information Science, vol. 45, 1994, pp. 548--60. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. S. Larkey and M. E. Connell, "Arabic Information Retrieval at UMass in TREC-10," Proceedings of the Tenth Text REtrieval Conference (TREC-10)", EM Voorhees and DK Harman ed, 2001, pp. 562--570.Google ScholarGoogle Scholar
  5. L. S. Larkey, L. Ballesteros, and M. E. Connell, "Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis," Tampere, Finland: ACM, 2002, pp. 275--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. J. Xu, A. Fraser, and R. Weischedel, "Empirical studies in strategies for Arabic retrieval," Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, 2002, pp. 269--274. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. S. Khoja and R. Garside, "Stemming Arabic Text," Lancaster, UK, Computing Department, Lancaster University, 1999.Google ScholarGoogle Scholar
  8. R. Duwairi, "A Distance-based Classifier for Arabic Text Categorization," Proceedings of the 2005 International Conference on Data Mining, Las Vegas USA, 2005.Google ScholarGoogle Scholar
  9. M. El Kourdi, A. Bensaid, and T. Rachidi, "Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm," COLING 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. S. H. Mustafa and Q. A. Al-Radaideh, "Using N-grams for Arabic text searching," Journal of the American Society for Information Science and Technology, vol. 55, 2004, pp. 1002--1007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. A. Baeza-Yates, "Text-Retrieval: Theory and Practice," North-Holland Publishing Co., 1992, pp. 465--476. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. "Snowball: A language for stemming algorithms"; http://snowball.tartarus.org/texts/introduction.html.Google ScholarGoogle Scholar
  13. S. S. Al-Fedaghi and F. Al-Anzi, "A New Algorithm to Generate Arabic Root-Pattern Forms," Proceedings of the 11th National Computer Conference and Exhibition, 1989, pp. 391--400.Google ScholarGoogle Scholar
  14. T. Korenius et al., "Stemming and lemmatization in the clustering of finnish text documents," Washington, D.C., USA: ACM, 2004, pp. 625--633. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. BOOT, "Homography and Lemmatization in Dutch Texts," ALLC Bulletin, vol. 8, 1980, pp. 175--189.Google ScholarGoogle Scholar
  16. Eiman Al-Shammari and J. Lin, "Automated Corpora Creation Using A novel Arabic Stemming Algorithm," The 2008 International Symposium on Using Corpora in Contrastive and Translation Studies (UCCTS), Hangzhou, China: 2008.Google ScholarGoogle Scholar
  17. A. K. Jain and R. C. Dubes, Algorithms for clustering data, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Steinbach, G. Karypis, and V. Kumar, "A comparison of document clustering techniques," KDD Workshop on Text Mining, vol. 34, 2000, p. 35.Google ScholarGoogle Scholar
  19. Y. Zhao and G. Karypis, "Criterion Functions for Document Clustering," Experiments and Analysis University of Minnesota, Department of Computer Science/Army HPC Research Center.Google ScholarGoogle Scholar
  20. E. Al-Shammari, "Towards an Error Free Stemming," IADIS European Conference on Data Mining (ECDM 2008), Amsterdam, The Netherlands: 2008.Google ScholarGoogle Scholar

Index Terms

  1. A novel Arabic lemmatization algorithm

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Other conferences
              AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data
              July 2008
              130 pages
              ISBN:9781605581965
              DOI:10.1145/1390749

              Copyright © 2008 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 24 July 2008

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article

              Acceptance Rates

              Overall Acceptance Rate15of22submissions,68%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader