research-article

A novel Arabic lemmatization algorithm

Authors:
Eiman Al-Shammari

Kuwait University, Fairfax, VA

Kuwait University, Fairfax, VA
View Profile

,
Jessica Lin

George Mason University, Fairfax, VA

George Mason University, Fairfax, VA
View Profile

AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text dataJuly 2008Pages 113–118https://doi.org/10.1145/1390749.1390767

Published:24 July 2008Publication History

AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data

Pages 113–118

ABSTRACT

Tokenization is a fundamental step in processing textual data preceding the tasks of information retrieval, text mining, and natural language processing. Tokenization is a language-dependent approach, including normalization, stop words removal, lemmatization and stemming.

Both stemming and lemmatization share a common goal of reducing a word to its base. However, lemmatization is more robust than stemming as it often involves usage of vocabulary and morphological analysis, as opposed to simply removing the suffix of the word. In this work, we introduce a novel lemmatization algorithm for the Arabic Language.

The new lemmatizer proposed here is a part of a comprehensive Arabic tokenization system, with a stop words list exceeding 2200 Arabic words. Currently, there are two Arabic leading stemmers: the root-based stemmer and the light stemmer. We hypothesize that lemmatization would be more effective than stemming in mining Arabic text. We investigate the impact of our new lemmatizer on unsupervised data mining techniques in comparison to the leading Arabic stemmers. We conclude that lemmatization is a better word normalization method than stemming for Arabic text.

References

W. B. Frakes, "Stemming algorithms," 1992.Google Scholar
I. A. Al-Kharashi, "Micro-AIRS: A microcomputer-based Arabic information retrieval system comparing words, stems, and roots as index terms," 1991.Google Scholar
I. A. Al-Kharashi and M. W. Evens, "Comparing Words, Stems, and Roots as Index Terms in an Arabic Information Retrieval System.," Journal of the American Society for Information Science, vol. 45, 1994, pp. 548--60. Google ScholarDigital Library
L. S. Larkey and M. E. Connell, "Arabic Information Retrieval at UMass in TREC-10," Proceedings of the Tenth Text REtrieval Conference (TREC-10)", EM Voorhees and DK Harman ed, 2001, pp. 562--570.Google Scholar
L. S. Larkey, L. Ballesteros, and M. E. Connell, "Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis," Tampere, Finland: ACM, 2002, pp. 275--282. Google ScholarDigital Library
J. Xu, A. Fraser, and R. Weischedel, "Empirical studies in strategies for Arabic retrieval," Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, 2002, pp. 269--274. Google ScholarDigital Library
S. Khoja and R. Garside, "Stemming Arabic Text," Lancaster, UK, Computing Department, Lancaster University, 1999.Google Scholar
R. Duwairi, "A Distance-based Classifier for Arabic Text Categorization," Proceedings of the 2005 International Conference on Data Mining, Las Vegas USA, 2005.Google Scholar
M. El Kourdi, A. Bensaid, and T. Rachidi, "Automatic Arabic Document Categorization Based on the Naïve Bayes Algorithm," COLING 2004. Google ScholarDigital Library
S. H. Mustafa and Q. A. Al-Radaideh, "Using N-grams for Arabic text searching," Journal of the American Society for Information Science and Technology, vol. 55, 2004, pp. 1002--1007. Google ScholarDigital Library
R. A. Baeza-Yates, "Text-Retrieval: Theory and Practice," North-Holland Publishing Co., 1992, pp. 465--476. Google ScholarDigital Library
"Snowball: A language for stemming algorithms"; http://snowball.tartarus.org/texts/introduction.html.Google Scholar
S. S. Al-Fedaghi and F. Al-Anzi, "A New Algorithm to Generate Arabic Root-Pattern Forms," Proceedings of the 11th National Computer Conference and Exhibition, 1989, pp. 391--400.Google Scholar
T. Korenius et al., "Stemming and lemmatization in the clustering of finnish text documents," Washington, D.C., USA: ACM, 2004, pp. 625--633. Google ScholarDigital Library
M. BOOT, "Homography and Lemmatization in Dutch Texts," ALLC Bulletin, vol. 8, 1980, pp. 175--189.Google Scholar
Eiman Al-Shammari and J. Lin, "Automated Corpora Creation Using A novel Arabic Stemming Algorithm," The 2008 International Symposium on Using Corpora in Contrastive and Translation Studies (UCCTS), Hangzhou, China: 2008.Google Scholar
A. K. Jain and R. C. Dubes, Algorithms for clustering data, 1988. Google ScholarDigital Library
M. Steinbach, G. Karypis, and V. Kumar, "A comparison of document clustering techniques," KDD Workshop on Text Mining, vol. 34, 2000, p. 35.Google Scholar
Y. Zhao and G. Karypis, "Criterion Functions for Document Clustering," Experiments and Analysis University of Minnesota, Department of Computer Science/Army HPC Research Center.Google Scholar
E. Al-Shammari, "Towards an Error Free Stemming," IADIS European Conference on Data Mining (ECDM 2008), Amsterdam, The Netherlands: 2008.Google Scholar

Index Terms

A novel Arabic lemmatization algorithm
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Information systems
  1. Information retrieval
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Stemming and lemmatization in the clustering of finnish text documents
CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge management

Stemming and lemmatization were compared in the clustering of Finnish text documents. Since Finnish is a highly inflectional and agglutinative language, we hypothesized that lemmatization, involving splitting of the compound words, would be more ...
Read More
Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis
SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval

Arabic, a highly inflected language, requires good stemming for effective information retrieval, yet no standard approach to stemming has emerged. We developed several light stemmers based on heuristics and a statistical stemmer based on co-occurrence ...
Read More
Towards an error-free Arabic stemming
iNEWS '08: Proceedings of the 2nd ACM workshop on Improving non english web searching

Stemming is a computational process for reducing words to their roots (or stems). It can be classified as a recall-enhancing or precision-enhancing component.

Existing Arabic stemmers suffer from high stemming error-rates. Arabic stemmers blindly stem ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data
July 2008
130 pages
ISBN:9781605581965
DOI:10.1145/1390749
Conference Chairs:
Daniel Lopresti
Lehigh University
,
Shourya Roy
IBM India Research Lab
,
Klaus Schulz
University of Munich
,
L. Venkata Subramaniam
India Research Lab
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 24 July 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Arabic
lemmatization
stemming
text mining
tokenization
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate15of22submissions,68%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 30
  Total Citations
  View Citations
- 832
  Total Downloads
- Downloads (Last 12 months)24
- Downloads (Last 6 weeks)6
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A novel Arabic lemmatization algorithm

AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Stemming and lemmatization in the clustering of finnish text documents

Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis

Towards an error-free Arabic stemming