nach oben

Cognitive Computation

Erschienen in:

07.06.2017

An Efficient Corpus-Based Stemmer

verfasst von: Jasmeet Singh, Vishal Gupta

Erschienen in: Cognitive Computation | Ausgabe 5/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Word stemming is a linguistic process in which the various inflected word forms are matched to their base form. It is among the basic text pre-processing approaches used in Natural Language Processing and Information Retrieval. Stemming is employed at the text pre-processing stage to solve the issue of vocabulary mismatch or to reduce the size of the word vocabulary, and consequently also the dimensionality of training data for statistical models. In this article, we present a fully unsupervised corpus-based text stemming method which clusters morphologically related words based on lexical knowledge. The proposed method performs cognitive-inspired computing to discover morphologically related words from the corpus without any human intervention or language-specific knowledge. The performance of the proposed method is evaluated in inflection removal (approximating lemmas) and Information Retrieval tasks. The retrieval experiments in four different languages using standard Text Retrieval Conference, Cross-Language Evaluation Forum, and Forum for Information Retrieval Evaluation collections show that the proposed stemming method performs significantly better than no stemming. In the case of highly inflectional languages, Marathi and Hungarian, the improvement in Mean Average Precision is nearly 50% as compared to unstemmed words. Moreover, the proposed unsupervised stemming method outperforms state-of-the-art strong language-independent and rule-based stemming methods in all the languages. Besides Information Retrieval, the proposed stemming method also performs significantly better in inflection removal experiments. The proposed unsupervised language-independent stemming method can be used as a multipurpose tool for various tasks such as the approximation of lemmas, improving retrieval performance or other Natural Language Processing applications.

Vorheriger Artikel FE-ELM: A New Friend Recommendation Model with Extreme Learning Machine

Nächster Artikel A Study on Text-Score Disagreement in Online Reviews

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

http://www.isical.ac.in/~fire

http://clef-campaign.org/

http://terrier.org/

Available at http://members.unine.ch/jacques.savoy/clef/index.html

Available at http://linguistica.uchicago.edu/linguistica.html

Available at http://www.cis.hut.fi/projects/morpho/

Available at http://www.isical.ac.in/~clia/resources.html

Available at http://liks.fav.zcu.cz/HPS/

Available at http://members.unine.ch/jacques.savoy/clef/marathiST.txt

Available at http://members.unine.ch/jacques.savoy/clef/hungarianST.txt

Available at http://members.unine.ch/jacques.savoy/clef/bengaliST.txt

Available at http://www.anc.org/data/oanc/

Available at http://www.inf.u-szeged.hu/

Xu J, Croft WB. Corpus-based stemming using cooccurrence of word variants. ACM Trans Inf Syst. 1998;16(1):61–81.CrossRef

Bhamidipati NL, Pal SK. Stemming via distribution-based word segregation for classification and retrieval. IEEE Trans Syst Man Cybern B Cybern: Publ IEEE Syst Man Cybern Soc. 2007;37(2):350–60. http://www.ncbi.nlm.nih.gov/pubmed/17416163 CrossRef

Gupta V, Kaur N. A novel hybrid text summarization system for Punjabi text. Cogn Comput. 2016;8(2):261–77.CrossRef

Toutanova K, Suzuki H, Ruopp A. Applying morphology generation models to machine translation. In Association for computational linguistics. 2008;pp. 514–522.

Shrivastava M, Bhattacharyya P. Hindi POS tagger using naive stemming: harnessing morphological information without extensive linguistic knowledge. In Proceedings of International Conference on NLP (ICON08). 2008.

Krovetz R. Viewing morphology as an inference process. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1993;pp. 191–202.

Dashtipour K, Poria S, Hussain A, Cambria E, Hawalah AYA, Gelbukh A, et al. Multilingual sentiment analysis: state of the art and independent comparison of techniques. Cogn Comput. 2016;8:757–71. 1–15CrossRef

Hu M, Liu B. Mining opinion features in customer reviews. AAAI. 2004;4:755–60.

Almeida TA, Silva TP, Santos I, Hidalgo JMG. Text normalization and semantic indexing to enhance instant messaging and SMS spam filtering. Knowledge-Based Systems. 2016.

10.

Lovins JB. Development of a stemming algorithm. Mech Transl Comput Linguist. 1968;11:22–31.

11.

Dawson JL. Suffix removal for word conflation. Bull Assoc Lit Linguist Comput. 1974;2(3):33–46.

12.

Porter MF. An algorithm for suffix stripping. Prog Electron Libr Inf Syst. 1980;14(3):130–7.

13.

Paice CD. Another stemmer. ACM SIGIR Forum. 1990;24(3):56–61.CrossRef

14.

Popovic M, Willett P. The effectiveness of stemming for natural-language access to Slovene textual data. J Am Soc Inf Sci. 1992;43:384–90.

15.

Kraaij W, Pohlman R. Viewing stemming as recall enhancement. In Proceedings of the 19th annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1996 ;pp. 40–48.

16.

Majumder P, Mitra M, Pal D. Bulgarian, Hungarian and Czech stemming using YASS. In Advances in multilingual and multimodal information retrieval. 2008;pp. 49–56.

17.

Savoy J, Berger PY. Monolingual, bilingual, and GIRT information retrieval at CLEF-2005. In 6th Workshop of the Cross-Language Evaluation Forum, CLEF 2005. 2006;pp. 131–140.

18.

Adam G, Asimakis K, Bouras C, Poulopoulos V. An efficient mechanism for stemming and tagging: the case of Greek language. In Proceedings of the 14th International Conference on Knowledge-Based and Intelligent Information and Engineering Systems. 2010:pp. 389–397.

19.

Dolamic L, Savoy J. Indexing and searching strategies for the Russian language. J Am Soc Inf Sci Technol. 2009a;60(12):2540–7.CrossRef

20.

Dolamic L, Savoy J. Indexing and stemming approaches for the Czech language. Inf Process Manag. 2009b;45:714–20.CrossRef

21.

Paik JH, Pal D, Parui SK. A novel corpus-based stemming algorithm using co-occurrence statistics. In Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’11). New York: ACM. 2011b; pp. 863–872.

22.

Paik JH, Parui SK, Pal D, Robertson SE. Effective and robust query-based stemming. ACM Trans Inf Syst. 2013;31(4):1–29. doi:10.1145/2536736.2536738.CrossRef

23.

Orad D, Levow G, Cabezas C. CLEF experiments at Maryland: statistical stemming and back off translation. In: Proceedings of the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation. Berlin: Springer-Verlag. 2001;pp. 176–187.

24.

Goldsmith J. Unsupervised learning of the morphology of a natural language. J Comput Linguist. 2001;27(2):153–98.CrossRef

25.

Goldsmith J. An algorithm for the unsupervised learning of morphology. Nat Lang Eng. 2006;12(04):353–71.CrossRef

26.

Melucci M, Orio N. A novel method for stemmer generation based on hidden Markov models. In Proceedings of the twelfth International Conference on Information and Knowledge Management (CIKM’03). 2003;pp. 131–138.

27.

Bacchin M, Ferro N, Melucci M. A probabilistic model for stemmer generation. Inf Process Manag. 2005;41(1):121–37.CrossRef

28.

Bacchin M, Ferro N, Melucci M. The effectiveness of a graph-based algorithm for stemming. In Digital libraries: people, knowledge, and technology. Springer; 2002. pp. 117–128.

29.

Creutz M, Lagus K. Unsupervised models for morpheme segmentation and morphology learning. ACM Trans Speech Lang Process (TSLP). 2007;4(1):3. article CrossRef

30.

Creutz M, Lagus K. Unsupervised discovery of morphemes. In Proceedings of the ACL-02 workshop on Morphological and phonological learning. 2002; Vol. 6: pp. 21–30.

31.

Creutz M. Unsupervised segmentation of words using prior distributions of morph length and frequency. In Proceedings of the 41st Annual Meeting of Association for Computational Linguistics. 2003;Vol. 1: pp. 280–287.

32.

Creutz M, Lagus K. Induction of a simple morphology for highly-inflecting languages. In Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology: Current Themes in Computational Phonology and Morphology. 2004:pp. 43–51.

33.

Creutz M, Lagus K. Inducing the morphological lexicon of a natural language from unannotated text. In Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR’05). 2005; Vol. 1: pp. 51–59.

34.

Kohonen O, Virpioja S, Klami M. Allomorfessor: towards unsupervised morpheme analysis. In Evaluating Systems for Multilingual and Multimodal Information Acces. Springer: 2008; pp. 975–982.

35.

Majumder P, Mitra M, Parui SK, Kole G, Mitra P, Datta K. YASS: Yet Another Suffix Stripper. ACM Trans Inf Syst. 2007;25(4):18.CrossRef

36.

Jaro MA. Probabilistic linkage of large public health data files. Stat Med. 1995;14(5–7):491–8.CrossRefPubMed

37.

Winkler WE. String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. 1990.

38.

Makin R, Pandey N, Pingali P, Varma V. Approximate string matching techniques for effective CLIR among Indian languages. In International Workshop on Fuzzy Logic and Applications. 2007;pp. 430–437.

39.

Bilenko M, Mooney R, Cohen W, Ravikumar P, Fienberg S. Adaptive name matching in information integration. IEEE Intell Syst. 2003;18(5):16–23.CrossRef

40.

Christen P. A comparison of personal name matching: techniques and practical issues. In Sixth IEEE International Conference on Data Mining-Workshops (ICDMW’06). 2006;pp. 290–294.

41.

Cohen W, Ravikumar P, Fienberg S. A comparison of string metrics for matching names and records. In KDD workshop on data cleaning and object consolidation. 2003 ;Vol. 3: pp. 73–78.

42.

Paik J, Mitra M, Parui S, Jarvelin K. GRAS: an effective and efficient stemming algorithm for information retrieval. ACM Trans Inf Syst. 2011a;29(4):1–24.CrossRef

43.

Paik JH, Parui SK. A fast corpus-based stemmer. ACM Trans Asian Lang Inf Process. 2011;10(2):1–16. doi:10.1145/1967293.1967295.CrossRef

44.

Peng F, Lu Y. Context Sensitive Stemming for Web Search. Proceeding SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. 2007;pp 639–46.

45.

Brychcín T, Konopík M. HPS: high precision stemmer. Inf Process Manag. 2015;51(1):68–91.CrossRef

46.

McNamee P, Nicholas C, Mayfield J. Addressing morphological variation in alphabetic languages. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. 2009; pp. 75–82.

47.

Pirkola A, Keskustalo H, Leppänen E, Känsälä A-P, Järvelin K. Targeted s-gram matching: a novel n-gram matching technique for cross and monolingual word form variants. Inf Res. 2002;7(2):2–7.

48.

Järvelin A. Applications of S-grams in natural language information retrieval. 2014.

49.

Dolamic L, Savoy J. Comparative study of indexing and search strategies for the Hindi, Marathi, and Bengali languages. ACM Trans Asian Lang Inf Process. 2010;9(3):11.CrossRef

50.

Jaro MA. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J Am Stat Assoc. 1989;84(406):414–20.CrossRef

51.

Brown PF, Desouza PV, Mercer RL, Pietra V, Della J, Lai JC. Class-based n-gram models of natural language. Comput Linguist. 1992;18(4):467–79.

52.

Jain AK, Murty MN, Flynn PJ. Data clustering: a review. ACM Comput Surv. 1999;31(3):264–323.CrossRef

53.

Amati G, Van Rijsbergen CJ. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans Inf Syst (TOIS). 2002;20(4):357–89.CrossRef

54.

Singh J, Gupta V. A systematic review of text stemming techniques. Artif Intell Rev. 2016:1–61. article. doi:10.1007/s10462-016-9498-2.

55.

Sakai T, Manabe T, Koyama M. Flexible pseudo-relevance feedback via selective sampling. ACM Trans Asian Lang Inf Process (TALIP). 2005;4(2):111–35.CrossRef

Titel: An Efficient Corpus-Based Stemmer
verfasst von: Jasmeet Singh
Vishal Gupta
Publikationsdatum: 07.06.2017
Verlag: Springer US
Erschienen in: Cognitive Computation / Ausgabe 5/2017
Print ISSN: 1866-9956
Elektronische ISSN: 1866-9964
DOI: https://doi.org/10.1007/s12559-017-9479-z

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 5/2017

Lane Boundary Detection Algorithm Based on Vector Fuzzy Connectedness

Semantic Category-Based Classification Using Nonlinear Features and Wavelet Coefficients of Brain Signals

A Comparative Study of In-Air Trajectories at Short and Long Distances in Online Handwriting

A Study on Text-Score Disagreement in Online Reviews

FE-ELM: A New Friend Recommendation Model with Extreme Learning Machine

Ensemble of Deep Neural Networks with Probability-Based Fusion for Facial Expression Recognition