Skip to main content
Erschienen in: Cognitive Computation 5/2017

07.06.2017

An Efficient Corpus-Based Stemmer

verfasst von: Jasmeet Singh, Vishal Gupta

Erschienen in: Cognitive Computation | Ausgabe 5/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Word stemming is a linguistic process in which the various inflected word forms are matched to their base form. It is among the basic text pre-processing approaches used in Natural Language Processing and Information Retrieval. Stemming is employed at the text pre-processing stage to solve the issue of vocabulary mismatch or to reduce the size of the word vocabulary, and consequently also the dimensionality of training data for statistical models. In this article, we present a fully unsupervised corpus-based text stemming method which clusters morphologically related words based on lexical knowledge. The proposed method performs cognitive-inspired computing to discover morphologically related words from the corpus without any human intervention or language-specific knowledge. The performance of the proposed method is evaluated in inflection removal (approximating lemmas) and Information Retrieval tasks. The retrieval experiments in four different languages using standard Text Retrieval Conference, Cross-Language Evaluation Forum, and Forum for Information Retrieval Evaluation collections show that the proposed stemming method performs significantly better than no stemming. In the case of highly inflectional languages, Marathi and Hungarian, the improvement in Mean Average Precision is nearly 50% as compared to unstemmed words. Moreover, the proposed unsupervised stemming method outperforms state-of-the-art strong language-independent and rule-based stemming methods in all the languages. Besides Information Retrieval, the proposed stemming method also performs significantly better in inflection removal experiments. The proposed unsupervised language-independent stemming method can be used as a multipurpose tool for various tasks such as the approximation of lemmas, improving retrieval performance or other Natural Language Processing applications.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Xu J, Croft WB. Corpus-based stemming using cooccurrence of word variants. ACM Trans Inf Syst. 1998;16(1):61–81.CrossRef Xu J, Croft WB. Corpus-based stemming using cooccurrence of word variants. ACM Trans Inf Syst. 1998;16(1):61–81.CrossRef
3.
Zurück zum Zitat Gupta V, Kaur N. A novel hybrid text summarization system for Punjabi text. Cogn Comput. 2016;8(2):261–77.CrossRef Gupta V, Kaur N. A novel hybrid text summarization system for Punjabi text. Cogn Comput. 2016;8(2):261–77.CrossRef
4.
Zurück zum Zitat Toutanova K, Suzuki H, Ruopp A. Applying morphology generation models to machine translation. In Association for computational linguistics. 2008;pp. 514–522. Toutanova K, Suzuki H, Ruopp A. Applying morphology generation models to machine translation. In Association for computational linguistics. 2008;pp. 514–522.
5.
Zurück zum Zitat Shrivastava M, Bhattacharyya P. Hindi POS tagger using naive stemming: harnessing morphological information without extensive linguistic knowledge. In Proceedings of International Conference on NLP (ICON08). 2008. Shrivastava M, Bhattacharyya P. Hindi POS tagger using naive stemming: harnessing morphological information without extensive linguistic knowledge. In Proceedings of International Conference on NLP (ICON08). 2008.
6.
Zurück zum Zitat Krovetz R. Viewing morphology as an inference process. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1993;pp. 191–202. Krovetz R. Viewing morphology as an inference process. In Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1993;pp. 191–202.
7.
Zurück zum Zitat Dashtipour K, Poria S, Hussain A, Cambria E, Hawalah AYA, Gelbukh A, et al. Multilingual sentiment analysis: state of the art and independent comparison of techniques. Cogn Comput. 2016;8:757–71. 1–15CrossRef Dashtipour K, Poria S, Hussain A, Cambria E, Hawalah AYA, Gelbukh A, et al. Multilingual sentiment analysis: state of the art and independent comparison of techniques. Cogn Comput. 2016;8:757–71. 1–15CrossRef
8.
Zurück zum Zitat Hu M, Liu B. Mining opinion features in customer reviews. AAAI. 2004;4:755–60. Hu M, Liu B. Mining opinion features in customer reviews. AAAI. 2004;4:755–60.
9.
Zurück zum Zitat Almeida TA, Silva TP, Santos I, Hidalgo JMG. Text normalization and semantic indexing to enhance instant messaging and SMS spam filtering. Knowledge-Based Systems. 2016. Almeida TA, Silva TP, Santos I, Hidalgo JMG. Text normalization and semantic indexing to enhance instant messaging and SMS spam filtering. Knowledge-Based Systems. 2016.
10.
Zurück zum Zitat Lovins JB. Development of a stemming algorithm. Mech Transl Comput Linguist. 1968;11:22–31. Lovins JB. Development of a stemming algorithm. Mech Transl Comput Linguist. 1968;11:22–31.
11.
Zurück zum Zitat Dawson JL. Suffix removal for word conflation. Bull Assoc Lit Linguist Comput. 1974;2(3):33–46. Dawson JL. Suffix removal for word conflation. Bull Assoc Lit Linguist Comput. 1974;2(3):33–46.
12.
Zurück zum Zitat Porter MF. An algorithm for suffix stripping. Prog Electron Libr Inf Syst. 1980;14(3):130–7. Porter MF. An algorithm for suffix stripping. Prog Electron Libr Inf Syst. 1980;14(3):130–7.
13.
14.
Zurück zum Zitat Popovic M, Willett P. The effectiveness of stemming for natural-language access to Slovene textual data. J Am Soc Inf Sci. 1992;43:384–90. Popovic M, Willett P. The effectiveness of stemming for natural-language access to Slovene textual data. J Am Soc Inf Sci. 1992;43:384–90.
15.
Zurück zum Zitat Kraaij W, Pohlman R. Viewing stemming as recall enhancement. In Proceedings of the 19th annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1996 ;pp. 40–48. Kraaij W, Pohlman R. Viewing stemming as recall enhancement. In Proceedings of the 19th annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1996 ;pp. 40–48.
16.
Zurück zum Zitat Majumder P, Mitra M, Pal D. Bulgarian, Hungarian and Czech stemming using YASS. In Advances in multilingual and multimodal information retrieval. 2008;pp. 49–56. Majumder P, Mitra M, Pal D. Bulgarian, Hungarian and Czech stemming using YASS. In Advances in multilingual and multimodal information retrieval. 2008;pp. 49–56.
17.
Zurück zum Zitat Savoy J, Berger PY. Monolingual, bilingual, and GIRT information retrieval at CLEF-2005. In 6th Workshop of the Cross-Language Evaluation Forum, CLEF 2005. 2006;pp. 131–140. Savoy J, Berger PY. Monolingual, bilingual, and GIRT information retrieval at CLEF-2005. In 6th Workshop of the Cross-Language Evaluation Forum, CLEF 2005. 2006;pp. 131–140.
18.
Zurück zum Zitat Adam G, Asimakis K, Bouras C, Poulopoulos V. An efficient mechanism for stemming and tagging: the case of Greek language. In Proceedings of the 14th International Conference on Knowledge-Based and Intelligent Information and Engineering Systems. 2010:pp. 389–397. Adam G, Asimakis K, Bouras C, Poulopoulos V. An efficient mechanism for stemming and tagging: the case of Greek language. In Proceedings of the 14th International Conference on Knowledge-Based and Intelligent Information and Engineering Systems. 2010:pp. 389–397.
19.
Zurück zum Zitat Dolamic L, Savoy J. Indexing and searching strategies for the Russian language. J Am Soc Inf Sci Technol. 2009a;60(12):2540–7.CrossRef Dolamic L, Savoy J. Indexing and searching strategies for the Russian language. J Am Soc Inf Sci Technol. 2009a;60(12):2540–7.CrossRef
20.
Zurück zum Zitat Dolamic L, Savoy J. Indexing and stemming approaches for the Czech language. Inf Process Manag. 2009b;45:714–20.CrossRef Dolamic L, Savoy J. Indexing and stemming approaches for the Czech language. Inf Process Manag. 2009b;45:714–20.CrossRef
21.
Zurück zum Zitat Paik JH, Pal D, Parui SK. A novel corpus-based stemming algorithm using co-occurrence statistics. In Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’11). New York: ACM. 2011b; pp. 863–872. Paik JH, Pal D, Parui SK. A novel corpus-based stemming algorithm using co-occurrence statistics. In Proceedings of the 34th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’11). New York: ACM. 2011b; pp. 863–872.
23.
Zurück zum Zitat Orad D, Levow G, Cabezas C. CLEF experiments at Maryland: statistical stemming and back off translation. In: Proceedings of the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation. Berlin: Springer-Verlag. 2001;pp. 176–187. Orad D, Levow G, Cabezas C. CLEF experiments at Maryland: statistical stemming and back off translation. In: Proceedings of the Workshop of Cross-Language Evaluation Forum on Cross-Language Information Retrieval and Evaluation. Berlin: Springer-Verlag. 2001;pp. 176–187.
24.
Zurück zum Zitat Goldsmith J. Unsupervised learning of the morphology of a natural language. J Comput Linguist. 2001;27(2):153–98.CrossRef Goldsmith J. Unsupervised learning of the morphology of a natural language. J Comput Linguist. 2001;27(2):153–98.CrossRef
25.
Zurück zum Zitat Goldsmith J. An algorithm for the unsupervised learning of morphology. Nat Lang Eng. 2006;12(04):353–71.CrossRef Goldsmith J. An algorithm for the unsupervised learning of morphology. Nat Lang Eng. 2006;12(04):353–71.CrossRef
26.
Zurück zum Zitat Melucci M, Orio N. A novel method for stemmer generation based on hidden Markov models. In Proceedings of the twelfth International Conference on Information and Knowledge Management (CIKM’03). 2003;pp. 131–138. Melucci M, Orio N. A novel method for stemmer generation based on hidden Markov models. In Proceedings of the twelfth International Conference on Information and Knowledge Management (CIKM’03). 2003;pp. 131–138.
27.
Zurück zum Zitat Bacchin M, Ferro N, Melucci M. A probabilistic model for stemmer generation. Inf Process Manag. 2005;41(1):121–37.CrossRef Bacchin M, Ferro N, Melucci M. A probabilistic model for stemmer generation. Inf Process Manag. 2005;41(1):121–37.CrossRef
28.
Zurück zum Zitat Bacchin M, Ferro N, Melucci M. The effectiveness of a graph-based algorithm for stemming. In Digital libraries: people, knowledge, and technology. Springer; 2002. pp. 117–128. Bacchin M, Ferro N, Melucci M. The effectiveness of a graph-based algorithm for stemming. In Digital libraries: people, knowledge, and technology. Springer; 2002. pp. 117–128.
29.
Zurück zum Zitat Creutz M, Lagus K. Unsupervised models for morpheme segmentation and morphology learning. ACM Trans Speech Lang Process (TSLP). 2007;4(1):3. article CrossRef Creutz M, Lagus K. Unsupervised models for morpheme segmentation and morphology learning. ACM Trans Speech Lang Process (TSLP). 2007;4(1):3. article CrossRef
30.
Zurück zum Zitat Creutz M, Lagus K. Unsupervised discovery of morphemes. In Proceedings of the ACL-02 workshop on Morphological and phonological learning. 2002; Vol. 6: pp. 21–30. Creutz M, Lagus K. Unsupervised discovery of morphemes. In Proceedings of the ACL-02 workshop on Morphological and phonological learning. 2002; Vol. 6: pp. 21–30.
31.
Zurück zum Zitat Creutz M. Unsupervised segmentation of words using prior distributions of morph length and frequency. In Proceedings of the 41st Annual Meeting of Association for Computational Linguistics. 2003;Vol. 1: pp. 280–287. Creutz M. Unsupervised segmentation of words using prior distributions of morph length and frequency. In Proceedings of the 41st Annual Meeting of Association for Computational Linguistics. 2003;Vol. 1: pp. 280–287.
32.
Zurück zum Zitat Creutz M, Lagus K. Induction of a simple morphology for highly-inflecting languages. In Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology: Current Themes in Computational Phonology and Morphology. 2004:pp. 43–51. Creutz M, Lagus K. Induction of a simple morphology for highly-inflecting languages. In Proceedings of the 7th Meeting of the ACL Special Interest Group in Computational Phonology: Current Themes in Computational Phonology and Morphology. 2004:pp. 43–51.
33.
Zurück zum Zitat Creutz M, Lagus K. Inducing the morphological lexicon of a natural language from unannotated text. In Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR’05). 2005; Vol. 1: pp. 51–59. Creutz M, Lagus K. Inducing the morphological lexicon of a natural language from unannotated text. In Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Representation and Reasoning (AKRR’05). 2005; Vol. 1: pp. 51–59.
34.
Zurück zum Zitat Kohonen O, Virpioja S, Klami M. Allomorfessor: towards unsupervised morpheme analysis. In Evaluating Systems for Multilingual and Multimodal Information Acces. Springer: 2008; pp. 975–982. Kohonen O, Virpioja S, Klami M. Allomorfessor: towards unsupervised morpheme analysis. In Evaluating Systems for Multilingual and Multimodal Information Acces. Springer: 2008; pp. 975–982.
35.
Zurück zum Zitat Majumder P, Mitra M, Parui SK, Kole G, Mitra P, Datta K. YASS: Yet Another Suffix Stripper. ACM Trans Inf Syst. 2007;25(4):18.CrossRef Majumder P, Mitra M, Parui SK, Kole G, Mitra P, Datta K. YASS: Yet Another Suffix Stripper. ACM Trans Inf Syst. 2007;25(4):18.CrossRef
36.
Zurück zum Zitat Jaro MA. Probabilistic linkage of large public health data files. Stat Med. 1995;14(5–7):491–8.CrossRefPubMed Jaro MA. Probabilistic linkage of large public health data files. Stat Med. 1995;14(5–7):491–8.CrossRefPubMed
37.
Zurück zum Zitat Winkler WE. String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. 1990. Winkler WE. String comparator metrics and enhanced decision rules in the Fellegi-Sunter model of record linkage. 1990.
38.
Zurück zum Zitat Makin R, Pandey N, Pingali P, Varma V. Approximate string matching techniques for effective CLIR among Indian languages. In International Workshop on Fuzzy Logic and Applications. 2007;pp. 430–437. Makin R, Pandey N, Pingali P, Varma V. Approximate string matching techniques for effective CLIR among Indian languages. In International Workshop on Fuzzy Logic and Applications. 2007;pp. 430–437.
39.
Zurück zum Zitat Bilenko M, Mooney R, Cohen W, Ravikumar P, Fienberg S. Adaptive name matching in information integration. IEEE Intell Syst. 2003;18(5):16–23.CrossRef Bilenko M, Mooney R, Cohen W, Ravikumar P, Fienberg S. Adaptive name matching in information integration. IEEE Intell Syst. 2003;18(5):16–23.CrossRef
40.
Zurück zum Zitat Christen P. A comparison of personal name matching: techniques and practical issues. In Sixth IEEE International Conference on Data Mining-Workshops (ICDMW’06). 2006;pp. 290–294. Christen P. A comparison of personal name matching: techniques and practical issues. In Sixth IEEE International Conference on Data Mining-Workshops (ICDMW’06). 2006;pp. 290–294.
41.
Zurück zum Zitat Cohen W, Ravikumar P, Fienberg S. A comparison of string metrics for matching names and records. In KDD workshop on data cleaning and object consolidation. 2003 ;Vol. 3: pp. 73–78. Cohen W, Ravikumar P, Fienberg S. A comparison of string metrics for matching names and records. In KDD workshop on data cleaning and object consolidation. 2003 ;Vol. 3: pp. 73–78.
42.
Zurück zum Zitat Paik J, Mitra M, Parui S, Jarvelin K. GRAS: an effective and efficient stemming algorithm for information retrieval. ACM Trans Inf Syst. 2011a;29(4):1–24.CrossRef Paik J, Mitra M, Parui S, Jarvelin K. GRAS: an effective and efficient stemming algorithm for information retrieval. ACM Trans Inf Syst. 2011a;29(4):1–24.CrossRef
44.
Zurück zum Zitat Peng F, Lu Y. Context Sensitive Stemming for Web Search. Proceeding SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. 2007;pp 639–46. Peng F, Lu Y. Context Sensitive Stemming for Web Search. Proceeding SIGIR '07 Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. 2007;pp 639–46.
45.
Zurück zum Zitat Brychcín T, Konopík M. HPS: high precision stemmer. Inf Process Manag. 2015;51(1):68–91.CrossRef Brychcín T, Konopík M. HPS: high precision stemmer. Inf Process Manag. 2015;51(1):68–91.CrossRef
46.
Zurück zum Zitat McNamee P, Nicholas C, Mayfield J. Addressing morphological variation in alphabetic languages. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. 2009; pp. 75–82. McNamee P, Nicholas C, Mayfield J. Addressing morphological variation in alphabetic languages. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval. 2009; pp. 75–82.
47.
Zurück zum Zitat Pirkola A, Keskustalo H, Leppänen E, Känsälä A-P, Järvelin K. Targeted s-gram matching: a novel n-gram matching technique for cross and monolingual word form variants. Inf Res. 2002;7(2):2–7. Pirkola A, Keskustalo H, Leppänen E, Känsälä A-P, Järvelin K. Targeted s-gram matching: a novel n-gram matching technique for cross and monolingual word form variants. Inf Res. 2002;7(2):2–7.
48.
Zurück zum Zitat Järvelin A. Applications of S-grams in natural language information retrieval. 2014. Järvelin A. Applications of S-grams in natural language information retrieval. 2014.
49.
Zurück zum Zitat Dolamic L, Savoy J. Comparative study of indexing and search strategies for the Hindi, Marathi, and Bengali languages. ACM Trans Asian Lang Inf Process. 2010;9(3):11.CrossRef Dolamic L, Savoy J. Comparative study of indexing and search strategies for the Hindi, Marathi, and Bengali languages. ACM Trans Asian Lang Inf Process. 2010;9(3):11.CrossRef
50.
Zurück zum Zitat Jaro MA. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J Am Stat Assoc. 1989;84(406):414–20.CrossRef Jaro MA. Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J Am Stat Assoc. 1989;84(406):414–20.CrossRef
51.
Zurück zum Zitat Brown PF, Desouza PV, Mercer RL, Pietra V, Della J, Lai JC. Class-based n-gram models of natural language. Comput Linguist. 1992;18(4):467–79. Brown PF, Desouza PV, Mercer RL, Pietra V, Della J, Lai JC. Class-based n-gram models of natural language. Comput Linguist. 1992;18(4):467–79.
52.
Zurück zum Zitat Jain AK, Murty MN, Flynn PJ. Data clustering: a review. ACM Comput Surv. 1999;31(3):264–323.CrossRef Jain AK, Murty MN, Flynn PJ. Data clustering: a review. ACM Comput Surv. 1999;31(3):264–323.CrossRef
53.
Zurück zum Zitat Amati G, Van Rijsbergen CJ. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans Inf Syst (TOIS). 2002;20(4):357–89.CrossRef Amati G, Van Rijsbergen CJ. Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans Inf Syst (TOIS). 2002;20(4):357–89.CrossRef
55.
Zurück zum Zitat Sakai T, Manabe T, Koyama M. Flexible pseudo-relevance feedback via selective sampling. ACM Trans Asian Lang Inf Process (TALIP). 2005;4(2):111–35.CrossRef Sakai T, Manabe T, Koyama M. Flexible pseudo-relevance feedback via selective sampling. ACM Trans Asian Lang Inf Process (TALIP). 2005;4(2):111–35.CrossRef
Metadaten
Titel
An Efficient Corpus-Based Stemmer
verfasst von
Jasmeet Singh
Vishal Gupta
Publikationsdatum
07.06.2017
Verlag
Springer US
Erschienen in
Cognitive Computation / Ausgabe 5/2017
Print ISSN: 1866-9956
Elektronische ISSN: 1866-9964
DOI
https://doi.org/10.1007/s12559-017-9479-z

Weitere Artikel der Ausgabe 5/2017

Cognitive Computation 5/2017 Zur Ausgabe