Skip to main content

2015 | OriginalPaper | Buchkapitel

A Comparative Evaluation of Statistical Part-of-Speech Taggers for Russian

verfasst von : Rinat Gareev, Vladimir Ivanov

Erschienen in: Information Retrieval

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Part-of-speech (POS) tagging is an essential step in many text processing applications. Quite a few works focus on solving this task for Russian; their results are not directly comparable due to the lack of shared datasets and tools. We propose a POS tagging evaluation framework for Russian that comprises existing third-party resources available for researchers. We applied the framework to compare several implementations of statistical classifiers: HunPos, Stanford POS tagger, OpenNLP implementation of MaxEnt Markov Model, and our own re-implementation of Tiered Conditional Random Fields. The best tagger that was trained on a corpus with less than one million words achieved an accuracy above 93 % .We expect that the evaluation framework will facilitate future studies and improvements on POS tagging for Russian.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
4
i.e., parts of speech: noun, verb, adjective, etc.
 
5
The size is approximate; it was estimated from a number of sentences.
 
6
The labelling of gramemmes can be found at http://​opencorpora.​org/​dict.​php.
 
8
There are also several predefined bidirectional architectures but we experienced technical issues with them.
 
9
The special POS tag for name initials in the RNC.
 
Literatur
1.
Zurück zum Zitat Antonova, A.Y., Soloviev, A.N.: Conditional random field models for the processing of Russian. In: Computational Linguistics and Intellectual Technologies: Papers From the Annual Conference “Dialogue” (Bekasovo, 29 May – 2 June 2013), vol. 1, pp. 27–44. RGGU, Moscow (2013) (in Russian) Antonova, A.Y., Soloviev, A.N.: Conditional random field models for the processing of Russian. In: Computational Linguistics and Intellectual Technologies: Papers From the Annual Conference “Dialogue” (Bekasovo, 29 May – 2 June 2013), vol. 1, pp. 27–44. RGGU, Moscow (2013) (in Russian)
2.
Zurück zum Zitat Bocharov, V., Bichineva, S., Granovsky, D., Ostapuk, N., Stepanova, M.: Quality assurance tools in the OpenCorpora project. In: Computational Linguistics and Intellectual Technologies: Papers From the Annual Conference “Dialogue” (Bekasovo, 25–29 May 2011), pp. 101–109. RGGU, Moscow, Russia (2011) Bocharov, V., Bichineva, S., Granovsky, D., Ostapuk, N., Stepanova, M.: Quality assurance tools in the OpenCorpora project. In: Computational Linguistics and Intellectual Technologies: Papers From the Annual Conference “Dialogue” (Bekasovo, 25–29 May 2011), pp. 101–109. RGGU, Moscow, Russia (2011)
3.
Zurück zum Zitat Brants, T.: TnT: a statistical part-of-speech tagger. In: Proceedings of the Sixth Conference on Applied Natural Language Processing, ANLC 2000, pp. 224–231. Association for Computational Linguistics, Stroudsburg, PA, USA (2000) Brants, T.: TnT: a statistical part-of-speech tagger. In: Proceedings of the Sixth Conference on Applied Natural Language Processing, ANLC 2000, pp. 224–231. Association for Computational Linguistics, Stroudsburg, PA, USA (2000)
4.
Zurück zum Zitat de Castilho, R.E., Gurevych, I.: A lightweight framework for reproducible parameter sweeping in information retrieval. In: Agosti, M., Ferro, N., Thanos, C. (eds.) Proceedings of the 2011 Workshop on Data Infrastructures for Supporting Information Retrieval Evaluation, DESIRE 2011, pp. 7–10. ACM, New York (2011) de Castilho, R.E., Gurevych, I.: A lightweight framework for reproducible parameter sweeping in information retrieval. In: Agosti, M., Ferro, N., Thanos, C. (eds.) Proceedings of the 2011 Workshop on Data Infrastructures for Supporting Information Retrieval Evaluation, DESIRE 2011, pp. 7–10. ACM, New York (2011)
5.
Zurück zum Zitat Hajič, J., Krbec, P., Květoň, P., Oliva, K., Petkevič, V.: Serial combination of rules and statistics: a case study in Czech tagging. In: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, ACL 2001, pp. 268–275. Association for Computational Linguistics, Stroudsburg, PA, USA (2001) Hajič, J., Krbec, P., Květoň, P., Oliva, K., Petkevič, V.: Serial combination of rules and statistics: a case study in Czech tagging. In: Proceedings of the 39th Annual Meeting on Association for Computational Linguistics, ACL 2001, pp. 268–275. Association for Computational Linguistics, Stroudsburg, PA, USA (2001)
6.
Zurück zum Zitat Halácsy, P., Kornai, A., Oravecz, C.: HunPos: an open source trigram tagger. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ACL 2007, pp. 209–212. Association for Computational Linguistics, Stroudsburg, PA, USA (2007) Halácsy, P., Kornai, A., Oravecz, C.: HunPos: an open source trigram tagger. In: Proceedings of the 45th Annual Meeting of the ACL on Interactive Poster and Demonstration Sessions, ACL 2007, pp. 209–212. Association for Computational Linguistics, Stroudsburg, PA, USA (2007)
7.
Zurück zum Zitat Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML 2001, pp. 282–289. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2001) Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the Eighteenth International Conference on Machine Learning, ICML 2001, pp. 282–289. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2001)
8.
Zurück zum Zitat Lakomkin, E.D., Ryzhova, D.A., Puzyrevskij, I.: Analiz statisticheskix algoritmov snyatiya morfologicheskoj omonimii v russkom yazyke. In: 2013. Moscow (2013) (in Russian) Lakomkin, E.D., Ryzhova, D.A., Puzyrevskij, I.: Analiz statisticheskix algoritmov snyatiya morfologicheskoj omonimii v russkom yazyke. In: https://static-content.springer.com/image/chp%3A10.1007%2F978-3-319-25485-2_8/MediaObjects/371952_1_En_8_Figd_HTML.gif 2013. Moscow (2013) (in Russian)
9.
Zurück zum Zitat Ljashevskaja, O.N., Astaf’eva, I., Bonch-Osmolovskaja, A., Garejshina, A., Grishina, J., D’jachkov, V., Ionov, M., Koroleva, A., Kudrinskij, M., Litjagina, A., Luchina, E., Sidorova, E., Toldova, S., Savchuk, S., Koval, S.: NLP evaluation: Russian morphological parsers. In: Computational Linguistics and Intellectual Technologies: Papers From the Annual Conference “Dialogue” (Bekasovo, 26–30 May 2010), pp. 318-326 (2010) (in Russian) Ljashevskaja, O.N., Astaf’eva, I., Bonch-Osmolovskaja, A., Garejshina, A., Grishina, J., D’jachkov, V., Ionov, M., Koroleva, A., Kudrinskij, M., Litjagina, A., Luchina, E., Sidorova, E., Toldova, S., Savchuk, S., Koval, S.: NLP evaluation: Russian morphological parsers. In: Computational Linguistics and Intellectual Technologies: Papers From the Annual Conference “Dialogue” (Bekasovo, 26–30 May 2010), pp. 318-326 (2010) (in Russian)
10.
Zurück zum Zitat Ljashevskaja, O.N., Plungjan, V.A., Sichinava, D.V.: O morfologicheskom standarte Nacional’nogo korpusa russkogo jazyka. In: : 2003–2005. , pp. 111–135. Indrik, Moscow, Russia (2005) (in Russian) Ljashevskaja, O.N., Plungjan, V.A., Sichinava, D.V.: O morfologicheskom standarte Nacional’nogo korpusa russkogo jazyka. In: https://static-content.springer.com/image/chp%3A10.1007%2F978-3-319-25485-2_8/MediaObjects/371952_1_En_8_Fige_HTML.gif : 2003–2005. https://static-content.springer.com/image/chp%3A10.1007%2F978-3-319-25485-2_8/MediaObjects/371952_1_En_8_Figf_HTML.gif , pp. 111–135. Indrik, Moscow, Russia (2005) (in Russian)
11.
Zurück zum Zitat Noreen, E.: Computer-Intensive Methods for Testing Hypotheses: An Introduction. A Wiley-Interscience publication, Wiley (1989) Noreen, E.: Computer-Intensive Methods for Testing Hypotheses: An Introduction. A Wiley-Interscience publication, Wiley (1989)
12.
Zurück zum Zitat Ogren, P.V., Wetzler, P.G., Bethard, S.J.: ClearTK: a framework for statistical natural language processing. In: Unstructured Information Management Architecture Workshop at the Conference of the German Society for Computational Linguistics and Language Technology (2009) Ogren, P.V., Wetzler, P.G., Bethard, S.J.: ClearTK: a framework for statistical natural language processing. In: Unstructured Information Management Architecture Workshop at the Conference of the German Society for Computational Linguistics and Language Technology (2009)
14.
Zurück zum Zitat Radziszewski, A.: A tiered CRF tagger for polish. In: Bembenik, R., Skonieczny, Ł., Rybiński, H., Kryszkiewicz, M., Niezgódka, M. (eds.) Intell. Tools for Building a Scientific Information. SCI, vol. 467, pp. 215–230. Springer, Heidelberg (2013) CrossRef Radziszewski, A.: A tiered CRF tagger for polish. In: Bembenik, R., Skonieczny, Ł., Rybiński, H., Kryszkiewicz, M., Niezgódka, M. (eds.) Intell. Tools for Building a Scientific Information. SCI, vol. 467, pp. 215–230. Springer, Heidelberg (2013) CrossRef
15.
Zurück zum Zitat Ratnaparkhi, A.: A maximum entropy model for part-of-speech tagging. In: Brill, E., Church, K. (eds.) Proceedings of the Empirical Methods in Natural Language Processing, pp. 133–142 (1996) Ratnaparkhi, A.: A maximum entropy model for part-of-speech tagging. In: Brill, E., Church, K. (eds.) Proceedings of the Empirical Methods in Natural Language Processing, pp. 133–142 (1996)
16.
Zurück zum Zitat Sharoff, S., Kopotev, M., Erjavec, T., Feldman, A., Divjak, D.: Designing and evaluating a Russian tagset. In: Chair, N.C.C., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Tapias, D. (eds.) Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008). European Language Resources Association (ELRA), Marrakech, Morocco (2008) Sharoff, S., Kopotev, M., Erjavec, T., Feldman, A., Divjak, D.: Designing and evaluating a Russian tagset. In: Chair, N.C.C., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Tapias, D. (eds.) Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008). European Language Resources Association (ELRA), Marrakech, Morocco (2008)
18.
Zurück zum Zitat Sutton, C., McCallum, A.: An introduction to conditional random fields. Found. Trends Mach. Learn. 4(4), 267–373 (2012)CrossRefMATH Sutton, C., McCallum, A.: An introduction to conditional random fields. Found. Trends Mach. Learn. 4(4), 267–373 (2012)CrossRefMATH
19.
Zurück zum Zitat Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, NAACL 2003, vol.1, pp. 173–180. Association for Computational Linguistics, Stroudsburg, PA, USA (2003) Toutanova, K., Klein, D., Manning, C.D., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, NAACL 2003, vol.1, pp. 173–180. Association for Computational Linguistics, Stroudsburg, PA, USA (2003)
20.
Zurück zum Zitat Zaliznjak, A.A.: Grammaticheskij slovar’ russkogo jazyka. Slovoizmenenie. Russkij jazyk, Moscow, 3 edn. (1987) (in Russian) Zaliznjak, A.A.: Grammaticheskij slovar’ russkogo jazyka. Slovoizmenenie. https://static-content.springer.com/image/chp%3A10.1007%2F978-3-319-25485-2_8/MediaObjects/371952_1_En_8_Figg_HTML.gif Russkij jazyk, Moscow, 3 edn. (1987) (in Russian)
Metadaten
Titel
A Comparative Evaluation of Statistical Part-of-Speech Taggers for Russian
verfasst von
Rinat Gareev
Vladimir Ivanov
Copyright-Jahr
2015
DOI
https://doi.org/10.1007/978-3-319-25485-2_8

Neuer Inhalt