Skip to main content
Top
Published in: International Journal of Speech Technology 3/2016

01-08-2016

Corpus based part-of-speech tagging

Authors: Chengyao Lv, Huihua Liu, Yuanxing Dong, Yunliang Chen

Published in: International Journal of Speech Technology | Issue 3/2016

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In natural language processing, a crucial subsystem in a wide range of applications is a part-of-speech (POS) tagger, which labels (or classifies) unannotated words of natural language with POS labels corresponding to categories such as noun, verb or adjective. Mainstream approaches are generally corpus-based: a POS tagger learns from a corpus of pre-annotated data how to correctly tag unlabeled data. Presented here is a brief state-of-the-art account on POS tagging. POS tagging approaches make use of labeled corpus to train computational trained models. Several typical models of three kings of tagging are introduced in this article: rule-based tagging, statistical approaches and evolution algorithms. The advantages and the pitfalls of each typical tagging are discussed and analyzed. Some rule-based and stochastic methods have been successfully achieved accuracies of 93–96 %, while that of some evolution algorithms are about 96–97 %.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
go back to reference Araujo, L. (2001). Evolutionary parsing for a probabilistic context free grammar. In Rough sets and current trends in computing, Canada (pp. 590–597). Berlin: Springer. Araujo, L. (2001). Evolutionary parsing for a probabilistic context free grammar. In Rough sets and current trends in computing, Canada (pp. 590–597). Berlin: Springer.
go back to reference Araujo, L. (2002). Part-of-speech tagging with evolutionary algorithms. In Third International conference on computational linguistics and intelligent text processing, Mexico City, Mexico (pp. 187–203). Araujo, L. (2002). Part-of-speech tagging with evolutionary algorithms. In Third International conference on computational linguistics and intelligent text processing, Mexico City, Mexico (pp. 187–203).
go back to reference Bohnet, B., & Nivre, J. (2012). A transition-based system for joint part-of-speech tagging and labeled non-projective dependency parsing. In Joint conference on empirical methods in natural language processing & computational natural language learning, Jeju Island, Korea (pp. 1455–1465). Bohnet, B., & Nivre, J. (2012). A transition-based system for joint part-of-speech tagging and labeled non-projective dependency parsing. In Joint conference on empirical methods in natural language processing & computational natural language learning, Jeju Island, Korea (pp. 1455–1465).
go back to reference Brants, T. (2000). TnT: a statistical part-of-speech tagger. In Proceedings of the sixth applied natural language processing conference, Seattle, WA (pp. 224–231). Trento: Association for Computational Linguistics. Brants, T. (2000). TnT: a statistical part-of-speech tagger. In Proceedings of the sixth applied natural language processing conference, Seattle, WA (pp. 224–231). Trento: Association for Computational Linguistics.
go back to reference Brill, E. (1992). A simple rule-based part of speech tagger. In Proceedings of the third conference on applied computational linguistics (pp. 112–116). Trento: Association for Computational Linguistics. Brill, E. (1992). A simple rule-based part of speech tagger. In Proceedings of the third conference on applied computational linguistics (pp. 112–116). Trento: Association for Computational Linguistics.
go back to reference Brill, E. (1995). Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging. Computational Linguistics, 21(4), 543–565. Brill, E. (1995). Transformation-based error-driven learning and natural language processing: a case study in part-of-speech tagging. Computational Linguistics, 21(4), 543–565.
go back to reference Carlberger, J., & Kann, V. (1999). Implementing an efficient part-of-speech tagger. Software-Practice and Experience, 29(9), 815–832.CrossRef Carlberger, J., & Kann, V. (1999). Implementing an efficient part-of-speech tagger. Software-Practice and Experience, 29(9), 815–832.CrossRef
go back to reference Charniak, E., Hendrickson, C., et al. (1993). Equations for part-of-speech tagging. In AAAI-93, Proceedings (pp. 784–784). New York: Wiley. Charniak, E., Hendrickson, C., et al. (1993). Equations for part-of-speech tagging. In AAAI-93, Proceedings (pp. 784–784). New York: Wiley.
go back to reference Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines: and other kernel-based learning methods. Cambridge: Cambridge University Press.CrossRefMATH Cristianini, N., & Shawe-Taylor, J. (2000). An introduction to support vector machines: and other kernel-based learning methods. Cambridge: Cambridge University Press.CrossRefMATH
go back to reference Cutting, D., Kupiec, J., et al. (1992). A practical part-of-speech tagger (pp. 133–140). Trendo: Association for Computational Linguistics. Cutting, D., Kupiec, J., et al. (1992). A practical part-of-speech tagger (pp. 133–140). Trendo: Association for Computational Linguistics.
go back to reference Davis, M., & Dunning, T. (1995). Query translation using evolutionary programming for multi-lingual information retrieval. In Proceedings of the fourth annual conference on evolutionary programming (pp. 175–185). Davis, M., & Dunning, T. (1995). Query translation using evolutionary programming for multi-lingual information retrieval. In Proceedings of the fourth annual conference on evolutionary programming (pp. 175–185).
go back to reference Ferreira, C. (2001). Gene expression programming: a new adaptive algorithm for solving problems. Arxiv preprint cs/0102027. Ferreira, C. (2001). Gene expression programming: a new adaptive algorithm for solving problems. Arxiv preprint cs/0102027.
go back to reference Ferreira, C. (2003). Function finding and the creation of numerical constants in gene expression programming. In Advances in soft computing, 265. Ferreira, C. (2003). Function finding and the creation of numerical constants in gene expression programming. In Advances in soft computing, 265.
go back to reference Garrette, D., & Baldridge, J. (2013). Learning a part-of-speech tagger from two hours of annotation. In Proceedings of NAACL, Atlanta, Georgia (pp. 129–134). Garrette, D., & Baldridge, J. (2013). Learning a part-of-speech tagger from two hours of annotation. In Proceedings of NAACL, Atlanta, Georgia (pp. 129–134).
go back to reference Giménez, J., & Marquez, L. (2004). SVMTool: A general POS tagger generator based on support vector machines. In Proceedings of the 4th international conference on language resources and evaluation (LREC’04), Citeseer. Giménez, J., & Marquez, L. (2004). SVMTool: A general POS tagger generator based on support vector machines. In Proceedings of the 4th international conference on language resources and evaluation (LREC’04), Citeseer.
go back to reference Goldberg, D. E. (1989). Genetic algorithms in search, optimization, and machine learning. Addison: Wesley. Goldberg, D. E. (1989). Genetic algorithms in search, optimization, and machine learning. Addison: Wesley.
go back to reference Greene, B. B., & Rubin, G. M. (1971). Automatic grammatical tagging of English. Department of Linguistics, Brown University. Greene, B. B., & Rubin, G. M. (1971). Automatic grammatical tagging of English. Department of Linguistics, Brown University.
go back to reference Jamatia, A., Gamblack, B., & Das, A. (2015). Part-of-speech tagging for code-mixed english-hindi twitter and facebook chat messages. In Proceedings of recent advances in natural language processing (pp. 239–248). Hissar. Jamatia, A., Gamblack, B., & Das, A. (2015). Part-of-speech tagging for code-mixed english-hindi twitter and facebook chat messages. In Proceedings of recent advances in natural language processing (pp. 239–248). Hissar.
go back to reference Jing, P., Changjie, T., et al. (2005). M-GEP: a new evolution algorithm based on multi-layer chromosomes gene expression programming. Chinese Journal of Computers, 28(9), 1459–1466. Jing, P., Changjie, T., et al. (2005). M-GEP: a new evolution algorithm based on multi-layer chromosomes gene expression programming. Chinese Journal of Computers, 28(9), 1459–1466.
go back to reference Karakasis, V. K., & Stafylopatis, A. (2008). Efficient evolution of accurate classification rules using a combination of gene expression programming and clonal selection. IEEE Transactions on Evolutionary Computation, 12(6), 662–678.CrossRef Karakasis, V. K., & Stafylopatis, A. (2008). Efficient evolution of accurate classification rules using a combination of gene expression programming and clonal selection. IEEE Transactions on Evolutionary Computation, 12(6), 662–678.CrossRef
go back to reference Karkaletsis, G., Petasis, G., & Paliouras, V. (2015). Using machine learning techniques for part-of-speech tagging in the Greek language. Singapore: World Scientific Publishing Company. Karkaletsis, G., Petasis, G., & Paliouras, V. (2015). Using machine learning techniques for part-of-speech tagging in the Greek language. Singapore: World Scientific Publishing Company.
go back to reference Kempe, A. (1993). A probabilistic tagger and an analysis of tagging errors. Rapport technique, Institut für maschinelle sprachverarbeitung, Universität stuttgart. Kempe, A. (1993). A probabilistic tagger and an analysis of tagging errors. Rapport technique, Institut für maschinelle sprachverarbeitung, Universität stuttgart.
go back to reference Krovetz, R. (1997). Homonymy and polysemy in information retrieval. In Meeting of the Association for Computational Linguistics (pp. 72–79). Trendo: Association for Computational Linguistics. Krovetz, R. (1997). Homonymy and polysemy in information retrieval. In Meeting of the Association for Computational Linguistics (pp. 72–79). Trendo: Association for Computational Linguistics.
go back to reference Lee, S. Z., Tsujii, J. I., & Rim, H. C. (2000). Lexicalized hidden markov models for part-of-speech tagging. In International conference on computational linguistics (pp. 481–487). Trendo: Association for Computational Linguistics. Lee, S. Z., Tsujii, J. I., & Rim, H. C. (2000). Lexicalized hidden markov models for part-of-speech tagging. In International conference on computational linguistics (pp. 481–487). Trendo: Association for Computational Linguistics.
go back to reference Lippmann, R. P. (1989). Review of neural networks for speech recognition. Neural Computation, 1(1), 1–38.CrossRef Lippmann, R. P. (1989). Review of neural networks for speech recognition. Neural Computation, 1(1), 1–38.CrossRef
go back to reference Lv, C., Liu, H., et al. (2010). An efficient corpus based part-of-speech tagging with GEP. In Sixth international conference on semantics, knowledge and grids (pp. 289–292). IEEE. Lv, C., Liu, H., et al. (2010). An efficient corpus based part-of-speech tagging with GEP. In Sixth international conference on semantics, knowledge and grids (pp. 289–292). IEEE.
go back to reference Magerman, D. M. (1995). Statistical decision-tree models for parsing. In Meeting of the Association for Computational Linguistics (pp. 276–283). Trendo: Association for Computational Linguistics. Magerman, D. M. (1995). Statistical decision-tree models for parsing. In Meeting of the Association for Computational Linguistics (pp. 276–283). Trendo: Association for Computational Linguistics.
go back to reference Manning, C. D., Schütze, H., et al. (1999). Foundations of statistical natural language processing. Cambridge: MIT Press.MATH Manning, C. D., Schütze, H., et al. (1999). Foundations of statistical natural language processing. Cambridge: MIT Press.MATH
go back to reference Marques, N., & Lopes, G. (2001). Tagging with small training corpora. In International symposium on advances in intelligent data analysis (pp. 63–72). Berlin: Springer. Marques, N., & Lopes, G. (2001). Tagging with small training corpora. In International symposium on advances in intelligent data analysis (pp. 63–72). Berlin: Springer.
go back to reference Màrquez, L., Padro, L., et al. (2000). A machine learning approach to POS tagging. Machine Learning, 39(1), 59–91.CrossRefMATH Màrquez, L., Padro, L., et al. (2000). A machine learning approach to POS tagging. Machine Learning, 39(1), 59–91.CrossRefMATH
go back to reference Martinez, A. R. (2012). Part-of-speech tagging. Wiley Interdisciplinary Reviews, 4(1), 107–113.CrossRef Martinez, A. R. (2012). Part-of-speech tagging. Wiley Interdisciplinary Reviews, 4(1), 107–113.CrossRef
go back to reference Merialdo, B. (1994). Tagging English text with a probabilistic model. Computational Linguistics, 20(2), 155–171. Merialdo, B. (1994). Tagging English text with a probabilistic model. Computational Linguistics, 20(2), 155–171.
go back to reference Nakagawa, T., Kudoh, T., et al. (2001). Unknown word guessing and part-of-speech tagging using support vector machines. In Proceedings of the sixth natural language processing pacific rim symposium (pp. 325–331). Nakagawa, T., Kudoh, T., et al. (2001). Unknown word guessing and part-of-speech tagging using support vector machines. In Proceedings of the sixth natural language processing pacific rim symposium (pp. 325–331).
go back to reference Nakamura, M., Maruyama, K., et al. (1990). Neural network approach to word category prediction for English texts. In International conference on computational linguistics (pp. 213–218). Trendo: Association for Computational Linguistics. Nakamura, M., Maruyama, K., et al. (1990). Neural network approach to word category prediction for English texts. In International conference on computational linguistics (pp. 213–218). Trendo: Association for Computational Linguistics.
go back to reference Ngai, G., & Florian, R. (2001). Transformation-based learning in the fast lane. In Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies (pp. 1–8). Ngai, G., & Florian, R. (2001). Transformation-based learning in the fast lane. In Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies (pp. 1–8).
go back to reference Owoputi, O., O’Connor, B., & Dyer, C. (2013). Improved part-of-speech tagging for online conversational text with word clusters. In Proceedings of NAACL, Atlanta (pp. 380–390). Owoputi, O., O’Connor, B., & Dyer, C. (2013). Improved part-of-speech tagging for online conversational text with word clusters. In Proceedings of NAACL, Atlanta (pp. 380–390).
go back to reference Rabiner, L. R. (1989). A tutorial on hidden markov models and selected applications in speech recognition. In Proceedings of the IEEE (vol. 77(2), pp. 257–286). Rabiner, L. R. (1989). A tutorial on hidden markov models and selected applications in speech recognition. In Proceedings of the IEEE (vol. 77(2), pp. 257–286).
go back to reference Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. In Proceedings of EMNLP’1996, New Brunswick, New Jersey (vol. 1, pp. 133–142). Ratnaparkhi, A. (1996). A maximum entropy model for part-of-speech tagging. In Proceedings of EMNLP’1996, New Brunswick, New Jersey (vol. 1, pp. 133–142).
go back to reference Sánchez-Villamil, E., Forcada, M., et al. (2004). Unsupervised training of a finite-state sliding-window part-of-speech tagger. EsTAL, 2004, 454–463. Sánchez-Villamil, E., Forcada, M., et al. (2004). Unsupervised training of a finite-state sliding-window part-of-speech tagger. EsTAL, 2004, 454–463.
go back to reference Schmid, H. (1994). Part-of-speech tagging with neural networks. In International conference on computational linguistics (pp. 172–176). Trendo: Association for Computational Linguistics. Schmid, H. (1994). Part-of-speech tagging with neural networks. In International conference on computational linguistics (pp. 172–176). Trendo: Association for Computational Linguistics.
go back to reference Smith, T. C., & Witten, I. H. (1995). A genetic algorithm for the induction of natural language grammars. In Proc IJCAI-95 workshop on new approaches to learning for natural language processing (pp. 17–24). Smith, T. C., & Witten, I. H. (1995). A genetic algorithm for the induction of natural language grammars. In Proc IJCAI-95 workshop on new approaches to learning for natural language processing (pp. 17–24).
go back to reference Sun, G., Lang, F., & Qiao P. (2008). Chinese part-of-speech tagging based on fusion model. In Proceedings of the 11th joint conference on information sciences. Amsterdam: Atlantis Press. Sun, G., Lang, F., & Qiao P. (2008). Chinese part-of-speech tagging based on fusion model. In Proceedings of the 11th joint conference on information sciences. Amsterdam: Atlantis Press.
go back to reference Thede, S. M., & Harper, M. P. (1999). A second-order hidden Markov model for part-of-speech tagging. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics (pp. 175–182). Thede, S. M., & Harper, M. P. (1999). A second-order hidden Markov model for part-of-speech tagging. In Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics (pp. 175–182).
go back to reference Varile, G. B., & Zampolli, A. (1997). Survey of the state of the art in human language technology. Cambridge: Cambridge University Press. Varile, G. B., & Zampolli, A. (1997). Survey of the state of the art in human language technology. Cambridge: Cambridge University Press.
go back to reference Viterbi, A. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2), 260–269.CrossRefMATH Viterbi, A. (1967). Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. IEEE Transactions on Information Theory, 13(2), 260–269.CrossRefMATH
go back to reference Voutilainen, A. (2003). Part-of-speech tagging. The Oxford handbook of computational linguistics (pp. 219–232). Voutilainen, A. (2003). Part-of-speech tagging. The Oxford handbook of computational linguistics (pp. 219–232).
go back to reference Wilks, Y., & Stevenson, M. (2000). Combining independent knowledge sources for word sense disambiguation. Amsterdam Studies in the Theory and History of Linguistic Science Series, 4, 117–130. Wilks, Y., & Stevenson, M. (2000). Combining independent knowledge sources for word sense disambiguation. Amsterdam Studies in the Theory and History of Linguistic Science Series, 4, 117–130.
go back to reference Tian, Y., & Lo, D. (2015). A comparative study on the effectiveness of part-of-speech tagging techniques on bug reports. In International conference on software analysis, evolution and reengineering (pp. 570–574). Montréal. Tian, Y., & Lo, D. (2015). A comparative study on the effectiveness of part-of-speech tagging techniques on bug reports. In International conference on software analysis, evolution and reengineering (pp. 570–574). Montréal.
go back to reference Zhou, C., Xiao, W., et al. (2003). Evolving accurate and compact classification rules with gene expression programming. IEEE Transactions on Evolutionary Computation, 7(6), 519–531.CrossRef Zhou, C., Xiao, W., et al. (2003). Evolving accurate and compact classification rules with gene expression programming. IEEE Transactions on Evolutionary Computation, 7(6), 519–531.CrossRef
go back to reference Zuo, J., Tang, C., et al. (2002). Mining predicate association rule by gene expression programming. In Advances in web-age information management (pp. 281–294). Zuo, J., Tang, C., et al. (2002). Mining predicate association rule by gene expression programming. In Advances in web-age information management (pp. 281–294).
go back to reference Zuo, J., Tang, C., et al. (2004). Time series prediction based on gene expression programming. In Advances in web-age information management (pp. 55–64). Zuo, J., Tang, C., et al. (2004). Time series prediction based on gene expression programming. In Advances in web-age information management (pp. 55–64).
Metadata
Title
Corpus based part-of-speech tagging
Authors
Chengyao Lv
Huihua Liu
Yuanxing Dong
Yunliang Chen
Publication date
01-08-2016
Publisher
Springer US
Published in
International Journal of Speech Technology / Issue 3/2016
Print ISSN: 1381-2416
Electronic ISSN: 1572-8110
DOI
https://doi.org/10.1007/s10772-016-9356-2

Other articles of this Issue 3/2016

International Journal of Speech Technology 3/2016 Go to the issue