Skip to main content
Top

2020 | OriginalPaper | Chapter

High Order N-gram Model Construction and Application Based on Natural Annotation

Authors : Qibo Wang, Gaoqi Rao, Endong Xun

Published in: Chinese Lexical Semantics

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The language model based on the n-gram grammar plays an important role in NLP tasks. In this paper, language models based on language boundary are proposed to conquer the challenge of the very big language data: intra-sentence boundary model and inter-sentence boundary model. We developed a training tool on the Hadoop platform based on MapReduce programming, and conducted the prefix tree to compress and store the model. We implemented our model in identifying the boundary in the syntactic parsing, achieving a good result.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Li, Z., Sun, M.: Punctuation as implicit annotations for Chinese word segmentation. Comput. Linguist. 35(4), 505–512 (2009)CrossRef Li, Z., Sun, M.: Punctuation as implicit annotations for Chinese word segmentation. Comput. Linguist. 35(4), 505–512 (2009)CrossRef
2.
go back to reference Rao, G., et al.: Natural annotation research in large-scale corpora with a focus on Chinese word segmentation. Acta Sci. Nat. Univ. Pekin. 49(1), 140–146 (2013) Rao, G., et al.: Natural annotation research in large-scale corpora with a focus on Chinese word segmentation. Acta Sci. Nat. Univ. Pekin. 49(1), 140–146 (2013)
3.
go back to reference Rosenfeld, R., Carbonell, J., Rudnicky, A., et al.: Adaptive statistical language modeling: a maximum entropy approach. A maximum entropy approach (1994) Rosenfeld, R., Carbonell, J., Rudnicky, A., et al.: Adaptive statistical language modeling: a maximum entropy approach. A maximum entropy approach (1994)
4.
go back to reference Huang, X., Alleva, F., Hon, H.W., et al.: The SPHINX-II speech recognition system: an overview. Comput. Speech Lang. 7(2), 137–148 (1992)CrossRef Huang, X., Alleva, F., Hon, H.W., et al.: The SPHINX-II speech recognition system: an overview. Comput. Speech Lang. 7(2), 137–148 (1992)CrossRef
5.
go back to reference Ney, H., Essen, U., Kneser, R.: On structuring probabilistic dependences in stochastic language modelling. Comput. Speech Lang. 8(1), 1–38 (1994)CrossRef Ney, H., Essen, U., Kneser, R.: On structuring probabilistic dependences in stochastic language modelling. Comput. Speech Lang. 8(1), 1–38 (1994)CrossRef
6.
go back to reference Brown, P.F., Desouza, P.V., Mercer, R.L., et al.: Class-based n-gram models of natural language. Comput. Linguist. 18(4), 467–479 (1992) Brown, P.F., Desouza, P.V., Mercer, R.L., et al.: Class-based n-gram models of natural language. Comput. Linguist. 18(4), 467–479 (1992)
7.
go back to reference Goodman, J.T.: A bit of progress in language modeling. Comput. Speech Lang. 15(4), 403–434 (2001)CrossRef Goodman, J.T.: A bit of progress in language modeling. Comput. Speech Lang. 15(4), 403–434 (2001)CrossRef
8.
go back to reference Kuhn, R.: Speech recognition and the frequency of recently used words: a modified Markov model for natural language. In: Proceedings of ACL, pp. 348–350 (1988) Kuhn, R.: Speech recognition and the frequency of recently used words: a modified Markov model for natural language. In: Proceedings of ACL, pp. 348–350 (1988)
9.
go back to reference Kuhn, R., De Mori, R.: A cache-based natural language model for speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. 14(6), 219–228 (1992) Kuhn, R., De Mori, R.: A cache-based natural language model for speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. 14(6), 219–228 (1992)
10.
go back to reference Kuhn, R., Mori, R.D.: Correction to: a cache-based natural language model for speech re-production (1992) Kuhn, R., Mori, R.D.: Correction to: a cache-based natural language model for speech re-production (1992)
11.
go back to reference Stolcke, A.: SRILM-an extensible language modeling toolkit. In: Interspeech, pp. 17–43 (2002) Stolcke, A.: SRILM-an extensible language modeling toolkit. In: Interspeech, pp. 17–43 (2002)
12.
go back to reference Federico, M., Cettolo, M.: Efficient handling of n-gram language models for statistical machine translation. In: Proceedings of the 2nd WSMT, pp. 88–95. ACL (2007) Federico, M., Cettolo, M.: Efficient handling of n-gram language models for statistical machine translation. In: Proceedings of the 2nd WSMT, pp. 88–95. ACL (2007)
13.
go back to reference Nguyen, P., Gao, J., Mahajan, M.: MSRLM: a scalable language modeling toolkit. Microsoft Research MSR-TR-2007-144 (2007) Nguyen, P., Gao, J., Mahajan, M.: MSRLM: a scalable language modeling toolkit. Microsoft Research MSR-TR-2007-144 (2007)
14.
go back to reference Zhang, R.: Research on Large Model and Its Application in Machine Translation, Ph.D thesis of Xiamen University (2009) Zhang, R.: Research on Large Model and Its Application in Machine Translation, Ph.D thesis of Xiamen University (2009)
15.
go back to reference Zhang, Y., Hildebrand, A.S., Vogel, S.: Distributed language modeling for N-best list re-ranking. In: EMNLP, pp. 216–223 (2007) Zhang, Y., Hildebrand, A.S., Vogel, S.: Distributed language modeling for N-best list re-ranking. In: EMNLP, pp. 216–223 (2007)
16.
go back to reference Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef
17.
go back to reference Yu, X.: Estimating language models using Hadoop and HBase. Ph.D thesis of University of Edinburgh (2008) Yu, X.: Estimating language models using Hadoop and HBase. Ph.D thesis of University of Edinburgh (2008)
18.
go back to reference Zhou, Q., Sun, M., Huang, C.: Automatic identification of Chinese maximal noun phrases. J. Softw. 11(2), 195–201 (2000) Zhou, Q., Sun, M., Huang, C.: Automatic identification of Chinese maximal noun phrases. J. Softw. 11(2), 195–201 (2000)
19.
go back to reference Zhao, J., Huang, C.: Chinese basic noun phrase recognition model based on conversion. J. Chin. Inf. Process. 13(2), 1–7 (1999) Zhao, J., Huang, C.: Chinese basic noun phrase recognition model based on conversion. J. Chin. Inf. Process. 13(2), 1–7 (1999)
20.
go back to reference Li, H., Yang, F., Zhu, J.: Transductive HMM based text chunking. Comput. Sci. 31(2), 152–154 (2004) Li, H., Yang, F., Zhu, J.: Transductive HMM based text chunking. Comput. Sci. 31(2), 152–154 (2004)
21.
go back to reference Ma, Y., Liu, Y.: Base noun phrase identification based on HMM and candidates sorting by weighted templates. In: Proceedings of CCL (2005) Ma, Y., Liu, Y.: Base noun phrase identification based on HMM and candidates sorting by weighted templates. In: Proceedings of CCL (2005)
22.
go back to reference Liu, F., Zhao, T., Yu, H.: Statistics based Chinese chunk Parsin. J. Chin. Inf. Process. 14(6), 28–32 (2000) Liu, F., Zhao, T., Yu, H.: Statistics based Chinese chunk Parsin. J. Chin. Inf. Process. 14(6), 28–32 (2000)
23.
go back to reference Huang, D., Wang, Y.: Chunk parsing based on SVM and error-driven learning methods. J. Chin. Inf. Process. 20(6), 17–24 (2006) Huang, D., Wang, Y.: Chunk parsing based on SVM and error-driven learning methods. J. Chin. Inf. Process. 20(6), 17–24 (2006)
24.
go back to reference Li, Y., Zhu, J., Yao, T.: Combined multiple classifiers based on a stacking algorithm and their application to Chinese text Chinese text chunking. J. Comput. Res. Dev. 42(5), 844–848 (2005)CrossRef Li, Y., Zhu, J., Yao, T.: Combined multiple classifiers based on a stacking algorithm and their application to Chinese text Chinese text chunking. J. Comput. Res. Dev. 42(5), 844–848 (2005)CrossRef
25.
go back to reference Liu, S., Li, Y., Zhang, L.: Chinese text chunking using co-training method. J. Chin. Inf. Process. 19(3), 73–79 (2005) Liu, S., Li, Y., Zhang, L.: Chinese text chunking using co-training method. J. Chin. Inf. Process. 19(3), 73–79 (2005)
Metadata
Title
High Order N-gram Model Construction and Application Based on Natural Annotation
Authors
Qibo Wang
Gaoqi Rao
Endong Xun
Copyright Year
2020
DOI
https://doi.org/10.1007/978-3-030-38189-9_34

Premium Partner