Skip to main content

2020 | OriginalPaper | Buchkapitel

High Order N-gram Model Construction and Application Based on Natural Annotation

verfasst von : Qibo Wang, Gaoqi Rao, Endong Xun

Erschienen in: Chinese Lexical Semantics

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The language model based on the n-gram grammar plays an important role in NLP tasks. In this paper, language models based on language boundary are proposed to conquer the challenge of the very big language data: intra-sentence boundary model and inter-sentence boundary model. We developed a training tool on the Hadoop platform based on MapReduce programming, and conducted the prefix tree to compress and store the model. We implemented our model in identifying the boundary in the syntactic parsing, achieving a good result.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Li, Z., Sun, M.: Punctuation as implicit annotations for Chinese word segmentation. Comput. Linguist. 35(4), 505–512 (2009)CrossRef Li, Z., Sun, M.: Punctuation as implicit annotations for Chinese word segmentation. Comput. Linguist. 35(4), 505–512 (2009)CrossRef
2.
Zurück zum Zitat Rao, G., et al.: Natural annotation research in large-scale corpora with a focus on Chinese word segmentation. Acta Sci. Nat. Univ. Pekin. 49(1), 140–146 (2013) Rao, G., et al.: Natural annotation research in large-scale corpora with a focus on Chinese word segmentation. Acta Sci. Nat. Univ. Pekin. 49(1), 140–146 (2013)
3.
Zurück zum Zitat Rosenfeld, R., Carbonell, J., Rudnicky, A., et al.: Adaptive statistical language modeling: a maximum entropy approach. A maximum entropy approach (1994) Rosenfeld, R., Carbonell, J., Rudnicky, A., et al.: Adaptive statistical language modeling: a maximum entropy approach. A maximum entropy approach (1994)
4.
Zurück zum Zitat Huang, X., Alleva, F., Hon, H.W., et al.: The SPHINX-II speech recognition system: an overview. Comput. Speech Lang. 7(2), 137–148 (1992)CrossRef Huang, X., Alleva, F., Hon, H.W., et al.: The SPHINX-II speech recognition system: an overview. Comput. Speech Lang. 7(2), 137–148 (1992)CrossRef
5.
Zurück zum Zitat Ney, H., Essen, U., Kneser, R.: On structuring probabilistic dependences in stochastic language modelling. Comput. Speech Lang. 8(1), 1–38 (1994)CrossRef Ney, H., Essen, U., Kneser, R.: On structuring probabilistic dependences in stochastic language modelling. Comput. Speech Lang. 8(1), 1–38 (1994)CrossRef
6.
Zurück zum Zitat Brown, P.F., Desouza, P.V., Mercer, R.L., et al.: Class-based n-gram models of natural language. Comput. Linguist. 18(4), 467–479 (1992) Brown, P.F., Desouza, P.V., Mercer, R.L., et al.: Class-based n-gram models of natural language. Comput. Linguist. 18(4), 467–479 (1992)
7.
Zurück zum Zitat Goodman, J.T.: A bit of progress in language modeling. Comput. Speech Lang. 15(4), 403–434 (2001)CrossRef Goodman, J.T.: A bit of progress in language modeling. Comput. Speech Lang. 15(4), 403–434 (2001)CrossRef
8.
Zurück zum Zitat Kuhn, R.: Speech recognition and the frequency of recently used words: a modified Markov model for natural language. In: Proceedings of ACL, pp. 348–350 (1988) Kuhn, R.: Speech recognition and the frequency of recently used words: a modified Markov model for natural language. In: Proceedings of ACL, pp. 348–350 (1988)
9.
Zurück zum Zitat Kuhn, R., De Mori, R.: A cache-based natural language model for speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. 14(6), 219–228 (1992) Kuhn, R., De Mori, R.: A cache-based natural language model for speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. 14(6), 219–228 (1992)
10.
Zurück zum Zitat Kuhn, R., Mori, R.D.: Correction to: a cache-based natural language model for speech re-production (1992) Kuhn, R., Mori, R.D.: Correction to: a cache-based natural language model for speech re-production (1992)
11.
Zurück zum Zitat Stolcke, A.: SRILM-an extensible language modeling toolkit. In: Interspeech, pp. 17–43 (2002) Stolcke, A.: SRILM-an extensible language modeling toolkit. In: Interspeech, pp. 17–43 (2002)
12.
Zurück zum Zitat Federico, M., Cettolo, M.: Efficient handling of n-gram language models for statistical machine translation. In: Proceedings of the 2nd WSMT, pp. 88–95. ACL (2007) Federico, M., Cettolo, M.: Efficient handling of n-gram language models for statistical machine translation. In: Proceedings of the 2nd WSMT, pp. 88–95. ACL (2007)
13.
Zurück zum Zitat Nguyen, P., Gao, J., Mahajan, M.: MSRLM: a scalable language modeling toolkit. Microsoft Research MSR-TR-2007-144 (2007) Nguyen, P., Gao, J., Mahajan, M.: MSRLM: a scalable language modeling toolkit. Microsoft Research MSR-TR-2007-144 (2007)
14.
Zurück zum Zitat Zhang, R.: Research on Large Model and Its Application in Machine Translation, Ph.D thesis of Xiamen University (2009) Zhang, R.: Research on Large Model and Its Application in Machine Translation, Ph.D thesis of Xiamen University (2009)
15.
Zurück zum Zitat Zhang, Y., Hildebrand, A.S., Vogel, S.: Distributed language modeling for N-best list re-ranking. In: EMNLP, pp. 216–223 (2007) Zhang, Y., Hildebrand, A.S., Vogel, S.: Distributed language modeling for N-best list re-ranking. In: EMNLP, pp. 216–223 (2007)
16.
Zurück zum Zitat Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef
17.
Zurück zum Zitat Yu, X.: Estimating language models using Hadoop and HBase. Ph.D thesis of University of Edinburgh (2008) Yu, X.: Estimating language models using Hadoop and HBase. Ph.D thesis of University of Edinburgh (2008)
18.
Zurück zum Zitat Zhou, Q., Sun, M., Huang, C.: Automatic identification of Chinese maximal noun phrases. J. Softw. 11(2), 195–201 (2000) Zhou, Q., Sun, M., Huang, C.: Automatic identification of Chinese maximal noun phrases. J. Softw. 11(2), 195–201 (2000)
19.
Zurück zum Zitat Zhao, J., Huang, C.: Chinese basic noun phrase recognition model based on conversion. J. Chin. Inf. Process. 13(2), 1–7 (1999) Zhao, J., Huang, C.: Chinese basic noun phrase recognition model based on conversion. J. Chin. Inf. Process. 13(2), 1–7 (1999)
20.
Zurück zum Zitat Li, H., Yang, F., Zhu, J.: Transductive HMM based text chunking. Comput. Sci. 31(2), 152–154 (2004) Li, H., Yang, F., Zhu, J.: Transductive HMM based text chunking. Comput. Sci. 31(2), 152–154 (2004)
21.
Zurück zum Zitat Ma, Y., Liu, Y.: Base noun phrase identification based on HMM and candidates sorting by weighted templates. In: Proceedings of CCL (2005) Ma, Y., Liu, Y.: Base noun phrase identification based on HMM and candidates sorting by weighted templates. In: Proceedings of CCL (2005)
22.
Zurück zum Zitat Liu, F., Zhao, T., Yu, H.: Statistics based Chinese chunk Parsin. J. Chin. Inf. Process. 14(6), 28–32 (2000) Liu, F., Zhao, T., Yu, H.: Statistics based Chinese chunk Parsin. J. Chin. Inf. Process. 14(6), 28–32 (2000)
23.
Zurück zum Zitat Huang, D., Wang, Y.: Chunk parsing based on SVM and error-driven learning methods. J. Chin. Inf. Process. 20(6), 17–24 (2006) Huang, D., Wang, Y.: Chunk parsing based on SVM and error-driven learning methods. J. Chin. Inf. Process. 20(6), 17–24 (2006)
24.
Zurück zum Zitat Li, Y., Zhu, J., Yao, T.: Combined multiple classifiers based on a stacking algorithm and their application to Chinese text Chinese text chunking. J. Comput. Res. Dev. 42(5), 844–848 (2005)CrossRef Li, Y., Zhu, J., Yao, T.: Combined multiple classifiers based on a stacking algorithm and their application to Chinese text Chinese text chunking. J. Comput. Res. Dev. 42(5), 844–848 (2005)CrossRef
25.
Zurück zum Zitat Liu, S., Li, Y., Zhang, L.: Chinese text chunking using co-training method. J. Chin. Inf. Process. 19(3), 73–79 (2005) Liu, S., Li, Y., Zhang, L.: Chinese text chunking using co-training method. J. Chin. Inf. Process. 19(3), 73–79 (2005)
Metadaten
Titel
High Order N-gram Model Construction and Application Based on Natural Annotation
verfasst von
Qibo Wang
Gaoqi Rao
Endong Xun
Copyright-Jahr
2020
DOI
https://doi.org/10.1007/978-3-030-38189-9_34