Skip to main content
Top
Published in: Neural Computing and Applications 8/2021

08-08-2020 | Original Article

Learning Chinese word representation better by cascade morphological n-gram

Authors: Zongyang Xiong, Ke Qin, Haobo Yang, Guangchun Luo

Published in: Neural Computing and Applications | Issue 8/2021

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Word embedding refers to mapping words or phrases to vectors of real numbers. This is the precondition of text classification, sentiment analysis and text mining in natural language processing using deep neural networks. Taking English as an example, most of current word embedding algorithms obtain the vectors by learning the distribution of word’s prefix, suffix, etyma and the entire word itself. Unlike English, Chinese words are composed of components and strokes. Furthermore, those components and strokes usually hint the meaning of the word. Thus, components and strokes distribution must be fully considered and learnt when one’s doing Chinese word embedding. In this paper, we propose a component-based cascade n-gram (CBC n-gram) model and a stroke-based cascade n-gram (SBC n-gram) model. By overlaying component and stroke n-gram vectors on word vectors, we successfully improve Chinese word embedding so as to preserve as more morphological information as possible at different granularity levels. We evaluate our models on word similarity, word analogy and text classification tasks using wordsim-240, wordsim-296, Chinese word analogy dataset and Fudan Corpus, respectively. Experimental and comparison results show that our models outperform other state-of-the-art methods.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533CrossRef Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533CrossRef
3.
go back to reference Li Z, Lu W, Sun Z, Xing W (2017) A parallel feature selection method study for text classification. Neural Comput Appl 28(1):513–524CrossRef Li Z, Lu W, Sun Z, Xing W (2017) A parallel feature selection method study for text classification. Neural Comput Appl 28(1):513–524CrossRef
4.
go back to reference Shams R, Mercer RE (2016) Supervised classification of spam emails with natural language stylometry. Neural Comput Appl 27(8):2315–2331CrossRef Shams R, Mercer RE (2016) Supervised classification of spam emails with natural language stylometry. Neural Comput Appl 27(8):2315–2331CrossRef
5.
go back to reference Liu X, Gao J, He X, Deng L, Duh K, Wang YY (2015) Representation learning using multi-task deep neural networks for semantic classification and information retrieval. In: The 2015 conference of the North American chapter of the association for computational linguistics—human language technologies Liu X, Gao J, He X, Deng L, Duh K, Wang YY (2015) Representation learning using multi-task deep neural networks for semantic classification and information retrieval. In: The 2015 conference of the North American chapter of the association for computational linguistics—human language technologies
6.
go back to reference Zhou G, He T, Zhao J, Hu P (2015) Learning continuous word embedding with metadata for question retrieval in community question answering. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (Volume 1: Long Papers), vol 1, pp 250–259 Zhou G, He T, Zhao J, Hu P (2015) Learning continuous word embedding with metadata for question retrieval in community question answering. In: Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (Volume 1: Long Papers), vol 1, pp 250–259
7.
go back to reference Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, pp 3104–3112 Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. In: Advances in neural information processing systems, pp 3104–3112
8.
go back to reference Zhang H, Li J, Ji Y, Yue H (2016) Understanding subtitles by character-level sequence-to-sequence learning. IEEE Trans Ind Inf 13(2):616–624CrossRef Zhang H, Li J, Ji Y, Yue H (2016) Understanding subtitles by character-level sequence-to-sequence learning. IEEE Trans Ind Inf 13(2):616–624CrossRef
9.
go back to reference Huang EH, Socher R, Manning CD, Ng AY (2012) Improving word representations via global context and multiple word prototypes. In: Proceedings of the 50th annual meeting of the association for computational linguistics: long papers-volume 1, association for computational linguistics, pp 873–882 Huang EH, Socher R, Manning CD, Ng AY (2012) Improving word representations via global context and multiple word prototypes. In: Proceedings of the 50th annual meeting of the association for computational linguistics: long papers-volume 1, association for computational linguistics, pp 873–882
10.
go back to reference Levy O, Goldberg Y (2014) Dependency-based word embeddings. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (short papers), vol 2, pp 302–308 Levy O, Goldberg Y (2014) Dependency-based word embeddings. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (short papers), vol 2, pp 302–308
11.
go back to reference Yang Y, Eisenstein J (2015) Unsupervised multi-domain adaptation with feature embeddings. In: Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 672–682 Yang Y, Eisenstein J (2015) Unsupervised multi-domain adaptation with feature embeddings. In: Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 672–682
12.
go back to reference Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119 Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
13.
go back to reference Mikolov T, Yih Wt, Zweig G (2013) Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North American Chapter of the association for computational linguistics: human language technologies, pp 746–751 Mikolov T, Yih Wt, Zweig G (2013) Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North American Chapter of the association for computational linguistics: human language technologies, pp 746–751
14.
go back to reference Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543 Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1532–1543
15.
go back to reference Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. arXiv preprint arXiv:180205365 Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. arXiv preprint arXiv:​180205365
16.
go back to reference Radford A, Narasimhan K, Salimans T, Sutskever I (2018) Improving language understanding by generative pre-training. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf Radford A, Narasimhan K, Salimans T, Sutskever I (2018) Improving language understanding by generative pre-training. https://​s3-us-west-2.​amazonaws.​com/​openai-assets/​research-covers/​language-unsupervised/​language_​understanding_​paper.​pdf
17.
go back to reference Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:181004805 Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:​181004805
18.
go back to reference Luong T, Socher R, Manning C (2013) Better word representations with recursive neural networks for morphology. In: Proceedings of the seventeenth conference on computational natural language learning, pp 104–113 Luong T, Socher R, Manning C (2013) Better word representations with recursive neural networks for morphology. In: Proceedings of the seventeenth conference on computational natural language learning, pp 104–113
19.
go back to reference Qiu S, Cui Q, Bian J, Gao B, Liu TY (2014) Co-learning of word representations and morpheme representations. In: Proceedings of COLING 2014, the 25th international conference on computational linguistics: technical papers, pp 141–150 Qiu S, Cui Q, Bian J, Gao B, Liu TY (2014) Co-learning of word representations and morpheme representations. In: Proceedings of COLING 2014, the 25th international conference on computational linguistics: technical papers, pp 141–150
20.
go back to reference Sun F, Guo J, Lan Y, Xu J, Cheng X (2016) Inside out: two jointly predictive models for word representations and phrase representations. In: AAAI, pp 2821–2827 Sun F, Guo J, Lan Y, Xu J, Cheng X (2016) Inside out: two jointly predictive models for word representations and phrase representations. In: AAAI, pp 2821–2827
21.
go back to reference Wieting J, Bansal M, Gimpel K, Livescu K (2016) Charagram: embedding words and sentences via character n-grams. arXiv preprint arXiv:160702789 Wieting J, Bansal M, Gimpel K, Livescu K (2016) Charagram: embedding words and sentences via character n-grams. arXiv preprint arXiv:​160702789
22.
go back to reference Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146CrossRef Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146CrossRef
23.
go back to reference Cao S, Lu W (2017) Improving word embeddings with convolutional feature learning and subword information. In: AAAI, pp 3144–3151 Cao S, Lu W (2017) Improving word embeddings with convolutional feature learning and subword information. In: AAAI, pp 3144–3151
24.
go back to reference Sun Y, Lin L, Yang N, Ji Z, Wang X (2014) Radical-enhanced Chinese character embedding. In: International conference on neural information processing, Springer, Berlin, pp 279–286 Sun Y, Lin L, Yang N, Ji Z, Wang X (2014) Radical-enhanced Chinese character embedding. In: International conference on neural information processing, Springer, Berlin, pp 279–286
26.
go back to reference Chen X, Xu L, Liu Z, Sun M, Luan HB (2015) Joint learning of character and word embeddings. In: IJCAI, pp 1236–1242 Chen X, Xu L, Liu Z, Sun M, Luan HB (2015) Joint learning of character and word embeddings. In: IJCAI, pp 1236–1242
27.
go back to reference Xu J, Liu J, Zhang L, Li Z, Chen H (2016) Improve Chinese word embeddings by exploiting internal structure. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1041–1050 Xu J, Liu J, Zhang L, Li Z, Chen H (2016) Improve Chinese word embeddings by exploiting internal structure. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies, pp 1041–1050
28.
go back to reference Yin R, Wang Q, Li P, Li R, Wang B (2016) Multi-granularity Chinese word embedding. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp 981–986 Yin R, Wang Q, Li P, Li R, Wang B (2016) Multi-granularity Chinese word embedding. In: Proceedings of the 2016 conference on empirical methods in natural language processing, pp 981–986
29.
go back to reference Yu J, Jian X, Xin H, Song Y (2017) Joint embeddings of Chinese words, characters, and fine-grained subcharacter components. In: Proceedings of the 2017 conference on empirical methods in natural language processing, pp 286–291 Yu J, Jian X, Xin H, Song Y (2017) Joint embeddings of Chinese words, characters, and fine-grained subcharacter components. In: Proceedings of the 2017 conference on empirical methods in natural language processing, pp 286–291
30.
go back to reference Cao S, Lu W, Zhou J, Li X (2018) cw2vec: Learning Chinese word embeddings with stroke n-gram information. In: Thirty-second AAAI conference on artificial intelligence Cao S, Lu W, Zhou J, Li X (2018) cw2vec: Learning Chinese word embeddings with stroke n-gram information. In: Thirty-second AAAI conference on artificial intelligence
31.
go back to reference Zhang Z, Zhao H, Ling K, Li J, Li Z, He S, Fu G (2019) Effective subword segmentation for text comprehension. In: IEEE/ACM transactions on audio, speech and language processing (TASLP), vol 27(11), pp 1664–1674 Zhang Z, Zhao H, Ling K, Li J, Li Z, He S, Fu G (2019) Effective subword segmentation for text comprehension. In: IEEE/ACM transactions on audio, speech and language processing (TASLP), vol 27(11), pp 1664–1674
32.
go back to reference Chen Z, Hu K (2018) Radical enhanced Chinese word embedding. In: Chinese computational linguistics and natural language processing based on naturally annotated big data. Springer, Berlin, pp 3–11 Chen Z, Hu K (2018) Radical enhanced Chinese word embedding. In: Chinese computational linguistics and natural language processing based on naturally annotated big data. Springer, Berlin, pp 3–11
33.
go back to reference Su Tr, Lee Hy (2017) Learning Chinese word representations from glyphs of characters. In: Proceedings of the 2017 conference on empirical methods in natural language processing, pp 264–273 Su Tr, Lee Hy (2017) Learning Chinese word representations from glyphs of characters. In: Proceedings of the 2017 conference on empirical methods in natural language processing, pp 264–273
34.
go back to reference Wu W, Meng Y, Han Q, Li M, Li X, Mei J, Nie P, Sun X, Li J (2019) Glyce: glyph-vectors for Chinese character representations. arXiv preprint arXiv:190110125 Wu W, Meng Y, Han Q, Li M, Li X, Mei J, Nie P, Sun X, Li J (2019) Glyce: glyph-vectors for Chinese character representations. arXiv preprint arXiv:​190110125
35.
go back to reference Gutmann MU, Hyvärinen A (2012) Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. J Mach Learn Res 13:307–361MathSciNetMATH Gutmann MU, Hyvärinen A (2012) Noise-contrastive estimation of unnormalized statistical models, with applications to natural image statistics. J Mach Learn Res 13:307–361MathSciNetMATH
36.
go back to reference Sun Y, Wang S, Li Y, Feng S, Chen X, Zhang H, Tian X, Zhu D, Tian H, Wu H (2019) Ernie: enhanced representation through knowledge integration. arXiv preprint arXiv:190409223 Sun Y, Wang S, Li Y, Feng S, Chen X, Zhang H, Tian X, Zhu D, Tian H, Wu H (2019) Ernie: enhanced representation through knowledge integration. arXiv preprint arXiv:​190409223
37.
go back to reference Jin P, Wu Y (2012) Semeval-2012 task 4: evaluating Chinese word similarity. In: Proceedings of the first joint conference on lexical and computational semantics-volume 1: proceedings of the main conference and the shared task, and Volume 2: proceedings of the sixth international workshop on semantic evaluation, association for computational linguistics, pp 374–377 Jin P, Wu Y (2012) Semeval-2012 task 4: evaluating Chinese word similarity. In: Proceedings of the first joint conference on lexical and computational semantics-volume 1: proceedings of the main conference and the shared task, and Volume 2: proceedings of the sixth international workshop on semantic evaluation, association for computational linguistics, pp 374–377
38.
go back to reference Spearman C (1904) The proof and measurement of association between two things. Am J Psychol 15(1):72–101CrossRef Spearman C (1904) The proof and measurement of association between two things. Am J Psychol 15(1):72–101CrossRef
39.
go back to reference Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871–1874MATH Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: a library for large linear classification. J Mach Learn Res 9:1871–1874MATH
Metadata
Title
Learning Chinese word representation better by cascade morphological n-gram
Authors
Zongyang Xiong
Ke Qin
Haobo Yang
Guangchun Luo
Publication date
08-08-2020
Publisher
Springer London
Published in
Neural Computing and Applications / Issue 8/2021
Print ISSN: 0941-0643
Electronic ISSN: 1433-3058
DOI
https://doi.org/10.1007/s00521-020-05198-7

Other articles of this Issue 8/2021

Neural Computing and Applications 8/2021 Go to the issue

Premium Partner