Top

Published in:

2018 | OriginalPaper | Chapter

Overview of Character-Based Models for Natural Language Processing

Authors : Heike Adel, Ehsaneddin Asgari, Hinrich Schütze

Published in: Computational Linguistics and Intelligent Text Processing

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Character-based models become more and more popular for different natural language processing task, especially due to the success of neural networks. They provide the possibility of directly model text sequences without the need of tokenization and, therefore, enhance the traditional preprocessing pipeline. This paper provides an overview of character-based models for a variety of natural language processing tasks. We group existing work in three categories: tokenization-based approaches, bag-of-n-gram models and end-to-end models. For each category, we present prominent examples of studies with a particular focus on recent character-based deep learning work.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

next chapter Pooling Word Vector Representations Across Models

There are also difficult cases in English, such as “Yahoo!” or “San Francisco-Los Angeles flights”.

In our view, morpheme-based models are not true instances of character-level models as linguistically motivated morphological segmentation is an equivalent step to tokenization, but on a different level. We therefore do not cover most work on morphological segmentation in this paper.

Alex, B.: An unsupervised system for identifying english inclusions in german text. In: Annual Meeting of the Association for Computational Linguistics (2005)

Andor, D., et al.: Globally normalized transition-based neural networks. In: Annual Meeting of the Association for Computational Linguistics (2016)

Asgari, E., Mofrad, M.R.K.: Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE 10(11), 1–15 (2015)CrossRef

Asgari, E., Mofrad, M.R.K.: Comparing fifty natural languages and twelve genetic languages using word embedding language divergence (WELD) as a quantitative measure of language distance. In: Workshop on Multilingual and Cross-lingual Methods in NLP, pp. 65–74 (2016)

Bahdanau, D., Chorowski, J., Serdyuk, D., Brakel, P., Bengio, Y.: End-to-end attention-based large vocabulary speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4945–4949 (2016)

Baldwin, T., Lui, M.: Language identification: the long and the short of the matter. In: Conference of the North American Chapter of the Association for Computational Linguistics/Human Language Technologies, pp. 229–237 (2010)

Ballesteros, M., Dyer, C., Smith, N.A.: Improved transition-based parsing by modeling characters instead of words with LSTMS. In: Conference on Empirical Methods in Natural Language Processing (2015)

Bilmes, J., Kirchhoff, K.: Factored language models and generalized parallel backoff. In: Conference of the North American Chapter of the Association for Computational Linguistics/Human Language Technologies (2003)

Bisani, M., Ney, H.: Joint-sequence models for grapheme-to-phoneme conversion. Speech Commun. 50(5), 434–451 (2008)CrossRef

10.

Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics (2017)

11.

Bojanowski, P., Joulin, A., Mikolov, T.: Alternative structures for character-level RNNS. In: Workshop at International Conference on Learning Representations (2016)

12.

Botha, J.A., Blunsom, P.: Compositional morphology for word representations and language modelling. In: International Conference on Machine Learning (2014)

13.

Cao, K., Rei, M.: A joint model for word embedding and word morphology. In: Annual Meeting of the Association for Computational Linguistics, pp. 18–26 (2016)

14.

Cavnar, W.: Using an n-gram-based document representation with a vector processing retrieval model. NIST SPECIAL PUBLICATION SP, pp. 269–269 (1995)

15.

Chan, W., Jaitly, N., Le, Q.V., Vinyals, O.: Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4960–4964 (2016)

16.

Chen, A., He, J., Xu, L., Gey, F.C., Meggs, J.: Chinese text retrieval without using a dictionary. ACM SIGIR Forum 31(SI), 42–49 (1997)CrossRef

17.

Chen, X., Xu, L., Liu, Z., Sun, M., Luan, H.: Joint learning of character and word embeddings. In: International Joint Conference on Artificial Intelligence, pp. 1236–1242 (2015)

18.

Chiu, J.P.C., Nichols, E.: Named entity recognition with bidirectional LSTM-CNNS. Trans. Assoc. Comput. Linguist. 4, 357–370 (2016)

19.

Chung, J., Ahn, S., Bengio, Y.: Hierarchical multiscale recurrent neural networks. In: Proceedings of International Conference on Learning Representations (2017)

20.

Chung, J., Cho, K., Bengio, Y.: A character-level decoder without explicit segmentation for neural machine translation. In: Annual Meeting of the Association for Computational Linguistics (2016)

21.

Church, K.W.: Char\(\_\)align: a program for aligning parallel texts at the character level. In: Annual Meeting of the Association for Computational Linguistics, pp. 1–8 (1993)

22.

Clark, A.: Combining distributional and morphological information for part of speech induction. In: Conference of the European Chapter of the Association for Computational Linguistics, pp. 59–66 (2003)

23.

Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)MATH

24.

Costa-Jussà, M.R., Fonollosa, J.A.R.: Character-based neural machine translation. In: Annual Meeting of the Association for Computational Linguistics (2016)

25.

Cotterell, R., Vieira, T., Schütze, H.: A joint model of orthography and morphological segmentation. In: Conference of the North American Chapter of the Association for Computational Linguistics/Human Language Technologies (2016)

26.

Damashek, M.: Gauging similarity with n-grams: language-independent categorization of text. Science 267, 843–848 (1995)CrossRef

27.

De Heer, T.: Experiments with syntactic traces in information retrieval. Inf. Storage Retr. 10(3–4), 133–144 (1974)CrossRef

28.

Dunning, T.: Statistical identification of language. Technical Report MCCS 940–273, Computing Research Laboratory, New Mexico State (1994)

29.

Eyben, F., Wöllmer, M., Schuller, B.W., Graves, A.: From speech to letters - using a novel neural network architecture for grapheme based ASR. In: IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), pp. 376–380 (2009)

30.

Eyecioglu, A., Keller, B.: ASOBEK at SemEval-2016 task 1: sentence representation with character n-gram embeddings for semantic textual similarity. In: SemEval-2016: The 10th International Workshop on Semantic Evaluation, pp. 1320–1324 (2016)

31.

Faruqui, M., Tsvetkov, Y., Neubig, G., Dyer, C.: Morphological inflection generation using character sequence to sequence learning. In: Conference of the North American Chapter of the Association for Computational Linguistics/Human Language Technologies (2016)

32.

Gerdjikov, S., Schulz, K.U.: Corpus analysis without prior linguistic knowledge-unsupervised mining of phrases and subphrase structure. CoRR abs/1602.05772 (2016)

33.

Gillick, D., Brunk, C., Vinyals, O., Subramanya, A.: Multilingual language processing from bytes. In: North American Chapter of the Association for Computational Linguistics, pp. 1296–1306, June 2016

34.

Golub, D., He, X.: Character-level question answering with attention. In: Conference on Empirical Methods in Natural Language Processing (2016)

35.

Graves, A.: Generating sequences with recurrent neural networks. CoRR abs/1308.0850 (2013)

36.

Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: International Conference on Machine Learning, pp. 1764–1772 (2014)

37.

Haizhou, L., Min, Z., Jian, S.: A joint source-channel model for machine transliteration. In: Annual Meeting of the Association for Computational Linguistics, p. 159 (2004)

38.

Hardmeier, C.: A neural model for part-of-speech tagging in historical texts. In: International Conference on Computational Linguistics, pp. 922–931 (2016)

39.

Hirsimäki, T., Creutz, M., Siivola, V., Kurimo, M., Virpioja, S., Pylkkönen, J.: Unlimited vocabulary speech recognition with morph language models applied to finnish. Comput. Speech Lang. 20(4), 515–541 (2006)CrossRef

40.

Ircing, P., et al.: On large vocabulary continuous speech recognition of highly inflectional language-czech. In: Proceedings of the 7th European Conference on Speech Communication and Technology, vol. 1, pp. 487–490. ISCA: International Speech Communication Association (2001)

41.

Jaech, A., Mulcaire, G., Hathi, S., Ostendorf, M., Smith, N.A.: Hierarchical character-word models for language identification. In: Proceedings of The Fourth International Workshop on Natural Language Processing for Social Media, pp. 84–93 (2016)

42.

Kalchbrenner, N., Espeholt, L., Simonyan, K., van den Oord, A., Graves, A., Kavukcuoglu, K.: Neural machine translation in linear time. CoRR abs/1610.10099 (2016)

43.

Kann, K., Cotterell, R., Schütze, H.: Neural morphological analysis: encoding-decoding canonical segments. In: Conference on Empirical Methods in Natural Language Processing (2016)

44.

Kann, K., Schütze, H.: MED: The LMU system for the SIGMORPHON 2016 shared task on morphological reinflection. In: SIGMORPHON Workshop (2016)

45.

Kann, K., Schütze, H.: Single-model encoder-decoder with explicit morphological representation for reinflection. In: Annual Meeting of the Association for Computational Linguistics (2016)

46.

Kaplan, R.M., Kay, M.: Regular models of phonological rule systems. Comput. Linguist. 20(3), 331–378 (1994)

47.

Kettunen, K., McNamee, P., Baskaya, F.: Using syllables as indexing terms in full-text information retrieval. In: Human Language Technologies - The Baltic Perspective - Proceedings of the Fourth International Conference Baltic HLT 2010, Riga, Latvia, October 7–8, 2010, pp. 225–232 (2010)

48.

Kim, Y., Jernite, Y., Sontag, D., Rush, A.M.: Character-aware neural language models. In: AAAI Conference on Artificial Intelligence, pp. 2741–2749 (2016)

49.

Kirchhoff, K., Vergyri, D., Bilmes, J., Duh, K., Stolcke, A.: Morphology-based language modeling for conversational arabic speech recognition. Comput. Speech Lang. 20(4), 589–608 (2006)CrossRef

50.

Klein, D., Smarr, J., Nguyen, H., Manning, C.D.: Named entity recognition with character-level models. In: Computational Natural Language Learning, pp. 180–183 (2003)

51.

Knight, K., Graehl, J.: Machine transliteration. Comput. Linguist. 24(4), 599–612 (1998)

52.

Kocmi, T., Bojar, O.: SubGram: extending skip-gram word representation with substrings. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2016. LNCS (LNAI), vol. 9924, pp. 182–189. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45510-5_21CrossRef

53.

Kombrink, S., Hannemann, M., Burget, L., Heřmanský, H.: Recovery of rare words in lecture speech. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2010. LNCS (LNAI), vol. 6231, pp. 330–337. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15760-8_42CrossRef

54.

Kou, W., Li, F., Baldwin, T.: Automatic labelling of topic models using word vectors and letter trigram vectors. In: Asia Information Retrieval Societies Conference (AIRS), pp. 253–264 (2015)

55.

Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Conference of the North American Chapter of the Association for Computational Linguistics/Human Language Technologies (2016)

56.

Lee, J., Cho, K., Hofmann, T.: Fully character-level neural machine translation without explicit segmentation. CoRR abs/1610.03017 (2016)

57.

Lepage, Y., Denoual, E.: Purest ever example-based machine translation: detailed presentation and assessment. Mach. Transl. 19(3–4), 251–282 (2005)

58.

Ling, W., et al.: Finding function in form: compositional character models for open vocabulary word representation. In: Conference on Empirical Methods in Natural Language Processing, pp. 1520–1530 (2015)

59.

Ling, W., Trancoso, I., Dyer, C., Black, A.W.: Character-based neural machine translation. CoRR abs/1511.04586 (2015)

60.

Luong, M., Manning, C.D.: Achieving open vocabulary neural machine translation with hybrid word-character models. In: Annual Meeting of the Association for Computational Linguistics (2016)

61.

Luong, M.T., Socher, R., Manning, C.D.: Better word representations with recursive neural networks for morphology. In: Computational Natural Language Learning (2013)

62.

Ma, X., Hovy, E.H.: End-to-end sequence labeling via bi-directional LSTM-CNNS-CRF. In: Annual Meeting of the Association for Computational Linguistics (2016)

63.

McNamee, P., Mayfield, J.: Character n-gram tokenization for european language text retrieval. Inf. Retr. 7(1–2), 73–97 (2004)CrossRef

64.

Mihalcea, R., Nastase, V.: Letter level learning for language independent diacritics restoration. In: Computational Natural Language Learning (2002)

65.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

66.

Mikolov, T., Sutskever, I., Deoras, A., Le, H.S., Kombrink, S., Cernocky, J.: Subword language modeling with neural networks (2012)

67.

Miyamoto, Y., Cho, K.: Gated word-character recurrent language model. In: Conference on Empirical Methods in Natural Language Processing, pp. 1992–1997 (2016)

68.

Müller, T., Schmid, H., Schütze, H.: Efficient higher-order CRFs for morphological tagging. In: Conference on Empirical Methods in Natural Language Processing, pp. 322–332 (2013)

69.

Parada, C., Dredze, M., Sethy, A., Rastrow, A.: Learning sub-word units for open vocabulary speech recognition. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 712–721 (2011)

70.

Peng, F., Schuurmans, D., Wang, S., Keselj, V.: Language independent authorship attribution using character level language models. In: Conference of the European Chapter of the Association for Computational Linguistics, pp. 267–274 (2003)

71.

Pettersson, E., Megyesi, B., Nivre, J.: A multilingual evaluation of three spelling normalisation methods for historical text. In: Proceedings of the 8th Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities, pp. 32–41 (2014)

72.

Plank, B., Søgaard, A., Goldberg, Y.: Multilingual part-of-speech tagging with bidirectional long short-term memory models and auxiliary loss. In: Annual Meeting of the Association for Computational Linguistics (2016)

73.

Rastogi, P., Cotterell, R., Eisner, J.: Weighting finite-state transductions with neural context. In: Conference of the North American Chapter of the Association for Computational Linguistics/Human Language Technologies, pp. 623–633 (2016)

74.

Ratnaparkhi, A., et al.: A maximum entropy model for part-of-speech tagging. In: Conference on Empirical Methods in Natural Language Processing, vol. 1, pp. 133–142. Philadelphia, USA (1996)

75.

Sajjad, H.: Statistical models for unsupervised, semi-supervised and supervised transliteration mining. In: Computational Linguistics (2012)

76.

dos Santos, C.N., Gatti, M.: Deep convolutional neural networks for sentiment analysis of short texts. In: International Conference on Computational Linguistics. pp. 69–78 (2014)

77.

dos Santos, C.N., Guimarães, V.: Boosting named entity recognition with neural character embeddings. In: Fifth Named Entity Workshop, pp. 25–33 (2015)

78.

dos Santos, C.N., Zadrozny, B.: Learning character-level representations for part-of-speech tagging. In: International Conference on Machine Learning, pp. 1818–1826 (2014)

79.

Schütze, H.: Word space. In: Advances in Neural Information Processing Systems, pp. 895–902 (1992)

80.

Schütze, H.: Nonsymbolic text representation. CoRR abs/1610.00479 (2016)

81.

Sejnowski, T.J., Rosenberg, C.R.: Parallel networks that learn to pronounce english text. Complex Syst. 1(1), 145–168 (1987)MATH

82.

Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: Annual Meeting of the Association for Computational Linguistics (2016)

83.

Shaik, M.A.B., Mousa, A.E.D., Schlüter, R., Ney, H.: Hybrid language models using mixed types of sub-lexical units for open vocabulary german lvcsr. In: Annual Conference of the International Speech Communication Association, pp. 1441–1444 (2011)

84.

Shaik, M.A.B., Mousa, A.E., Schlüter, R., Ney, H.: Feature-rich sub-lexical language models using a maximum entropy approach for german LVCSR. In: Annual Conference of the International Speech Communication Association, pp. 3404–3408 (2013)

85.

Shannon, C.E.: Prediction and entropy of printed english. Bell Labs Tech. J. 30(1), 50–64 (1951)CrossRef

86.

Sperr, H., Niehues, J., Waibel, A.: Letter n-gram-based input encoding for continuous space language models. In: Workshop on Continuous Vector Space Models and their Compositionality, pp. 30–39 (2013)

87.

Srivastava, R.K., Greff, K., Schmidhuber, J.: Highway networks. In: ICML 2015 Deep Learing Workshop (2015)

88.

Sutskever, I., Martens, J., Hinton, G.E.: Generating text with recurrent neural networks. In: International Conference on Machine Learning, pp. 1017–1024 (2011)

89.

Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: Advances in Neural Information Processing Systems, pp. 3104–3112 (2014)

90.

Tiedemann, J., Nakov, P.: Analyzing the use of character-level translation with sparse and noisy datasets. In: Recent Advances in Natural Language Processing, RANLP 2013, 9–11 September, 2013, Hissar, Bulgaria, pp. 676–684 (2013)

91.

Murthy, V., Khapra, M.M., Bhattacharyya, P.: Sharing network parameters for crosslingual named entity recognition. CoRR abs/1607.00198 (2016)

92.

Vergyri, D., Kirchhoff, K., Duh, K., Stolcke, A.: Morphology-based language modeling for arabic speech recognition. In: Annual Conference of the International Speech Communication Association, 4, 2245–2248 (2004)

93.

Vilar, D., Peter, J.T., Ney, H.: Can we translate letters? In: Workshop on Statistical Machine Translation (2007)

94.

Vylomova, E., Cohn, T., He, X., Haffari, G.: Word representation models for morphologically rich languages in neural machine translation. CoRR abs/1606.04217 (2016)

95.

Wang, L., Cao, Z., Xia, Y., de Melo, G.: Morphological segmentation with window LSTM neural networks. In: AAAI Conference on Artificial Intelligence (2016)

96.

Wieting, J., Bansal, M., Gimpel, K., Livescu, K.: Charagram: embedding words and sentences via character n-grams. In: Conference on Empirical Methods in Natural Language Processing (2016)

97.

Wu, Y., et al.: Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR abs/1609.08144 (2016)

98.

Xiao, Y., Cho, K.: Efficient character-level document classification by combining convolution and recurrent layers. CoRR abs/1602.00367 (2016)

99.

Yaghoobzadeh, Y., Schütze, H.: Multi-level representations for fine-grained typing of knowledge base entities. In: Conference of the European Chapter of the Association for Computational Linguistics (2017)

100.

Yang, Z., Chen, W., Wang, F., Xu, B.: A character-aware encoder for neural machine translation. In: International Conference on Computational Linguistics, pp. 3063–3070 (2016)

101.

Yang, Z., Salakhutdinov, R., Cohen, W.W.: Multi-task cross-lingual sequence tagging from scratch. CoRR abs/1603.06270 (2016)

102.

Yu, L., Buys, J., Blunsom, P.: Online segment to segment neural transduction. In: Conference on Empirical Methods in Natural Language Processing, pp. 1307–1316 (2016)

103.

Zhang, X., LeCun, Y.: Text understanding from scratch. CoRR abs/1502.01710 (2015)

104.

Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. In: Advances in Neural Information Processing Systems, pp. 649–657 (2015)

Title: Overview of Character-Based Models for Natural Language Processing
Authors: Heike Adel
Ehsaneddin Asgari
Hinrich Schütze
Publisher: Springer International Publishing
Book: Computational Linguistics and Intelligent Text Processing
Print ISBN: 978-3-319-77112-0

Electronic ISBN: 978-3-319-77113-7

Copyright Year: 2018
DOI: https://doi.org/10.1007/978-3-319-77113-7_1

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner