Top

Published in:

2020 | OriginalPaper | Chapter

Deep Learning Models for Representing Out-of-Vocabulary Words

Authors : Johannes V. Lochter, Renato M. Silva, Tiago A. Almeida

Published in: Intelligent Systems

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Communication has become increasingly dynamic with the popularization of social networks and applications that allow people to express themselves and communicate instantly. In this scenario, distributed representation models have their quality impacted by new words that appear frequently or that are derived from spelling errors. These words that are unknown by the models, known as out-of-vocabulary (OOV) words, need to be properly handled to not degrade the quality of the natural language processing (NLP) applications, which depend on the appropriate vector representation of the texts. To better understand this problem and finding the best techniques to handle OOV words, in this study, we present a comprehensive performance evaluation of deep learning models for representing OOV words. We performed an intrinsic evaluation using a benchmark dataset and an extrinsic evaluation using different NLP tasks: text categorization, named entity recognition, and part-of-speech tagging. Although the results indicated that the best technique for handling OOV words is different for each task, Comick, a deep learning method that infers the embedding based on the context and the morphological structure of the OOV word, obtained promising results.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter BERTimbau: Pretrained BERT Models for Brazilian Portuguese

next chapter DeepBT and NLP Data Augmentation Techniques: A New Proposal and a Comprehensive Study

In this study, we use the word sample to denote instance or text document.

Transformers. Available at https://huggingface.co/transformers, Accessed on 2020/10/07 14:45:55.

HiCE. Available at https://github.com/acbull/HiCE, Accessed on 2020/10/07 14:45:55.

PyTorch Github. Available at https://bit.ly/2B7LS3U, Accessed on 2020/10/07 14:45:55.

HiCE. Available at https://github.com/acbull/HiCE, Accessed on 2020/10/07 14:45:55.

Keras. Available at https://keras.io/. Accessed on 2020/10/07 14:45:55.

TensorFlow. Available at https://www.tensorflow.org/. Accessed on 2020/10/07 14:45:55.

NER dataset with unusual and unseen entities in the context of emerging discussions.

NER dataset with sentences that have technical terms in the biology domain.

POS tagging dataset of Twitter messages.

Adams, O., Makarucha, A., Neubig, G., Bird, S., Cohn, T.: Cross-lingual word embeddings for low-resource language modeling. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 937–947. Association for Computational Linguistics, Valencia, April 2017

Agarwal, A., Xie, B., Vovsha, I., Rambow, O., Passonneau, R.: Sentiment analysis of twitter data. In: Proceedings of the Workshop on Languages in Social Media (LSM 2011), pp. 30–38. Association for Computational Linguistics, Portland, June 2011

Baziotis, C., Pelekis, N., Doulkeridis, C.: DataStories at SemEval-2017 task 4: deep LSTM with attention for message-level and topic-based sentiment analysis. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 747–754. Association for Computational Linguistics, Vancouver, August 2017. https://doi.org/10.18653/v1/S17-2126

Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)CrossRef

Cho, K., et al.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1724–1734. Association for Computational Linguistics, Doha, October 2014

Chomsky, N.: Syntactic Structures. Mouton and Co., The Hague (1957)CrossRef

Clark, K., Luong, M.T., Le, Q.V., Manning, C.D.: Electra: pre-training text encoders as discriminators rather than generators. In: International Conference on Learning Representations (2020)

Derczynski, L., Nichols, E., van Erp, M., Limsopatham, N.: Results of the WNUT2017 shared task on novel and emerging entity recognition. In: Proceedings of the 3rd Workshop on Noisy User-generated Text, pp. 140–147. Association for Computational Linguistics, Copenhagen, September 2017. https://doi.org/10.18653/v1/W17-4418

Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018). http://arxiv.org/abs/1810.04805

10.

Garneau, N., Leboeuf, J.S., Lamontagne, L.: Predicting and interpreting embeddings for out of vocabulary words in downstream tasks. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 331–333. Association for Computational Linguistics, Brussels, November 2018. https://doi.org/10.18653/v1/W18-5439

11.

Hu, Z., Chen, T., Chang, K.W., Sun, Y.: Few-shot representation learning for out-of-vocabulary words. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4102–4112. Association for Computational Linguistics, Florence, July 2019. https://doi.org/10.18653/v1/P19-1402

12.

Khodak, M., Saunshi, N., Liang, Y., Ma, T., Stewart, B., Arora, S.: A la carte embedding: cheap but effective induction of semantic feature vectors. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 12–22. Association for Computational Linguistics, Melbourne, July 2018. https://doi.org/10.18653/v1/P18-1002

13.

Kim, J.D., Ohta, T., Tsuruoka, Y., Tateisi, Y., Collier, N.: Introduction to the bio-entity recognition task at JNLPBA. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications. JNLPBA 2004, pp. 70–75. Association for Computational Linguistics, Stroudsburg (2004)

14.

Lazaridou, A., Marelli, M., Zamparelli, R., Baroni, M.: Compositional-ly derived representations of morphologically complex words in distributional semantics. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1517–1526. Association for Computational Linguistics, Sofia, August 2013

15.

Liu, Y., et al.: Roberta: a robustly optimized BERT pretraining approach. CoRR abs/1907.11692 (2019)

16.

Lochter, J., Pires, P., Bossolani, C., Yamakami, A., Almeida, T.: Evaluating the impact of corpora used to train distributed text representation models for noisy and short texts. In: 2018 International Joint Conference on Neural Networks (IJCNN), pp. 315–322, July 2018

17.

Lochter, J., Zanetti, R., Reller, D., Almeida, T.: Short text opinion detection using ensemble of classifiers and semantic indexing. Expert Syst. Appl. 62, 243–249 (2016)CrossRef

18.

Mikolov, T., Karafiát, M., Burget, L., Cernocký, J., Khudanpur, S.: Recurrent neural network based language model. In: INTERSPEECH, pp. 1045–1048 (2010)

19.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems (NIPS 2013), pp. 3111–3119. Curran Associates Inc., Lake Tahoe (2013)

20.

Ouyang, X., Zhou, P., Li, C.H., Liu, L.: Sentiment analysis using convolutional neural network. In: 2015 IEEE International Conference on Computer and Information Technology; Ubiquitous Computing and Communications; Dependable, Autonomic and Secure Computing; Pervasive Intelligence and Computing, pp. 2359–2364 (2015)

21.

Pinter, Y., Guthrie, R., Eisenstein, J.: Mimicking word embeddings using subword RNNs. CoRR abs/1707.06961 (2017). http://arxiv.org/abs/1707.06961

22.

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners (2019)

23.

Ritter, A., Clark, S., Mausam, Etzioni, O.: Named entity recognition in tweets: an experimental study. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 1524–1534. Association for Computational Linguistics, Edinburgh, July 2011

24.

Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter (2019)

25.

Shamma, D.A., Kennedy, L., Churchill, E.F.: Tweet the debates: understanding community annotation of uncollected sources. In: Proceedings of the First SIGMM Workshop on Social Media (WSM 2009), pp. 3–10. ACM, Beijing (2009). https://doi.org/10.1145/1631144.1631148

26.

Speriosu, M., Sudan, N., Upadhyay, S., Baldridge, J.: Twitter polarity classification with label propagation over lexical links and the follower graph. In: Proceedings of the First Workshop on Unsupervised Learning in NLP (EMNLP 2011), pp. 53–63. Association for Computational Linguistics, Edinburgh (2011)

27.

Sundermeyer, M., Schlüter, R., Ney, H.: LSTM neural networks for language modeling. In: INTERSPEECH (2012)

28.

Thelwall, M., Buckley, K., Paltoglou, G.: Sentiment strength detection for the social web. J. Am. Soc. Inf. Sci. Technol. 63(1), 163–173 (2012). https://doi.org/10.1002/asi.21662CrossRef

29.

Vaswani, A., et al.: Attention is all you need. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems 30, pp. 5998–6008. Curran Associates, Inc. (2017)

30.

Yang, J., Leskovec, J.: Patterns of temporal variation in online media. In: Proceedings of the Fourth ACM International Conference on Web Search and Data Mining. WSDM 2011, pp. 177–186. ACM, New York (2011)

31.

Yang, X., Macdonald, C., Ounis, I.: Using word embeddings in Twitter election classification. Inf. Retrieval J. 21(2–3), 183–207 (2017). https://doi.org/10.1007/s10791-017-9319-5CrossRef

Title: Deep Learning Models for Representing Out-of-Vocabulary Words
Authors: Johannes V. Lochter
Renato M. Silva
Tiago A. Almeida
Publisher: Springer International Publishing
Book: Intelligent Systems
Print ISBN: 978-3-030-61376-1

Electronic ISBN: 978-3-030-61377-8

Copyright Year: 2020
DOI: https://doi.org/10.1007/978-3-030-61377-8_29

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner