Skip to main content

2020 | OriginalPaper | Buchkapitel

DeepBT and NLP Data Augmentation Techniques: A New Proposal and a Comprehensive Study

verfasst von : Taynan Maier Ferreira, Anna Helena Reali Costa

Erschienen in: Intelligent Systems

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Data Augmentation methods – a family of techniques designed for synthetic generation of training data – have shown remarkable results in various Deep Learning and Machine Learning tasks. Despite its widespread and successful adoption within the computer vision community, data augmentation techniques designed for natural language processing (NLP) tasks have exhibited much slower advances and limited success in achieving performance gains. As a consequence, with the exception of applications of back-translation to machine translation tasks, these techniques have not been as thoroughly explored by the wider NLP community. Recent research on the subject also still lacks a proper practical understanding of the relationship between data augmentation and several important aspects of model design, such as hyperparameters and regularization parameters. In this paper, we perform a comprehensive study of NLP data augmentation techniques, comparing their relative performance under different settings. We also propose Deep Back-Translation, a novel NLP data augmentation technique and apply it to benchmark datasets. We analyze the quality of the synthetic data generated, evaluate its performance gains and compare all of these aspects to previous existing data augmentation procedures.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
The term Noised Back-Translation was not used in [7], but coined by [3].
 
Literatur
2.
Zurück zum Zitat Basile, V., et al.: SemEval-2019 task 5: multilingual detection of hate speech against immigrants and women in twitter. In: Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 54–63. Association for Computational Linguistics, Minneapolis, June 2019. https://doi.org/10.18653/v1/S19-2007 Basile, V., et al.: SemEval-2019 task 5: multilingual detection of hate speech against immigrants and women in twitter. In: Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 54–63. Association for Computational Linguistics, Minneapolis, June 2019. https://​doi.​org/​10.​18653/​v1/​S19-2007
3.
Zurück zum Zitat Caswell, I., Chelba, C., Grangier, D.: Tagged back-translation. In: Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), pp. 53–63. Association for Computational Linguistics, Florence, August 2019. https://doi.org/10.18653/v1/W19-5206 Caswell, I., Chelba, C., Grangier, D.: Tagged back-translation. In: Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), pp. 53–63. Association for Computational Linguistics, Florence, August 2019. https://​doi.​org/​10.​18653/​v1/​W19-5206
4.
Zurück zum Zitat Cortis, K., et al.: SemEval-2017 task 5: fine-grained sentiment analysis on financial microblogs and news. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 519–535. Association for Computational Linguistics, Stroudsburg (2017). https://doi.org/10.18653/v1/S17-2089 Cortis, K., et al.: SemEval-2017 task 5: fine-grained sentiment analysis on financial microblogs and news. In: Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017), pp. 519–535. Association for Computational Linguistics, Stroudsburg (2017). https://​doi.​org/​10.​18653/​v1/​S17-2089
5.
Zurück zum Zitat Dao, T., Gu, A., Ratner, A., Smith, V., De Sa, C., Re, C.: A kernel theory of modern data augmentation. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 1528–1537. PMLR, Long Beach, 09–15 June 2019. http://proceedings.mlr.press/v97/dao19b.html Dao, T., Gu, A., Ratner, A., Smith, V., De Sa, C., Re, C.: A kernel theory of modern data augmentation. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 1528–1537. PMLR, Long Beach, 09–15 June 2019. http://​proceedings.​mlr.​press/​v97/​dao19b.​html
6.
Zurück zum Zitat Davis, B., Cortis, K., Vasiliu, L., Koumpis, A., Mcdermott, R., Handschuh, S.: Social sentiment indices powered by X-scores. In: ALLDATA 2016, The Second International Conference on Big Data, Small Data, Linked Data and Open Data, Lisbon, Portugal (2016) Davis, B., Cortis, K., Vasiliu, L., Koumpis, A., Mcdermott, R., Handschuh, S.: Social sentiment indices powered by X-scores. In: ALLDATA 2016, The Second International Conference on Big Data, Small Data, Linked Data and Open Data, Lisbon, Portugal (2016)
7.
Zurück zum Zitat Edunov, S., Ott, M., Auli, M., Grangier, D.: Understanding back-translation at scale. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 489–500. Association for Computational Linguistics, Brussels, October–November 2018. https://doi.org/10.18653/v1/D18-1045 Edunov, S., Ott, M., Auli, M., Grangier, D.: Understanding back-translation at scale. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 489–500. Association for Computational Linguistics, Brussels, October–November 2018. https://​doi.​org/​10.​18653/​v1/​D18-1045
9.
Zurück zum Zitat Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. The MIT Press, Cambridge (2016)MATH Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. The MIT Press, Cambridge (2016)MATH
10.
Zurück zum Zitat Graça, M., Kim, Y., Schamper, J., Khadivi, S., Ney, H.: Generalizing back-translation in neural machine translation. In: Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), pp. 45–52. Association for Computational Linguistics, Florence, August 2019. https://doi.org/10.18653/v1/W19-5205 Graça, M., Kim, Y., Schamper, J., Khadivi, S., Ney, H.: Generalizing back-translation in neural machine translation. In: Proceedings of the Fourth Conference on Machine Translation (Volume 1: Research Papers), pp. 45–52. Association for Computational Linguistics, Florence, August 2019. https://​doi.​org/​10.​18653/​v1/​W19-5205
13.
Zurück zum Zitat Hoang, V.C.D., Koehn, P., Haffari, G., Cohn, T.: Iterative back-translation for neural machine translation. In: Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pp. 18–24. Association for Computational Linguistics, Melbourne, July 2018. https://doi.org/10.18653/v1/W18-2703 Hoang, V.C.D., Koehn, P., Haffari, G., Cohn, T.: Iterative back-translation for neural machine translation. In: Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pp. 18–24. Association for Computational Linguistics, Melbourne, July 2018. https://​doi.​org/​10.​18653/​v1/​W18-2703
14.
Zurück zum Zitat Imamura, K., Fujita, A., Sumita, E.: Enhancement of encoder and attention using target monolingual corpora in neural machine translation. In: Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pp. 55–63. Association for Computational Linguistics, Melbourne, July 2018. https://doi.org/10.18653/v1/W18-2707 Imamura, K., Fujita, A., Sumita, E.: Enhancement of encoder and attention using target monolingual corpora in neural machine translation. In: Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pp. 55–63. Association for Computational Linguistics, Melbourne, July 2018. https://​doi.​org/​10.​18653/​v1/​W18-2707
15.
Zurück zum Zitat Kobayashi, S.: Contextual augmentation: data augmentation by words with paradigmatic relations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 452–457. Association for Computational Linguistics, New Orleans, June 2018. https://doi.org/10.18653/v1/N18-2072 Kobayashi, S.: Contextual augmentation: data augmentation by words with paradigmatic relations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 452–457. Association for Computational Linguistics, New Orleans, June 2018. https://​doi.​org/​10.​18653/​v1/​N18-2072
16.
Zurück zum Zitat Konda, K.R., Bouthillier, X., Memisevic, R., Vincent, P.: Dropout as data augmentation. arXiv abs/1506.08700 (2015) Konda, K.R., Bouthillier, X., Memisevic, R., Vincent, P.: Dropout as data augmentation. arXiv abs/1506.08700 (2015)
17.
Zurück zum Zitat Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. NIPS 2012, vol. 1, pp. 1097–1105. Curran Associates Inc., Red Hook (2012) Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Proceedings of the 25th International Conference on Neural Information Processing Systems. NIPS 2012, vol. 1, pp. 1097–1105. Curran Associates Inc., Red Hook (2012)
18.
Zurück zum Zitat Mikołajczyk, A., Grochowski, M.: Data augmentation for improving deep learning in image classification problem. In: 2018 International Interdisciplinary PhD Workshop (IIPhDW), pp. 117–122 (2018) Mikołajczyk, A., Grochowski, M.: Data augmentation for improving deep learning in image classification problem. In: 2018 International Interdisciplinary PhD Workshop (IIPhDW), pp. 117–122 (2018)
19.
Zurück zum Zitat Mohammad, S., Bravo-Marquez, F., Salameh, M., Kiritchenko, S.: SemEval-2018 task 1: affect in tweets. In: Proceedings of The 12th International Workshop on Semantic Evaluation, pp. 1–17. Association for Computational Linguistics, New Orleans, June 2018. https://doi.org/10.18653/v1/S18-1001 Mohammad, S., Bravo-Marquez, F., Salameh, M., Kiritchenko, S.: SemEval-2018 task 1: affect in tweets. In: Proceedings of The 12th International Workshop on Semantic Evaluation, pp. 1–17. Association for Computational Linguistics, New Orleans, June 2018. https://​doi.​org/​10.​18653/​v1/​S18-1001
20.
Zurück zum Zitat Pennington, J., Socher, R., Manning, C.D.: GloVe : global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014) Pennington, J., Socher, R., Manning, C.D.: GloVe : global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
21.
Zurück zum Zitat Salamon, J., Bello, J.P.: Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process. Lett. 24(3), 279–283 (2017)CrossRef Salamon, J., Bello, J.P.: Deep convolutional neural networks and data augmentation for environmental sound classification. IEEE Signal Process. Lett. 24(3), 279–283 (2017)CrossRef
22.
Zurück zum Zitat Sennrich, R., Haddow, B., Birch, A.: Improving neural machine translation models with monolingual data. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 86–96. Association for Computational Linguistics, Berlin, August 2016. https://doi.org/10.18653/v1/P16-1009 Sennrich, R., Haddow, B., Birch, A.: Improving neural machine translation models with monolingual data. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 86–96. Association for Computational Linguistics, Berlin, August 2016. https://​doi.​org/​10.​18653/​v1/​P16-1009
24.
Zurück zum Zitat Sugiyama, A., Yoshinaga, N.: Data augmentation using back-translation for context-aware neural machine translation. In: Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019), pp. 35–44. Association for Computational Linguistics, Hong Kong, November 2019. https://doi.org/10.18653/v1/D19-6504 Sugiyama, A., Yoshinaga, N.: Data augmentation using back-translation for context-aware neural machine translation. In: Proceedings of the Fourth Workshop on Discourse in Machine Translation (DiscoMT 2019), pp. 35–44. Association for Computational Linguistics, Hong Kong, November 2019. https://​doi.​org/​10.​18653/​v1/​D19-6504
25.
Zurück zum Zitat Taylor, L., Nitschke, G.: Improving deep learning with generic data augmentation. In: 2018 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1542–1547 (2018) Taylor, L., Nitschke, G.: Improving deep learning with generic data augmentation. In: 2018 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1542–1547 (2018)
26.
Zurück zum Zitat Wei, J., Zou, K.: EDA: easy data augmentation techniques for boosting performance on text classification tasks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 6382–6388. Association for Computational Linguistics, Hong Kong, November 2019. https://doi.org/10.18653/v1/D19-1670 Wei, J., Zou, K.: EDA: easy data augmentation techniques for boosting performance on text classification tasks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 6382–6388. Association for Computational Linguistics, Hong Kong, November 2019. https://​doi.​org/​10.​18653/​v1/​D19-1670
27.
Zurück zum Zitat Wen, Q., Sun, L., Song, X., Gao, J., Wang, X., Xu, H.: Time series data augmentation for deep learning: a survey (2020) Wen, Q., Sun, L., Song, X., Gao, J., Wang, X., Xu, H.: Time series data augmentation for deep learning: a survey (2020)
29.
Zurück zum Zitat Zhang, Y., Wallace, B.C.: A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. In: Proceedings of the 8th International Joint Conference on Natural Language Processing, pp. 253–263 (2017) Zhang, Y., Wallace, B.C.: A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. In: Proceedings of the 8th International Joint Conference on Natural Language Processing, pp. 253–263 (2017)
30.
Zurück zum Zitat Zhao, D., Yu, G., Xu, P., Luo, M.: Equivalence between dropout and data augmentation: a mathematical check. Neural Netw. Off. J. Int. Neural Netw. Soc. 115, 82–89 (2019)CrossRef Zhao, D., Yu, G., Xu, P., Luo, M.: Equivalence between dropout and data augmentation: a mathematical check. Neural Netw. Off. J. Int. Neural Netw. Soc. 115, 82–89 (2019)CrossRef
Metadaten
Titel
DeepBT and NLP Data Augmentation Techniques: A New Proposal and a Comprehensive Study
verfasst von
Taynan Maier Ferreira
Anna Helena Reali Costa
Copyright-Jahr
2020
DOI
https://doi.org/10.1007/978-3-030-61377-8_30