Skip to main content
Erschienen in: International Journal of Machine Learning and Cybernetics 8/2023

30.01.2023 | Original Article

Data augmentation using Heuristic Masked Language Modeling

verfasst von: Xiaorong Liu, Yuan Zhong, Jie Wang, Ping Li

Erschienen in: International Journal of Machine Learning and Cybernetics | Ausgabe 8/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Data augmentation has played an important role in generalization capability and performance improvement for data-driven deep learning models in recent years. However, most of the existing data augmentation methods in NLP suffer from high manpower consumption or low promotion, which limits the practical applications. To this end, we propose a simple yet effective approach named Heuristic Masked Language Modeling(HMLM) to obtain high-quality data by introducing mask language modeling embedded in pre-trained models. More specifically, the HMLM method first identifies the core words of the sentence and masks some non-core fragments in the sentence. Then, these masked fragments will be filled with words created by the pre-trained model to match the contextual semantics. Compared with the previous data augmentation approaches, the proposed method can create more grammatical and contextual augmented data without a heavy cost. We conducted experiments on typical text classification tasks e.g., intent recognition, news classification and sentiment analysis separately. Experimental results demonstrate that our proposed method is comparable to state-of-the-art data augmentation approaches.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Weitere Produktempfehlungen anzeigen
Literatur
1.
Zurück zum Zitat Xie Q, Dai Z, Hovy E.H, Luong T, Le Q (2020) Unsupervised data augmentation for consistency training. In: Advances in neural information processing systems 33: annual conference on neural information processing systems 2020, NeurIPS 2020, December 6-12, Xie Q, Dai Z, Hovy E.H, Luong T, Le Q (2020) Unsupervised data augmentation for consistency training. In: Advances in neural information processing systems 33: annual conference on neural information processing systems 2020, NeurIPS 2020, December 6-12,
2.
Zurück zum Zitat Anaby-Tavor A, Carmeli B, Goldbraich E, Kantor A, Kour G, Shlomov S, Tepper N, Zwerdling N (2020) Do not have enough data? deep learning to the rescue! In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 7383–7390 Anaby-Tavor A, Carmeli B, Goldbraich E, Kantor A, Kour G, Shlomov S, Tepper N, Zwerdling N (2020) Do not have enough data? deep learning to the rescue! In: The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 7383–7390
3.
Zurück zum Zitat Wang J, Yang Y, Liu K, Xie P, Liu X (2022) Instance-guided multi-modal fake news detection with dynamic intra- and inter-modality fusion. In: Advances in knowledge discovery and data mining—26th Pacific-Asia conference, PAKDD 2022, Chengdu, China, May 16-19, 2022, pp. 510–521 Wang J, Yang Y, Liu K, Xie P, Liu X (2022) Instance-guided multi-modal fake news detection with dynamic intra- and inter-modality fusion. In: Advances in knowledge discovery and data mining—26th Pacific-Asia conference, PAKDD 2022, Chengdu, China, May 16-19, 2022, pp. 510–521
4.
Zurück zum Zitat Liu K, Li T, Yang X, Yang X, Liu D, Zhang P (2022) Wang J Granular cabin: an efficient solution to neighborhood learning in big data. Inform Sci 583:189–201CrossRef Liu K, Li T, Yang X, Yang X, Liu D, Zhang P (2022) Wang J Granular cabin: an efficient solution to neighborhood learning in big data. Inform Sci 583:189–201CrossRef
5.
Zurück zum Zitat Tobin J, Fong R, Ray A, Schneider J, Zaremba W, Abbeel P (2017) Domain randomization for transferring deep neural networks from simulation to the real world. In: 2017 IEEE/RSJ International conference on intelligent robots and systems, IROS 2017, Vancouver, BC, Canada, September 24-28, 2017, pp. 23–30 Tobin J, Fong R, Ray A, Schneider J, Zaremba W, Abbeel P (2017) Domain randomization for transferring deep neural networks from simulation to the real world. In: 2017 IEEE/RSJ International conference on intelligent robots and systems, IROS 2017, Vancouver, BC, Canada, September 24-28, 2017, pp. 23–30
6.
Zurück zum Zitat Hoang C.D.V, Koehn P, Haffari G, Cohn T (2018) Iterative back-translation for neural machine translation. In: Proceedings of the 2nd workshop on neural machine translation and generation, NMT@ACL 2018, Melbourne, Australia, July 20, 2018, pp. 18–24 Hoang C.D.V, Koehn P, Haffari G, Cohn T (2018) Iterative back-translation for neural machine translation. In: Proceedings of the 2nd workshop on neural machine translation and generation, NMT@ACL 2018, Melbourne, Australia, July 20, 2018, pp. 18–24
7.
Zurück zum Zitat Edunov S, Ott M, Auli M, Grangier D (2018) Understanding back-translation at scale. In: Proceedings of the 2018 conference on empirical methods in natural language processing, Brussels, Belgium, October 31 - November 4, 2018, pp. 489–500 Edunov S, Ott M, Auli M, Grangier D (2018) Understanding back-translation at scale. In: Proceedings of the 2018 conference on empirical methods in natural language processing, Brussels, Belgium, October 31 - November 4, 2018, pp. 489–500
8.
Zurück zum Zitat Fadaee M, Bisazza A, Monz C (2017) Data augmentation for low-resource neural machine translation. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, pp. 567–573 Fadaee M, Bisazza A, Monz C (2017) Data augmentation for low-resource neural machine translation. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, pp. 567–573
9.
Zurück zum Zitat Kobayashi S (2018) Contextual augmentation: Data augmentation by words with paradigmatic relations. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL-HLT, New Orleans, Louisiana, USA, June 1-6, 2018, pp. 452–457 Kobayashi S (2018) Contextual augmentation: Data augmentation by words with paradigmatic relations. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL-HLT, New Orleans, Louisiana, USA, June 1-6, 2018, pp. 452–457
10.
Zurück zum Zitat Wu X, Lv S, Zang L, Han J, Hu S (2019) Conditional bert contextual augmentation. In: Computational Science—ICCS 2019—19th International Conference, Faro, Portugal, June 12-14, 2019, pp. 84–95 Wu X, Lv S, Zang L, Han J, Hu S (2019) Conditional bert contextual augmentation. In: Computational Science—ICCS 2019—19th International Conference, Faro, Portugal, June 12-14, 2019, pp. 84–95
11.
Zurück zum Zitat Liu T, Cui Y, Yin Q, Zhang W, Wang S, Hu G (2017) Generating and exploiting large-scale pseudo training data for zero pronoun resolution. In: Proceedings of the 55th annual meeting of the association for computational linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, pp. 102–111 Liu T, Cui Y, Yin Q, Zhang W, Wang S, Hu G (2017) Generating and exploiting large-scale pseudo training data for zero pronoun resolution. In: Proceedings of the 55th annual meeting of the association for computational linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, pp. 102–111
12.
Zurück zum Zitat Hou Y, Liu Y, Che W, Liu T (2018) Sequence-to-sequence data augmentation for dialogue language understanding. In: Proceedings of the 27th international conference on computational linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018, pp. 1234–1245 Hou Y, Liu Y, Che W, Liu T (2018) Sequence-to-sequence data augmentation for dialogue language understanding. In: Proceedings of the 27th international conference on computational linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018, pp. 1234–1245
13.
Zurück zum Zitat Dong L, Mallinson J, Reddy S, Lapata M (2017) Learning to paraphrase for question answering. In: Proceedings of the 2017 conference on empirical methods in natural language processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pp. 875–886 Dong L, Mallinson J, Reddy S, Lapata M (2017) Learning to paraphrase for question answering. In: Proceedings of the 2017 conference on empirical methods in natural language processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pp. 875–886
14.
Zurück zum Zitat Wei JW, Zou K (2019) EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pp. 6382–6388 Wei JW, Zou K (2019) EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pp. 6382–6388
15.
Zurück zum Zitat Dai X, Adel H (2020) An analysis of simple data augmentation for named entity recognition. In: Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, pp. 3861–3867 Dai X, Adel H (2020) An analysis of simple data augmentation for named entity recognition. In: Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, pp. 3861–3867
16.
Zurück zum Zitat Vania C, Kementchedjhieva Y, Søgaard A, Lopez A (2019) A systematic comparison of methods for low-resource dependency parsing on genuinely low-resource languages. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pp. 1105–1116 Vania C, Kementchedjhieva Y, Søgaard A, Lopez A (2019) A systematic comparison of methods for low-resource dependency parsing on genuinely low-resource languages. In: Inui, K., Jiang, J., Ng, V., Wan, X. (eds.) Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pp. 1105–1116
17.
Zurück zum Zitat Gulordava K, Bojanowski P, Grave E, Linzen T, Baroni M Colorless green recurrent networks dream hierarchically. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, pp. 1195–1205 Gulordava K, Bojanowski P, Grave E, Linzen T, Baroni M Colorless green recurrent networks dream hierarchically. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, pp. 1195–1205
18.
Zurück zum Zitat Sennrich R, Haddow B, Birch A Edinburgh neural machine translation systems for WMT 16. In: Proceedings of the first conference on machine translation, WMT 2016, colocated with ACL 2016, August 11-12, Berlin, Germany, pp. 371–376 Sennrich R, Haddow B, Birch A Edinburgh neural machine translation systems for WMT 16. In: Proceedings of the first conference on machine translation, WMT 2016, colocated with ACL 2016, August 11-12, Berlin, Germany, pp. 371–376
19.
Zurück zum Zitat Gal Y, Ghahramani Z A theoretically grounded application of dropout in recurrent neural networks. In: Advances in neural information processing systems 29: annual conference on neural information processing systems 2016, December 5-10, 2016, pp. 1019–1027 Gal Y, Ghahramani Z A theoretically grounded application of dropout in recurrent neural networks. In: Advances in neural information processing systems 29: annual conference on neural information processing systems 2016, December 5-10, 2016, pp. 1019–1027
20.
Zurück zum Zitat Norouzi M, Bengio S, Chen Z, Jaitly N, Schuster M, Wu Y, Schuurmans D Reward augmented maximum likelihood for neural structured prediction. In: Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 1723–1731 Norouzi M, Bengio S, Chen Z, Jaitly N, Schuster M, Wu Y, Schuurmans D Reward augmented maximum likelihood for neural structured prediction. In: Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pp. 1723–1731
21.
Zurück zum Zitat Sennrich R, Haddow B, Birch A Improving neural machine translation models with monolingual data. In: Proceedings of the 54th annual meeting of the association for computational linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, pp. 86–96 Sennrich R, Haddow B, Birch A Improving neural machine translation models with monolingual data. In: Proceedings of the 54th annual meeting of the association for computational linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, pp. 86–96
22.
Zurück zum Zitat Mallinson J, Sennrich R, Lapata M Paraphrasing revisited with neural machine translation. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, pp. 881–893 Mallinson J, Sennrich R, Lapata M Paraphrasing revisited with neural machine translation. In: Proceedings of the 15th conference of the European chapter of the association for computational linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, pp. 881–893
23.
Zurück zum Zitat Yu A.W, Dohan D, Luong M, Zhao R, Chen K, Norouzi M, Le Q.V Qanet: Combining local convolution with global self-attention for reading comprehension. In: 6th international conference on learning representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018 Yu A.W, Dohan D, Luong M, Zhao R, Chen K, Norouzi M, Le Q.V Qanet: Combining local convolution with global self-attention for reading comprehension. In: 6th international conference on learning representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018
24.
Zurück zum Zitat Li Y, Cohn T, Baldwin T Robust training under linguistic adversity. In: Proceedings of the 15th Conference of the European chapter of the association for computational linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, pp. 21–27 Li Y, Cohn T, Baldwin T Robust training under linguistic adversity. In: Proceedings of the 15th Conference of the European chapter of the association for computational linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, pp. 21–27
25.
Zurück zum Zitat Yasunaga M, Kasai J, Radev D.R Robust multilingual part-of-speech tagging via adversarial training. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, pp. 976–986 Yasunaga M, Kasai J, Radev D.R Robust multilingual part-of-speech tagging via adversarial training. In: Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, pp. 976–986
26.
Zurück zum Zitat Alzantot M, Sharma Y, Elgohary A, Ho B, Srivastava M.B, Chang K Generating natural language adversarial examples. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp. 2890–2896 Alzantot M, Sharma Y, Elgohary A, Ho B, Srivastava M.B, Chang K Generating natural language adversarial examples. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pp. 2890–2896
27.
Zurück zum Zitat Qiu X, Sun T, Xu Y, Shao Y, Dai N (2020) Huang X Pre-trained models for natural language processing: a survey. Sci China Technol Sci 63:1872–1897CrossRef Qiu X, Sun T, Xu Y, Shao Y, Dai N (2020) Huang X Pre-trained models for natural language processing: a survey. Sci China Technol Sci 63:1872–1897CrossRef
28.
Zurück zum Zitat Devlin J, Chang M, Lee K, Toutanova K BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, pp. 4171–4186 Devlin J, Chang M, Lee K, Toutanova K BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, pp. 4171–4186
29.
Zurück zum Zitat Sun WSLYFSTHWHWH Y Ernie 2.0: A continual pre-training framework for language understanding. In: The Thirty-fourth AAAI conference on artificial intelligence, AAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 8968–8975 Sun WSLYFSTHWHWH Y Ernie 2.0: A continual pre-training framework for language understanding. In: The Thirty-fourth AAAI conference on artificial intelligence, AAAI 2020, New York, NY, USA, February 7-12, 2020, pp. 8968–8975
30.
Zurück zum Zitat Cui Y, Che W, Liu T, Qin B, Yang Z Pre-training with whole word masking for chinese BERT. IEEE ACM Trans. Audio Speech Lang. Process. 29 3504–3514 (2021) Cui Y, Che W, Liu T, Qin B, Yang Z Pre-training with whole word masking for chinese BERT. IEEE ACM Trans. Audio Speech Lang. Process. 29 3504–3514 (2021)
31.
Zurück zum Zitat Xie Z, Huang Y, Zhu Y, Jin L, Liu Y, Xie L Aggregation cross-entropy for sequence recognition. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 6538–6547 Xie Z, Huang Y, Zhu Y, Jin L, Liu Y, Xie L Aggregation cross-entropy for sequence recognition. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 6538–6547
32.
Zurück zum Zitat Taylor W.L “cloze procedure”: A new tool for measuring readability. Journalism quarterly 30(4), 415–433 (1953) Taylor W.L “cloze procedure”: A new tool for measuring readability. Journalism quarterly 30(4), 415–433 (1953)
33.
Zurück zum Zitat Yu S, Yang J, Liu D, Li R, Zhang Y (2019) Zhao S Hierarchical data augmentation and the application in text classification. IEEE Access 7:185476–185485CrossRef Yu S, Yang J, Liu D, Li R, Zhang Y (2019) Zhao S Hierarchical data augmentation and the application in text classification. IEEE Access 7:185476–185485CrossRef
34.
Zurück zum Zitat Thakur N, Reimers N, Daxenberger J, Gurevych I Augmented SBERT: data augmentation method for improving bi-encoders for pairwise sentence scoring tasks. In: Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pp. 296–310 Thakur N, Reimers N, Daxenberger J, Gurevych I Augmented SBERT: data augmentation method for improving bi-encoders for pairwise sentence scoring tasks. In: Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, pp. 296–310
35.
Zurück zum Zitat Kim Y Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing, EMNLP 2014, October 25-29, 2014, pp. 1746–1751 Kim Y Convolutional neural networks for sentence classification. In: Proceedings of the 2014 conference on empirical methods in natural language processing, EMNLP 2014, October 25-29, 2014, pp. 1746–1751
36.
Zurück zum Zitat Mihalcea R, Tarau P Textrank: Bringing order into texts. In: Proceedings of the 2016 conference on empirical methods in natural language processing, EMNLP 2004,Barcelona, Spain, July Mihalcea R, Tarau P Textrank: Bringing order into texts. In: Proceedings of the 2016 conference on empirical methods in natural language processing, EMNLP 2004,Barcelona, Spain, July
Metadaten
Titel
Data augmentation using Heuristic Masked Language Modeling
verfasst von
Xiaorong Liu
Yuan Zhong
Jie Wang
Ping Li
Publikationsdatum
30.01.2023
Verlag
Springer Berlin Heidelberg
Erschienen in
International Journal of Machine Learning and Cybernetics / Ausgabe 8/2023
Print ISSN: 1868-8071
Elektronische ISSN: 1868-808X
DOI
https://doi.org/10.1007/s13042-023-01784-y

Weitere Artikel der Ausgabe 8/2023

International Journal of Machine Learning and Cybernetics 8/2023 Zur Ausgabe

Neuer Inhalt