nach oben

International Journal of Machine Learning and Cybernetics

Erschienen in:

10.11.2023 | Original Article

Fine-tuning pretrained transformer encoders for sequence-to-sequence learning

verfasst von: Hangbo Bao, Li Dong, Wenhui Wang, Nan Yang, Songhao Piao, Furu Wei

Erschienen in: International Journal of Machine Learning and Cybernetics | Ausgabe 5/2024

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

In this paper, we introduce s2s-ft, a method for adapting pretrained bidirectional Transformer encoders, such as BERT and RoBERTa, to sequence-to-sequence tasks like abstractive summarization and question generation. By employing a unified modeling approach and well-designed self-attention masks, s2s-ft leverages the generative capabilities of pretrained Transformer encoders without the need for an additional decoder. We conduct extensive experiments comparing three fine-tuning algorithms (causal fine-tuning, masked fine-tuning, and pseudo-masked fine-tuning) and various pretrained models for initialization. Results demonstrate that s2s-ft achieves strong performance across different tasks and languages. Additionally, the method is successfully extended to multilingual pretrained models, such as XLM-RoBERTa, and evaluated on multilingual generation tasks. Our work highlights the importance of reducing the discrepancy between masked language model pretraining and sequence-to-sequence fine-tuning and showcases the effectiveness and expansibility of the s2s-ft method.

Vorheriger Artikel RDMA: low-light image enhancement based on retinex decomposition and multi-scale adjustment

Nächster Artikel Global relational attention with a maximum suppression constraint for vehicle re-identification

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

ATZelectronics worldwide

ATZlectronics worldwide is up-to-speed on new trends and developments in automotive electronics on a scientific level with a high depth of information.

Order your 30-days-trial for free and without any commitment.

Jetzt informieren

ATZelektronik

Die Fachzeitschrift ATZelektronik bietet für Entwickler und Entscheider in der Automobil- und Zulieferindustrie qualitativ hochwertige und fundierte Informationen aus dem gesamten Spektrum der Pkw- und Nutzfahrzeug-Elektronik.

Lassen Sie sich jetzt unverbindlich 2 kostenlose Ausgabe zusenden.

Jetzt informieren

www.cnn.com.

www.dailymail.co.uk.

http://github.com/abisee/cnn-dailymail.

http://github.com/huggingface/transformers.

https://github.com/google/sentencepiece.

github.com/google-research/.

Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations, in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics, June, 2227–2237

Radford A, Narasimhan K, Salimans T, Sutskever I (2018) Improving language understanding by generative pre-training,

Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I (2019) Language models are unsupervised multitask learners,

Devlin J, Chang M, Lee K, Toutanova K (2019) BERT: Pre-training of deep bidirectional transformers for language understanding, in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds. Association for Computational Linguistics, 4171–4186

Yang Z, Dai Z, Yang Y, Carbonell JG, Salakhutdinov R, Le QV (2019) XLNet: Generalized autoregressive pretraining for language understanding, in Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, Eds., 5754–5764

Dong L, Yang N, Wang W, Wei F, Liu X, Wang Y, Gao J, Zhou M, Hon H (2019) Unified language model pre-training for natural language understanding and generation, in Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, Eds., 13 042–13 054

Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V (2019) Roberta: A robustly optimized BERT pretraining approach,” CoRR, vol. abs/1907.11692,

Conneau A, Khandelwal K, Goyal N, Chaudhary V, Wenzek G, Guzmán F, Grave E, Ott M, Zettlemoyer L, Stoyanov V (2020) Unsupervised cross-lingual representation learning at scale, in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault, Eds. Association for Computational Linguistics, 8440–8451

Clark K, Luong M-T, Le QV, Manning CD (2020) ELECTRA: Pre-training text encoders as discriminators rather than generators, in ICLR,

10.

Bao H, Dong L, Wei F, Wang W, Yang N, Liu X, Wang Y, Gao J, Piao S, Zhou M, Hon H (2020) Unilmv2: Pseudo-masked language models for unified language model pre-training, in Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, ser. Proceedings of Machine Learning Research, vol. 119. PMLR, 642–652

11.

Taylor WL (1953) Cloze procedure: A new tool for measuring readability. Journalism Bulletin 30(4):415–433CrossRef

12.

Wang A, Cho K (2019) BERT has a mouth, and it must speak: BERT as a markov random field language model,” CoRR, vol. abs/1902.04094,

13.

Wang A, Singh A, Michael J, Hill F, Levy O, Bowman SR (2019) GLUE: A multi-task benchmark and analysis platform for natural language understanding,” in International Conference on Learning Representations,

14.

Rajpurkar P, Zhang J, Lopyrev K, Liang P (2016) SQuAD: 100,000+ questions for machine comprehension of text, in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, Texas: Association for Computational Linguistics, Nov. 2383–2392

15.

Rajpurkar P, Jia R, Liang P (2018) Know what you don’t know: Unanswerable questions for SQuAD, in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 2: Short Papers, 784–789

16.

Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks, in NIPS,

17.

Liu Y, Lapata M (2019) Text summarization with pretrained encoders, in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, Nov. 3728–3738

18.

Rothe S, Narayan S, Severyn A (2020) Leveraging pre-trained checkpoints for sequence generation tasks. Trans. Assoc. Comput. Linguistics 8:264–280CrossRef

19.

Zou Y, Zhang X, Lu W, Wei F, Zhou M (2020) Pre-training for abstractive document summarization by reinstating source text,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, Nov. 3646–3660. [Online]. Available: https://aclanthology.org/2020.emnlp-main.297

20.

Lewis M, Liu Y, Goyal N, Ghazvininejad M, Mohamed A, Levy O, Stoyanov V, Zettlemoyer L (2020) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, D. Jurafsky, J. Chai, N. Schluter, and J. R. Tetreault, Eds. Association for Computational Linguistics, 7871–7880

21.

Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ (2020) Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21(1):5485–5551MathSciNet

22.

Narayan S, Cohen SB, Lapata M (2018) Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization, in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, E. Riloff, D. Chiang, J. Hockenmaier, and J. Tsujii, Eds. Association for Computational Linguistics, 1797–1807

23.

Hermann KM, Kocisky T, Grefenstette E, Espeholt L, Kay W, Suleyman M, Blunsom P (2015) Teaching machines to read and comprehend,” in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 1693–1701

24.

Conneau A, Lample G (2019) Cross-lingual language model pretraining, in Advances in Neural Information Processing Systems. Curran Associates, Inc., 7057–7067

25.

Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space,” 1st International Conference on Learning Representations, ICLR 2013, Workshop Track Proceedings,

26.

Pennington J, Socher R, Manning CD (2014) Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 1532–1543

27.

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need, in Advances in Neural Information Processing Systems 30. Curran Associates, Inc., 5998–6008

28.

Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets, in Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger, Eds., vol. 27. Curran Associates, Inc.,

29.

Chi Z, Dong L, Wei F, Yang N, Singhal S, Wang W, Song X, Mao X-L, Huang H-Y, Zhou M (2021) Infoxlm: An information-theoretic framework for cross-lingual language model pre-training,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 3576–3588

30.

Zheng B, Che W (2023) Improving cross-lingual language understanding with consistency regularization-based fine-tuning, International Journal of Machine Learning and Cybernetics, 1–19,

31.

Li Z, Peng Z, Tang S, Zhang C, Ma H (2020) Text summarization method based on double attention pointer network. IEEE Access 8:11 279-11 288CrossRef

32.

Du X, Cardie C (2018) Harvesting paragraph-level question-answer pairs from wikipedia, in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, I. Gurevych and Y. Miyao, Eds. Association for Computational Linguistics, 1907–1917

33.

Song K, Tan X, Qin T, Lu J, Liu T (2019) MASS: masked sequence to sequence pre-training for language generation, in Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, ser. Proceedings of Machine Learning Research, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. PMLR, 5926–5936

34.

Liu Y (2019) Fine-tune BERT for extractive summarization, CoRR, vol. abs/1903.10318,

35.

Xian T, Li Z, Tang Z, Ma H (2022) Adaptive path selection for dynamic image captioning. IEEE Transactions on Circuits and Systems for Video Technology 32(9):5762–5775CrossRef

36.

Graves A (2013) Generating sequences with recurrent neural networks, CoRR, vol. abs/1308.0850,

37.

Joshi M, Chen D, Liu Y, Weld DS, Zettlemoyer L, Levy O (2020) Spanbert: Improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics 8:64–77CrossRef

38.

Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, Cistac P, Rault T, Louf R, Funtowicz M et al (2020) Transformers: State-of-the-art natural language processing, in Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, 38–45

39.

Lin C-Y (2004) ROUGE: A package for automatic evaluation of summaries, in Text Summarization Branches Out: Proceedings of the ACL-04 Workshop. Barcelona, Spain: Association for Computational Linguistics, Jul. 74–81

40.

Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: A method for automatic evaluation of machine translation, in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Philadelphia, Pennsylvania, USA: Association for Computational Linguistics, Jul. 311–318

41.

Banerjee S, Lavie A (2005) METEOR: An automatic metric for MT evaluation with improved correlation with human judgments, in Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization. Ann Arbor, Michigan: Association for Computational Linguistics, Jun 65–72

42.

See A, Liu PJ, Manning CD (2017) Get to the point: Summarization with pointer-generator networks, in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vancouver, Canada: Association for Computational Linguistics, Jul 1073–1083

43.

Zhou Q, Yang N, Wei F, Tan C, Bao H, Zhou M (2017) Neural question generation from text: A preliminary study, in Natural Language Processing and Chinese Computing - 6th CCF International Conference, NLPCC 2017, Dalian, China, November 8-12, 2017, Proceedings, 662–671

44.

Chi Z, Dong L, Wei F, Wang W, Mao X, Huang H (2020) Cross-lingual natural language generation via pre-training, in The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, New York, NY, USA, February 7-12, AAAI Press, 7570–7577

45.

Sennrich R, Haddow B, Birch A (2016) Neural machine translation of rare words with subword units, in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 1715–1725

46.

Wu Y, Schuster M, Chen Z, Le QV, Norouzi M, Macherey W, Krikun M, Cao Y, Gao Q, Macherey K, Klingner J, Shah A, Johnson M, Liu X, Kaiser L, Gouws S, Kato Y, Kudo T, Kazawa H, Stevens K, Kurian G, Patil N, Wang W, Young C, Smith J, Riesa J, Rudnick A, Vinyals O, Corrado G, Hughes M, Dean J (2016) Google’s neural machine translation system: Bridging the gap between human and machine translation, CoRR, vol. abs/1609.08144,

47.

Kingma DP, Ba J (2015) Adam: A method for stochastic optimization, in 3rd International Conference on Learning Representations, San Diego, CA,

48.

Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wojna Z (2016) Rethinking the inception architecture for computer vision, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2818–2826

49.

Paulus R, Xiong C, Socher R (2018) A deep reinforced model for abstractive summarization, CoRR, vol. abs/1705.04304,

50.

Edunov S, Baevski A, Auli M (2019) Pre-trained language model representations for language generation, CoRR, vol. abs/1903.09722,

51.

Xiao D, Zhang H, Li Y, Sun Y, Tian H, Wu H, Wang H (2020) ERNIE-GEN: An enhanced multi-flow pre-training and fine-tuning framework for natural language generation, CoRR, vol. abs/2001.11314,

52.

Akiyama K, Tamura A, Ninomiya T (2021) Hie-bart: Document summarization with hierarchical bart, in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Student Research Workshop, 159–165

53.

Du Z, Qian Y, Liu X, Ding M, Qiu J, Yang Z, Tang J (2022) Glm: General language model pretraining with autoregressive blank infilling, in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 320–335

54.

Yan Y, Qi W, Gong Y, Liu D, Duan N, Chen J, Zhang R, Zhou M (2020) ProphetNet: Predicting future n-gram for sequence-to-sequence pre-training, CoRR, vol. abs/2001.04063,

55.

Zhang J, Zhao Y, Saleh M, Liu PJ (2020) PEGASUS: pre-training with extracted gap-sentences for abstractive summarization, in Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, ser. Proceedings of Machine Learning Research, vol. 119. PMLR, 11 328–11 339

56.

Zhao Y, Ni X, Ding Y, Ke Q (2018) “Paragraph-level neural question generation with maxout pointer and gated self-attention networks,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, Oct.-Nov. 3901–3910

57.

Zhang S, Bansal M (2019) Addressing semantic drift in question generation for semi-supervised question answering, in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, K. Inui, J. Jiang, V. Ng, and X. Wan, Eds. Association for Computational Linguistics, 2495–2509

Titel: Fine-tuning pretrained transformer encoders for sequence-to-sequence learning
verfasst von: Hangbo Bao
Li Dong
Wenhui Wang
Nan Yang
Songhao Piao
Furu Wei
Publikationsdatum: 10.11.2023
Verlag: Springer Berlin Heidelberg
Erschienen in: International Journal of Machine Learning and Cybernetics / Ausgabe 5/2024
Print ISSN: 1868-8071
Elektronische ISSN: 1868-808X
DOI: https://doi.org/10.1007/s13042-023-01992-6

Neuer Inhalt

Bildnachweise

VDI-Icon, Profil Icon, inhalt2, Springer Professional Modul/© Springer Fachmedien Wiesbaden GmbH, Die Gewinner und Laudatoren des Sustainability Award in Automotive 2024/© Uli Regenscheit | ATZlive, Search Icon, Banner Hanser, Suresh Vittal/© Alteryx, Additiv gefertigte Teile/© Marina_Skoropadskaya | Getty Images | iStock, Warnschild "Land unter"/© Bluedesign / Fotolia, Zeitschrift Wissensmanagement Cover, PatentFit-Logo/© Springer Fachmedien Wiesbaden GmbH, ATZ-Webinar: Prototypenfreie Entwicklung durch Offline- und Driver-in-the-Loop-HiL-Tests /© (c) VI-grade, chassis.tech plus 2023/© [M] ATZlive / TÜV SÜD PRODUCT SERVICE GMBH, adäsion-Webinar-Matinee/© krystiannawrocki_ Getty Images

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

ATZelectronics worldwide

ATZelektronik

Weitere Artikel der Ausgabe 5/2024

The fusion feature wavelet pyramid based on FCIS and GLCM for texture classification

A multi-perspective global–local interaction framework for identifying dialogue acts and sentiments of dialogue utterances jointly

PDCNN-MRW: a parallel Winograd convolutional neural network algorithm base on MapReduce

A new robust contrastive learning for unsupervised person re-identification

GCAM: lightweight image inpainting via group convolution and attention mechanism

Self-supervised learning-leveraged boosting ultrasound image segmentation via mask reconstruction

Neuer Inhalt

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.