Skip to main content
Top

2024 | OriginalPaper | Chapter

Layer-Wise Sparse Training of Transformer via Convolutional Flood Filling

Authors : Bokyeong Yoon, Yoonsang Han, Gordon Euhyun Moon

Published in: Advances in Knowledge Discovery and Data Mining

Publisher: Springer Nature Singapore

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Sparsifying the Transformer has garnered considerable interest, as training the Transformer is very computationally demanding. Prior efforts to sparsify the Transformer have either used a fixed pattern or data-driven approach to reduce the number of operations involving the computation of multi-head attention, which is the main bottleneck of the Transformer. However, existing methods suffer from inevitable problems, including potential loss of essential sequence features and an increase in the model size. In this paper, we propose a novel sparsification scheme for the Transformer that integrates convolution filters and the flood filling method to efficiently capture the layer-wise sparse pattern in attention operations. Our sparsification approach significantly reduces the computational complexity and memory footprint of the Transformer during training. Efficient implementations of the layer-wise sparsified attention algorithm on GPUs are developed, demonstrating our SPION that achieves up to 2.78\(\times \) speedup over existing state-of-the-art sparse Transformer models and maintain high evaluation quality.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Ainslie, J., Ontanon, S., et al.: Etc: encoding long and structured inputs in transformers. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 268–284 (2020) Ainslie, J., Ontanon, S., et al.: Etc: encoding long and structured inputs in transformers. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 268–284 (2020)
3.
go back to reference Child, R., Gray, S., Radford, A., Sutskever, I.: Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 (2019) Child, R., Gray, S., Radford, A., Sutskever, I.: Generating long sequences with sparse transformers. arXiv preprint arXiv:​1904.​10509 (2019)
4.
go back to reference Condevaux, C., Harispe, S.: Lsg attention: extrapolation of pretrained transformers to long sequences. In: Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 443–454 (2023) Condevaux, C., Harispe, S.: Lsg attention: extrapolation of pretrained transformers to long sequences. In: Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 443–454 (2023)
5.
go back to reference Devlin, J., Chang, M.W., et al.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2019) Devlin, J., Chang, M.W., et al.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2019)
6.
go back to reference Dosovitskiy, A., Beyer, L., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021) Dosovitskiy, A., Beyer, L., et al.: An image is worth 16\(\times \)16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
7.
go back to reference Goldman, R.: Graphics gems, p. 304 (1990) Goldman, R.: Graphics gems, p. 304 (1990)
8.
go back to reference iNaturalist 2018 competition dataset. (2018) iNaturalist 2018 competition dataset. (2018)
9.
go back to reference Kitaev, N., Kaiser, Ł., Levskaya, A.: Reformer: the efficient transformer. In: Proceedings of the International Conference on Learning Representations (2020) Kitaev, N., Kaiser, Ł., Levskaya, A.: Reformer: the efficient transformer. In: Proceedings of the International Conference on Learning Representations (2020)
10.
go back to reference Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009) Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)
11.
14.
go back to reference Qiu, J., Ma, H., et al.: Blockwise self-attention for long document understanding. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2555–2565 (2020) Qiu, J., Ma, H., et al.: Blockwise self-attention for long document understanding. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2555–2565 (2020)
15.
go back to reference Radev, D.R., Muthukrishnan, P., et al.: The ACL anthology network corpus. Lang. Res. Eval. 47, 919–944 (2013)CrossRef Radev, D.R., Muthukrishnan, P., et al.: The ACL anthology network corpus. Lang. Res. Eval. 47, 919–944 (2013)CrossRef
16.
go back to reference Roy, A., Saffar, M., et al.: Efficient content-based sparse attention with routing transformers. Trans. Assoc. Comput. Linguist. 9, 53–68 (2021)CrossRef Roy, A., Saffar, M., et al.: Efficient content-based sparse attention with routing transformers. Trans. Assoc. Comput. Linguist. 9, 53–68 (2021)CrossRef
17.
go back to reference Tay, Y., Dehghani, M., et al.: Efficient transformers: a survey. ACM Comput. Surv. 55(6), 1–28 (2022)CrossRef Tay, Y., Dehghani, M., et al.: Efficient transformers: a survey. ACM Comput. Surv. 55(6), 1–28 (2022)CrossRef
18.
go back to reference Vaswani, A., Shazeer, N., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017) Vaswani, A., Shazeer, N., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
20.
go back to reference Zaheer, M., Guruganesh, G., et al.: Big bird: transformers for longer sequences. Adv. Neural. Inf. Process. Syst. 33, 17283–17297 (2020) Zaheer, M., Guruganesh, G., et al.: Big bird: transformers for longer sequences. Adv. Neural. Inf. Process. Syst. 33, 17283–17297 (2020)
21.
go back to reference Zhang, H., Gong, Y., et al.: Poolingformer: long document modeling with pooling attention. In: International Conference on Machine Learning, pp. 12437–12446. PMLR (2021) Zhang, H., Gong, Y., et al.: Poolingformer: long document modeling with pooling attention. In: International Conference on Machine Learning, pp. 12437–12446. PMLR (2021)
22.
go back to reference Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. Adv. Neural Inf. Process. Syst. 28 (2015) Zhang, X., Zhao, J., LeCun, Y.: Character-level convolutional networks for text classification. Adv. Neural Inf. Process. Syst. 28 (2015)
Metadata
Title
Layer-Wise Sparse Training of Transformer via Convolutional Flood Filling
Authors
Bokyeong Yoon
Yoonsang Han
Gordon Euhyun Moon
Copyright Year
2024
Publisher
Springer Nature Singapore
DOI
https://doi.org/10.1007/978-981-97-2253-2_13

Premium Partner