Skip to main content
Top
Published in: Arabian Journal for Science and Engineering 3/2024

16-06-2023 | Research Article-Computer Engineering and Computer Science

Compact Image Transformer Based on Convolutional Variational Autoencoder with Augmented Attention Backbone for Target Recognition in Infrared Images

Authors: Billel Nebili, Atmane Khellal, Abdelkrim Nemra, Said Yacine Boulahia, Laurent Mascarilla

Published in: Arabian Journal for Science and Engineering | Issue 3/2024

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Recently, Vision Transformer (ViT) has become a relevant alternative to convolutional neural networks (CNN) for image classification tasks. However, we believe that ViT needs pre-training on large-size datasets, making it unsuitable for certain scientific fields such as infrared imaging where the amount of training data is limited. In this direction, we proposed a Compact image Transformer based on convolutional variational Autoencoder with Augmented attention backbone (referred to AA-CiT) for target recognition in infrared images, which can learn efficiently from scratch even with small-size datasets. This is performed by three main adaptations of the original ViT architecture, in which we introduced convolutions in its different parts to fully benefit from the properties of both paradigms: attention and convolution. First, we proposed an improvement in the tokenization step by introducing a new module based on a local convolutional variational autoencoder. Second, convolutional features are incorporated in ViT’s encoder, which allows us to introduce some inductive bias of CNN in the proposed transformer. We finally took profit of a new sequence pooling technique on the top of ViT’s encoder to make our model compact and more accurate. These modifications allow us to overcome the difficulties of ViT training and also eliminate the need for Class token and the heavy reliance on positional embeddings. We validated our approach by carrying out extensive experiments on FLIR-SEEK dataset. Globally, we achieved a \(3\%\) improvement in overall classification accuracy compared to conventional ViT while relying on fewer parameters (\(14\%\) of ViT’s parameters).

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Literature
1.
go back to reference Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I.: Attention is all you need. In: International Conference on Neural Information Processing Systems, vol. 30, pp. 6000–6010 (2017) Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I.: Attention is all you need. In: International Conference on Neural Information Processing Systems, vol. 30, pp. 6000–6010 (2017)
2.
go back to reference Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019). https://doi.org/10.18653/v1/N19-1423 Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019). https://​doi.​org/​10.​18653/​v1/​N19-1423
3.
go back to reference Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019) Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:​1907.​11692 (2019)
4.
go back to reference Conneau, A.; Lample, G.: Cross-lingual Language Model Pretraining. In: Wallach, H.; Larochelle, H.; Beygelzimer, A.; d’ Alché-Buc, F.; Fox, E.; Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32 (2019) Conneau, A.; Lample, G.: Cross-lingual Language Model Pretraining. In: Wallach, H.; Larochelle, H.; Beygelzimer, A.; d’ Alché-Buc, F.; Fox, E.; Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32 (2019)
5.
go back to reference Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; Uszkoreit, J.; Houlsby, N.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021) Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; Uszkoreit, J.; Houlsby, N.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)
12.
go back to reference Kim, S.; Song, W.-J.; Kim, S.-H.: Infrared variation optimized deep convolutional neural network for robust automatic ground target recognition. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 195–202 (2017). https://doi.org/10.1109/CVPRW.2017.30 Kim, S.; Song, W.-J.; Kim, S.-H.: Infrared variation optimized deep convolutional neural network for robust automatic ground target recognition. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 195–202 (2017). https://​doi.​org/​10.​1109/​CVPRW.​2017.​30
14.
go back to reference Abreu de Souza, M.; Krefer, A.G.; Borba, G.B.; Centeno, T.M.; Gamba, H.R.: Combining 3D models with 2D infrared images for medical applications. In: International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 2395–2398 (2015). https://doi.org/10.1109/EMBC.2015.7318876 Abreu de Souza, M.; Krefer, A.G.; Borba, G.B.; Centeno, T.M.; Gamba, H.R.: Combining 3D models with 2D infrared images for medical applications. In: International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 2395–2398 (2015). https://​doi.​org/​10.​1109/​EMBC.​2015.​7318876
16.
go back to reference Ke, A.; Ellsworth, W.; Banerjee, O.; Ng, A.Y.; Rajpurkar, P.: CheXtransfer: performance and parameter efficiency of ImageNet models for chest X-Ray interpretation. In: Conference on Health, Inference, and Learning, pp. 116–124 (2021). https://doi.org/10.1145/3450439.3451867 Ke, A.; Ellsworth, W.; Banerjee, O.; Ng, A.Y.; Rajpurkar, P.: CheXtransfer: performance and parameter efficiency of ImageNet models for chest X-Ray interpretation. In: Conference on Health, Inference, and Learning, pp. 116–124 (2021). https://​doi.​org/​10.​1145/​3450439.​3451867
18.
go back to reference Simonyan, K.; Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations, San Diego, CA, USA (2015) Simonyan, K.; Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations, San Diego, CA, USA (2015)
20.
go back to reference D’Ascoli, S.; Touvron, H.; Leavitt, M.L.; Morcos, A.S.; Biroli, G.; Sagun, L.: ConViT: Improving vision transformers with soft convolutional inductive biases. In: International Conference on Machine Learning, vol. 139, pp. 2286–2296 (2021) D’Ascoli, S.; Touvron, H.; Leavitt, M.L.; Morcos, A.S.; Biroli, G.; Sagun, L.: ConViT: Improving vision transformers with soft convolutional inductive biases. In: International Conference on Machine Learning, vol. 139, pp. 2286–2296 (2021)
21.
go back to reference Chen, Y.; Kalantidis, Y.; Li, J.; Yan, S.; Feng, J.: A\(^2\)-Nets: Double attention networks. In: Advances in Neural Information Processing Systems, 31 (2018) Chen, Y.; Kalantidis, Y.; Li, J.; Yan, S.; Feng, J.: A\(^2\)-Nets: Double attention networks. In: Advances in Neural Information Processing Systems, 31 (2018)
22.
go back to reference Ramachandran, P.; Parmar, N.; Vaswani, A.; Bello, I.; Levskaya, A.; Shlens, J.: Stand-Alone Self-Attention in Vision Models 32, 68–80 (2019) Ramachandran, P.; Parmar, N.; Vaswani, A.; Bello, I.; Levskaya, A.; Shlens, J.: Stand-Alone Self-Attention in Vision Models 32, 68–80 (2019)
27.
go back to reference Park, J.; Woo, S.; Lee, J.-Y.; Kweon, I.-S.: BAM: bottleneck attention module. In: British Machine Vision Conference, p. 147 (2018) Park, J.; Woo, S.; Lee, J.-Y.; Kweon, I.-S.: BAM: bottleneck attention module. In: British Machine Vision Conference, p. 147 (2018)
31.
go back to reference Hassani, A.; Walton, S.; Shah, N.; Abuduweili, A.; Li, J.; Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021) Hassani, A.; Walton, S.; Shah, N.; Abuduweili, A.; Li, J.; Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:​2104.​05704 (2021)
32.
go back to reference Kingma, D.P.; Welling, M.: Auto-encoding variational Bayes. In: International Conference on Learning Representations, ICLR (2014) Kingma, D.P.; Welling, M.: Auto-encoding variational Bayes. In: International Conference on Learning Representations, ICLR (2014)
36.
go back to reference Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, vol. 139, pp. 10347–10357 (2021) Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, vol. 139, pp. 10347–10357 (2021)
41.
42.
go back to reference Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y.: Transformer in transformer. Adv. Neural Inf. Process. Syst. 34, 15908–15919 (2021) Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y.: Transformer in transformer. Adv. Neural Inf. Process. Syst. 34, 15908–15919 (2021)
43.
go back to reference Ioffe, S.; Szegedy, C.: Batch Normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015) Ioffe, S.; Szegedy, C.: Batch Normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015)
44.
go back to reference Loshchilov, I.; Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations, ICLR (2019) Loshchilov, I.; Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations, ICLR (2019)
46.
go back to reference Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017) Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:​1704.​04861 (2017)
Metadata
Title
Compact Image Transformer Based on Convolutional Variational Autoencoder with Augmented Attention Backbone for Target Recognition in Infrared Images
Authors
Billel Nebili
Atmane Khellal
Abdelkrim Nemra
Said Yacine Boulahia
Laurent Mascarilla
Publication date
16-06-2023
Publisher
Springer Berlin Heidelberg
Published in
Arabian Journal for Science and Engineering / Issue 3/2024
Print ISSN: 2193-567X
Electronic ISSN: 2191-4281
DOI
https://doi.org/10.1007/s13369-023-08012-3

Other articles of this Issue 3/2024

Arabian Journal for Science and Engineering 3/2024 Go to the issue

Research Article-Computer Engineering and Computer Science

Data Augmentation for Improving Explainability of Hate Speech Detection

Research Article-Computer Engineering and Computer Science

HAUOPM: High Average Utility Occupancy Pattern Mining

Research Article-Computer Engineering and Computer Science

Smart Utilities IoT-Based Data Collection Scheduling

Research Article-Computer Engineering and Computer Science

Deep Learning-Based Noise Type Classification and Removal for Drone Image Restoration

Premium Partners