Top

Arabian Journal for Science and Engineering

Published in:

16-06-2023 | Research Article-Computer Engineering and Computer Science

Compact Image Transformer Based on Convolutional Variational Autoencoder with Augmented Attention Backbone for Target Recognition in Infrared Images

Authors: Billel Nebili, Atmane Khellal, Abdelkrim Nemra, Said Yacine Boulahia, Laurent Mascarilla

Published in: Arabian Journal for Science and Engineering | Issue 3/2024

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Recently, Vision Transformer (ViT) has become a relevant alternative to convolutional neural networks (CNN) for image classification tasks. However, we believe that ViT needs pre-training on large-size datasets, making it unsuitable for certain scientific fields such as infrared imaging where the amount of training data is limited. In this direction, we proposed a Compact image Transformer based on convolutional variational Autoencoder with Augmented attention backbone (referred to AA-CiT) for target recognition in infrared images, which can learn efficiently from scratch even with small-size datasets. This is performed by three main adaptations of the original ViT architecture, in which we introduced convolutions in its different parts to fully benefit from the properties of both paradigms: attention and convolution. First, we proposed an improvement in the tokenization step by introducing a new module based on a local convolutional variational autoencoder. Second, convolutional features are incorporated in ViT’s encoder, which allows us to introduce some inductive bias of CNN in the proposed transformer. We finally took profit of a new sequence pooling technique on the top of ViT’s encoder to make our model compact and more accurate. These modifications allow us to overcome the difficulties of ViT training and also eliminate the need for Class token and the heavy reliance on positional embeddings. We validated our approach by carrying out extensive experiments on FLIR-SEEK dataset. Globally, we achieved a \(3\%\) improvement in overall classification accuracy compared to conventional ViT while relying on fewer parameters (\(14\%\) of ViT’s parameters).

previous article A Regularization Factor-Based Approach to Anomaly Detection Using Contrastive Learning

next article HAUOPM: High Average Utility Occupancy Pattern Mining

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Available only for authorised users

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I.: Attention is all you need. In: International Conference on Neural Information Processing Systems, vol. 30, pp. 6000–6010 (2017)

Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019). https://doi.org/10.18653/v1/N19-1423

Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

Conneau, A.; Lample, G.: Cross-lingual Language Model Pretraining. In: Wallach, H.; Larochelle, H.; Beygelzimer, A.; d’ Alché-Buc, F.; Fox, E.; Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32 (2019)

Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; Uszkoreit, J.; Houlsby, N.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)

Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M.: Transformers in vision: A survey. ACM Comput. Surv. 54(10s), 1–41 (2022). https://doi.org/10.1145/3505244CrossRef

Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; Yang, Z.; Zhang, Y.; Tao, D.: A survey on vision transformer. IEEE Trans. Patt. Anal. Mach. Intell. 45(1), 87–110 (2023). https://doi.org/10.1109/TPAMI.2022.3152247CrossRef

Sun, C.; Shrivastava, A.; Singh, S.; Gupta, A.: Revisiting unreasonable effectiveness of data in deep learning Era. In: International Conference on Computer Vision (ICCV), pp. 843–852 (2017). https://doi.org/10.1109/ICCV.2017.97

Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848

10.

Yan, K.; Wang, X.; Lu, L.; Summers, R.M.: DeepLesion: automated mining of large-scale lesion annotations and universal lesion detection with deep learning. J. Med. Imag. 5(3), 036501–036501 (2018). https://doi.org/10.1117/1.JMI.5.3.036501CrossRef

11.

(2018): FLIR thermal starter dataset. [Online]. Available: https://www.flir.com/oem/adas/adas-dataset-form/

12.

Kim, S.; Song, W.-J.; Kim, S.-H.: Infrared variation optimized deep convolutional neural network for robust automatic ground target recognition. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 195–202 (2017). https://doi.org/10.1109/CVPRW.2017.30

13.

Hong, F.; Song, J.; Meng, H.; Wang, R.; Fang, F.; Zhang, G.: A novel framework on intelligent detection for module defects of PV plant combining the visible and infrared images. Solar Energy 236, 406–416 (2022). https://doi.org/10.1016/j.solener.2022.03.018ADSCrossRef

14.

Abreu de Souza, M.; Krefer, A.G.; Borba, G.B.; Centeno, T.M.; Gamba, H.R.: Combining 3D models with 2D infrared images for medical applications. In: International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 2395–2398 (2015). https://doi.org/10.1109/EMBC.2015.7318876

15.

Akula, A.; Sardana, H.K.: Deep CNN-based feature extractor for target recognition in thermal images. In: IEEE Region 10 Conference (TENCON), pp. 2370–2375 (2019). https://doi.org/10.1109/TENCON.2019.8929697

16.

Ke, A.; Ellsworth, W.; Banerjee, O.; Ng, A.Y.; Rajpurkar, P.: CheXtransfer: performance and parameter efficiency of ImageNet models for chest X-Ray interpretation. In: Conference on Health, Inference, and Learning, pp. 116–124 (2021). https://doi.org/10.1145/3450439.3451867

17.

Zhang, W.; Deng, L.; Zhang, L.; Wu, D.: A survey on negative transfer. IEEE/CAA J. Autom. Sin. 10(2), 305–329 (2023). https://doi.org/10.1109/JAS.2022.106004CrossRef

18.

Simonyan, K.; Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations, San Diego, CA, USA (2015)

19.

He, K.; Zhang, X.; Ren, S.; Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90

20.

D’Ascoli, S.; Touvron, H.; Leavitt, M.L.; Morcos, A.S.; Biroli, G.; Sagun, L.: ConViT: Improving vision transformers with soft convolutional inductive biases. In: International Conference on Machine Learning, vol. 139, pp. 2286–2296 (2021)

21.

Chen, Y.; Kalantidis, Y.; Li, J.; Yan, S.; Feng, J.: A\(^2\)-Nets: Double attention networks. In: Advances in Neural Information Processing Systems, 31 (2018)

22.

Ramachandran, P.; Parmar, N.; Vaswani, A.; Bello, I.; Levskaya, A.; Shlens, J.: Stand-Alone Self-Attention in Vision Models 32, 68–80 (2019)

23.

Wang, H.; Zhu, Y.; Green, B.; Adam, H.; Yuille, A.; Chen, L.-C.: Axial-DeepLab: Stand-alone axial-attention for Panoptic segmentation. In: European Conference on Computer Vision, pp. 108–126 (2020). https://doi.org/10.1007/978-3-030-58548-8_7

24.

Zhao, H.; Jia, J.; Koltun, V.: Exploring self-attention for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 10073–10082 (2020). https://doi.org/10.1109/CVPR42600.2020.01009

25.

Meng, H.; Yuan, F.; Tian, Y.; Wei, H.: A ship detection method in complex background via mixed attention model. Arab. J. Sci. Eng. 47(8), 9505–9525 (2022). https://doi.org/10.1007/s13369-021-06275-2CrossRef

26.

Boulahia, S.Y.; Benatia, M.A.; Bouzar, A.: Att2ResNet: a deep attention-based approach for melanoma skin cancer classification. Int. J. Imag. Syst. Technol. 32(2), 476–489 (2022). https://doi.org/10.1002/ima.22687CrossRef

27.

Park, J.; Woo, S.; Lee, J.-Y.; Kweon, I.-S.: BAM: bottleneck attention module. In: British Machine Vision Conference, p. 147 (2018)

28.

Billel, N.; Atmane, K.; Abdelkrim, N.; Laurent, M.: Augmented convolutional neural network models with relative multi-head attention for target recognition in infrared images. Unmanned Syst. (2022). https://doi.org/10.1142/S2301385023500085CrossRef

29.

Bello, I.; Zoph, B.; Le, Q.; Vaswani, A.; Shlens, J.: Attention augmented convolutional networks. In: International Conference on Computer Vision (ICCV), pp. 3285–3294 (2019). https://doi.org/10.1109/ICCV.2019.00338

30.

Srinivas, A.; Lin, T.-Y.; Parmar, N.; Shlens, J.; Abbeel, P.; Vaswani, A.: Bottleneck transformers for visual recognition. In: Conference on Computer Vision and Pattern Recognition, pp. 16519–16529 (2021). https://doi.org/10.1109/CVPR46437.2021.01625

31.

Hassani, A.; Walton, S.; Shah, N.; Abuduweili, A.; Li, J.; Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021)

32.

Kingma, D.P.; Welling, M.: Auto-encoding variational Bayes. In: International Conference on Learning Representations, ICLR (2014)

33.

Ashfaq Qirat, Z.R. Akram Usman: Thermal Image dataset for object classification (2021). https://doi.org/10.17632/btmrycjpbj.1

34.

Lee, S.H.; Lee, S.; Song, B.C.: Vision transformer for small-size datasets. arXiv preprint arXiv:2112.13492 (2021)

35.

Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L.: CvT: Introducing convolutions to vision transformers. In: International Conference on Computer Vision (ICCV), pp. 22–31 (2021). https://doi.org/10.1109/ICCV48922.2021.00009

36.

Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, vol. 139, pp. 10347–10357 (2021)

37.

Yuan, K.; Guo, S.; Liu, Z.; Zhou, A.; Yu, F.; Wu, W.: Incorporating convolution designs into visual transformers. In: International Conference on Computer Vision (ICCV), pp. 579–588 (2021). https://doi.org/10.1109/ICCV48922.2021.00062

38.

Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.-H.; Tay, F.E.H.; Feng, J.; Yan, S.: Tokens-to-token ViT: training vision transformers from scratch on ImageNet. In: International Conference on Computer Vision (ICCV), pp. 558–567 (2021). https://doi.org/10.1109/ICCV48922.2021.00060

39.

Zagoruyko, S.; Komodakis, N.: Wide residual networks. In: British Machine Vision Conference (BMVC), pp. 87–18712 (2016). https://doi.org/10.5244/C.30.87

40.

Touvron, H.; Cord, M.; Sablayrolles, A.; Synnaeve, G.; Jégou, H.: Going deeper with image transformers. In: International Conference on Computer Vision (ICCV), pp. 32–42 (2021). https://doi.org/10.1109/ICCV48922.2021.00010

41.

Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: International Conference on Computer Vision (ICCV), pp. 548–558 (2021). https://doi.org/10.1109/ICCV48922.2021.00061

42.

Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y.: Transformer in transformer. Adv. Neural Inf. Process. Syst. 34, 15908–15919 (2021)

43.

Ioffe, S.; Szegedy, C.: Batch Normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015)

44.

Loshchilov, I.; Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations, ICLR (2019)

45.

Huang, G.; Sun, Y.; Liu, Z.; Sedra, D.; Weinberger, K.Q.: Deep networks with Stochastic depth. In: European Conference on Computer Vision, pp. 646–661 (2016). https://doi.org/10.1007/978-3-319-46493-0_39

46.

Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)

47.

Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C.: MobileNetV2: Inverted residuals and linear bottlenecks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018). https://doi.org/10.1109/CVPR.2018.00474

48.

Hore, A.; Ziou, D.: Image quality metrics: PSNR vs. SSIM. In: International Conference on Pattern Recognition, pp. 2366–2369 (2010). https://doi.org/10.1109/ICPR.2010.579

Title: Compact Image Transformer Based on Convolutional Variational Autoencoder with Augmented Attention Backbone for Target Recognition in Infrared Images
Authors: Billel Nebili
Atmane Khellal
Abdelkrim Nemra
Said Yacine Boulahia
Laurent Mascarilla
Publication date: 16-06-2023
Publisher: Springer Berlin Heidelberg
Published in: Arabian Journal for Science and Engineering / Issue 3/2024
Print ISSN: 2193-567X
Electronic ISSN: 2191-4281
DOI: https://doi.org/10.1007/s13369-023-08012-3

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Other articles of this Issue 3/2024

Data Augmentation for Improving Explainability of Hate Speech Detection

IBPC: An Approach for Mitigation of Cache Pollution Attack in NDN using Interface-Based Popularity

HAUOPM: High Average Utility Occupancy Pattern Mining

Smart Utilities IoT-Based Data Collection Scheduling

Genetic Algorithm-Based Hyperparameter Optimization for Convolutional Neural Networks in the Classification of Crop Pests

Deep Learning-Based Noise Type Classification and Removal for Drone Image Restoration

Premium Partners