nach oben

Arabian Journal for Science and Engineering

Erschienen in:

16.06.2023 | Research Article-Computer Engineering and Computer Science

Compact Image Transformer Based on Convolutional Variational Autoencoder with Augmented Attention Backbone for Target Recognition in Infrared Images

verfasst von: Billel Nebili, Atmane Khellal, Abdelkrim Nemra, Said Yacine Boulahia, Laurent Mascarilla

Erschienen in: Arabian Journal for Science and Engineering | Ausgabe 3/2024

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Recently, Vision Transformer (ViT) has become a relevant alternative to convolutional neural networks (CNN) for image classification tasks. However, we believe that ViT needs pre-training on large-size datasets, making it unsuitable for certain scientific fields such as infrared imaging where the amount of training data is limited. In this direction, we proposed a Compact image Transformer based on convolutional variational Autoencoder with Augmented attention backbone (referred to AA-CiT) for target recognition in infrared images, which can learn efficiently from scratch even with small-size datasets. This is performed by three main adaptations of the original ViT architecture, in which we introduced convolutions in its different parts to fully benefit from the properties of both paradigms: attention and convolution. First, we proposed an improvement in the tokenization step by introducing a new module based on a local convolutional variational autoencoder. Second, convolutional features are incorporated in ViT’s encoder, which allows us to introduce some inductive bias of CNN in the proposed transformer. We finally took profit of a new sequence pooling technique on the top of ViT’s encoder to make our model compact and more accurate. These modifications allow us to overcome the difficulties of ViT training and also eliminate the need for Class token and the heavy reliance on positional embeddings. We validated our approach by carrying out extensive experiments on FLIR-SEEK dataset. Globally, we achieved a \(3\%\) improvement in overall classification accuracy compared to conventional ViT while relying on fewer parameters (\(14\%\) of ViT’s parameters).

Vorheriger Artikel A Regularization Factor-Based Approach to Anomaly Detection Using Contrastive Learning

Nächster Artikel HAUOPM: High Average Utility Occupancy Pattern Mining

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I.: Attention is all you need. In: International Conference on Neural Information Processing Systems, vol. 30, pp. 6000–6010 (2017)

Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186 (2019). https://doi.org/10.18653/v1/N19-1423

Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

Conneau, A.; Lample, G.: Cross-lingual Language Model Pretraining. In: Wallach, H.; Larochelle, H.; Beygelzimer, A.; d’ Alché-Buc, F.; Fox, E.; Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32 (2019)

Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; Uszkoreit, J.; Houlsby, N.: An image is worth 16x16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2021)

Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M.: Transformers in vision: A survey. ACM Comput. Surv. 54(10s), 1–41 (2022). https://doi.org/10.1145/3505244CrossRef

Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; Yang, Z.; Zhang, Y.; Tao, D.: A survey on vision transformer. IEEE Trans. Patt. Anal. Mach. Intell. 45(1), 87–110 (2023). https://doi.org/10.1109/TPAMI.2022.3152247CrossRef

Sun, C.; Shrivastava, A.; Singh, S.; Gupta, A.: Revisiting unreasonable effectiveness of data in deep learning Era. In: International Conference on Computer Vision (ICCV), pp. 843–852 (2017). https://doi.org/10.1109/ICCV.2017.97

Deng, J.; Dong, W.; Socher, R.; Li, L.-J.; Li, K.; Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009). https://doi.org/10.1109/CVPR.2009.5206848

10.

Yan, K.; Wang, X.; Lu, L.; Summers, R.M.: DeepLesion: automated mining of large-scale lesion annotations and universal lesion detection with deep learning. J. Med. Imag. 5(3), 036501–036501 (2018). https://doi.org/10.1117/1.JMI.5.3.036501CrossRef

11.

(2018): FLIR thermal starter dataset. [Online]. Available: https://www.flir.com/oem/adas/adas-dataset-form/

12.

Kim, S.; Song, W.-J.; Kim, S.-H.: Infrared variation optimized deep convolutional neural network for robust automatic ground target recognition. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 195–202 (2017). https://doi.org/10.1109/CVPRW.2017.30

13.

Hong, F.; Song, J.; Meng, H.; Wang, R.; Fang, F.; Zhang, G.: A novel framework on intelligent detection for module defects of PV plant combining the visible and infrared images. Solar Energy 236, 406–416 (2022). https://doi.org/10.1016/j.solener.2022.03.018ADSCrossRef

14.

Abreu de Souza, M.; Krefer, A.G.; Borba, G.B.; Centeno, T.M.; Gamba, H.R.: Combining 3D models with 2D infrared images for medical applications. In: International Conference of the IEEE Engineering in Medicine and Biology Society, pp. 2395–2398 (2015). https://doi.org/10.1109/EMBC.2015.7318876

15.

Akula, A.; Sardana, H.K.: Deep CNN-based feature extractor for target recognition in thermal images. In: IEEE Region 10 Conference (TENCON), pp. 2370–2375 (2019). https://doi.org/10.1109/TENCON.2019.8929697

16.

Ke, A.; Ellsworth, W.; Banerjee, O.; Ng, A.Y.; Rajpurkar, P.: CheXtransfer: performance and parameter efficiency of ImageNet models for chest X-Ray interpretation. In: Conference on Health, Inference, and Learning, pp. 116–124 (2021). https://doi.org/10.1145/3450439.3451867

17.

Zhang, W.; Deng, L.; Zhang, L.; Wu, D.: A survey on negative transfer. IEEE/CAA J. Autom. Sin. 10(2), 305–329 (2023). https://doi.org/10.1109/JAS.2022.106004CrossRef

18.

Simonyan, K.; Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations, San Diego, CA, USA (2015)

19.

He, K.; Zhang, X.; Ren, S.; Sun, J.: Deep residual learning for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016). https://doi.org/10.1109/CVPR.2016.90

20.

D’Ascoli, S.; Touvron, H.; Leavitt, M.L.; Morcos, A.S.; Biroli, G.; Sagun, L.: ConViT: Improving vision transformers with soft convolutional inductive biases. In: International Conference on Machine Learning, vol. 139, pp. 2286–2296 (2021)

21.

Chen, Y.; Kalantidis, Y.; Li, J.; Yan, S.; Feng, J.: A\(^2\)-Nets: Double attention networks. In: Advances in Neural Information Processing Systems, 31 (2018)

22.

Ramachandran, P.; Parmar, N.; Vaswani, A.; Bello, I.; Levskaya, A.; Shlens, J.: Stand-Alone Self-Attention in Vision Models 32, 68–80 (2019)

23.

Wang, H.; Zhu, Y.; Green, B.; Adam, H.; Yuille, A.; Chen, L.-C.: Axial-DeepLab: Stand-alone axial-attention for Panoptic segmentation. In: European Conference on Computer Vision, pp. 108–126 (2020). https://doi.org/10.1007/978-3-030-58548-8_7

24.

Zhao, H.; Jia, J.; Koltun, V.: Exploring self-attention for image recognition. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 10073–10082 (2020). https://doi.org/10.1109/CVPR42600.2020.01009

25.

Meng, H.; Yuan, F.; Tian, Y.; Wei, H.: A ship detection method in complex background via mixed attention model. Arab. J. Sci. Eng. 47(8), 9505–9525 (2022). https://doi.org/10.1007/s13369-021-06275-2CrossRef

26.

Boulahia, S.Y.; Benatia, M.A.; Bouzar, A.: Att2ResNet: a deep attention-based approach for melanoma skin cancer classification. Int. J. Imag. Syst. Technol. 32(2), 476–489 (2022). https://doi.org/10.1002/ima.22687CrossRef

27.

Park, J.; Woo, S.; Lee, J.-Y.; Kweon, I.-S.: BAM: bottleneck attention module. In: British Machine Vision Conference, p. 147 (2018)

28.

Billel, N.; Atmane, K.; Abdelkrim, N.; Laurent, M.: Augmented convolutional neural network models with relative multi-head attention for target recognition in infrared images. Unmanned Syst. (2022). https://doi.org/10.1142/S2301385023500085CrossRef

29.

Bello, I.; Zoph, B.; Le, Q.; Vaswani, A.; Shlens, J.: Attention augmented convolutional networks. In: International Conference on Computer Vision (ICCV), pp. 3285–3294 (2019). https://doi.org/10.1109/ICCV.2019.00338

30.

Srinivas, A.; Lin, T.-Y.; Parmar, N.; Shlens, J.; Abbeel, P.; Vaswani, A.: Bottleneck transformers for visual recognition. In: Conference on Computer Vision and Pattern Recognition, pp. 16519–16529 (2021). https://doi.org/10.1109/CVPR46437.2021.01625

31.

Hassani, A.; Walton, S.; Shah, N.; Abuduweili, A.; Li, J.; Shi, H.: Escaping the big data paradigm with compact transformers. arXiv preprint arXiv:2104.05704 (2021)

32.

Kingma, D.P.; Welling, M.: Auto-encoding variational Bayes. In: International Conference on Learning Representations, ICLR (2014)

33.

Ashfaq Qirat, Z.R. Akram Usman: Thermal Image dataset for object classification (2021). https://doi.org/10.17632/btmrycjpbj.1

34.

Lee, S.H.; Lee, S.; Song, B.C.: Vision transformer for small-size datasets. arXiv preprint arXiv:2112.13492 (2021)

35.

Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L.: CvT: Introducing convolutions to vision transformers. In: International Conference on Computer Vision (ICCV), pp. 22–31 (2021). https://doi.org/10.1109/ICCV48922.2021.00009

36.

Touvron, H.; Cord, M.; Douze, M.; Massa, F.; Sablayrolles, A.; Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, vol. 139, pp. 10347–10357 (2021)

37.

Yuan, K.; Guo, S.; Liu, Z.; Zhou, A.; Yu, F.; Wu, W.: Incorporating convolution designs into visual transformers. In: International Conference on Computer Vision (ICCV), pp. 579–588 (2021). https://doi.org/10.1109/ICCV48922.2021.00062

38.

Yuan, L.; Chen, Y.; Wang, T.; Yu, W.; Shi, Y.; Jiang, Z.-H.; Tay, F.E.H.; Feng, J.; Yan, S.: Tokens-to-token ViT: training vision transformers from scratch on ImageNet. In: International Conference on Computer Vision (ICCV), pp. 558–567 (2021). https://doi.org/10.1109/ICCV48922.2021.00060

39.

Zagoruyko, S.; Komodakis, N.: Wide residual networks. In: British Machine Vision Conference (BMVC), pp. 87–18712 (2016). https://doi.org/10.5244/C.30.87

40.

Touvron, H.; Cord, M.; Sablayrolles, A.; Synnaeve, G.; Jégou, H.: Going deeper with image transformers. In: International Conference on Computer Vision (ICCV), pp. 32–42 (2021). https://doi.org/10.1109/ICCV48922.2021.00010

41.

Wang, W.; Xie, E.; Li, X.; Fan, D.-P.; Song, K.; Liang, D.; Lu, T.; Luo, P.; Shao, L.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: International Conference on Computer Vision (ICCV), pp. 548–558 (2021). https://doi.org/10.1109/ICCV48922.2021.00061

42.

Han, K.; Xiao, A.; Wu, E.; Guo, J.; Xu, C.; Wang, Y.: Transformer in transformer. Adv. Neural Inf. Process. Syst. 34, 15908–15919 (2021)

43.

Ioffe, S.; Szegedy, C.: Batch Normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456 (2015)

44.

Loshchilov, I.; Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations, ICLR (2019)

45.

Huang, G.; Sun, Y.; Liu, Z.; Sedra, D.; Weinberger, K.Q.: Deep networks with Stochastic depth. In: European Conference on Computer Vision, pp. 646–661 (2016). https://doi.org/10.1007/978-3-319-46493-0_39

46.

Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)

47.

Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C.: MobileNetV2: Inverted residuals and linear bottlenecks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018). https://doi.org/10.1109/CVPR.2018.00474

48.

Hore, A.; Ziou, D.: Image quality metrics: PSNR vs. SSIM. In: International Conference on Pattern Recognition, pp. 2366–2369 (2010). https://doi.org/10.1109/ICPR.2010.579

Titel: Compact Image Transformer Based on Convolutional Variational Autoencoder with Augmented Attention Backbone for Target Recognition in Infrared Images
verfasst von: Billel Nebili
Atmane Khellal
Abdelkrim Nemra
Said Yacine Boulahia
Laurent Mascarilla
Publikationsdatum: 16.06.2023
Verlag: Springer Berlin Heidelberg
Erschienen in: Arabian Journal for Science and Engineering / Ausgabe 3/2024
Print ISSN: 2193-567X
Elektronische ISSN: 2191-4281
DOI: https://doi.org/10.1007/s13369-023-08012-3

Premium Partner

Marktübersichten

Die im Laufe eines Jahres in der „adhäsion“ veröffentlichten Marktübersichten helfen Anwendern verschiedenster Branchen, sich einen gezielten Überblick über Lieferantenangebote zu verschaffen.

Zur Marktübersicht

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Weitere Artikel der Ausgabe 3/2024

SEBR: Scharr Edge-Based Regularization Method for Blind Image Deblurring

IFEJM: New Intuitionistic Fuzzy Expert Judgment Method for Effort Estimation in Agile Software Development

Automated Question and Answer Generation from Texts using Text-to-Text Transformers

Fault Diagnosis in an Asynchronous Motor Using Three-Dimensional Convolutional Neural Network

Region-Based Medical Image Watermarking Approach For Secure EPR Transmission Applied to e-Health

LightFIDS: Lightweight and Hierarchical Federated IDS for Massive IoT in 6G Network

Premium Partner

Marktübersichten