Top

Published in:

2024 | OriginalPaper | Chapter

Transformer based Multitask Learning for Image Captioning and Object Detection

Authors : Debolena Basak, P. K. Srijith, Maunendra Sankar Desarkar

Published in: Advances in Knowledge Discovery and Data Mining

Publisher: Springer Nature Singapore

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

In several real-world scenarios like autonomous navigation and mobility, to obtain a better visual understanding of the surroundings, image captioning and object detection play a crucial role. This work introduces a novel multitask learning framework that combines image captioning and object detection into a joint model. We propose TICOD, Transformer-based Image Captioning and Object Detection model for jointly training both tasks by combining the losses obtained from image captioning and object detection networks. By leveraging joint training, the model benefits from the complementary information shared between the two tasks, leading to improved performance for image captioning. Our approach utilizes a transformer-based architecture that enables end-to-end network integration for image captioning and object detection and performs both tasks jointly. We evaluate the effectiveness of our approach through comprehensive experiments on the MS-COCO dataset. Our model outperforms the baselines from image captioning literature by achieving a \(3.65\%\) improvement in BERTScore.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter RPH-PGD: Randomly Projected Hessian for Perturbed Gradient Descent

next chapter Communicative and Cooperative Learning for Multi-agent Indoor Navigation

Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on CVPR (2018)

Cai, Z., Vasconcelos, N.: Cascade r-cnn: delving into high quality object detection. In: Proceedings of the IEEE Conference on CVPR (2018)

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) Computer Vision. ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

Chen, T., et al.: A unified sequence interface for vision tasks. In: NeurIPS (2022)

Cornia, M., et al.: Meshed-memory transformer for image captioning. In: CVPR (2020)

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

Fariha, A.: Automatic image captioning using multitask learning. In: NeurIPS, vol. 20 (2016)

Girshick, R.: Fast r-cnn. In: International Conference on Computer Vision (ICCV) (2015)

10.

Girshick, R., et al.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on CVPR (2014)

11.

Han, K., et al.: Transformer in transformer. In: Advances in NeurIPS, pp. 15908–15919 (2021)

12.

He, P., et al.: Deberta: decoding-enhanced bert with disentangled attention. In: ICLR (2021)

13.

Jiang, H., et al.: In defense of grid features for visual question answering. In: CVPR (2020)

14.

Li, X., et al.: Oscar: Object-semantics aligned pre-training for vision-language tasks. In: Computer Vision. ECCV 2020. Springer (2020)

15.

Liang, J., et al.: Swinir: image restoration using Swin transformer. In: ICCV (2021)

16.

Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision. ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

17.

Lin, T.Y., et al.: Feature pyramid networks for object detection. In: CVPR (2017)

18.

Liu, H., et al.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)

19.

Liu, Y., et al.: Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

20.

Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF ICCV, pp. 10012–10022 (2021)

21.

Luo, Y., et al.: Dual-level collaborative transformer for image captioning. Proc. AAAI Conf. Artif. Intell. 35(3), 2286–2293 (2021). https://doi.org/10.1609/aaai.v35i3.16328

22.

Mokady, R., Hertz, A., Bermano, A.H.: Clipcap: clip prefix for image captioning. arXiv preprint arXiv:2111.09734 (2021)

23.

Pan, Y., et al.: X-linear attention networks for image captioning. In: CVPR (2020)

24.

Radford, A., et al.: Language models are unsupervised multitask learners. OpenAI blog (2019)

25.

Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

26.

Ren, S., et al.: Faster r-cnn: towards real-time object detection with region proposal networks. In: NeurIPS, vol. 28 (2015)

27.

Touvron, H., et al.: Training data-efficient image transformers and distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)

28.

Vaswani, A., et al.: Attention is all you need. In: NeurIPS, vol. 30 (2017)

29.

Vinyals, O., et al: Show and tell: a neural image caption generator. In: CVPR (2015)

30.

Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF ICCV, pp. 568–578 (2021)

31.

Wang, Y., Xu, J., Sun, Y.: End-to-end transformer based model for image captioning. Proc. AAAI Conf. Artif. Intell. 36(3), 2585–2594 (2022). https://doi.org/10.1609/aaai.v36i3.20160

32.

Xinlei Chen, H.F., et al.: Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)

33.

Xu, H., et al.: E2e-vlp: end-to-end vision-language pre-training enhanced by visual learning. In: Proceedings of the 59th Annual Meeting of the ACL and the 11th IJCNLP (2021)

34.

Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: Proceedings of the 32nd ICML, vol. 37, pp. 2048–2057. PMLR (2015)

35.

Zhang, K., et al.: Bertscore: evaluating text generation with bert. In: ICLR (2020)

36.

Zhang, P., et al.: Vinvl: revisiting visual representations in vision-language models. In: Proceedings of the IEEE/CVF Conference on CVPR (2021)

37.

Zhao, W., et al.: A multi-task learning approach for image captioning. In: IJCAI 2018

38.

Zhuo, T.Y., et al.: Rethinking round-trip translation for machine translation evaluation. In: Findings of the Association for Computational Linguistics: ACL (2023)

Title: Transformer based Multitask Learning for Image Captioning and Object Detection
Authors: Debolena Basak
P. K. Srijith
Maunendra Sankar Desarkar
Publisher: Springer Nature Singapore
Book: Advances in Knowledge Discovery and Data Mining
Print ISBN: 978-981-9722-52-5

Electronic ISBN: 978-981-9722-53-2

Copyright Year: 2024
DOI: https://doi.org/10.1007/978-981-97-2253-2_21

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner