Skip to main content
main-content
Top

Hint

Swipe to navigate through the chapters of this book

2021 | OriginalPaper | Chapter

Fusion Models for Improved Image Captioning

Authors: Marimuthu Kalimuthu, Aditya Mogadala, Marius Mosbach, Dietrich Klakow

Published in: Pattern Recognition. ICPR International Workshops and Challenges

Publisher: Springer International Publishing

share
SHARE

Abstract

Visual captioning aims to generate textual descriptions given images or videos. Traditionally, image captioning models are trained on human annotated datasets such as Flickr30k and MS-COCO, which are limited in size and diversity. This limitation hinders the generalization capabilities of these models while also rendering them liable to making mistakes. Language models can, however, be trained on vast amounts of freely available unlabelled data and have recently emerged as successful language encoders [10] and coherent text generators [4]. Meanwhile, several unimodal and multimodal fusion techniques have been proven to work well for natural language generation [11] and automatic speech recognition [30]. Building on these recent developments, and with the aim of improving the quality of generated captions, the contribution of our work in this paper is two-fold: First, we propose a generic multimodal model fusion framework for caption generation as well as emendation where we utilize different fusion strategies to integrate a pretrained Auxiliary Language Model (AuxLM) within the traditional encoder-decoder visual captioning frameworks. Next, we employ the same fusion strategies to integrate a pretrained Masked Language Model (MLM), namely BERT, with a visual captioning model, viz. Show, Attend, and Tell, for emending both syntactic and semantic errors in captions. Our caption emendation experiments on three benchmark image captioning datasets, viz. Flickr8k, Flickr30k, and MSCOCO, show improvements over the baseline, indicating the usefulness of our proposed multimodal fusion strategies. Further, we perform a preliminary qualitative analysis on the emended captions and identify error categories based on the type of corrections.
Appendix
Available only for authorised users
Literature
2.
go back to reference Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72. Association for Computational Linguistics (2005) Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72. Association for Computational Linguistics (2005)
3.
go back to reference Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003) MATH Bengio, Y., Ducharme, R., Vincent, P., Janvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003) MATH
5.
go back to reference Callison-Burch, C., Osborne, M., Koehn, P.: Re-evaluating the role of Bleu in machine translation research. In: 11th Conference EACL, Trento, Italy (2006) Callison-Burch, C., Osborne, M., Koehn, P.: Re-evaluating the role of Bleu in machine translation research. In: 11th Conference EACL, Trento, Italy (2006)
6.
go back to reference Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. CoRR abs/2005.12872 (2020) Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. CoRR abs/2005.12872 (2020)
7.
go back to reference Cho, J., et al.: Language model integration based on memory control for sequence to sequence speech recognition. In: ICASSP, Brighton, United Kingdom, pp. 6191–6195 (2019) Cho, J., et al.: Language model integration based on memory control for sequence to sequence speech recognition. In: ICASSP, Brighton, United Kingdom, pp. 6191–6195 (2019)
8.
go back to reference Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of EMNLP, pp. 1724–1734 (2014) Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of EMNLP, pp. 1724–1734 (2014)
9.
go back to reference Dauphin, Y.N., Fan, A., Auli, M., Grangier, D.: Language modeling with gated convolutional networks. In: Proceedings of the 34th ICML, pp. 933–941 (2017) Dauphin, Y.N., Fan, A., Auli, M., Grangier, D.: Language modeling with gated convolutional networks. In: Proceedings of the 34th ICML, pp. 933–941 (2017)
11.
go back to reference Fan, A., Lewis, M., Dauphin, Y.: Hierarchical neural story generation. In: Proceedings of the 56th Annual Meeting of ACL 2018, Melbourne, pp. 889–898 (2018) Fan, A., Lewis, M., Dauphin, Y.: Hierarchical neural story generation. In: Proceedings of the 56th Annual Meeting of ACL 2018, Melbourne, pp. 889–898 (2018)
13.
go back to reference He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on CVPR 2016, pp. 770–778 (2016) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on CVPR 2016, pp. 770–778 (2016)
14.
go back to reference Kalimuthu, M., Nunnari, F., Sonntag, D.: A competitive deep neural network approach for the imageclefmed caption 2020 task. In: Working Notes of CLEF 2020, Thessaloniki. CEUR Workshop Proceedings, vol. 2696. CEUR-WS.org (2020) Kalimuthu, M., Nunnari, F., Sonntag, D.: A competitive deep neural network approach for the imageclefmed caption 2020 task. In: Working Notes of CLEF 2020, Thessaloniki. CEUR Workshop Proceedings, vol. 2696. CEUR-WS.org (2020)
15.
go back to reference Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR, San Diego (2015) Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR, San Diego (2015)
16.
go back to reference Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS 2012, Lake Tahoe, pp. 1106–1114 (2012) Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS 2012, Lake Tahoe, pp. 1106–1114 (2012)
17.
go back to reference Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81. Assoc. for Comp. Linguistics (2004) Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81. Assoc. for Comp. Linguistics (2004)
18.
go back to reference Lu, X., Wang, B., Zheng, X., Li, X.: Exploring models and data for remote sensing image caption generation. Trans. Geosci. Remote Sens. 56, 2183–2195 (2017) Lu, X., Wang, B., Zheng, X., Li, X.: Exploring models and data for remote sensing image caption generation. Trans. Geosci. Remote Sens. 56, 2183–2195 (2017)
19.
go back to reference Merity, S., Keskar, N.S., Socher, R.: Regularizing and optimizing LSTM language models. In: Proceedings of ICLR 2018, Vancouver, Conference Track Proceedings (2018) Merity, S., Keskar, N.S., Socher, R.: Regularizing and optimizing LSTM language models. In: Proceedings of ICLR 2018, Vancouver, Conference Track Proceedings (2018)
20.
go back to reference Mikolov, T., Karafiát, M., Burget, L., Cernocký, J., Khudanpur, S.: Recurrent neural network based language model. In: INTERSPEECH, pp. 1045–1048 (2010) Mikolov, T., Karafiát, M., Burget, L., Cernocký, J., Khudanpur, S.: Recurrent neural network based language model. In: INTERSPEECH, pp. 1045–1048 (2010)
23.
go back to reference Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of ACL, pp. 311–318 (2002) Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of ACL, pp. 311–318 (2002)
24.
go back to reference Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: PyTorch: an imperative style, high-performance deep learning library. NeurIPS 2019, 8026–8037 (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: PyTorch: an imperative style, high-performance deep learning library. NeurIPS 2019, 8026–8037 (2019)
25.
go back to reference Pelka, O., Friedrich, C.M., Garcıa Seco de Herrera, A., Müller, H.: Overview of the imageclefmed 2020 concept prediction task: medical image understanding. In: CLEF2020 Working Notes, CEUR Workshop Proceedings, vol. 2696 (2020) Pelka, O., Friedrich, C.M., Garcıa Seco de Herrera, A., Müller, H.: Overview of the imageclefmed 2020 concept prediction task: medical image understanding. In: CLEF2020 Working Notes, CEUR Workshop Proceedings, vol. 2696 (2020)
26.
go back to reference Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019) Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
27.
go back to reference Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Object hallucination in image captioning. In: Proceedings of EMNLP Brussels, pp. 4035–4045 (2018) Rohrbach, A., Hendricks, L.A., Burns, K., Darrell, T., Saenko, K.: Object hallucination in image captioning. In: Proceedings of EMNLP Brussels, pp. 4035–4045 (2018)
28.
go back to reference Sammani, F., Elsayed, M.: Look and modify: modification networks for image captioning. In: 30th BMVC 2019, Cardiff, UK, p. 75. BMVA Press (2019) Sammani, F., Elsayed, M.: Look and modify: modification networks for image captioning. In: 30th BMVC 2019, Cardiff, UK, p. 75. BMVA Press (2019)
29.
go back to reference Sammani, F., Melas-Kyriazi, L.: Show, edit and tell: a framework for editing image captions. In: Proceedings of CVPR 2020, Seattle, pp. 4808–4816. IEEE (2020) Sammani, F., Melas-Kyriazi, L.: Show, edit and tell: a framework for editing image captions. In: Proceedings of CVPR 2020, Seattle, pp. 4808–4816. IEEE (2020)
30.
go back to reference Sriram, A., Jun, H., Satheesh, S., Coates, A.: Cold fusion: training seq2seq models together with language models. In: Proceedings of Interspeech 2018, pp. 387–391 (2018) Sriram, A., Jun, H., Satheesh, S., Coates, A.: Cold fusion: training seq2seq models together with language models. In: Proceedings of Interspeech 2018, pp. 387–391 (2018)
31.
go back to reference Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. NIPS 2017, 5998–6008 (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. NIPS 2017, 5998–6008 (2017)
32.
go back to reference Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of CVPR 2015, pp. 4566–4575. IEEE CS (2015) Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of CVPR 2015, pp. 4566–4575. IEEE CS (2015)
33.
go back to reference Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: aeural image caption generator. In: CVPR, pp. 3156–3164. IEEE Computer Society (2015) Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: aeural image caption generator. In: CVPR, pp. 3156–3164. IEEE Computer Society (2015)
35.
go back to reference Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015) Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)
Metadata
Title
Fusion Models for Improved Image Captioning
Authors
Marimuthu Kalimuthu
Aditya Mogadala
Marius Mosbach
Dietrich Klakow
Copyright Year
2021
DOI
https://doi.org/10.1007/978-3-030-68780-9_32

Premium Partner