Skip to main content
Erschienen in: International Journal of Multimedia Information Retrieval 4/2022

06.10.2022 | Regular Paper

Tri-RAT: optimizing the attention scores for image captioning

verfasst von: You Yang, Yongzhi An, Juntao Hu, Longyue Pan

Erschienen in: International Journal of Multimedia Information Retrieval | Ausgabe 4/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Attention mechanisms and grid features are widely used in current visual language tasks like image captioning. The attention scores are the key factor to the success of the attention mechanism. However, the connection between attention scores in different layers is not strong enough since Transformer is a hierarchical structure. Additionally, geometric information is inevitably lost when grid features are flattened to be fed into a transformer model. Therefore, bias scores about geometric position information should be added to the attention scores. Considering that there are three different kinds of attention modules in the transformer architecture, we build three independent paths (residual attention paths, RAPs) to propagate the attention scores from the previous layer as a prior for attention computation. This operation is like a residual connection between attention scores, which can enhance the connection and make each attention layer obtain a global comprehension. Then, we replace the traditional attention module with a novel residual attention with relative position module in the encoder to incorporate relative position scores with attention scores. Residual attention may increase the internal covariate shifts. To optimize the data distribution, we introduce residual attention with layer normalization on query vectors module in the decoder. Finally, we build our Residual Attention Transformer with three RAPs (Tri-RAT) for the image captioning task. The proposed model achieves competitive performance on the MSCOCO benchmark with all the state-of-the-art models. We gain 135.8\(\%\) CIDEr on MS COCO “Karpathy” offline test split and 135.3\(\%\) CIDEr on the online testing server.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Anderson P, Fernando B, Johnson M et al (2016) Spice: semantic propositional image caption evaluation. In: European conference on computer vision. Springer, pp 382–398 Anderson P, Fernando B, Johnson M et al (2016) Spice: semantic propositional image caption evaluation. In: European conference on computer vision. Springer, pp 382–398
2.
Zurück zum Zitat Anderson P, He X, Buehler C et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086 Anderson P, He X, Buehler C et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086
3.
Zurück zum Zitat Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72 Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72
4.
Zurück zum Zitat Cornia M, Stefanini M, Baraldi L et al (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,578–10,587 Cornia M, Stefanini M, Baraldi L et al (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,578–10,587
5.
Zurück zum Zitat Guo L, Liu J, Zhu X et al (2020) Normalized and geometry-aware self-attention network for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,327–10,336 Guo L, Liu J, Zhu X et al (2020) Normalized and geometry-aware self-attention network for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,327–10,336
6.
Zurück zum Zitat Gupta A, Verma Y, Jawahar C (2012) Choosing linguistics over vision to describe images. In: Proceedings of the AAAI conference on artificial intelligence, pp 606–612 Gupta A, Verma Y, Jawahar C (2012) Choosing linguistics over vision to describe images. In: Proceedings of the AAAI conference on artificial intelligence, pp 606–612
7.
Zurück zum Zitat He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
9.
Zurück zum Zitat Herdade S, Kappeler A, Boakye K et al (2019) Image captioning: transforming objects into words. Adv Neural Inf Process Syst 32:11137–11147 Herdade S, Kappeler A, Boakye K et al (2019) Image captioning: transforming objects into words. Adv Neural Inf Process Syst 32:11137–11147
10.
Zurück zum Zitat Huang L, Wang W, Chen J et al (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4634–4643 Huang L, Wang W, Chen J et al (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4634–4643
11.
Zurück zum Zitat Ji J, Luo Y, Sun X et al (2021) Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. In: Proceedings of the AAAI conference on artificial intelligence, pp 1655–1663 Ji J, Luo Y, Sun X et al (2021) Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. In: Proceedings of the AAAI conference on artificial intelligence, pp 1655–1663
12.
Zurück zum Zitat Jiang H, Misra I, Rohrbach M et al (2020) In defense of grid features for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,267–10,276 Jiang H, Misra I, Rohrbach M et al (2020) In defense of grid features for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,267–10,276
13.
Zurück zum Zitat Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137 Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
14.
Zurück zum Zitat Li G, Zhu L, Liu P et al (2019) Entangled transformer for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8928–8937 Li G, Zhu L, Liu P et al (2019) Entangled transformer for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8928–8937
15.
16.
Zurück zum Zitat Lin TY, Maire M, Belongie S et al (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755 Lin TY, Maire M, Belongie S et al (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755
18.
Zurück zum Zitat Lu J, Xiong C, Parikh D et al (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 375–383 Lu J, Xiong C, Parikh D et al (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 375–383
21.
Zurück zum Zitat Mitchell M, Dodge J, Goyal A et al (2012) Midge: generating image descriptions from computer vision detections. In: Proceedings of the 13th conference of the European chapter of the association for computational linguistics, pp 747–756 Mitchell M, Dodge J, Goyal A et al (2012) Midge: generating image descriptions from computer vision detections. In: Proceedings of the 13th conference of the European chapter of the association for computational linguistics, pp 747–756
22.
Zurück zum Zitat Ordonez V, Kulkarni G, Berg T (2011) Im2text: describing images using 1 million captioned photographs. Adv Neural Inf Process Syst 24:1143–1151 Ordonez V, Kulkarni G, Berg T (2011) Im2text: describing images using 1 million captioned photographs. Adv Neural Inf Process Syst 24:1143–1151
23.
Zurück zum Zitat Pan Y, Yao T, Li Y et al (2020) X-linear attention networks for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,971–10,980 Pan Y, Yao T, Li Y et al (2020) X-linear attention networks for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,971–10,980
24.
Zurück zum Zitat Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318 Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318
25.
Zurück zum Zitat Rennie SJ, Marcheret E, Mroueh Y et al (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7008–7024 Rennie SJ, Marcheret E, Mroueh Y et al (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7008–7024
26.
Zurück zum Zitat Ushiku Y, Yamaguchi M, Mukuta Y et al (2015) Common subspace for model and similarity: phrase learning for caption generation from images. In: Proceedings of the IEEE international conference on computer vision, pp 2668–2676 Ushiku Y, Yamaguchi M, Mukuta Y et al (2015) Common subspace for model and similarity: phrase learning for caption generation from images. In: Proceedings of the IEEE international conference on computer vision, pp 2668–2676
27.
Zurück zum Zitat Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30:5998–6008 Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30:5998–6008
28.
Zurück zum Zitat Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575 Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
29.
Zurück zum Zitat Vinyals O, Toshev A, Bengio S et al (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164 Vinyals O, Toshev A, Bengio S et al (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164
30.
Zurück zum Zitat Xu K, Ba J, Kiros R et al (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, PMLR, pp 2048–2057 Xu K, Ba J, Kiros R et al (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, PMLR, pp 2048–2057
31.
Zurück zum Zitat Yao T, Pan Y, Li Y et al (2018) Exploring visual relationship for image captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 684–699 Yao T, Pan Y, Li Y et al (2018) Exploring visual relationship for image captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 684–699
33.
Zurück zum Zitat Zhang X, Sun X, Luo Y et al (2021) Rstnet: captioning with adaptive attention on visual and non-visual words. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15,465–15,474 Zhang X, Sun X, Luo Y et al (2021) Rstnet: captioning with adaptive attention on visual and non-visual words. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15,465–15,474
Metadaten
Titel
Tri-RAT: optimizing the attention scores for image captioning
verfasst von
You Yang
Yongzhi An
Juntao Hu
Longyue Pan
Publikationsdatum
06.10.2022
Verlag
Springer London
Erschienen in
International Journal of Multimedia Information Retrieval / Ausgabe 4/2022
Print ISSN: 2192-6611
Elektronische ISSN: 2192-662X
DOI
https://doi.org/10.1007/s13735-022-00260-7

Weitere Artikel der Ausgabe 4/2022

International Journal of Multimedia Information Retrieval 4/2022 Zur Ausgabe

Premium Partner