nach oben

International Journal of Multimedia Information Retrieval

Erschienen in:

06.10.2022 | Regular Paper

Tri-RAT: optimizing the attention scores for image captioning

verfasst von: You Yang, Yongzhi An, Juntao Hu, Longyue Pan

Erschienen in: International Journal of Multimedia Information Retrieval | Ausgabe 4/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Attention mechanisms and grid features are widely used in current visual language tasks like image captioning. The attention scores are the key factor to the success of the attention mechanism. However, the connection between attention scores in different layers is not strong enough since Transformer is a hierarchical structure. Additionally, geometric information is inevitably lost when grid features are flattened to be fed into a transformer model. Therefore, bias scores about geometric position information should be added to the attention scores. Considering that there are three different kinds of attention modules in the transformer architecture, we build three independent paths (residual attention paths, RAPs) to propagate the attention scores from the previous layer as a prior for attention computation. This operation is like a residual connection between attention scores, which can enhance the connection and make each attention layer obtain a global comprehension. Then, we replace the traditional attention module with a novel residual attention with relative position module in the encoder to incorporate relative position scores with attention scores. Residual attention may increase the internal covariate shifts. To optimize the data distribution, we introduce residual attention with layer normalization on query vectors module in the decoder. Finally, we build our Residual Attention Transformer with three RAPs (Tri-RAT) for the image captioning task. The proposed model achieves competitive performance on the MSCOCO benchmark with all the state-of-the-art models. We gain 135.8\(\%\) CIDEr on MS COCO “Karpathy” offline test split and 135.3\(\%\) CIDEr on the online testing server.

Vorheriger Artikel Gender classification from face images using central difference convolutional networks

Nächster Artikel Multimodal Quasi-AutoRegression: forecasting the visual popularity of new fashion products

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Anderson P, Fernando B, Johnson M et al (2016) Spice: semantic propositional image caption evaluation. In: European conference on computer vision. Springer, pp 382–398

Anderson P, He X, Buehler C et al (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6077–6086

Banerjee S, Lavie A (2005) Meteor: an automatic metric for mt evaluation with improved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pp 65–72

Cornia M, Stefanini M, Baraldi L et al (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,578–10,587

Guo L, Liu J, Zhu X et al (2020) Normalized and geometry-aware self-attention network for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,327–10,336

Gupta A, Verma Y, Jawahar C (2012) Choosing linguistics over vision to describe images. In: Proceedings of the AAAI conference on artificial intelligence, pp 606–612

He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

He R, Ravula A, Kanagal B et al (2020) Realformer: transformer likes residual attention. arXiv:2012.11747

Herdade S, Kappeler A, Boakye K et al (2019) Image captioning: transforming objects into words. Adv Neural Inf Process Syst 32:11137–11147

10.

Huang L, Wang W, Chen J et al (2019) Attention on attention for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 4634–4643

11.

Ji J, Luo Y, Sun X et al (2021) Improving image captioning by leveraging intra-and inter-layer global representation in transformer network. In: Proceedings of the AAAI conference on artificial intelligence, pp 1655–1663

12.

Jiang H, Misra I, Rohrbach M et al (2020) In defense of grid features for visual question answering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,267–10,276

13.

Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137

14.

Li G, Zhu L, Liu P et al (2019) Entangled transformer for image captioning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8928–8937

15.

Lin C-Y (2004) ROUGE: A Package for Automatic Evaluation of Summaries. In: Text Summarization Branches Out. Association for Computational Linguistics, Barcelona, Spain, pp 74–81. https://aclanthology.org/W04-1013

16.

Lin TY, Maire M, Belongie S et al (2014) Microsoft coco: common objects in context. In: European conference on computer vision. Springer, pp 740–755

17.

Liu Z, Hu H, Lin Y et al (2021) Swin transformer v2: scaling up capacity and resolution. arXiv:2111.09883

18.

Lu J, Xiong C, Parikh D et al (2017) Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 375–383

19.

Luo Y, Ji J, Sun X et al (2021) Dual-level collaborative transformer for image captioning. arXiv:2101.06462

20.

Mao J, Xu W, Yang Y,et al (2014) Explain images with multimodal recurrent neural networks. arXiv:1410.1090

21.

Mitchell M, Dodge J, Goyal A et al (2012) Midge: generating image descriptions from computer vision detections. In: Proceedings of the 13th conference of the European chapter of the association for computational linguistics, pp 747–756

22.

Ordonez V, Kulkarni G, Berg T (2011) Im2text: describing images using 1 million captioned photographs. Adv Neural Inf Process Syst 24:1143–1151

23.

Pan Y, Yao T, Li Y et al (2020) X-linear attention networks for image captioning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10,971–10,980

24.

Papineni K, Roukos S, Ward T, et al (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the association for computational linguistics, pp 311–318

25.

Rennie SJ, Marcheret E, Mroueh Y et al (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7008–7024

26.

Ushiku Y, Yamaguchi M, Mukuta Y et al (2015) Common subspace for model and similarity: phrase learning for caption generation from images. In: Proceedings of the IEEE international conference on computer vision, pp 2668–2676

27.

Vaswani A, Shazeer N, Parmar N et al (2017) Attention is all you need. Adv Neural Inf Process Syst 30:5998–6008

28.

Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575

29.

Vinyals O, Toshev A, Bengio S et al (2015) Show and tell: a neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3156–3164

30.

Xu K, Ba J, Kiros R et al (2015) Show, attend and tell: neural image caption generation with visual attention. In: International conference on machine learning, PMLR, pp 2048–2057

31.

Yao T, Pan Y, Li Y et al (2018) Exploring visual relationship for image captioning. In: Proceedings of the European conference on computer vision (ECCV), pp 684–699

32.

Ying C, Ke G, He D et al (2021) Lazyformer: self attention with lazy update. arXiv:2102.12702

33.

Zhang X, Sun X, Luo Y et al (2021) Rstnet: captioning with adaptive attention on visual and non-visual words. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 15,465–15,474

Titel: Tri-RAT: optimizing the attention scores for image captioning
verfasst von: You Yang
Yongzhi An
Juntao Hu
Longyue Pan
Publikationsdatum: 06.10.2022
Verlag: Springer London
Erschienen in: International Journal of Multimedia Information Retrieval / Ausgabe 4/2022
Print ISSN: 2192-6611
Elektronische ISSN: 2192-662X
DOI: https://doi.org/10.1007/s13735-022-00260-7

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 4/2022

Similar interior coordination image retrieval with multi-view features

Special issue on cross-modal retrieval and analysis

Multimodal Quasi-AutoRegression: forecasting the visual popularity of new fashion products

FDAM: full-dimension attention module for deep convolutional neural networks

Who is gambling? Finding cryptocurrency gamblers using multi-modal retrieval methods

Your heart rate betrays you: multimodal learning with spatio-temporal fusion networks for micro-expression recognition

Premium Partner