Top

Multimedia Systems

Published in:

02-04-2023 | Regular Paper

Triple-level relationship enhanced transformer for image captioning

Authors: Anqi Zheng, Shiqi Zheng, Cong Bai, Deng Chen

Published in: Multimedia Systems | Issue 4/2023

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Region features and grid features are often used in the field of image captioning. As they are often extracted by different networks, fusing them for image captioning needs connections between them. However, these connections often rely on simple coordinates, which will lead the captions lacks of precise expression of visual relationships. Meanwhile, the scene graph features contain object relationship information, through the multi-layer calculation, and the extracted object relationship information is of higher level and more complete, which can compensate the shortage of region features and grid features to a certain extent. Therefore, a Triple-Level Relationship Enhanced Transformer (TRET) is proposed in this paper, which can process three features in parallel. TRET can obtain and combine different levels of object relationship features to achieve the advantages of complementarity between different features. Specifically, we obtain high-level object-relational information with the help of Graph Based Attention, and achieve the fusion of low-level relational information and high-level object-relational information with the help of Cross Relationship Enhanced Attention, so as to better align the information of both modalities, visual and text. To validate our model, we conduct comprehensive experiments on the MS-COCO dataset. The results indicate that our method achieves better performance compared with the existing state-of-the-art methods and effectively enhances the ability of describing the representation of object relationships in the generated outcomes.

previous article Attentional weighting strategy-based dynamic GCN for skeleton-based action recognition

next article Shallow multi-branch attention convolutional neural network for micro-expression recognition

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Hossain, M.Z., Sohel, F., Shiratuddin, M.F., Laga, H.: A comprehensive survey of deep learning for image captioning. ACM Comput. Surv. 51(6), 118–111836 (2019)CrossRef

LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRef

Sutskever, I., Martens, J., Hinton, G.E.: Generating text with recurrent neural networks. In: Getoor, L., Scheffer, T. (eds.) Proceedings of the 28th International Conference on Machine Learning, ICML 2011, Bellevue, Washington, USA, June 28 - July 2, 2011, pp. 1017–1024. Omnipress (2011)

Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Darrell, T., Saenko, K.: Long-term recurrent convolutional networks for visual recognition and description. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pp. 2625–2634. IEEE Computer Society (2015)

Johnson, J., Karpathy, A., Fei-Fei, L.: Densecap: Fully convolutional localization networks for dense captioning. In: 2016 IEEE conference on computer vision and pattern recognition, CVPR 2016, pp. 4565–4574. Las Vegas, NV, USA (2016)

Rennie, S.J., Marcheret, E., Mroueh, Y., Ross, J., Goel, vol,: Self-critical sequence training for image captioning. In: 2017 IEEE conference on computer vision and pattern recognition, CVPR 2017, pp. 1179–1195. Honolulu, HI, USA (2017)

Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans. Patt. Analy. Mach. Intell. 39(4), 652–663 (2016)CrossRef

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A.C., Salakhutdinov, R., Zemel, R.S., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: Bach, F.R., Blei, D.M. (eds.) Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015. JMLR Workshop and Conference Proceedings, vol. 37, pp. 2048–2057. JMLR.org (2015)

Schwartz, I., Schwing, A.G., Hazan, T.: High-order attention models for visual question answering. In: Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R. (eds.) Advances in neural information processing systems 30: annual conference on neural information processing systems 2017, pp. 3664–3674. Long Beach, CA, USA (2017)

10.

Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and visual question answering. In: 2018 IEEE conference on computer vision and pattern recognition, CVPR 2018, pp. 6077–6086. Salt Lake City, UT, USA (2018)

11.

Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pp. 3156–3164. IEEE Computer Society (2015). https://doi.org/10.1109/CVPR.2015.7298935

12.

Yang, X., Tang, K., Zhang, H., Cai, J.: Auto-encoding scene graphs for image captioning. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pp. 10685–10694. Computer Vision Foundation / IEEE (2019)

13.

Lin, T., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D.J., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer Vision - ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V. Lecture Notes in Computer Science, vol. 8693, pp. 740–755. Springer (2014)

14.

Socher, R., Fei-Fei, L.: Connecting modalities: Semi-supervised segmentation and annotation of images using unaligned text corpora. In: The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2010, San Francisco, CA, USA, 13-18 June 2010, pp. 966–973. IEEE Computer Society (2010)

15.

Yao, B.Z., Yang, X., Lin, L., Lee, M.W., Zhu, S.C.: I2T: image parsing to text description. Proc. IEEE 98(8), 1485–1508 (2010)CrossRef

16.

Li, S., Kulkarni, G., Berg, T.L., Berg, A.C., Choi, Y.: Composing simple image descriptions using web-scale n-grams. In: Goldwater, S., Manning, C.D. (eds.) Proceedings of the Fifteenth Conference on Computational Natural Language Learning, CoNLL 2011, Portland, Oregon, USA, June 23-24, 2011, pp. 220–228. ACL (2011)

17.

Kulkarni, G., Premraj, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., Berg, T.L.: Baby talk: Understanding and generating simple image descriptions. In: The 24th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2011, Colorado Springs, CO, USA, 20-25 June 2011, pp. 1601–1608. IEEE Computer Society (2011)

18.

Mitchell, M., Dodge, J., Goyal, A., Yamaguchi, K., Stratos, K., Han, X., Mensch, A.C., Berg, A.C., Berg, T.L., III, H.D.: Midge: Generating image descriptions from computer vision detections. In: Daelemans, W., Lapata, M., Màrquez, L. (eds.) EACL 2012, 13th Conference of the European Chapter of the Association for Computational Linguistics, Avignon, France, April 23-27, 2012, pp. 747–756. The Association for Computer Linguistics (2012)

19.

Ordonez, V., Kulkarni, G., Berg, T.L.: Im2text: Describing images using 1 million captioned photographs. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F.C.N., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems 24: 25th Annual Conference on Neural Information Processing Systems 2011. Proceedings of a Meeting Held 12-14 December 2011, Granada, Spain, pp. 1143–1151 (2011)

20.

Socher, R., Karpathy, A., Le, Q.V., Manning, C.D., Ng, A.Y.: Grounded compositional semantics for finding and describing images with sentences. Trans. Assoc. Comput. Linguistics 2, 207–218 (2014)CrossRef

21.

Kuznetsova, P., Ordonez, V., Berg, T.L., Choi, Y.: TREETALK: composition and compression of trees for image descriptions. Trans. Assoc. Comput. Linguistics 2, 351–362 (2014)CrossRef

22.

Sun, C., Gan, C., Nevatia, R.: Automatic concept discovery from parallel text and visual corpora. In: 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp. 2596–2604. IEEE Computer Society (2015)

23.

Cho, K., van Merrienboer, B., Gülçehre, Ç., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Moschitti, A., Pang, B., Daelemans, W. (eds.) Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A Meeting of SIGDAT, a Special Interest Group of The ACL, pp. 1724–1734. ACL (2014)

24.

Bai, C., Zheng, A., Huang, Y., Pan, X., Chen, N.: Boosting convolutional image captioning with semantic content and visual relationship. Displays 70, 102069 (2021). https://doi.org/10.1016/j.displa.2021.102069CrossRef

25.

Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 91–99 (2015)

26.

27.

Jiang, H., Misra, I., Rohrbach, M., Learned-Miller, E.G., Chen, X.: In defense of grid features for visual question answering. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp. 10264–10273. Computer Vision Foundation / IEEE (2020)

28.

Luo, Y., Ji, J., Sun, X., Cao, L., Wu, Y., Huang, F., Lin, C., Ji, R.: Dual-level collaborative transformer for image captioning. In: Thirty-Fifth AAAI Conference on Artificial Intelligence, AAAI 2021, Thirty-Third Conference on Innovative Applications of Artificial Intelligence, IAAI 2021, The Eleventh Symposium on Educational Advances in Artificial Intelligence, EAAI 2021, Virtual Event, February 2-9, 2021, pp. 2286–2293. AAAI Press (2021)

29.

Huang, L., Wang, W., Chen, J., Wei, X.: Attention on attention for image captioning. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp. 4633–4642. IEEE (2019)

30.

Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-memory transformer for image captioning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp. 10575–10584. Computer Vision Foundation / IEEE (2020)

31.

32.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. In: Guyon, I., von Luxburg, U., Bengio, S., Wallach, H.M., Fergus, R., Vishwanathan, S.V.N., Garnett, R. (eds.) Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 5998–6008 (2017)

33.

Ranzato, M., Chopra, S., Auli, M., Zaremba, W.: Sequence level training with recurrent neural networks. In: Bengio, Y., LeCun, Y. (eds.) 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings (2016)

34.

Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pp. 3128–3137. IEEE Computer Society (2015)

35.

Papineni, K., Roukos, S., Ward, T., Zhu, W.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA, USA, pp. 311–318. ACL (2002)

36.

Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Goldstein, J., Lavie, A., Lin, C., Voss, C.R. (eds.) Proceedings of the Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation And/or Summarization@ACL 2005, Ann Arbor, Michigan, USA, June 29, 2005, pp. 65–72. Association for Computational Linguistics (2005)

37.

Lin, C.-Y.: Rouge: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74–81 (2004)

38.

Vedantam, R., Zitnick, C.L., Parikh, D.: Cider: Consensus-based image description evaluation. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pp. 4566–4575. IEEE Computer Society (2015)

39.

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings (2015)

40.

Pan, Y., Yao, T., Li, Y., Mei, T.: X-linear attention networks for image captioning. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, pp. 10968–10977. Computer Vision Foundation / IEEE (2020)

Title: Triple-level relationship enhanced transformer for image captioning
Authors: Anqi Zheng
Shiqi Zheng
Cong Bai
Deng Chen
Publication date: 02-04-2023
Publisher: Springer Berlin Heidelberg
Published in: Multimedia Systems / Issue 4/2023
Print ISSN: 0942-4962
Electronic ISSN: 1432-1882
DOI: https://doi.org/10.1007/s00530-023-01073-2

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 4/2023

Style matching CAPTCHA: match neural transferred styles to thwart intelligent attacks

A fast recognition method for coal gangue image processing

View-aware attribute-guided network for vehicle re-identification

Attentional weighting strategy-based dynamic GCN for skeleton-based action recognition

Face attribute recognition via end-to-end weakly supervised regional location

Local discriminative graph convolutional networks for text classification