Top

Multimedia Systems

Published in:

17-04-2023 | Regular Paper

Hierarchical cross-modal contextual attention network for visual grounding

Authors: Xin Xu, Gang Lv, Yining Sun, Yuxia Hu, Fudong Nian

Published in: Multimedia Systems | Issue 4/2023

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

This paper explores the task of visual grounding (VG), which aims to localize regions of an image through sentence queries. The development of VG has significantly advanced with Transformer-based frameworks, which can capture image and text contexts without proposals. However, previous research has rarely explored hierarchical semantics and cross-interactions between two uni-modal encoders. Therefore, this paper proposes a Hierarchical Cross-modal Contextual Attention Network (HCCAN) for the VG task. The HCCAN model utilizes a visual-guided text contextual attention module, a text-guided visual contextual attention module, and a Transformer-based multi-modal feature fusion module. This approach not only captures intra-modality and inter-modality relationships through self-attention mechanisms but also captures the hierarchical semantics of textual and visual content in a common space. Experiments conducted on four standard benchmarks, including Flickr30K Entities and RefCOCO, RefCOCO+, RefCOCOg, demonstrate the effectiveness of the proposed method. The code is publicly available at https://www.github.com/cutexin66/HCCAN.

previous article Human pose transfer via shape-aware partial flow prediction network

next article HSGNet: hierarchically stacked graph network with attention mechanism for 3D human pose estimation

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

https://github.com/Lyken17/pytorch-OpCounter.

Carion, N., Massa, F., Synnaeve, G., et al.: End-to-end object detection with transformers. In: European conference on computer vision, Springer, pp 213–229 (2020)

Chen, L., Ma, W., Xiao, J., et al.: Ref-nms: Breaking proposal bottlenecks in two-stage referring expression grounding. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 1036–1044 (2021)

Chen, X., Ma, L., Chen, J., et al.: Real-time referring expression comprehension by single-stage grounding network. arXiv preprint arXiv:1812.03426 (2018)

Cui, R., Qian, T., Peng, P., et al.: Video moment retrieval from text queries via single frame annotation. arXiv preprint arXiv:2204.09409 (2022)

Deng, C., Wu, Q., Wu, Q., et al.: Visual grounding via accumulated attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7746–7755 (2018)

Deng, J., Yang, Z., Chen, T., et al.: Transvg: End-to-end visual grounding with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1769–1779 (2021)

Devlin, J., Chang, M.W., Lee, K., et al.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

Du, Y., Fu, Z., Liu, Q., et al.: Visual grounding with transformers. In: 2022 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 (2022)

Gabeur, V., Sun, C., Alahari, K., et al.: Multi-modal transformer for video retrieval. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, Springer, pp 214–229 (2020)

10.

Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448 (2015)

11.

Han, K., Wang, Y., Chen, H., et al.: A survey on vision transformer. IEEE Transact. Patt. Anal. Mach. Intell. 45, 87–110 (2022)CrossRef

12.

He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 (2016)

13.

Hong, R., Liu, D., Mo, X., et al.: Learning to compose and reason with language tree structures for visual grounding. IEEE Transact. Patt. Anal. Mach. Intell. 44, 684–96 (2019)CrossRef

14.

Hu, R., Rohrbach, M., Andreas, J., et al.: Modeling relationships in referential expressions with compositional modular networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1115–1124 (2017)

15.

Huang, B., Lian, D., Luo, W., et al.: Look before you leap: Learning landmark features for one-stage visual grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 16,888–16,897 (2021)

16.

Jiao, Y., Jie, Z., Chen, J., et al.: Suspected object matters: Rethinking model’s prediction for one-stage visual grounding. arXiv preprint arXiv:2203.05186 (2022)

17.

Kovaleva, O., Romanov, A., Rogers, A., et al.: Revealing the dark secrets of bert. arXiv preprint arXiv:1908.08593 (2019)

18.

Kovvuri, R., Nevatia, R.: Pirc net: Using proposal indexing, relationships and context for phrase grounding. In: Asian Conference on Computer Vision, Springer, pp 451–467 (2018)

19.

Liao, Y., Liu, S., Li, G., et al.: A real-time cross-modality correlation filtering method for referring expression comprehension. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10,880–10,889 (2020)

20.

Lin, T.Y., Maire, M., Belongie, S., et al.: Microsoft coco: Common objects in context. In: European conference on computer vision, Springer, pp 740–755 (2014)

21.

Liu, D., Zhang, H., Wu, F., et al.: Learning to assemble neural module tree networks for visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4673–4682 (2019a)

22.

Liu, X., Wang, Z., Shao, J., et al.: Improving referring expression grounding with cross-modal attention-guided erasing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1950–1959 (2019b)

23.

Liu, Y., Li, S., Wu, Y., et al.: Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3042–3051 (2022)

24.

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

25.

Mao, J., Huang, J., Toshev, A., et al.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 11–20 (2016)

26.

Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, Springer, pp 792–807 (2016)

27.

Parmar, N., Vaswani, A., Uszkoreit, J., et al.: Image transformer. In: International conference on machine learning, PMLR, pp 4055–4064 (2018)

28.

Plummer, B.A., Wang, L., Cervantes, C.M., et al.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE international conference on computer vision, pp 2641–2649 (2015)

29.

Plummer, B.A., Kordas, P., Kiapour, M.H., et al.: Conditional image-text embedding networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 249–264 (2018)

30.

Qian, S., Wang, J., Hu, J., et al.: Hierarchical multi-modal contextual attention network for fake news detection. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp 153–162 (2021)

31.

Qiao, Y., Deng, C., Wu, Q.: Referring expression comprehension: a survey of methods and datasets. IEEE Transact. Multimedia 23, 4426–4440 (2020)CrossRef

32.

Redmon, J., Farhadi, A.: Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767 (2018)

33.

Rezatofighi, H., Tsoi, N., Gwak, J., et al.: Generalized intersection over union: A metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 658–666 (2019)

34.

Sadhu, A., Chen, K., Nevatia, R.: Zero-shot grounding of objects from natural language queries. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4694–4703 (2019)

35.

Song, Y., Wang, J., Liang, Z., et al.: Utilizing bert intermediate layers for aspect based sentiment analysis and natural language inference. arXiv preprint arXiv:2002.04815 (2020)

36.

Wang, L., Li, Y., Huang, J., et al.: Learning two-branch neural networks for image-text matching tasks. IEEE Transact. Pattern Anal. Mach. Intell. 41(2), 394–407 (2018)CrossRef

37.

Wang, P., Wu, Q., Cao, J., et al.: Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1960–1968 (2019)

38.

Wu, P., He, X., Tang, M., et al.: Hanet: Hierarchical alignment networks for video-text retrieval. In: Proceedings of the 29th ACM international conference on Multimedia, pp 3518–3527 (2021)

39.

Yang, C., Wang, G., Li, D., et al.: Ppgn: Phrase-guided proposal generation network for referring expression comprehension. arXiv preprint arXiv:2012.10890 (2020a)

40.

Yang, S., Li, G., Yu, Y.: Dynamic graph attention for referring expression comprehension. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4644–4653 (2019a)

41.

Yang, S., Li, G., Yu, Y.: Propagating over phrase relations for one-stage visual grounding. In: European Conference on Computer Vision, Springer, pp 589–605 (2020b)

42.

Yang, Z., Gong, B., Wang, L., et al.: A fast and accurate one-stage approach to visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4683–4693 (2019b)

43.

Yang, Z., Chen, T., Wang, L., et al.: Improving one-stage visual grounding by recursive sub-query construction. In: European Conference on Computer Vision, Springer, pp 387–404 (2020c)

44.

Young, P., Lai, A., Hodosh, M., et al.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transact. Assoc. Computat. Linguistics 2, 67–78 (2014)CrossRef

45.

Yu, L., Poirson, P., Yang, S., et al.: Modeling context in referring expressions. In: European Conference on Computer Vision, Springer, pp 69–85 (2016)

46.

Yu, L., Lin, Z., Shen, X., et al.: Mattnet: Modular attention network for referring expression comprehension. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1307–1315 (2018a)

47.

Yu, Z., Yu, J., Xiang, C., et al.: Rethinking diversified and discriminative proposal generation for visual grounding. arXiv preprint arXiv:1805.03508 (2018b)

48.

Zhang, H., Niu, Y., Chang, S.F.: Grounding referring expressions in images by variational context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4158–4166 (2018)

49.

Zhuang, B., Wu, Q., Shen, C., et al.: Parallel attention: A unified framework for visual object discovery through dialogs and queries. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4252–4261 (2018)

Title: Hierarchical cross-modal contextual attention network for visual grounding
Authors: Xin Xu
Gang Lv
Yining Sun
Yuxia Hu
Fudong Nian
Publication date: 17-04-2023
Publisher: Springer Berlin Heidelberg
Published in: Multimedia Systems / Issue 4/2023
Print ISSN: 0942-4962
Electronic ISSN: 1432-1882
DOI: https://doi.org/10.1007/s00530-023-01097-8

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 4/2023

Virtual reality in medical emergencies training: benefits, perceived stress, and learning success

View-aware attribute-guided network for vehicle re-identification

An acupoint health care system with real-time acupoint localization and visualization in augmented reality

Micro-expression spotting network based on attention and one-dimensional convolutional sliding window

Triple-level relationship enhanced transformer for image captioning

Cascaded deep residual learning network for single image dehazing