Skip to main content
Top
Published in: Multimedia Systems 4/2023

17-04-2023 | Regular Paper

Hierarchical cross-modal contextual attention network for visual grounding

Authors: Xin Xu, Gang Lv, Yining Sun, Yuxia Hu, Fudong Nian

Published in: Multimedia Systems | Issue 4/2023

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This paper explores the task of visual grounding (VG), which aims to localize regions of an image through sentence queries. The development of VG has significantly advanced with Transformer-based frameworks, which can capture image and text contexts without proposals. However, previous research has rarely explored hierarchical semantics and cross-interactions between two uni-modal encoders. Therefore, this paper proposes a Hierarchical Cross-modal Contextual Attention Network (HCCAN) for the VG task. The HCCAN model utilizes a visual-guided text contextual attention module, a text-guided visual contextual attention module, and a Transformer-based multi-modal feature fusion module. This approach not only captures intra-modality and inter-modality relationships through self-attention mechanisms but also captures the hierarchical semantics of textual and visual content in a common space. Experiments conducted on four standard benchmarks, including Flickr30K Entities and RefCOCO, RefCOCO+, RefCOCOg, demonstrate the effectiveness of the proposed method. The code is publicly available at https://​www.​github.​com/​cutexin66/​HCCAN.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Carion, N., Massa, F., Synnaeve, G., et al.: End-to-end object detection with transformers. In: European conference on computer vision, Springer, pp 213–229 (2020) Carion, N., Massa, F., Synnaeve, G., et al.: End-to-end object detection with transformers. In: European conference on computer vision, Springer, pp 213–229 (2020)
2.
go back to reference Chen, L., Ma, W., Xiao, J., et al.: Ref-nms: Breaking proposal bottlenecks in two-stage referring expression grounding. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 1036–1044 (2021) Chen, L., Ma, W., Xiao, J., et al.: Ref-nms: Breaking proposal bottlenecks in two-stage referring expression grounding. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp 1036–1044 (2021)
3.
go back to reference Chen, X., Ma, L., Chen, J., et al.: Real-time referring expression comprehension by single-stage grounding network. arXiv preprint arXiv:1812.03426 (2018) Chen, X., Ma, L., Chen, J., et al.: Real-time referring expression comprehension by single-stage grounding network. arXiv preprint arXiv:​1812.​03426 (2018)
4.
go back to reference Cui, R., Qian, T., Peng, P., et al.: Video moment retrieval from text queries via single frame annotation. arXiv preprint arXiv:2204.09409 (2022) Cui, R., Qian, T., Peng, P., et al.: Video moment retrieval from text queries via single frame annotation. arXiv preprint arXiv:​2204.​09409 (2022)
5.
go back to reference Deng, C., Wu, Q., Wu, Q., et al.: Visual grounding via accumulated attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7746–7755 (2018) Deng, C., Wu, Q., Wu, Q., et al.: Visual grounding via accumulated attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7746–7755 (2018)
6.
go back to reference Deng, J., Yang, Z., Chen, T., et al.: Transvg: End-to-end visual grounding with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1769–1779 (2021) Deng, J., Yang, Z., Chen, T., et al.: Transvg: End-to-end visual grounding with transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 1769–1779 (2021)
7.
go back to reference Devlin, J., Chang, M.W., Lee, K., et al.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) Devlin, J., Chang, M.W., Lee, K., et al.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:​1810.​04805 (2018)
8.
go back to reference Du, Y., Fu, Z., Liu, Q., et al.: Visual grounding with transformers. In: 2022 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 (2022) Du, Y., Fu, Z., Liu, Q., et al.: Visual grounding with transformers. In: 2022 IEEE International Conference on Multimedia and Expo (ICME), IEEE, pp 1–6 (2022)
9.
go back to reference Gabeur, V., Sun, C., Alahari, K., et al.: Multi-modal transformer for video retrieval. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, Springer, pp 214–229 (2020) Gabeur, V., Sun, C., Alahari, K., et al.: Multi-modal transformer for video retrieval. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, Springer, pp 214–229 (2020)
10.
go back to reference Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448 (2015) Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448 (2015)
11.
go back to reference Han, K., Wang, Y., Chen, H., et al.: A survey on vision transformer. IEEE Transact. Patt. Anal. Mach. Intell. 45, 87–110 (2022)CrossRef Han, K., Wang, Y., Chen, H., et al.: A survey on vision transformer. IEEE Transact. Patt. Anal. Mach. Intell. 45, 87–110 (2022)CrossRef
12.
go back to reference He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 (2016) He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778 (2016)
13.
go back to reference Hong, R., Liu, D., Mo, X., et al.: Learning to compose and reason with language tree structures for visual grounding. IEEE Transact. Patt. Anal. Mach. Intell. 44, 684–96 (2019)CrossRef Hong, R., Liu, D., Mo, X., et al.: Learning to compose and reason with language tree structures for visual grounding. IEEE Transact. Patt. Anal. Mach. Intell. 44, 684–96 (2019)CrossRef
14.
go back to reference Hu, R., Rohrbach, M., Andreas, J., et al.: Modeling relationships in referential expressions with compositional modular networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1115–1124 (2017) Hu, R., Rohrbach, M., Andreas, J., et al.: Modeling relationships in referential expressions with compositional modular networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1115–1124 (2017)
15.
go back to reference Huang, B., Lian, D., Luo, W., et al.: Look before you leap: Learning landmark features for one-stage visual grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 16,888–16,897 (2021) Huang, B., Lian, D., Luo, W., et al.: Look before you leap: Learning landmark features for one-stage visual grounding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 16,888–16,897 (2021)
16.
go back to reference Jiao, Y., Jie, Z., Chen, J., et al.: Suspected object matters: Rethinking model’s prediction for one-stage visual grounding. arXiv preprint arXiv:2203.05186 (2022) Jiao, Y., Jie, Z., Chen, J., et al.: Suspected object matters: Rethinking model’s prediction for one-stage visual grounding. arXiv preprint arXiv:​2203.​05186 (2022)
17.
18.
go back to reference Kovvuri, R., Nevatia, R.: Pirc net: Using proposal indexing, relationships and context for phrase grounding. In: Asian Conference on Computer Vision, Springer, pp 451–467 (2018) Kovvuri, R., Nevatia, R.: Pirc net: Using proposal indexing, relationships and context for phrase grounding. In: Asian Conference on Computer Vision, Springer, pp 451–467 (2018)
19.
go back to reference Liao, Y., Liu, S., Li, G., et al.: A real-time cross-modality correlation filtering method for referring expression comprehension. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10,880–10,889 (2020) Liao, Y., Liu, S., Li, G., et al.: A real-time cross-modality correlation filtering method for referring expression comprehension. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 10,880–10,889 (2020)
20.
go back to reference Lin, T.Y., Maire, M., Belongie, S., et al.: Microsoft coco: Common objects in context. In: European conference on computer vision, Springer, pp 740–755 (2014) Lin, T.Y., Maire, M., Belongie, S., et al.: Microsoft coco: Common objects in context. In: European conference on computer vision, Springer, pp 740–755 (2014)
21.
go back to reference Liu, D., Zhang, H., Wu, F., et al.: Learning to assemble neural module tree networks for visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4673–4682 (2019a) Liu, D., Zhang, H., Wu, F., et al.: Learning to assemble neural module tree networks for visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4673–4682 (2019a)
22.
go back to reference Liu, X., Wang, Z., Shao, J., et al.: Improving referring expression grounding with cross-modal attention-guided erasing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1950–1959 (2019b) Liu, X., Wang, Z., Shao, J., et al.: Improving referring expression grounding with cross-modal attention-guided erasing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 1950–1959 (2019b)
23.
go back to reference Liu, Y., Li, S., Wu, Y., et al.: Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3042–3051 (2022) Liu, Y., Li, S., Wu, Y., et al.: Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3042–3051 (2022)
25.
go back to reference Mao, J., Huang, J., Toshev, A., et al.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 11–20 (2016) Mao, J., Huang, J., Toshev, A., et al.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 11–20 (2016)
26.
go back to reference Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, Springer, pp 792–807 (2016) Nagaraja, V.K., Morariu, V.I., Davis, L.S.: Modeling context between objects for referring expression understanding. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, Springer, pp 792–807 (2016)
27.
go back to reference Parmar, N., Vaswani, A., Uszkoreit, J., et al.: Image transformer. In: International conference on machine learning, PMLR, pp 4055–4064 (2018) Parmar, N., Vaswani, A., Uszkoreit, J., et al.: Image transformer. In: International conference on machine learning, PMLR, pp 4055–4064 (2018)
28.
go back to reference Plummer, B.A., Wang, L., Cervantes, C.M., et al.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE international conference on computer vision, pp 2641–2649 (2015) Plummer, B.A., Wang, L., Cervantes, C.M., et al.: Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE international conference on computer vision, pp 2641–2649 (2015)
29.
go back to reference Plummer, B.A., Kordas, P., Kiapour, M.H., et al.: Conditional image-text embedding networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 249–264 (2018) Plummer, B.A., Kordas, P., Kiapour, M.H., et al.: Conditional image-text embedding networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp 249–264 (2018)
30.
go back to reference Qian, S., Wang, J., Hu, J., et al.: Hierarchical multi-modal contextual attention network for fake news detection. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp 153–162 (2021) Qian, S., Wang, J., Hu, J., et al.: Hierarchical multi-modal contextual attention network for fake news detection. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp 153–162 (2021)
31.
go back to reference Qiao, Y., Deng, C., Wu, Q.: Referring expression comprehension: a survey of methods and datasets. IEEE Transact. Multimedia 23, 4426–4440 (2020)CrossRef Qiao, Y., Deng, C., Wu, Q.: Referring expression comprehension: a survey of methods and datasets. IEEE Transact. Multimedia 23, 4426–4440 (2020)CrossRef
33.
go back to reference Rezatofighi, H., Tsoi, N., Gwak, J., et al.: Generalized intersection over union: A metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 658–666 (2019) Rezatofighi, H., Tsoi, N., Gwak, J., et al.: Generalized intersection over union: A metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 658–666 (2019)
34.
go back to reference Sadhu, A., Chen, K., Nevatia, R.: Zero-shot grounding of objects from natural language queries. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4694–4703 (2019) Sadhu, A., Chen, K., Nevatia, R.: Zero-shot grounding of objects from natural language queries. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4694–4703 (2019)
35.
go back to reference Song, Y., Wang, J., Liang, Z., et al.: Utilizing bert intermediate layers for aspect based sentiment analysis and natural language inference. arXiv preprint arXiv:2002.04815 (2020) Song, Y., Wang, J., Liang, Z., et al.: Utilizing bert intermediate layers for aspect based sentiment analysis and natural language inference. arXiv preprint arXiv:​2002.​04815 (2020)
36.
go back to reference Wang, L., Li, Y., Huang, J., et al.: Learning two-branch neural networks for image-text matching tasks. IEEE Transact. Pattern Anal. Mach. Intell. 41(2), 394–407 (2018)CrossRef Wang, L., Li, Y., Huang, J., et al.: Learning two-branch neural networks for image-text matching tasks. IEEE Transact. Pattern Anal. Mach. Intell. 41(2), 394–407 (2018)CrossRef
37.
go back to reference Wang, P., Wu, Q., Cao, J., et al.: Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1960–1968 (2019) Wang, P., Wu, Q., Cao, J., et al.: Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 1960–1968 (2019)
38.
go back to reference Wu, P., He, X., Tang, M., et al.: Hanet: Hierarchical alignment networks for video-text retrieval. In: Proceedings of the 29th ACM international conference on Multimedia, pp 3518–3527 (2021) Wu, P., He, X., Tang, M., et al.: Hanet: Hierarchical alignment networks for video-text retrieval. In: Proceedings of the 29th ACM international conference on Multimedia, pp 3518–3527 (2021)
39.
go back to reference Yang, C., Wang, G., Li, D., et al.: Ppgn: Phrase-guided proposal generation network for referring expression comprehension. arXiv preprint arXiv:2012.10890 (2020a) Yang, C., Wang, G., Li, D., et al.: Ppgn: Phrase-guided proposal generation network for referring expression comprehension. arXiv preprint arXiv:​2012.​10890 (2020a)
40.
go back to reference Yang, S., Li, G., Yu, Y.: Dynamic graph attention for referring expression comprehension. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4644–4653 (2019a) Yang, S., Li, G., Yu, Y.: Dynamic graph attention for referring expression comprehension. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4644–4653 (2019a)
41.
go back to reference Yang, S., Li, G., Yu, Y.: Propagating over phrase relations for one-stage visual grounding. In: European Conference on Computer Vision, Springer, pp 589–605 (2020b) Yang, S., Li, G., Yu, Y.: Propagating over phrase relations for one-stage visual grounding. In: European Conference on Computer Vision, Springer, pp 589–605 (2020b)
42.
go back to reference Yang, Z., Gong, B., Wang, L., et al.: A fast and accurate one-stage approach to visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4683–4693 (2019b) Yang, Z., Gong, B., Wang, L., et al.: A fast and accurate one-stage approach to visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4683–4693 (2019b)
43.
go back to reference Yang, Z., Chen, T., Wang, L., et al.: Improving one-stage visual grounding by recursive sub-query construction. In: European Conference on Computer Vision, Springer, pp 387–404 (2020c) Yang, Z., Chen, T., Wang, L., et al.: Improving one-stage visual grounding by recursive sub-query construction. In: European Conference on Computer Vision, Springer, pp 387–404 (2020c)
44.
go back to reference Young, P., Lai, A., Hodosh, M., et al.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transact. Assoc. Computat. Linguistics 2, 67–78 (2014)CrossRef Young, P., Lai, A., Hodosh, M., et al.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transact. Assoc. Computat. Linguistics 2, 67–78 (2014)CrossRef
45.
go back to reference Yu, L., Poirson, P., Yang, S., et al.: Modeling context in referring expressions. In: European Conference on Computer Vision, Springer, pp 69–85 (2016) Yu, L., Poirson, P., Yang, S., et al.: Modeling context in referring expressions. In: European Conference on Computer Vision, Springer, pp 69–85 (2016)
46.
go back to reference Yu, L., Lin, Z., Shen, X., et al.: Mattnet: Modular attention network for referring expression comprehension. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1307–1315 (2018a) Yu, L., Lin, Z., Shen, X., et al.: Mattnet: Modular attention network for referring expression comprehension. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 1307–1315 (2018a)
47.
go back to reference Yu, Z., Yu, J., Xiang, C., et al.: Rethinking diversified and discriminative proposal generation for visual grounding. arXiv preprint arXiv:1805.03508 (2018b) Yu, Z., Yu, J., Xiang, C., et al.: Rethinking diversified and discriminative proposal generation for visual grounding. arXiv preprint arXiv:​1805.​03508 (2018b)
48.
go back to reference Zhang, H., Niu, Y., Chang, S.F.: Grounding referring expressions in images by variational context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4158–4166 (2018) Zhang, H., Niu, Y., Chang, S.F.: Grounding referring expressions in images by variational context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4158–4166 (2018)
49.
go back to reference Zhuang, B., Wu, Q., Shen, C., et al.: Parallel attention: A unified framework for visual object discovery through dialogs and queries. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4252–4261 (2018) Zhuang, B., Wu, Q., Shen, C., et al.: Parallel attention: A unified framework for visual object discovery through dialogs and queries. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4252–4261 (2018)
Metadata
Title
Hierarchical cross-modal contextual attention network for visual grounding
Authors
Xin Xu
Gang Lv
Yining Sun
Yuxia Hu
Fudong Nian
Publication date
17-04-2023
Publisher
Springer Berlin Heidelberg
Published in
Multimedia Systems / Issue 4/2023
Print ISSN: 0942-4962
Electronic ISSN: 1432-1882
DOI
https://doi.org/10.1007/s00530-023-01097-8

Other articles of this Issue 4/2023

Multimedia Systems 4/2023 Go to the issue