Top

Multimedia Systems

Published in:

01-06-2022 | Regular Paper

Closed-loop reasoning with graph-aware dense interaction for visual dialog

Authors: An-An Liu, Guokai Zhang, Ning Xu, Junbo Guo, Guoqing Jin, Xuanya Li

Published in: Multimedia Systems | Issue 5/2022

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Visual dialog is one attractive vision-language task to predict correct answer according to the given question, dialog history and image. Although researchers have offered diversified solutions to contact text with vision, multi-modal information still get inadequate interaction for semantic alignment. To solve the problem, we propose closed-loop reasoning with graph-aware dense interaction, aiming to discover cues through the dynamic structure of graph and leverage it to benefit dialog and image features. Moreover, we analyze the statistics of the linguistic entities hidden in dialog to prove the reliability of graph construction. Experiments are set up on two VisDial datasets, which indicate that our model achieves the competitive results against the previous methods. Ablation study and parameter analysis can further demonstrate the effectiveness of our model.

previous article ESGAN for generating high quality enhanced samples

next article FedFV: federated face verification via equivalent class embeddings

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Zhang, Y., Shi, X., Mi, S., Yang, X.: Image captioning with transformer and knowledge graph. Pattern Recognit. Lett. 143, 43–49 (2021)CrossRef

Shao, Z., Han, J., Marnerides, D., Debattista, K.: Region-object relation-aware dense captioning via transformer. IEEE Trans. Neural Netw. Learning Syst., 1–12 (2022)

Peng, Y., Chi, J.: Unsupervised cross-media retrieval using domain adaptation with scene graph. IEEE Trans. Circuits Syst. Video Technol. 30(11), 4368–4379 (2020)CrossRef

Chen, X., Jiang, M., Zhao, Q.: Predicting human scanpaths in visual question answering. In: CVPR, pp. 10876–10885 (2021)

Wu, G., Han, J., Lin, Z., Ding, G., Zhang, B., Ni, Q.: Joint image-text hashing for fast large-scale cross-media retrieval using self-supervised deep learning. IEEE Trans. Ind. Electron. 66(12), 9868–9877 (2019)CrossRef

Ji, Z., Wang, H., Han, J., Pang, Y.: SMAN: stacked multimodal attention network for cross-modal image-text retrieval. IEEE Trans. Cybern. 52(2), 1086–1097 (2022)CrossRef

Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D.: Visual dialog. In: CVPR, pp. 1080–1089 (2017)

Zhao, W., Guan, Z., Huang, Y., Xi, T., Sun, H., Wang, Z., He, X.: Discerning influence patterns with beta-poisson factorization in microblogging environments. IEEE Trans. Knowl. Data Eng. 32(6), 1092–1103 (2020)CrossRef

Bao, B., Lang, C., Mei, T., Bimbo, A.D.: Guest editorial: Learning multimedia for real world applications. Multim. Tools Appl. 75(5), 2413–2417 (2016)CrossRef

10.

Kottur, S., Moura, J.M.F., Parikh, D., Batra, D., Rohrbach, M.: Visual coreference resolution in visual dialog using neural module networks. In: ECCV, vol. 11219, pp. 160–178 (2018)

11.

Niu, Y., Zhang, H., Zhang, M., Zhang, J., Lu, Z., Wen, J.: Recursive visual attention in visual dialog. In: CVPR, pp. 6679–6688 (2019)

12.

Wu, Q., Wang, P., Shen, C., Reid, I.D., van den Hengel, A.: Are you talking to me? reasoned visual dialog generation through adversarial learning. In: CVPR, pp. 6106–6115 (2018)

13.

Chen, F., Meng, F., Xu, J., Li, P., Xu, B., Zhou, J.: DMRM: A dual-channel multi-hop reasoning model for visual dialog. In: AAAI, pp. 7504–7511 (2020)

14.

Yu, J., Jiang, X., Qin, Z., Zhang, W., Hu, Y., Wu, Q.: Learning dual encoding model for adaptive visual understanding in visual dialogue. IEEE Trans. Image Process. 30, 220–233 (2021)CrossRef

15.

Mazuecos, M., Luque, F.M., Sánchez, J., Maina, H., Vadora, T., Benotti, L.: Region under discussion for visual dialog. In: EMNLP, pp. 4745–4759. Association for Computational Linguistics, ??? (2021)

16.

Shi, Y., Tan, Y., Feng, F., Zheng, C., Wang, X.: Category-based strategy-driven question generator for visual dialogue. In: CCL, vol. 12869, pp. 177–192 (2021)

17.

Zhu, L., Huang, Z., Li, Z., Xie, L., Shen, H.T.: Exploring auxiliary context: Discrete semantic transfer hashing for scalable image retrieval. IEEE Trans. Neural Netw. Learning Syst. 29(11), 5264–5276 (2018)MathSciNetCrossRef

18.

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: VQA: visual question answering. In: ICCV, pp. 2425–2433 (2015)

19.

Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: A neural-based approach to answering questions about images. In: ICCV, pp. 1–9 (2015)

20.

Niu, Y., Huang, F., Liang, J., Chen, W., Zhu, X., Huang, M.: A semantic-based method for unsupervised commonsense question answering. In: ACL/IJCNLP, pp. 3037–3049 (2021)

21.

Zhang, L., Lin, C., Zhou, D., He, Y., Zhang, M.: A bayesian end-to-end model with estimated uncertainties for simple question answering over knowledge bases. Comput. Speech Lang. 66, 101167 (2021)CrossRef

22.

Kapanipathi, P., Abdelaziz, I., Ravishankar, S., Roukos, S., Gray, A.G., Astudillo, R.F.: Leveraging abstract meaning representation for knowledge base question answering. Findings of ACL, vol. ACL/IJCNLP 2021, pp. 3884–3894 (2021)

23.

Li, X., Sun, Y., Cheng, G.: TSQA: tabular scenario based question answering. In: AAAI, pp. 13297–13305 (2021)

24.

Huang, Z., Shen, Y., Li, X., Wei, Y., Cheng, G.: Geosqa: A benchmark for scenario-based question answering in the geography domain at high school level. In: EMNLP-IJCNLP, pp. 5865–5870 (2019)

25.

Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W.: Are you talking to a machine? dataset and methods for multilingual image question. In: NIPS, pp. 2296–2304 (2015)

26.

Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: CVPR, pp. 6281–6290 (2019)

27.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.: Attention is all you need. In: NIPS, pp. 5998–6008 (2017)

28.

Agarwal, S., Bui, T., Lee, J., Konstas, I., Rieser, V.: History for visual dialog: Do we really need it? In: ACL, pp. 8182–8197 (2020)

29.

Das, A., Kottur, S., Moura, J.M.F., Lee, S., Batra, D.: Learning cooperative visual dialog agents with deep reinforcement learning. In: ICCV, pp. 2970–2979 (2017)

30.

Murahari, V., Chattopadhyay, P., Batra, D., Parikh, D., Das, A.: Improving generative visual dialog by answering diverse questions. In: EMNLP-IJCNLP, pp. 1449–1454 (2019)

31.

Cho, Y., Kim, I.: NMN-VD: A neural module network for visual dialog. Sensors 21(3), 931 (2021)CrossRef

32.

Yang, T., Zha, Z., Zhang, H.: Making history matter: History-advantage sequence training for visual dialog. In: ICCV, pp. 2561–2569 (2019)

33.

Kang, G., Lim, J., Zhang, B.: Dual attention networks for visual reference resolution in visual dialog. In: EMNLP-IJCNLP, pp. 2024–2033 (2019)

34.

Cogswell, M., Lu, J., Jain, R., Lee, S., Parikh, D.: Dialog without dialog data: Learning visual dialog agents from VQA data. In: NIPS (2020)

35.

Zheng, Z., Wang, W., Qi, S., Zhu, S.: Reasoning visual dialogs with structural and partial observations. In: CVPR, pp. 6669–6678 (2019)

36.

Schwartz, I., Yu, S., Hazan, T., Schwing, A.G.: Factor graph attention. In: CVPR, pp. 2039–2048 (2019)

37.

Lu, C., Krishna, R., Bernstein, M.S., Fei-Fei, L.: Visual relationship detection with language priors. In: ECCV, vol. 9905, pp. 852–869 (2016)

38.

Jain, U., Lazebnik, S., Schwing, A.G.: Two can play this game: Visual dialog with discriminative question generation and answering. In: CVPR, pp. 5754–5763 (2018)

39.

Lee, S., Gao, T., Yang, S., Yoo, J., Ha, J.: Large-scale answerer in questioner’s mind for visual dialog question generation. In: ICLR (2019)

40.

Pang, W., Wang, X.: Visual dialogue state tracking for question generation. In: AAAI, pp. 11831–11838 (2020)

41.

Zheng, D., Xu, Z., Meng, F., Wang, X., Wang, J., Zhou, J.: Enhancing visual dialog questioner with entity-based strategy learning and augmented guesser. In: EMNLP, pp. 1839–1851 (2021)

42.

Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)

43.

Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. In: NAACL-HLT, pp. 2227–2237 (2018)

44.

Hochreiter, S., Schmidhuber, J.: Long short-term memory. In: Neural Computation, pp. 1735–1780 (1997)

45.

Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)CrossRef

46.

Zhang, J., Wang, Q., Han, Y.: Multi-modal fusion with multi-level attention for visual dialog. Inf. Process. Manag. 57(4), 102152 (2020)CrossRef

47.

Paddlepaddle, An Easy-to-use, Easy-to-learn Deep Learning Platform. http://www.paddlepaddle.org/

48.

Gao, J., Zhang, T., Yang, X., Xu, C.: Deep relative tracking. IEEE Trans. Image Process. 26(4), 1845–1858 (2017)MathSciNetCrossRef

Title: Closed-loop reasoning with graph-aware dense interaction for visual dialog
Authors: An-An Liu
Guokai Zhang
Ning Xu
Junbo Guo
Guoqing Jin
Xuanya Li
Publication date: 01-06-2022
Publisher: Springer Berlin Heidelberg
Published in: Multimedia Systems / Issue 5/2022
Print ISSN: 0942-4962
Electronic ISSN: 1432-1882
DOI: https://doi.org/10.1007/s00530-022-00947-1

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 5/2022

Point cloud denoising algorithm with geometric feature preserving

Double-scale similarity with rich features for cross-modal retrieval

An improvement for PDF417 code authentication on mobile phone terminals based on code feature analysis and watermarking

Overcoming the practical restrictions in H.266/VVC-based video communication systems by a PI bit rate controller

Optimized generative adversarial network based breast cancer diagnosis with wavelet and texture features

An olfactory display for virtual reality glasses