Skip to main content
Top
Published in: Multimedia Systems 5/2022

01-06-2022 | Regular Paper

Closed-loop reasoning with graph-aware dense interaction for visual dialog

Authors: An-An Liu, Guokai Zhang, Ning Xu, Junbo Guo, Guoqing Jin, Xuanya Li

Published in: Multimedia Systems | Issue 5/2022

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Visual dialog is one attractive vision-language task to predict correct answer according to the given question, dialog history and image. Although researchers have offered diversified solutions to contact text with vision, multi-modal information still get inadequate interaction for semantic alignment. To solve the problem, we propose closed-loop reasoning with graph-aware dense interaction, aiming to discover cues through the dynamic structure of graph and leverage it to benefit dialog and image features. Moreover, we analyze the statistics of the linguistic entities hidden in dialog to prove the reliability of graph construction. Experiments are set up on two VisDial datasets, which indicate that our model achieves the competitive results against the previous methods. Ablation study and parameter analysis can further demonstrate the effectiveness of our model.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Zhang, Y., Shi, X., Mi, S., Yang, X.: Image captioning with transformer and knowledge graph. Pattern Recognit. Lett. 143, 43–49 (2021)CrossRef Zhang, Y., Shi, X., Mi, S., Yang, X.: Image captioning with transformer and knowledge graph. Pattern Recognit. Lett. 143, 43–49 (2021)CrossRef
2.
go back to reference Shao, Z., Han, J., Marnerides, D., Debattista, K.: Region-object relation-aware dense captioning via transformer. IEEE Trans. Neural Netw. Learning Syst., 1–12 (2022) Shao, Z., Han, J., Marnerides, D., Debattista, K.: Region-object relation-aware dense captioning via transformer. IEEE Trans. Neural Netw. Learning Syst., 1–12 (2022)
3.
go back to reference Peng, Y., Chi, J.: Unsupervised cross-media retrieval using domain adaptation with scene graph. IEEE Trans. Circuits Syst. Video Technol. 30(11), 4368–4379 (2020)CrossRef Peng, Y., Chi, J.: Unsupervised cross-media retrieval using domain adaptation with scene graph. IEEE Trans. Circuits Syst. Video Technol. 30(11), 4368–4379 (2020)CrossRef
4.
go back to reference Chen, X., Jiang, M., Zhao, Q.: Predicting human scanpaths in visual question answering. In: CVPR, pp. 10876–10885 (2021) Chen, X., Jiang, M., Zhao, Q.: Predicting human scanpaths in visual question answering. In: CVPR, pp. 10876–10885 (2021)
5.
go back to reference Wu, G., Han, J., Lin, Z., Ding, G., Zhang, B., Ni, Q.: Joint image-text hashing for fast large-scale cross-media retrieval using self-supervised deep learning. IEEE Trans. Ind. Electron. 66(12), 9868–9877 (2019)CrossRef Wu, G., Han, J., Lin, Z., Ding, G., Zhang, B., Ni, Q.: Joint image-text hashing for fast large-scale cross-media retrieval using self-supervised deep learning. IEEE Trans. Ind. Electron. 66(12), 9868–9877 (2019)CrossRef
6.
go back to reference Ji, Z., Wang, H., Han, J., Pang, Y.: SMAN: stacked multimodal attention network for cross-modal image-text retrieval. IEEE Trans. Cybern. 52(2), 1086–1097 (2022)CrossRef Ji, Z., Wang, H., Han, J., Pang, Y.: SMAN: stacked multimodal attention network for cross-modal image-text retrieval. IEEE Trans. Cybern. 52(2), 1086–1097 (2022)CrossRef
7.
go back to reference Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D.: Visual dialog. In: CVPR, pp. 1080–1089 (2017) Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D.: Visual dialog. In: CVPR, pp. 1080–1089 (2017)
8.
go back to reference Zhao, W., Guan, Z., Huang, Y., Xi, T., Sun, H., Wang, Z., He, X.: Discerning influence patterns with beta-poisson factorization in microblogging environments. IEEE Trans. Knowl. Data Eng. 32(6), 1092–1103 (2020)CrossRef Zhao, W., Guan, Z., Huang, Y., Xi, T., Sun, H., Wang, Z., He, X.: Discerning influence patterns with beta-poisson factorization in microblogging environments. IEEE Trans. Knowl. Data Eng. 32(6), 1092–1103 (2020)CrossRef
9.
go back to reference Bao, B., Lang, C., Mei, T., Bimbo, A.D.: Guest editorial: Learning multimedia for real world applications. Multim. Tools Appl. 75(5), 2413–2417 (2016)CrossRef Bao, B., Lang, C., Mei, T., Bimbo, A.D.: Guest editorial: Learning multimedia for real world applications. Multim. Tools Appl. 75(5), 2413–2417 (2016)CrossRef
10.
go back to reference Kottur, S., Moura, J.M.F., Parikh, D., Batra, D., Rohrbach, M.: Visual coreference resolution in visual dialog using neural module networks. In: ECCV, vol. 11219, pp. 160–178 (2018) Kottur, S., Moura, J.M.F., Parikh, D., Batra, D., Rohrbach, M.: Visual coreference resolution in visual dialog using neural module networks. In: ECCV, vol. 11219, pp. 160–178 (2018)
11.
go back to reference Niu, Y., Zhang, H., Zhang, M., Zhang, J., Lu, Z., Wen, J.: Recursive visual attention in visual dialog. In: CVPR, pp. 6679–6688 (2019) Niu, Y., Zhang, H., Zhang, M., Zhang, J., Lu, Z., Wen, J.: Recursive visual attention in visual dialog. In: CVPR, pp. 6679–6688 (2019)
12.
go back to reference Wu, Q., Wang, P., Shen, C., Reid, I.D., van den Hengel, A.: Are you talking to me? reasoned visual dialog generation through adversarial learning. In: CVPR, pp. 6106–6115 (2018) Wu, Q., Wang, P., Shen, C., Reid, I.D., van den Hengel, A.: Are you talking to me? reasoned visual dialog generation through adversarial learning. In: CVPR, pp. 6106–6115 (2018)
13.
go back to reference Chen, F., Meng, F., Xu, J., Li, P., Xu, B., Zhou, J.: DMRM: A dual-channel multi-hop reasoning model for visual dialog. In: AAAI, pp. 7504–7511 (2020) Chen, F., Meng, F., Xu, J., Li, P., Xu, B., Zhou, J.: DMRM: A dual-channel multi-hop reasoning model for visual dialog. In: AAAI, pp. 7504–7511 (2020)
14.
go back to reference Yu, J., Jiang, X., Qin, Z., Zhang, W., Hu, Y., Wu, Q.: Learning dual encoding model for adaptive visual understanding in visual dialogue. IEEE Trans. Image Process. 30, 220–233 (2021)CrossRef Yu, J., Jiang, X., Qin, Z., Zhang, W., Hu, Y., Wu, Q.: Learning dual encoding model for adaptive visual understanding in visual dialogue. IEEE Trans. Image Process. 30, 220–233 (2021)CrossRef
15.
go back to reference Mazuecos, M., Luque, F.M., Sánchez, J., Maina, H., Vadora, T., Benotti, L.: Region under discussion for visual dialog. In: EMNLP, pp. 4745–4759. Association for Computational Linguistics, ??? (2021) Mazuecos, M., Luque, F.M., Sánchez, J., Maina, H., Vadora, T., Benotti, L.: Region under discussion for visual dialog. In: EMNLP, pp. 4745–4759. Association for Computational Linguistics, ??? (2021)
16.
go back to reference Shi, Y., Tan, Y., Feng, F., Zheng, C., Wang, X.: Category-based strategy-driven question generator for visual dialogue. In: CCL, vol. 12869, pp. 177–192 (2021) Shi, Y., Tan, Y., Feng, F., Zheng, C., Wang, X.: Category-based strategy-driven question generator for visual dialogue. In: CCL, vol. 12869, pp. 177–192 (2021)
17.
go back to reference Zhu, L., Huang, Z., Li, Z., Xie, L., Shen, H.T.: Exploring auxiliary context: Discrete semantic transfer hashing for scalable image retrieval. IEEE Trans. Neural Netw. Learning Syst. 29(11), 5264–5276 (2018)MathSciNetCrossRef Zhu, L., Huang, Z., Li, Z., Xie, L., Shen, H.T.: Exploring auxiliary context: Discrete semantic transfer hashing for scalable image retrieval. IEEE Trans. Neural Netw. Learning Syst. 29(11), 5264–5276 (2018)MathSciNetCrossRef
18.
go back to reference Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: VQA: visual question answering. In: ICCV, pp. 2425–2433 (2015) Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: VQA: visual question answering. In: ICCV, pp. 2425–2433 (2015)
19.
go back to reference Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: A neural-based approach to answering questions about images. In: ICCV, pp. 1–9 (2015) Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: A neural-based approach to answering questions about images. In: ICCV, pp. 1–9 (2015)
20.
go back to reference Niu, Y., Huang, F., Liang, J., Chen, W., Zhu, X., Huang, M.: A semantic-based method for unsupervised commonsense question answering. In: ACL/IJCNLP, pp. 3037–3049 (2021) Niu, Y., Huang, F., Liang, J., Chen, W., Zhu, X., Huang, M.: A semantic-based method for unsupervised commonsense question answering. In: ACL/IJCNLP, pp. 3037–3049 (2021)
21.
go back to reference Zhang, L., Lin, C., Zhou, D., He, Y., Zhang, M.: A bayesian end-to-end model with estimated uncertainties for simple question answering over knowledge bases. Comput. Speech Lang. 66, 101167 (2021)CrossRef Zhang, L., Lin, C., Zhou, D., He, Y., Zhang, M.: A bayesian end-to-end model with estimated uncertainties for simple question answering over knowledge bases. Comput. Speech Lang. 66, 101167 (2021)CrossRef
22.
go back to reference Kapanipathi, P., Abdelaziz, I., Ravishankar, S., Roukos, S., Gray, A.G., Astudillo, R.F.: Leveraging abstract meaning representation for knowledge base question answering. Findings of ACL, vol. ACL/IJCNLP 2021, pp. 3884–3894 (2021) Kapanipathi, P., Abdelaziz, I., Ravishankar, S., Roukos, S., Gray, A.G., Astudillo, R.F.: Leveraging abstract meaning representation for knowledge base question answering. Findings of ACL, vol. ACL/IJCNLP 2021, pp. 3884–3894 (2021)
23.
go back to reference Li, X., Sun, Y., Cheng, G.: TSQA: tabular scenario based question answering. In: AAAI, pp. 13297–13305 (2021) Li, X., Sun, Y., Cheng, G.: TSQA: tabular scenario based question answering. In: AAAI, pp. 13297–13305 (2021)
24.
go back to reference Huang, Z., Shen, Y., Li, X., Wei, Y., Cheng, G.: Geosqa: A benchmark for scenario-based question answering in the geography domain at high school level. In: EMNLP-IJCNLP, pp. 5865–5870 (2019) Huang, Z., Shen, Y., Li, X., Wei, Y., Cheng, G.: Geosqa: A benchmark for scenario-based question answering in the geography domain at high school level. In: EMNLP-IJCNLP, pp. 5865–5870 (2019)
25.
go back to reference Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W.: Are you talking to a machine? dataset and methods for multilingual image question. In: NIPS, pp. 2296–2304 (2015) Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W.: Are you talking to a machine? dataset and methods for multilingual image question. In: NIPS, pp. 2296–2304 (2015)
26.
go back to reference Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: CVPR, pp. 6281–6290 (2019) Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: CVPR, pp. 6281–6290 (2019)
27.
go back to reference Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.: Attention is all you need. In: NIPS, pp. 5998–6008 (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J.: Attention is all you need. In: NIPS, pp. 5998–6008 (2017)
28.
go back to reference Agarwal, S., Bui, T., Lee, J., Konstas, I., Rieser, V.: History for visual dialog: Do we really need it? In: ACL, pp. 8182–8197 (2020) Agarwal, S., Bui, T., Lee, J., Konstas, I., Rieser, V.: History for visual dialog: Do we really need it? In: ACL, pp. 8182–8197 (2020)
29.
go back to reference Das, A., Kottur, S., Moura, J.M.F., Lee, S., Batra, D.: Learning cooperative visual dialog agents with deep reinforcement learning. In: ICCV, pp. 2970–2979 (2017) Das, A., Kottur, S., Moura, J.M.F., Lee, S., Batra, D.: Learning cooperative visual dialog agents with deep reinforcement learning. In: ICCV, pp. 2970–2979 (2017)
30.
go back to reference Murahari, V., Chattopadhyay, P., Batra, D., Parikh, D., Das, A.: Improving generative visual dialog by answering diverse questions. In: EMNLP-IJCNLP, pp. 1449–1454 (2019) Murahari, V., Chattopadhyay, P., Batra, D., Parikh, D., Das, A.: Improving generative visual dialog by answering diverse questions. In: EMNLP-IJCNLP, pp. 1449–1454 (2019)
31.
go back to reference Cho, Y., Kim, I.: NMN-VD: A neural module network for visual dialog. Sensors 21(3), 931 (2021)CrossRef Cho, Y., Kim, I.: NMN-VD: A neural module network for visual dialog. Sensors 21(3), 931 (2021)CrossRef
32.
go back to reference Yang, T., Zha, Z., Zhang, H.: Making history matter: History-advantage sequence training for visual dialog. In: ICCV, pp. 2561–2569 (2019) Yang, T., Zha, Z., Zhang, H.: Making history matter: History-advantage sequence training for visual dialog. In: ICCV, pp. 2561–2569 (2019)
33.
go back to reference Kang, G., Lim, J., Zhang, B.: Dual attention networks for visual reference resolution in visual dialog. In: EMNLP-IJCNLP, pp. 2024–2033 (2019) Kang, G., Lim, J., Zhang, B.: Dual attention networks for visual reference resolution in visual dialog. In: EMNLP-IJCNLP, pp. 2024–2033 (2019)
34.
go back to reference Cogswell, M., Lu, J., Jain, R., Lee, S., Parikh, D.: Dialog without dialog data: Learning visual dialog agents from VQA data. In: NIPS (2020) Cogswell, M., Lu, J., Jain, R., Lee, S., Parikh, D.: Dialog without dialog data: Learning visual dialog agents from VQA data. In: NIPS (2020)
35.
go back to reference Zheng, Z., Wang, W., Qi, S., Zhu, S.: Reasoning visual dialogs with structural and partial observations. In: CVPR, pp. 6669–6678 (2019) Zheng, Z., Wang, W., Qi, S., Zhu, S.: Reasoning visual dialogs with structural and partial observations. In: CVPR, pp. 6669–6678 (2019)
36.
go back to reference Schwartz, I., Yu, S., Hazan, T., Schwing, A.G.: Factor graph attention. In: CVPR, pp. 2039–2048 (2019) Schwartz, I., Yu, S., Hazan, T., Schwing, A.G.: Factor graph attention. In: CVPR, pp. 2039–2048 (2019)
37.
go back to reference Lu, C., Krishna, R., Bernstein, M.S., Fei-Fei, L.: Visual relationship detection with language priors. In: ECCV, vol. 9905, pp. 852–869 (2016) Lu, C., Krishna, R., Bernstein, M.S., Fei-Fei, L.: Visual relationship detection with language priors. In: ECCV, vol. 9905, pp. 852–869 (2016)
38.
go back to reference Jain, U., Lazebnik, S., Schwing, A.G.: Two can play this game: Visual dialog with discriminative question generation and answering. In: CVPR, pp. 5754–5763 (2018) Jain, U., Lazebnik, S., Schwing, A.G.: Two can play this game: Visual dialog with discriminative question generation and answering. In: CVPR, pp. 5754–5763 (2018)
39.
go back to reference Lee, S., Gao, T., Yang, S., Yoo, J., Ha, J.: Large-scale answerer in questioner’s mind for visual dialog question generation. In: ICLR (2019) Lee, S., Gao, T., Yang, S., Yoo, J., Ha, J.: Large-scale answerer in questioner’s mind for visual dialog question generation. In: ICLR (2019)
40.
go back to reference Pang, W., Wang, X.: Visual dialogue state tracking for question generation. In: AAAI, pp. 11831–11838 (2020) Pang, W., Wang, X.: Visual dialogue state tracking for question generation. In: AAAI, pp. 11831–11838 (2020)
41.
go back to reference Zheng, D., Xu, Z., Meng, F., Wang, X., Wang, J., Zhou, J.: Enhancing visual dialog questioner with entity-based strategy learning and augmented guesser. In: EMNLP, pp. 1839–1851 (2021) Zheng, D., Xu, Z., Meng, F., Wang, X., Wang, J., Zhou, J.: Enhancing visual dialog questioner with entity-based strategy learning and augmented guesser. In: EMNLP, pp. 1839–1851 (2021)
42.
go back to reference Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014) Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: EMNLP, pp. 1532–1543 (2014)
43.
go back to reference Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. In: NAACL-HLT, pp. 2227–2237 (2018) Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. In: NAACL-HLT, pp. 2227–2237 (2018)
44.
go back to reference Hochreiter, S., Schmidhuber, J.: Long short-term memory. In: Neural Computation, pp. 1735–1780 (1997) Hochreiter, S., Schmidhuber, J.: Long short-term memory. In: Neural Computation, pp. 1735–1780 (1997)
45.
go back to reference Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)CrossRef Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)CrossRef
46.
go back to reference Zhang, J., Wang, Q., Han, Y.: Multi-modal fusion with multi-level attention for visual dialog. Inf. Process. Manag. 57(4), 102152 (2020)CrossRef Zhang, J., Wang, Q., Han, Y.: Multi-modal fusion with multi-level attention for visual dialog. Inf. Process. Manag. 57(4), 102152 (2020)CrossRef
48.
Metadata
Title
Closed-loop reasoning with graph-aware dense interaction for visual dialog
Authors
An-An Liu
Guokai Zhang
Ning Xu
Junbo Guo
Guoqing Jin
Xuanya Li
Publication date
01-06-2022
Publisher
Springer Berlin Heidelberg
Published in
Multimedia Systems / Issue 5/2022
Print ISSN: 0942-4962
Electronic ISSN: 1432-1882
DOI
https://doi.org/10.1007/s00530-022-00947-1

Other articles of this Issue 5/2022

Multimedia Systems 5/2022 Go to the issue