Skip to main content
Top

2021 | OriginalPaper | Chapter

IQ-VQA: Intelligent Visual Question Answering

Authors : Vatsal Goel, Mohit Chandak, Ashish Anand, Prithwijit Guha

Published in: Pattern Recognition. ICPR International Workshops and Challenges

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Despite tremendous progress in the field of Visual Question Answering, models today still tend to be inconsistent and brittle. Thus, we propose a model-independent cyclic framework which increases consistency and robustness of any VQA architecture. We train our models to answer the original question, generate an implication based on the answer and then learn to answer the generated implication correctly. As part of the cyclic framework, we propose a novel implication generator which generates implied questions from any question-answer pair. As a baseline for future works on consistency, we provide a new human-annotated VQA-Implications dataset. The dataset consists of 30k implications of 3 types - Logical Equivalence, Necessary Condition and Mutual Exclusion - made from the VQA validation dataset. We show that our framework improves consistency of VQA models by https://static-content.springer.com/image/chp%3A10.1007%2F978-3-030-68790-8_28/MediaObjects/510911_1_En_28_Figa_HTML.gif on the rule-based dataset, https://static-content.springer.com/image/chp%3A10.1007%2F978-3-030-68790-8_28/MediaObjects/510911_1_En_28_Figb_HTML.gif on VQA-Implications dataset and robustness by https://static-content.springer.com/image/chp%3A10.1007%2F978-3-030-68790-8_28/MediaObjects/510911_1_En_28_Figc_HTML.gif , without degrading their performance.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Literature
1.
go back to reference Agrawal, A., Batra, D., Parikh, D., Kembhavi, A.: Don’t just assume; look and answer: overcoming priors for visual question answering. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4971–4980 (2018) Agrawal, A., Batra, D., Parikh, D., Kembhavi, A.: Don’t just assume; look and answer: overcoming priors for visual question answering. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4971–4980 (2018)
2.
go back to reference Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018 Anderson, P., et al.: Bottom-up and top-down attention for image captioning and visual question answering. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018
3.
go back to reference Antol, S., et al.: VQA: visual question answering. In: International Conference on Computer Vision (ICCV) (2015) Antol, S., et al.: VQA: visual question answering. In: International Conference on Computer Vision (ICCV) (2015)
4.
go back to reference Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the EACL 2014 Workshop on Statistical Machine Translation (2014) Denkowski, M., Lavie, A.: Meteor universal: language specific translation evaluation for any target language. In: Proceedings of the EACL 2014 Workshop on Statistical Machine Translation (2014)
6.
go back to reference Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2017) Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
7.
go back to reference Kim, J.H., Jun, J., Zhang, B.T.: Bilinear attention networks. Adv. Neural Inf. Process. Syst. 31, 1571–1581 (2018) Kim, J.H., Jun, J., Zhang, B.T.: Bilinear attention networks. Adv. Neural Inf. Process. Syst. 31, 1571–1581 (2018)
8.
go back to reference Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123(1), 32–73 (2017)MathSciNetCrossRef Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vision 123(1), 32–73 (2017)MathSciNetCrossRef
9.
go back to reference Kumar, V., Ramakrishnan, G., Li, Y.F.: Putting the horse before the cart: a generator-evaluator framework for question generation from text. In: Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pp. 812–821 (2019) Kumar, V., Ramakrishnan, G., Li, Y.F.: Putting the horse before the cart: a generator-evaluator framework for question generation from text. In: Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pp. 812–821 (2019)
12.
go back to reference Manjunatha, V., Saini, N., Davis, L.S.: Explicit bias discovery in visual question answering models. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019 Manjunatha, V., Saini, N., Davis, L.S.: Explicit bias discovery in visual question answering models. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019
13.
go back to reference Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL 2002, pp. 311–318. Association for Computational Linguistics (2002). https://doi.org/10.3115/1073083.1073135 Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL 2002, pp. 311–318. Association for Computational Linguistics (2002). https://​doi.​org/​10.​3115/​1073083.​1073135
14.
go back to reference Ray, A., Sikka, K., Divakaran, A., Lee, S., Burachas, G.: Sunny and dark outside?! Improving answer consistency in VQA through entailed question generation. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5863–5868 (2019) Ray, A., Sikka, K., Divakaran, A., Lee, S., Burachas, G.: Sunny and dark outside?! Improving answer consistency in VQA through entailed question generation. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5863–5868 (2019)
15.
go back to reference Reddy, S., Raghu, D., Khapra, M.M., Joshi, S.: Generating natural language question-answer pairs from a knowledge graph using a RNN based question generation model. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 376–385. Association for Computational Linguistics, Valencia, April 2017. https://www.aclweb.org/anthology/E17-1036 Reddy, S., Raghu, D., Khapra, M.M., Joshi, S.: Generating natural language question-answer pairs from a knowledge graph using a RNN based question generation model. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pp. 376–385. Association for Computational Linguistics, Valencia, April 2017. https://​www.​aclweb.​org/​anthology/​E17-1036
17.
go back to reference Selvaraju, R.R., et al.: SQuINTing at VQA models: introspecting VQA models with sub-questions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10003–10011 (2020) Selvaraju, R.R., et al.: SQuINTing at VQA models: introspecting VQA models with sub-questions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10003–10011 (2020)
18.
go back to reference Shah, M., Chen, X., Rohrbach, M., Parikh, D.: Cycle-consistency for robust visual question answering. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019) Shah, M., Chen, X., Rohrbach, M., Parikh, D.: Cycle-consistency for robust visual question answering. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
20.
go back to reference Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (2019) Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (2019)
22.
go back to reference Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015 Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015
23.
go back to reference Yan, Y., et al.: ProphetNet: predicting future N-gram for sequence-to-sequence pre-training (2020) Yan, Y., et al.: ProphetNet: predicting future N-gram for sequence-to-sequence pre-training (2020)
24.
go back to reference Jiang, Y., Natarajan, V., Chen, X., Rohrbach, M., Batra, D., Parikh, D.: Pythia v0.1: the winning entry to the VQA challenge 2018. arXiv preprint arXiv:1807.09956 (2018) Jiang, Y., Natarajan, V., Chen, X., Rohrbach, M., Batra, D., Parikh, D.: Pythia v0.1: the winning entry to the VQA challenge 2018. arXiv preprint arXiv:​1807.​09956 (2018)
25.
go back to reference Zhang, P., Goyal, Y., Summers-Stay, D., Batra, D., Parikh, D.: Yin and Yang: balancing and answering binary visual questions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5014–5022 (2016) Zhang, P., Goyal, Y., Summers-Stay, D., Batra, D., Parikh, D.: Yin and Yang: balancing and answering binary visual questions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5014–5022 (2016)
Metadata
Title
IQ-VQA: Intelligent Visual Question Answering
Authors
Vatsal Goel
Mohit Chandak
Ashish Anand
Prithwijit Guha
Copyright Year
2021
DOI
https://doi.org/10.1007/978-3-030-68790-8_28

Premium Partner