Skip to main content

2018 | OriginalPaper | Buchkapitel

Straight to the Facts: Learning Knowledge Base Retrieval for Factual Visual Question Answering

verfasst von : Medhini Narasimhan, Alexander G. Schwing

Erschienen in: Computer Vision – ECCV 2018

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Question answering is an important task for autonomous agents and virtual assistants alike and was shown to support the disabled in efficiently navigating an overwhelming environment. Many existing methods focus on observation-based questions, ignoring our ability to seamlessly combine observed content with general knowledge. To understand interactions with a knowledge base, a dataset has been introduced recently and keyword matching techniques were shown to yield compelling results despite being vulnerable to misconceptions due to synonyms and homographs. To address this issue, we develop a learning-based approach which goes straight to the facts via a learned embedding space. We demonstrate state-of-the-art results on the challenging recently introduced fact-based visual question answering dataset, outperforming competing methods by more than \(5\%\).

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
1.
Zurück zum Zitat Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV 123, 32–73 (2017)MathSciNetCrossRef Krishna, R., et al.: Visual genome: connecting language and vision using crowdsourced dense image annotations. IJCV 123, 32–73 (2017)MathSciNetCrossRef
2.
Zurück zum Zitat Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: NIPS (2015) Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question answering. In: NIPS (2015)
3.
Zurück zum Zitat Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7W: grounded question answering in images. In: CVPR (2016) Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7W: grounded question answering in images. In: CVPR (2016)
4.
Zurück zum Zitat Malinowski, M., Fritz, M.: Towards a visual turing challenge. In: NIPS (2014) Malinowski, M., Fritz, M.: Towards a visual turing challenge. In: NIPS (2014)
5.
Zurück zum Zitat Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR (2017) Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C.L., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: CVPR (2017)
7.
Zurück zum Zitat Yu, L., Park, E., Berg, A., Berg, T.: Visual Madlibs: fill in the blank image generation and question answering. In: ICCV (2015) Yu, L., Park, E., Berg, A., Berg, T.: Visual Madlibs: fill in the blank image generation and question answering. In: ICCV (2015)
8.
Zurück zum Zitat Antol, S., et al.: VQA: visual question answering. In: ICCV (2015) Antol, S., et al.: VQA: visual question answering. In: ICCV (2015)
9.
Zurück zum Zitat Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: CVPR (2017) Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the V in VQA matter: elevating the role of image understanding in visual question answering. In: CVPR (2017)
10.
Zurück zum Zitat Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W.: Are you talking to a machine? Dataset and methods for multilingual image question answering. In: NIPS (2015) Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W.: Are you talking to a machine? Dataset and methods for multilingual image question answering. In: NIPS (2015)
11.
Zurück zum Zitat Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. In: NIPS (2014) Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. In: NIPS (2014)
12.
Zurück zum Zitat Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: a neural-based approach to answering questions about images. In: ICCV (2015) Malinowski, M., Rohrbach, M., Fritz, M.: Ask your neurons: a neural-based approach to answering questions about images. In: ICCV (2015)
13.
Zurück zum Zitat Hu, R., Andreas, J., Rohrbach, M., Darrell, T., Saenko, K.: Learning to reason: end-to-end module networks for visual question answering. CoRR, abs/1704.05526 3 (2017) Hu, R., Andreas, J., Rohrbach, M., Darrell, T., Saenko, K.: Learning to reason: end-to-end module networks for visual question answering. CoRR, abs/1704.05526 3 (2017)
14.
Zurück zum Zitat Wang, P., Wu, Q., Shen, C., Dick, A., van den Hengel, A.: FVQA: fact-based visual question answering. TPAMI (2018) Wang, P., Wu, Q., Shen, C., Dick, A., van den Hengel, A.: FVQA: fact-based visual question answering. TPAMI (2018)
15.
Zurück zum Zitat Tandon, N., de Melo, G., Suchanek, F., Weikum, G.: Webchild: harvesting and organizing commonsense knowledge from the web. In: WSDM (2014) Tandon, N., de Melo, G., Suchanek, F., Weikum, G.: Webchild: harvesting and organizing commonsense knowledge from the web. In: WSDM (2014)
17.
Zurück zum Zitat Speer, R., Chin, J., Havasi, C.: ConceptNet 5.5: an open multilingual graph of general knowledge. In: AAAI (2017) Speer, R., Chin, J., Havasi, C.: ConceptNet 5.5: an open multilingual graph of general knowledge. In: AAAI (2017)
18.
Zurück zum Zitat Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: NIPS (2016) Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: NIPS (2016)
19.
Zurück zum Zitat Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: CVPR (2016) Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: CVPR (2016)
20.
Zurück zum Zitat Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Deep compositional question answering with neural module networks. In: CVPR (2016) Andreas, J., Rohrbach, M., Darrell, T., Klein, D.: Deep compositional question answering with neural module networks. In: CVPR (2016)
21.
Zurück zum Zitat Das, A., Agrawal, H., Zitnick, C.L., Parikh, D., Batra, D.: Human attention in visual question answering: do humans and deep networks look at the same regions? In: EMNLP (2016) Das, A., Agrawal, H., Zitnick, C.L., Parikh, D., Batra, D.: Human attention in visual question answering: do humans and deep networks look at the same regions? In: EMNLP (2016)
22.
Zurück zum Zitat Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. In: EMNLP (2016) Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. In: EMNLP (2016)
23.
Zurück zum Zitat Shih, K.J., Singh, S., Hoiem, D.: Where to look: focus regions for visual question answering. In: CVPR (2016) Shih, K.J., Singh, S., Hoiem, D.: Where to look: focus regions for visual question answering. In: CVPR (2016)
25.
Zurück zum Zitat Schwartz, I., Schwing, A.G., Hazan, T.: High-order attention models for visual question answering. In: NIPS (2017) Schwartz, I., Schwing, A.G., Hazan, T.: High-order attention models for visual question answering. In: NIPS (2017)
26.
Zurück zum Zitat Ben-younes, H., Cadene, R., Cord, M., Thome, N.: MUTAN: multimodal tucker fusion for visual question answering. In: ICCV (2017) Ben-younes, H., Cadene, R., Cord, M., Thome, N.: MUTAN: multimodal tucker fusion for visual question answering. In: ICCV (2017)
27.
Zurück zum Zitat Ma, L., Lu, Z., Li, H.: Learning to answer questions from image using convolutional neural network. In: AAAI (2016) Ma, L., Lu, Z., Li, H.: Learning to answer questions from image using convolutional neural network. In: AAAI (2016)
28.
Zurück zum Zitat Jain, U., Zhang, Z., Schwing, A.G.: Creativity: generating diverse questions using variational autoencoders. In: CVPR (2017) Jain, U., Zhang, Z., Schwing, A.G.: Creativity: generating diverse questions using variational autoencoders. In: CVPR (2017)
29.
Zurück zum Zitat Xiong, C., Merity, S., Socher, R.: Dynamic memory networks for visual and textual question answering. In: ICML (2016) Xiong, C., Merity, S., Socher, R.: Dynamic memory networks for visual and textual question answering. In: ICML (2016)
30.
Zurück zum Zitat Kim, J.H., et al.: Multimodal residual learning for visual QA. In: NIPS (2016) Kim, J.H., et al.: Multimodal residual learning for visual QA. In: NIPS (2016)
31.
Zurück zum Zitat Zitnick, C.L., Agrawal, A., Antol, S., Mitchell, M., Batra, D., Parikh, D.: Measuring machine intelligence through visual question answering. AI Mag. 37, 63–72 (2016)CrossRef Zitnick, C.L., Agrawal, A., Antol, S., Mitchell, M., Batra, D., Parikh, D.: Measuring machine intelligence through visual question answering. AI Mag. 37, 63–72 (2016)CrossRef
32.
Zurück zum Zitat Zhou, B., Tian, Y., Sukhbataar, S., Szlam, A., Fergus, R.: Simple baseline for visual question answering (2015). arXiv preprint: arXiv:1512.02167 Zhou, B., Tian, Y., Sukhbataar, S., Szlam, A., Fergus, R.: Simple baseline for visual question answering (2015). arXiv preprint: arXiv:​1512.​02167
33.
Zurück zum Zitat Wu, Q., Shen, C., van den Hengel, A., Wang, P., Dick, A.: Image captioning and visual question answering based on attributes and their related external knowledge (2016). arXiv:1603.02814 Wu, Q., Shen, C., van den Hengel, A., Wang, P., Dick, A.: Image captioning and visual question answering based on attributes and their related external knowledge (2016). arXiv:​1603.​02814
34.
Zurück zum Zitat Jain, U., Lazebnik, S., Schwing, A.G.: Two can play this game: visual dialog with discriminative question generation and answering. In: CVPR (2018) Jain, U., Lazebnik, S., Schwing, A.G.: Two can play this game: visual dialog with discriminative question generation and answering. In: CVPR (2018)
35.
Zurück zum Zitat Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). arXiv preprint: arXiv:1409.1556 Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014). arXiv preprint: arXiv:​1409.​1556
36.
37.
Zurück zum Zitat Zettlemoyer, L.S., Collins, M.: Learning to map sentences to logical form: structured classification with probabilistic categorial grammars. In: UAI (2005) Zettlemoyer, L.S., Collins, M.: Learning to map sentences to logical form: structured classification with probabilistic categorial grammars. In: UAI (2005)
38.
Zurück zum Zitat Zettlemoyer, L.S., Collins, M.: Learning context-dependent mappings from sentences to logical form. In: ACL (2005) Zettlemoyer, L.S., Collins, M.: Learning context-dependent mappings from sentences to logical form. In: ACL (2005)
39.
Zurück zum Zitat Berant, J., Chou, A., Frostig, R., Liang, P.: Semantic parsing on freebase from question-answer pairs. In: EMNLP (2013) Berant, J., Chou, A., Frostig, R., Liang, P.: Semantic parsing on freebase from question-answer pairs. In: EMNLP (2013)
40.
Zurück zum Zitat Cai, Q., Yates, A.: Large-scale semantic parsing via schema matching and lexicon extension. In: ACL (2013) Cai, Q., Yates, A.: Large-scale semantic parsing via schema matching and lexicon extension. In: ACL (2013)
41.
Zurück zum Zitat Liang, P., Jordan, M.I., Klein, D.: Learning dependency-based compositional semantics. Comput. Linguist. 39, 389–446 (2013)MathSciNetCrossRef Liang, P., Jordan, M.I., Klein, D.: Learning dependency-based compositional semantics. Comput. Linguist. 39, 389–446 (2013)MathSciNetCrossRef
42.
Zurück zum Zitat Kwiatkowski, T., Choi, E., Artzi, Y., Zettlemoyer, L.: Scaling semantic parsers with on-the-fly ontology matching. In: EMNLP (2013) Kwiatkowski, T., Choi, E., Artzi, Y., Zettlemoyer, L.: Scaling semantic parsers with on-the-fly ontology matching. In: EMNLP (2013)
43.
Zurück zum Zitat Berant, J., Liang, P.: Semantic parsing via paraphrasing. In: ACL (2014) Berant, J., Liang, P.: Semantic parsing via paraphrasing. In: ACL (2014)
44.
Zurück zum Zitat Fader, A., Zettlemoyer, L., Etzioni, O.: Open question answering over curated and extracted knowledge bases. In: KDD (2014) Fader, A., Zettlemoyer, L., Etzioni, O.: Open question answering over curated and extracted knowledge bases. In: KDD (2014)
45.
Zurück zum Zitat Yih, W., Chang, M.W., He, X., Gao, J.: Semantic parsing via staged query graph generation: question answering with knowledge base. In: ACL-IJCNLP (2015) Yih, W., Chang, M.W., He, X., Gao, J.: Semantic parsing via staged query graph generation: question answering with knowledge base. In: ACL-IJCNLP (2015)
46.
Zurück zum Zitat Reddy, S., et al.: Transforming dependency structures to logical forms for semantic parsing. In: ACL (2016) Reddy, S., et al.: Transforming dependency structures to logical forms for semantic parsing. In: ACL (2016)
47.
Zurück zum Zitat Xiao, C., Dymetman, M., Gardent, C.: Sequence-based structured prediction for semantic parsing. In: ACL (2016) Xiao, C., Dymetman, M., Gardent, C.: Sequence-based structured prediction for semantic parsing. In: ACL (2016)
48.
Zurück zum Zitat Unger, C., Bühmann, L., Lehmann, J., Ngomo, A.C.N., Gerber, D., Cimiano, P.: Template-based question answering over RDF data. In: WWW (2012) Unger, C., Bühmann, L., Lehmann, J., Ngomo, A.C.N., Gerber, D., Cimiano, P.: Template-based question answering over RDF data. In: WWW (2012)
49.
Zurück zum Zitat Kolomiyets, O., Moens, M.F.: A survey on question answering technology from an information retrieval perspective. Inf. Sci. 181, 5412–5434 (2011)MathSciNetCrossRef Kolomiyets, O., Moens, M.F.: A survey on question answering technology from an information retrieval perspective. Inf. Sci. 181, 5412–5434 (2011)MathSciNetCrossRef
50.
Zurück zum Zitat Yao, X., Durme, B.V.: Information extraction over structured data: question answering with freebase. In: ACL (2014) Yao, X., Durme, B.V.: Information extraction over structured data: question answering with freebase. In: ACL (2014)
51.
Zurück zum Zitat Bordes, A., Chopra, S., Weston, J.: Question answering with sub-graph embeddings. In: EMNLP (2014) Bordes, A., Chopra, S., Weston, J.: Question answering with sub-graph embeddings. In: EMNLP (2014)
53.
Zurück zum Zitat Dong, L., Wei, F., Zhou, M., Xu, K.: Question answering over freebase with multi-column convolutional neural networks. In: ACL (2015) Dong, L., Wei, F., Zhou, M., Xu, K.: Question answering over freebase with multi-column convolutional neural networks. In: ACL (2015)
54.
Zurück zum Zitat Bordes, A., Usunier, N., Chopra, S., Weston, J.: Large-scale simple question answering with memory networks. In: ICLR (2015) Bordes, A., Usunier, N., Chopra, S., Weston, J.: Large-scale simple question answering with memory networks. In: ICLR (2015)
55.
Zurück zum Zitat Zhu, Y., Zhang, C., Ré, C., Fei-Fei, L.: Building a large-scale multimodal knowledge base for visual question answering. CoRR (2015) Zhu, Y., Zhang, C., Ré, C., Fei-Fei, L.: Building a large-scale multimodal knowledge base for visual question answering. CoRR (2015)
56.
Zurück zum Zitat Wu, Q., Wang, P., Shen, C., van den Hengel, A., Dick, A.: Ask me anything: free-form visual question answering based on knowledge from external sources. In: CVPR (2016) Wu, Q., Wang, P., Shen, C., van den Hengel, A., Dick, A.: Ask me anything: free-form visual question answering based on knowledge from external sources. In: CVPR (2016)
57.
Zurück zum Zitat Wang, P., Wu, Q., Shen, C., van den Hengel, A., Dick, A.: Explicit knowledge-based reasoning for visual question answering. In: IJCAI (2017) Wang, P., Wu, Q., Shen, C., van den Hengel, A., Dick, A.: Explicit knowledge-based reasoning for visual question answering. In: IJCAI (2017)
58.
Zurück zum Zitat Krishnamurthy, J., Kollar, T.: Jointly learning to parse and perceive: connecting natural language to the physical world. In: ACL (2013) Krishnamurthy, J., Kollar, T.: Jointly learning to parse and perceive: connecting natural language to the physical world. In: ACL (2013)
59.
Zurück zum Zitat Narasimhan, K., Yala, A., Barzilay, R.: Improving information extraction by acquiring external evidence with reinforcement learning. In: EMNLP (2016) Narasimhan, K., Yala, A., Barzilay, R.: Improving information extraction by acquiring external evidence with reinforcement learning. In: EMNLP (2016)
60.
Zurück zum Zitat Wu, Q., Wang, P., Shen, C., Dick, A., van den Hengel, A.: Ask me anything: free-form visual question answering based on knowledge from external sources. In: CVPR (2016) Wu, Q., Wang, P., Shen, C., Dick, A., van den Hengel, A.: Ask me anything: free-form visual question answering based on knowledge from external sources. In: CVPR (2016)
61.
Zurück zum Zitat Wang, P., Wu, Q., Shen, C., Dick, A., Van Den Henge, A.: Explicit knowledge-based reasoning for visual question answering. In: IJCAI (2017) Wang, P., Wu, Q., Shen, C., Dick, A., Van Den Henge, A.: Explicit knowledge-based reasoning for visual question answering. In: IJCAI (2017)
62.
Zurück zum Zitat Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP (2014) Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: EMNLP (2014)
63.
Zurück zum Zitat Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)CrossRef Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9, 1735–1780 (1997)CrossRef
64.
Zurück zum Zitat He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016) He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
65.
Zurück zum Zitat Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR (2009) Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: a large-scale hierarchical image database. In: CVPR (2009)
66.
Zurück zum Zitat Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015) Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: NIPS (2015)
68.
Zurück zum Zitat Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: a 10 million image database for scene recognition. TPAMI 40, 1452–1464 (2017)CrossRef Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: a 10 million image database for scene recognition. TPAMI 40, 1452–1464 (2017)CrossRef
70.
Zurück zum Zitat Chao, Y.W., Wang, Z., He, Y., Wang, J., Deng, J.: HICO: a benchmark for recognizing human-object interactions in images. In: ICCV (2015) Chao, Y.W., Wang, Z., He, Y., Wang, J., Deng, J.: HICO: a benchmark for recognizing human-object interactions in images. In: ICCV (2015)
71.
Zurück zum Zitat Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: CVPR (2014) Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2D human pose estimation: new benchmark and state of the art analysis. In: CVPR (2014)
72.
Zurück zum Zitat Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large margin methods for structured and interdependent output variables. JMLR 6, 1453–1484 (2005)MathSciNetMATH Tsochantaridis, I., Joachims, T., Hofmann, T., Altun, Y.: Large margin methods for structured and interdependent output variables. JMLR 6, 1453–1484 (2005)MathSciNetMATH
Metadaten
Titel
Straight to the Facts: Learning Knowledge Base Retrieval for Factual Visual Question Answering
verfasst von
Medhini Narasimhan
Alexander G. Schwing
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-030-01237-3_28