Skip to main content

2022 | Buch

Visual Question Answering

From Theory to Application

verfasst von: Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu

Verlag: Springer Nature Singapore

Buchreihe : Advances in Pattern Recognition

insite
SUCHEN

Über dieses Buch

Visual Question Answering (VQA) usually combines visual inputs like image and video with a natural language question concerning the input and generates a natural language answer as the output. This is by nature a multi-disciplinary research problem, involving computer vision (CV), natural language processing (NLP), knowledge representation and reasoning (KR), etc.

Further, VQA is an ambitious undertaking, as it must overcome the challenges of general image understanding and the question-answering task, as well as the difficulties entailed by using large-scale databases with mixed-quality inputs. However, with the advent of deep learning (DL) and driven by the existence of advanced techniques in both CV and NLP and the availability of relevant large-scale datasets, we have recently seen enormous strides in VQA, with more systems and promising results emerging.

This book provides a comprehensive overview of VQA, covering fundamental theories, models, datasets, and promising future directions. Given its scope, it can be used as a textbook on computer vision and natural language processing, especially for researchers and students in the area of visual question answering. It also highlights the key models used in VQA.

Inhaltsverzeichnis

Frontmatter
Chapter 1. Introduction
Abstract
Visual question answering (VQA) is a challenging task that has received increasing attention from computer vision, natural language processing and all other AI communities. Given an image and a question in natural language format, reasoning over visual elements of the image and general knowledge are required to infer the correct answer, which may be presented in different formats. In this section, we first explain the motivation behind realizing VQA, i.e., the necessity of this new task and the benefits that the artificial intelligence (AI) field can derive from it. Subsequently, we categorize the VQA problem from different perspectives, including data type and task level. Finally, we present an overview and describe the structure of this book.
Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu

Preliminaries

Frontmatter
Chapter 2. Deep Learning Basics
Abstract
Deep learning basics are essential for the visual question answering task since multimodal information is usually complex and multidimensional. Therefore, in this chapter, we present basic information regarding deep learning, covering the following: (1) neural networks, (2) convolutional neural networks, (3) recurrent neural networks and their variants, (4) encoder/decoder structure, (5) attention mechanism, (6) memory networks, (7) transformer networks and BERT, and (8) graph neural networks.
Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu
Chapter 3. Question Answering (QA) Basics
Abstract
The main objective of the question answering (QA) task is to provide relevant answers in response to questions asked in natural language through either a prestructured database or a collection of natural language documents [11]. The basic architecture usually consists of three components: a question processing unit, a document processing unit and an answer processing unit. The question processing unit first analyzes the structure of the given question and transforms the question into a meaningful format compatible with the QA domain. The document processing unit generates a dataset or a model that provides information for answer generation. The answer processing unit extracts the answer from information and formatted questions. In this chapter, we discuss the QA task from the following aspects: rule-based methods, information retrieval-based methods, neural semantic parsing-based methods and approaches taking knowledge base into account.
Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu

Image-Based VQA

Frontmatter
Chapter 4. Classical Visual Question Answering
Abstract
VQA has received considerable attention from both the computer vision and the natural language processing research communities in recent years. Given an image and the corresponding question in natural language, a VQA system is required to comprehend the question and find the essential visual elements in the image to predict the correct answer. In this chapter, we first introduce the prevalent datasets for VQA tasks, such as the COCO-QA, VQA v1 and VQA v2 datasets. Subsequently, we present a detailed description of several classical VQA methods classified as joint embedding methods, attention-based methods, memory networks and compositional methods.
Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu
Chapter 5. Knowledge-Based VQA
Abstract
Tasks such as VQA often require common sense and factual information in addition to the information learned from a task-specific dataset. Therefore, a knowledge-based VQA task is established. In this chapter, we first introduce the main datasets proposed for knowledge-based VQA and knowledge bases such as DBpedia and ConceptNet. Subsequently, we classify methods from three aspects: knowledge embedding, question-to-query translation and querying knowledge base methods.
Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu
Chapter 6. Vision-and-Language Pretraining for VQA
Abstract
Multimodal (e.g., vision and language) pretraining has emerged as a popular topic, and many representation learning models have been proposed in recent years. In this chapter, we focus on the vision-and-language pretraining model, which can be adapted in the VQA task. To this end, we first introduce three general pretraining models—ELMo, GPT and BERT—for which only the representation of natural language is considered in the original research. Subsequently, we describe the vision-and-language pretraining models, which can be regarded as extensions of the language-aware pretraining models. Specifically, we categorize these models into two types: single stream and two stream. Finally, we describe the method to finetune these models for each specific downstream task, e.g., VQA, visual common-sense reasoning (VCR) and referring expression comprehension (REC).
Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu

Video-Based VQA

Frontmatter
Chapter 7. Video Representation Learning
Abstract
Video representation learning generates visual semantic representations from given videos, which is vital for video-related tasks, including human action understanding in videos and video question answering. Video representations can be categorized into handcrafted local features and deep-learned features. Handcrafted local features are video features extracted by handcrafted formulas, and deep-learned features are extracted automatically through neural networks. In this chapter, we discuss video representation learning from the two aspects of handcrafted features and deep architecture-generated features.
Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu
Chapter 8. Video Question Answering
Abstract
The video question answering task, which was first introduced in 2014, is a more complex task than the classical visual (static image) question answering task. For video question answering tasks, both datasets and models are essential for research. Therefore, in this chapter, we first illustrate the most popular datasets for video question answering, ranging from datasets containing physical objects to those characterizing the real world, and subsequently introduce several models based on the encoder-decoder framework.
Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu
Chapter 9. Advanced Models for Video Question Answering
Abstract
In Chap. 8, we introduced several traditional models for video question answering tasks based on the encoder-decoder framework. However, other models exist beyond this framework, which exhibit fine architectures and performances. In this chapter, we categorize these methods into four categories, i.e., attention on spatiotemporal features, memory networks, spatiotemporal graph neural networks and multitask pretraining and discuss the characteristics of these frameworks.
Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu

Advanced Topics in VQA

Frontmatter
Chapter 10. Embodied VQA
Abstract
It is a long-standing goal for scientists to develop robots that can perceive, communicate with humans in natural language and complete commands as requested. Several sub-tasks are proposed to achieve this goal in sequential manner, e.g. Vision-and-Language Navigation requires the intelligent agent to follow detailed instructions with visual perception, Remote object localization gives the agent shorter and more abstract instructions, Embodied QA expects the agent to actively explore the environment and respond to inquiries, Interactive QA hopes the agent actively interact with a virtual environment to get responses of inquiries. In this chapter, we first briefly introduce some mainstream simulators, datasets and evaluation criteria that benchmark applied in this field, such as MatterPort3D, iGibison and Habitat et al. Subsequently, we describe the motivation, methodology and key performance of several methods corresponding to each sub-tasks.
Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu
Chapter 11. Medical VQA
Abstract
Inspired by the rise of VQA research in general domain, the task of Medical VQA has received great attention from computer vision, natural language processing and biomedical research communities in recent years. Given a medical image and clinically related question about the visual elements in the medical image, a Medical VQA system is required to deeply comprehend both the medical image and the asked question to predict the correct answer. In this chapter, we first introduce mainstream datasets used for Medical VQA tasks, such as VQA-RAD, VQA-Med, PathVQA and SLAKE datasets. Then, we elaborate the prevalent methods for Medical VQA tasks in detail. These methods can be classified into three categories based on their main characteristics: classical VQA methods, meta-learning methods and BERT-based methods for Medical VQA.
Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu
Chapter 12. Text-Based VQA
Abstract
VQA requires reasoning regarding the visual content of an image. However, in a large proportion of images, visual content is not the only information. Texts that can be recognized by optical character recognition (OCR) tools provide considerably more useful and high-level semantic information, such as the street name, product brand and prices, which is not available in any other forms in the scene. Interpreting this written information in human environments is essential for performing most everyday tasks like making a purchase, using public transportation and finding a location in a city. Hence, the new task TextVQA has been proposed. In this chapter, we briefly introduce the main datasets that benchmark progress in this field, including TextVQA [29], ST-VQA [2] and OCR-VQA [25]. Subsequently, we describe an important tool (OCR) that is a prerequisite for the reasoning process, as texts must be first recognized. Next, we select 3 representative and effective models to address this problem and describe them in a sequential manner.
Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu
Chapter 13. Visual Question Generation
Abstract
To explore how questions regarding images are posed and abstract the events caused by objects in the image, the visual question generation (VQG) task has been established. In this chapter, we classify VQG methods according to whether their objective is data augmentation or visual understanding.
Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu
Chapter 14. Visual Dialogue
Abstract
Visual dialogue is an important and complicated vision language task that processes the visual features of images and textual features of captions, questions and histories to answer questions. To accomplish this task, the machine must exhibit the abilities of perception, multimodal reasoning, relationship mining and visual coreference resolution. In this chapter, we briefly describe the challenges associated with this method and introduce the two benchmarks. Subsequently, a comprehensive review of the associated methods is presented, which are classified into four categories.
Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu
Chapter 15. Referring Expression Comprehension
Abstract
Referring expression comprehension (REC) aims to localize objects in images based on natural language queries. In contrast to the object detection task, in which queried object labels are predefined, the REC problem can only observe the queries during the test. REC is difficult to implement because this task requires a comprehensive understanding of complicated natural language and various types of visual information. In this chapter, we first describe this task and subsequently introduce prevalent datasets proposed for REC tasks such as the RefCOCO, RefCOCO+ and RefCOCOg datasets. Finally, we classify the methods in the REC domain into three main categories: two-stage models, one-stage models and reasoning process comprehension.
Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu

Summary and Outlook

Frontmatter
Chapter 16. Summary and Outlook
Abstract
Visual question answering is a significant topic in current AI research and has been linked to many applications such as AI assistant and dialog systems. As a cross-disciplinary task, this topic has attracted considerable attention from researchers in different communities, such as computer vision and natural language processing. VQA is a typical cross-modal task since it requires machines to simultaneously understand visual content (images and videos) and natural language, and, in certain cases, common sense knowledge. Nevertheless, certain challenges must be addressed to realize artificial general intelligence.
Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu
Backmatter
Metadaten
Titel
Visual Question Answering
verfasst von
Qi Wu
Peng Wang
Xin Wang
Xiaodong He
Wenwu Zhu
Copyright-Jahr
2022
Verlag
Springer Nature Singapore
Electronic ISBN
978-981-19-0964-1
Print ISBN
978-981-19-0963-4
DOI
https://doi.org/10.1007/978-981-19-0964-1