Visual Question Answering | springerprofessional.de

Springer Professional

nach oben

2022 | Buch

Kapitel lesen Erstes Kapitel lesen

Visual Question Answering

From Theory to Application

verfasst von: Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu

Verlag: Springer Nature Singapore

Buchreihe : Advances in Pattern Recognition

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

Visual Question Answering (VQA) usually combines visual inputs like image and video with a natural language question concerning the input and generates a natural language answer as the output. This is by nature a multi-disciplinary research problem, involving computer vision (CV), natural language processing (NLP), knowledge representation and reasoning (KR), etc.

Further, VQA is an ambitious undertaking, as it must overcome the challenges of general image understanding and the question-answering task, as well as the difficulties entailed by using large-scale databases with mixed-quality inputs. However, with the advent of deep learning (DL) and driven by the existence of advanced techniques in both CV and NLP and the availability of relevant large-scale datasets, we have recently seen enormous strides in VQA, with more systems and promising results emerging.

This book provides a comprehensive overview of VQA, covering fundamental theories, models, datasets, and promising future directions. Given its scope, it can be used as a textbook on computer vision and natural language processing, especially for researchers and students in the area of visual question answering. It also highlights the key models used in VQA.

Anzeige

Inhaltsverzeichnis

Frontmatter

Chapter 1. Introduction

Abstract

Visual question answering (VQA) is a challenging task that has received increasing attention from computer vision, natural language processing and all other AI communities. Given an image and a question in natural language format, reasoning over visual elements of the image and general knowledge are required to infer the correct answer, which may be presented in different formats. In this section, we first explain the motivation behind realizing VQA, i.e., the necessity of this new task and the benefits that the artificial intelligence (AI) field can derive from it. Subsequently, we categorize the VQA problem from different perspectives, including data type and task level. Finally, we present an overview and describe the structure of this book.

Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu

Preliminaries

Frontmatter

Chapter 2. Deep Learning Basics

Abstract

Deep learning basics are essential for the visual question answering task since multimodal information is usually complex and multidimensional. Therefore, in this chapter, we present basic information regarding deep learning, covering the following: (1) neural networks, (2) convolutional neural networks, (3) recurrent neural networks and their variants, (4) encoder/decoder structure, (5) attention mechanism, (6) memory networks, (7) transformer networks and BERT, and (8) graph neural networks.

Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu

Chapter 3. Question Answering (QA) Basics

Abstract

The main objective of the question answering (QA) task is to provide relevant answers in response to questions asked in natural language through either a prestructured database or a collection of natural language documents [11]. The basic architecture usually consists of three components: a question processing unit, a document processing unit and an answer processing unit. The question processing unit first analyzes the structure of the given question and transforms the question into a meaningful format compatible with the QA domain. The document processing unit generates a dataset or a model that provides information for answer generation. The answer processing unit extracts the answer from information and formatted questions. In this chapter, we discuss the QA task from the following aspects: rule-based methods, information retrieval-based methods, neural semantic parsing-based methods and approaches taking knowledge base into account.

Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu

Image-Based VQA

Frontmatter

Chapter 4. Classical Visual Question Answering

Abstract

VQA has received considerable attention from both the computer vision and the natural language processing research communities in recent years. Given an image and the corresponding question in natural language, a VQA system is required to comprehend the question and find the essential visual elements in the image to predict the correct answer. In this chapter, we first introduce the prevalent datasets for VQA tasks, such as the COCO-QA, VQA v1 and VQA v2 datasets. Subsequently, we present a detailed description of several classical VQA methods classified as joint embedding methods, attention-based methods, memory networks and compositional methods.

Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu

Chapter 5. Knowledge-Based VQA

Abstract

Tasks such as VQA often require common sense and factual information in addition to the information learned from a task-specific dataset. Therefore, a knowledge-based VQA task is established. In this chapter, we first introduce the main datasets proposed for knowledge-based VQA and knowledge bases such as DBpedia and ConceptNet. Subsequently, we classify methods from three aspects: knowledge embedding, question-to-query translation and querying knowledge base methods.

Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu

Chapter 6. Vision-and-Language Pretraining for VQA

Abstract

Multimodal (e.g., vision and language) pretraining has emerged as a popular topic, and many representation learning models have been proposed in recent years. In this chapter, we focus on the vision-and-language pretraining model, which can be adapted in the VQA task. To this end, we first introduce three general pretraining models—ELMo, GPT and BERT—for which only the representation of natural language is considered in the original research. Subsequently, we describe the vision-and-language pretraining models, which can be regarded as extensions of the language-aware pretraining models. Specifically, we categorize these models into two types: single stream and two stream. Finally, we describe the method to finetune these models for each specific downstream task, e.g., VQA, visual common-sense reasoning (VCR) and referring expression comprehension (REC).

Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu

Video-Based VQA

Frontmatter

Chapter 7. Video Representation Learning

Abstract

Video representation learning generates visual semantic representations from given videos, which is vital for video-related tasks, including human action understanding in videos and video question answering. Video representations can be categorized into handcrafted local features and deep-learned features. Handcrafted local features are video features extracted by handcrafted formulas, and deep-learned features are extracted automatically through neural networks. In this chapter, we discuss video representation learning from the two aspects of handcrafted features and deep architecture-generated features.

Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu

Chapter 8. Video Question Answering

Abstract

The video question answering task, which was first introduced in 2014, is a more complex task than the classical visual (static image) question answering task. For video question answering tasks, both datasets and models are essential for research. Therefore, in this chapter, we first illustrate the most popular datasets for video question answering, ranging from datasets containing physical objects to those characterizing the real world, and subsequently introduce several models based on the encoder-decoder framework.

Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu

Chapter 9. Advanced Models for Video Question Answering

Abstract

In Chap. 8, we introduced several traditional models for video question answering tasks based on the encoder-decoder framework. However, other models exist beyond this framework, which exhibit fine architectures and performances. In this chapter, we categorize these methods into four categories, i.e., attention on spatiotemporal features, memory networks, spatiotemporal graph neural networks and multitask pretraining and discuss the characteristics of these frameworks.

Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu

Advanced Topics in VQA

Frontmatter

Chapter 10. Embodied VQA

Abstract

It is a long-standing goal for scientists to develop robots that can perceive, communicate with humans in natural language and complete commands as requested. Several sub-tasks are proposed to achieve this goal in sequential manner, e.g. Vision-and-Language Navigation requires the intelligent agent to follow detailed instructions with visual perception, Remote object localization gives the agent shorter and more abstract instructions, Embodied QA expects the agent to actively explore the environment and respond to inquiries, Interactive QA hopes the agent actively interact with a virtual environment to get responses of inquiries. In this chapter, we first briefly introduce some mainstream simulators, datasets and evaluation criteria that benchmark applied in this field, such as MatterPort3D, iGibison and Habitat et al. Subsequently, we describe the motivation, methodology and key performance of several methods corresponding to each sub-tasks.

Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu

Chapter 11. Medical VQA

Abstract

Inspired by the rise of VQA research in general domain, the task of Medical VQA has received great attention from computer vision, natural language processing and biomedical research communities in recent years. Given a medical image and clinically related question about the visual elements in the medical image, a Medical VQA system is required to deeply comprehend both the medical image and the asked question to predict the correct answer. In this chapter, we first introduce mainstream datasets used for Medical VQA tasks, such as VQA-RAD, VQA-Med, PathVQA and SLAKE datasets. Then, we elaborate the prevalent methods for Medical VQA tasks in detail. These methods can be classified into three categories based on their main characteristics: classical VQA methods, meta-learning methods and BERT-based methods for Medical VQA.

Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu

Chapter 12. Text-Based VQA

Abstract

VQA requires reasoning regarding the visual content of an image. However, in a large proportion of images, visual content is not the only information. Texts that can be recognized by optical character recognition (OCR) tools provide considerably more useful and high-level semantic information, such as the street name, product brand and prices, which is not available in any other forms in the scene. Interpreting this written information in human environments is essential for performing most everyday tasks like making a purchase, using public transportation and finding a location in a city. Hence, the new task TextVQA has been proposed. In this chapter, we briefly introduce the main datasets that benchmark progress in this field, including TextVQA [29], ST-VQA [2] and OCR-VQA [25]. Subsequently, we describe an important tool (OCR) that is a prerequisite for the reasoning process, as texts must be first recognized. Next, we select 3 representative and effective models to address this problem and describe them in a sequential manner.

Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu

Chapter 13. Visual Question Generation

Abstract

To explore how questions regarding images are posed and abstract the events caused by objects in the image, the visual question generation (VQG) task has been established. In this chapter, we classify VQG methods according to whether their objective is data augmentation or visual understanding.

Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu

Chapter 14. Visual Dialogue

Abstract

Visual dialogue is an important and complicated vision language task that processes the visual features of images and textual features of captions, questions and histories to answer questions. To accomplish this task, the machine must exhibit the abilities of perception, multimodal reasoning, relationship mining and visual coreference resolution. In this chapter, we briefly describe the challenges associated with this method and introduce the two benchmarks. Subsequently, a comprehensive review of the associated methods is presented, which are classified into four categories.

Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu

Chapter 15. Referring Expression Comprehension

Abstract

Referring expression comprehension (REC) aims to localize objects in images based on natural language queries. In contrast to the object detection task, in which queried object labels are predefined, the REC problem can only observe the queries during the test. REC is difficult to implement because this task requires a comprehensive understanding of complicated natural language and various types of visual information. In this chapter, we first describe this task and subsequently introduce prevalent datasets proposed for REC tasks such as the RefCOCO, RefCOCO+ and RefCOCOg datasets. Finally, we classify the methods in the REC domain into three main categories: two-stage models, one-stage models and reasoning process comprehension.

Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu

Summary and Outlook

Frontmatter

Chapter 16. Summary and Outlook

Abstract

Visual question answering is a significant topic in current AI research and has been linked to many applications such as AI assistant and dialog systems. As a cross-disciplinary task, this topic has attracted considerable attention from researchers in different communities, such as computer vision and natural language processing. VQA is a typical cross-modal task since it requires machines to simultaneously understand visual content (images and videos) and natural language, and, in certain cases, common sense knowledge. Nevertheless, certain challenges must be addressed to realize artificial general intelligence.

Qi Wu, Peng Wang, Xin Wang, Xiaodong He, Wenwu Zhu

Backmatter

Titel: Visual Question Answering
verfasst von: Qi Wu
Peng Wang
Xin Wang
Xiaodong He
Wenwu Zhu
Copyright-Jahr: 2022
Verlag: Springer Nature Singapore
Electronic ISBN: 978-981-19-0964-1
Print ISBN: 978-981-19-0963-4
DOI: https://doi.org/10.1007/978-981-19-0964-1

Weitere Formate
Fachgebiete
Bücher
Zeitschriften
Themenseiten
Jetzt Einzelzugang starten
Zugang für Unternehmen
Referenzkunden
SLX-Digitalkonferenz: Zukunftswerkstatt 2024