Skip to main content

2014 | Buch

Fusion in Computer Vision

Understanding Complex Visual Content

insite
SUCHEN

Über dieses Buch

This book presents a thorough overview of fusion in computer vision, from an interdisciplinary and multi-application viewpoint, describing successful approaches, evaluated in the context of international benchmarks that model realistic use cases. Features: examines late fusion approaches for concept recognition in images and videos; describes the interpretation of visual content by incorporating models of the human visual system with content understanding methods; investigates the fusion of multi-modal features of different semantic levels, as well as results of semantic concept detections, for example-based event recognition in video; proposes rotation-based ensemble classifiers for high-dimensional data, which encourage both individual accuracy and diversity within the ensemble; reviews application-focused strategies of fusion in video surveillance, biomedical information retrieval, and content detection in movies; discusses the modeling of mechanisms of human interpretation of complex visual content.

Inhaltsverzeichnis

Frontmatter
Chapter 1. A Selective Weighted Late Fusion for Visual Concept Recognition
Abstract
We propose a novel multimodal approach to automatically predict the visual concepts of images through an effective fusion of visual and textual features. It relies on a Selective Weighted Late Fusion (SWLF) scheme which, in optimizing an overall Mean interpolated Average Precision (MiAP), learns to automatically select and weight the best features for each visual concept to be recognized. Experiments were conducted on the MIR Flickr image collection within the ImageCLEF Photo Annotation challenge. The results have brought to the fore the effectiveness of SWLF as it achieved a MiAP of 43.69 % in 2011 which ranked second out of the 79 submitted runs, and a MiAP of 43.67 % that ranked first out of the 80 submitted runs in 2012.
Ningning Liu, Emmanuel Dellandréa, Bruno Tellez, Liming Chen
Chapter 2. Bag-of-Words Image Representation: Key Ideas and Further Insight
Abstract
In the context of object and scene recognition, state-of-the-art performances are obtained with visual Bag-of-Words (BoW) models of mid-level representations computed from dense sampled local descriptors (e.g., Scale-Invariant Feature Transform (SIFT)). Several methods to combine low-level features and to set mid-level parameters have been evaluated recently for image classification. In this chapter, we study in detail the different components of the BoW model in the context of image classification. Particularly, we focus on the coding and pooling steps and investigate the impact of the main parameters of the BoW pipeline. We show that an adequate combination of several low (sampling rate, multiscale) and mid-level (codebook size, normalization) parameters is decisive to reach good performances. Based on this analysis, we propose a merging scheme that exploits the specificities of edge-based descriptors. Low and high contrast regions are pooled separately and combined to provide a powerful representation of images. We study the impact on classification performance of the contrast threshold that determines whether a SIFT descriptor corresponds to a low contrast region or a high contrast region. Successful experiments are provided on the Caltech-101 and Scene-15 datasets.
Marc T. Law, Nicolas Thome, Matthieu Cord
Chapter 3. Hierarchical Late Fusion for Concept Detection in Videos
Abstract
Current research shows that the detection of semantic concepts (e.g., animal, bus, person, dancing, etc.) in multimedia documents such as videos, requires the use of several types of complementary descriptors in order to achieve good results. In this work, we explore strategies for combining dozens of complementary content descriptors (or “experts”) in an efficient way, through the use of late fusion approaches, for concept detection in multimedia documents. We explore two fusion approaches that share a common structure: both start with a clustering of experts stage, continue with an intra-cluster fusion and finish with an inter-cluster fusion, and we also experiment with other state-of-the-art methods. The first fusion approach relies on a priori knowledge about the internals of each expert to group the set of available experts by similarity. The second approach automatically obtains measures on the similarity of experts from their output to group the experts using agglomerative clustering, and then combines the results of this fusion with those from other methods. In the end, we show that an additional performance boost can be obtained by also considering the context of multimedia elements.
Sabin Tiberius Strat, Alexandre Benoit, Patrick Lambert, Hervé Bredin, Georges Quénot
Chapter 4. Fusion of Multiple Visual Cues for Object Recognition in Videos
Abstract
In this chapter, we are interested in the open problem of meaningful object recognition in video. Recently the approaches which estimate human visual attention and incorporate it into the whole visual content understanding process have become popular. In estimation of visual attention in a complex spatio-temporal content such as video one has to fuse multiple information channels such as motion, spatial contrast, and others. In the first part of the chapter, we are interested in these questions and report on optimal strategies of bottom–up fusion in visual saliency estimation. Then the estimated visual saliency is used in pooling of local descriptors. We compare different pooling approaches and show results on rather interesting visual content: that one recorded with wearable cameras for a large-scale research on Alzheimer’s disease. The results which will be shown together with conclusion demonstrate that the approaches based on the saliency fusion outperform the best state-of-the art techniques in this content.
Iván González-Díaz, Jenny Benois-Pineau, Vincent Buso, Hugo Boujut
Chapter 5. Evaluating Multimedia Features and Fusion for Example-Based Event Detection
Abstract
Multimedia event detection (MED) is a challenging problem because of the heterogeneous content and variable quality found in large collections of Internet videos. To study the value of multimedia features and fusion for representing and learning events from a set of example video clips, we created SESAME, a system for video SEarch with Speed and Accuracy for Multimedia Events. SESAME includes multiple bag-of-words event classifiers based on single data types: low-level visual, motion, and audio features; high-level semantic visual concepts; and automatic speech recognition (ASR). Event detection performance was evaluated for each event classifier. The performance of low-level visual and motion features was improved by the use of difference coding. The accuracy of the visual concepts was nearly as strong as that of the low-level visual features. Experiments with a number of fusion methods for combining the event detection scores from these classifiers revealed that simple fusion methods, such as arithmetic mean, perform as well as or better than other, more complex fusion methods.
Gregory K. Myers, Cees G. M. Snoek, Ramakant Nevatia, Ramesh Nallapati, Julien van Hout, Stephanie Pancoast, Chen Sun, Amirhossein Habibian, Dennis C. Koelma, Koen E. A. van de Sande, Arnold W. M. Smeulders
Chapter 6. Rotation-Based Ensemble Classifiers for High-Dimensional Data
Abstract
In past 20 years, Multiple Classifier System (MCS) has shown great potential to improve the accuracy and reliability of pattern classification. In this chapter, we discuss the major issues of MCS, including MCS topology, classifier generation, and classifier combination, providing a summary of MCS applied to remote sensing image classification, especially in high-dimensional data. Furthermore, the recently rotation-based ensemble classifiers, which encourage both individual accuracy and diversity within the ensemble simultaneously, are presented to classify high-dimensional data, taking hyperspectral and multidate remote sensing images as examples. Rotation-based ensemble classifiers project the original data into a new feature space using feature extraction and subset selection methods to generate the diverse individual classifiers. Two classifiers: Decision Tree (DT) and Support Vector Machine (SVM), are selected as the base classifier. Unsupervised and supervised feature extraction methods are employed in the rotation-based ensemble classifiers. Experimental results demonstrated that rotation-based ensemble classifiers are superior to Bagging, AdaBoost and random-based ensemble classifiers.
Junshi Xia, Jocelyn Chanussot, Peijun Du, Xiyan He
Chapter 7. Multimodal Fusion in Surveillance Applications
Abstract
The recent outbreak of vandalism, accidents and criminal activities has increased general public’s awareness about safety and security, demanding improved security measures. Smart surveillance video systems have become an ubiquitous platform which monitors private and public environments, ensuring citizens well-being. Their universal deployment integrates diverse media and acquisition systems, generating daily an enormous amount of multimodal data. Nowadays, numerous surveillance applications exploit multiple types of data and features benefitting from their uncorrelated contributions. Hence, the analysis, standardisation and fusion of complex content, specially visual, have become a fundamental problem to enhance surveillance systems by increasing their accuracy, robustness and reliability. During this chapter, an exhaustive survey of the existing multimodal fusion techniques and their applications in surveillance is provided. Addressing some of the revealed challenges from the state of the art, this chapter focuses on the development of a multimodal fusion technique for automatic surveillance object classification. The proposed fusion technique exploits the benefits of a Bayesian inference scheme to enhance surveillance systems’ performance. The chapter ends with an evaluation of the proposed Bayesian-based multimodal object classifier against two state-of-the-art object classifiers to demonstrate the benefits of multimodal fusion in surveillance applications.
Virginia Fernandez Arguedas, Qianni Zhang, Ebroul Izquierdo
Chapter 8. Multimodal Violence Detection in Hollywood Movies: State-of-the-Art and Benchmarking
Abstract
This chapter introduces a benchmark evaluation targeting the detection of violent scenes in Hollywood movies. The evaluation was implemented in 2011 and 2012 as an affect task in the framework of the international MediaEval benchmark initiative. We report on these 2 years of evaluation, providing a detailed description of the dataset created, describing the state of the art by studying the results achieved by participants and providing a detailed analysis of two of the best performing multimodal systems. We elaborate on the lessons learned after 2 years to provide insights on future work emphasizing multimodal modeling and fusion.
Claire-Hélène Demarty, Cédric Penet, Bogdan Ionescu, Guillaume Gravier, Mohammad Soleymani
Chapter 9. Fusion Techniques in Biomedical Information Retrieval
Abstract
For difficult cases clinicians usually use their experience and also the information found in textbooks to determine a diagnosis. Computer tools can help them supply the relevant information now that much medical knowledge is available in digital form. A biomedical search system such as developed in the Khresmoi project (that this chapter partially reuses) has the goal to fulfil information needs of physicians. This chapter concentrates on information needs for medical cases that contain a large variety of data, from free text, structured data to images. Fusion techniques will be compared to combine the various information sources to supply cases similar to an example case given. This can supply physicians with answers to problems similar to the one they are analyzing and can help in diagnosis and treatment planning.
Alba García Seco de Herrera, Henning Müller
Chapter 10. Using Crowdsourcing to Capture Complexity in Human Interpretations of Multimedia Content
Abstract
Large-scale crowdsourcing platforms are a key tool allowing researchers in the area of multimedia content analysis to gain insight into how users interpret social multimedia. The goal of this article is to support this process in a practical manner that opens the path for productive exploitation of complex human interpretations of multimedia content within multimedia systems. We first discuss in detail the nature of complexity in human interpretations of multimedia, and why we, as researchers, should look outward to the crowd, rather than inward to ourselves, to determine what users consider important about the content of images and videos. Then, we present strategies and insights from our own experience in designing tasks for crowdworkers. Our techniques are useful to researchers interested in eliciting information about the elements and aspects of multimedia that are important in the contexts in which humans use social multimedia.
Martha Larson, Mark Melenhorst, María Menéndez, Peng Xu
Backmatter
Metadaten
Titel
Fusion in Computer Vision
herausgegeben von
Bogdan Ionescu
Jenny Benois-Pineau
Tomas Piatrik
Georges Quénot
Copyright-Jahr
2014
Electronic ISBN
978-3-319-05696-8
Print ISBN
978-3-319-05695-1
DOI
https://doi.org/10.1007/978-3-319-05696-8