Skip to main content

About this book

This book presents a broad selection of cutting-edge research, covering both theoretical and practical aspects of reconstruction, registration, and recognition. The text provides an overview of challenging areas and descriptions of novel algorithms. Features: investigates visual features, trajectory features, and stereo matching; reviews the main challenges of semi-supervised object recognition, and a novel method for human action categorization; presents a framework for the visual localization of MAVs, and for the use of moment constraints in convex shape optimization; examines solutions to the co-recognition problem, and distance-based classifiers for large-scale image classification; describes how the four-color theorem can be used for solving MRF problems; introduces a Bayesian generative model for understanding indoor environments, and a boosting approach for generalizing the k-NN rule; discusses the issue of scene-specific object detection, and an approach for making temporal super resolution video.

Table of Contents


Chapter 1. Visual Features—From Early Concepts to Modern Computer Vision

Extracting, representing and comparing image content is one of the most important tasks in the fields of computer vision and pattern recognition. Distinctive image characteristics are often described by visual image features which serve as input for applications such as image registration, image retrieval, 3D reconstruction, navigation, object recognition and object tracking. The awareness for the need of adequately describing visual features emerged in the 1920s in the domain of visual perception, and fundamental concepts have been established to which almost every approach for feature extraction can be traced back. After the transfer of the basic ideas to the field of computer vision, much research has been carried out including the development of new concepts and methods for extracting such features, the improvement of existing ideas and numerous comparisons of different methods. In this chapter, a definition of visual features is derived, and different types are presented which address both the spatial and the spatio-temporal domain. This includes local image features, which are used in a variety of computer vision applications, and their evolution from early ideas to powerful feature extraction and matching methods.
Martin Weinmann

Chapter 2. Where Next in Object Recognition and how much Supervision Do We Need?

Object class recognition is an active topic in computer vision still presenting many challenges. In most approaches, this task is addressed by supervised learning algorithms that need a large quantity of labels to perform well. This leads either to small datasets (<10,000 images) that capture only a subset of the real-world class distribution (but with a controlled and verified labeling procedure), or to large datasets that are more representative but also add more label noise. Therefore, semi-supervised learning has been established as a promising direction to address object recognition. It requires only few labels while simultaneously making use of the vast amount of images available today. In this chapter, we outline the main challenges of semi-supervised object recognition, we review existing approaches, and we emphasize open issues that should be addressed next to advance this research topic.
Sandra Ebert, Bernt Schiele

Chapter 3. Recognizing Human Actions by Using Effective Codebooks and Tracking

Recognition and classification of human actions for annotation of unconstrained video sequences has proven to be challenging because of the variations in the environment, appearance of actors, modalities in which the same action is performed by different persons, speed and duration and points of view from which the event is observed. This variability reflects in the difficulty of defining effective descriptors and deriving appropriate and effective codebooks for action categorization. In this chapter, we present a novel and effective solution to classify human actions in unconstrained videos. In the formation of the codebook, we employ radius-based clustering with soft assignment in order to create a rich vocabulary that may account for the high variability of human actions. We show that our solution scores very good performance with no need of parameter tuning. We also show that a strong reduction of computation time can be obtained by applying codebook size reduction with Deep Belief Networks with little loss of accuracy.
Lamberto Ballan, Lorenzo Seidenari, Giuseppe Serra, Marco Bertini, Alberto Del Bimbo

Chapter 4. Evaluating and Extending Trajectory Features for Activity Recognition

Trajectory features are a powerful new way to describe video data. By leveraging the spatio-temporal range and structure of trajectories, they improve activity recognition performance compared to systems based on fixed local spatio-temporal volumes. This chapter places them in context, compares a sparse, generative model of extended trajectories to a dense, discriminative model of local trajectories, and explores ways to extend the sparse system with new kinds of information.
Ross Messing, Atousa Torabi, Aaron Courville, Chris Pal

Chapter 5. Co-recognition of Images and Videos: Unsupervised Matching of Identical Object Patterns and Its Applications

In this chapter, we address the problem of detecting, matching, and segmenting all identical object-level patterns from images or videos in an unsupervised way, called the “co-recognition” problem. In an unsupervised setting without any prior knowledge of specific target objects, it relies entirely on geometric and photometric relations of visual features. To solve this problem, a multi-layer match-growing framework is proposed which explores given visual data by intra-layer expansion and inter-layer merge. We demonstrate the effectiveness of this approach on identical object detection, image retrieval, symmetry detection, and action recognition. These applications will validate the usefulness of co-recognition to several vision problems.
Minsu Cho, Young Min Shin, Kyoung Mu Lee

Chapter 6. Stereo Matching—State-of-the-Art and Research Challenges

Stereo matching denotes the problem of finding dense correspondences in pairs of images in order to perform 3D reconstruction. In this chapter, we provide a review of stereo methods with a focus on recent developments and our own work. We start with a discussion of local methods and introduce our algorithms: geodesic stereo, cost filtering and PatchMatch stereo. Although local algorithms have recently become very popular, they are not capable of handling large untextured regions where a global smoothness prior is required. In the discussion of such global methods, we briefly describe standard optimization techniques. However, the real problem is not in the optimization, but in finding an energy function that represents a good model of the stereo problem. In this context, we investigate data and smoothness terms of standard energies to find the best-suited implementations of which. We then describe our own work on finding a good model. This includes our combined stereo and matting approach, Surface Stereo, Object Stereo as well as a new method that incorporates physics-based reasoning in stereo matching.
Michael Bleyer, Christian Breiteneder

Chapter 7. Visual Localization for Micro Aerial Vehicles in Urban Outdoor Environments

Accurate localization of a micro aerial vehicle (MAV) with respect to a scene is important for a wide range of applications, in particular autonomous navigation, surveillance, and inspection. In this context, visual localization in urban outdoor environments is gaining importance as common methods such as GPS positioning are often not accurate enough or even fail. We present recent approaches and results for robust 3D reconstruction of suitable visual landmarks, for the alignment in a world coordinate system, and for fast, high-accuracy monocular localization. We introduce a scalable representation of the prior knowledge about the scene and demonstrate how in-flight information can be integrated to facilitate long-term operation. Our method outperforms a state-of-the-art visual SLAM approach and achieves localization accuracies comparable to differential GPS.
Andreas Wendel, Horst Bischof

Chapter 8. Moment Constraints in Convex Optimization for Segmentation and Tracking

Convex relaxation techniques have become a popular approach to shape optimization as they allow to compute solutions independent of initialization to a variety of problems. In this chapter, we will show that shape priors in terms of moment constraints can be imposed within the convex optimization framework, since they give rise to convex constraints. In particular, the lower-order moments correspond to the overall area, the centroid, and the variance or covariance of the shape and can be easily imposed in interactive segmentation methods. Respective constraints can be imposed as hard constraints or soft constraints. Quantitative segmentation studies on a variety of images demonstrate that the user can impose such constraints with a few mouse clicks, leading to substantial improvements of the resulting segmentation, and reducing the average segmentation error from 12 % to 0.35 %. GPU-based computation times of around 1 second allow for interactive segmentation.
Maria Klodt, Frank Steinbrücker, Daniel Cremers

Chapter 9. Large Scale Metric Learning for Distance-Based Image Classification on Open Ended Data Sets

Many real-life large-scale datasets are open-ended and dynamic: new images are continuously added to existing classes, new classes appear over time, and the semantics of existing classes might evolve too. Therefore, we study large-scale image classification methods that can incorporate new classes and training images continuously over time at negligible cost. To this end, we consider two distance-based classifiers, the k-nearest neighbor (k-NN) and nearest class mean (NCM) classifiers. Since the performance of distance-based classifiers heavily depends on the used distance function, we cast the problem into one of learning a low-rank metric, which is shared across all classes. For the NCM classifier, we introduce a new metric learning approach, and we also introduce an extension to allow for richer class representations.
Experiments on the ImageNet 2010 challenge dataset, which contains over one million training images of thousand classes, show that, surprisingly, the NCM classifier compares favorably to the more flexible k-NN classifier. Moreover, the NCM performance is comparable to that of linear SVMs which obtain current state-of-the-art performance. Experimentally we study the generalization performance to classes that were not used to learn the metrics. Using a metric learned on 1,000 classes, we show results for the ImageNet-10K dataset which contains 10,000 classes, and obtain performance that is competitive with the current state-of-the-art, while being orders of magnitude faster.
Thomas Mensink, Jakob Verbeek, Florent Perronnin, Gabriela Csurka

Chapter 10. Top–Down Bayesian Inference of Indoor Scenes

The task of inferring the 3D layout of indoor scenes from images has seen many recent advancements. Understanding the basic 3D geometry of these environments is important for higher level applications, such as object recognition and robot navigation. In this chapter, we present our Bayesian generative model for understanding indoor environments. We model the 3D geometry of a room and the objects within it with non-overlapping 3D boxes, which provide approximations for both the room boundary and objects like tables and beds. We separately model the imaging process (camera parameters), and an image likelihood, thus providing a complete, generative statistical model for image data. A key feature of this work is using prior information and constraints on the 3D geometry of the scene elements, which addresses ambiguities in the imaging process in a top–down fashion. We also describe and relate this work to other state-of-the-art approaches, and discuss techniques that have become standard in this field, such as estimating the camera pose from a triplet of vanishing points.
Luca Del Pero, Kobus Barnard

Chapter 11. Efficient Loopy Belief Propagation Using the Four Color Theorem

Recent work on early vision such as image segmentation, image denoising, stereo matching, and optical flow uses Markov Random Fields. Although this formulation yields an NP-hard energy minimization problem, good heuristics have been developed based on graph cuts and belief propagation. Nevertheless both approaches still require tens of seconds to solve stereo problems on recent PCs. Such running times are impractical for optical flow and many image segmentation and denoising problems and we review recent techniques for speeding them up. Moreover, we show how to reduce the computational complexity of belief propagation by applying the Four Color Theorem to limit the maximum number of labels in the underlying image segmentation to at most four. We show that this provides substantial speed improvements for large inputs, and this for a variety of vision problems, while maintaining competitive result quality.
Radu Timofte, Luc Van Gool

Chapter 12. Boosting k-Nearest Neighbors Classification

A major drawback of the k-nearest neighbors (k-NN) rule is the high variance when dealing with sparse prototype datasets in high dimensions. Most techniques proposed for improving k-NN classification rely either on deforming the k-NN relationship by learning a distance function or modifying the input space by means of subspace selection. Here we propose a novel boosting approach for generalizing the k-NN rule. Namely, we redefine the voting rule as a strong classifier that linearly combines predictions from the k closest prototypes. Our algorithm, called UNN (Universal Nearest Neighbors), rely on the k-nearest neighbors examples as weak classifiers and learn their weights so as to minimize a surrogate risk. These weights, called leveraging coefficients, allow us to distinguish the most relevant prototypes for a given class. Results obtained on several scene categorization datasets display the ability of UNN to compete with or beat state-of-the-art methods, while achieving comparatively small training and testing times.
Paolo Piro, Richard Nock, Wafa Bel Haj Ali, Frank Nielsen, Michel Barlaud

Chapter 13. Learning Object Detectors in Stationary Environments

The most successful approach for object detection is still applying a sliding window technique, where a pre-trained classifier is evaluated on different locations and scales. In this chapter, we interrogate this strategy in the context of stationary environments. In particular, having a fixed camera position observing the same scene a lot of prior (spatio-temporal) information is available. Exploiting this specific scene information allows for (a) improving the detection performance and (b) for reducing the model complexity; both on reduced computational costs! These benefits are demonstrated for two different real-world tasks (i.e., person and car detection). In particular, we apply two different evaluation/update strategies (holistic, grid-based), where any suited online learner can be applied. In our case we demonstrate the proposed approaches for different applications and scenarios, clearly showing their benefits compared to generic methods.
Peter M. Roth, Sabine Sternig, Horst Bischof

Chapter 14. Video Temporal Super-resolution Based on Self-similarity

We introduce a method for making temporal super-resolution video from a single video by exploiting the self-similarity that exists in the spatio-temporal domain of videos. Temporal super-resolution is an inherently ill-posed problem because there are an infinite number of high temporal resolution frames that can produce the same low temporal resolution frame. The key idea in this work is exploiting self-similarity for solving this ambiguity. Self-similarity means self-similar motion blur appearances that often reappear at different temporal resolutions. Several existing methods generate plausible intermediate frames by interpolating input frames of a captured video, which has frame exposure time shorter than inter-frame period. In contrast with them, our method can increase the temporal resolution of a given video in which frame exposure time equals to inter-frame period, for instance, by resolving one frame to two frames.
Mihoko Shimano, Takahiro Okabe, Imari Sato, Yoichi Sato


Additional information