Energy Minimization

Comparison of Energy Minimization Algorithms for Highly Connected Graphs

Algorithms for discrete energy minimization play a fundamental role for low-level vision. Known techniques include graph cuts, belief propagation (BP) and recently introduced tree-reweighted message passing (TRW). So far, the standard benchmark for their comparison has been a 4-connected grid-graph arising in pixel-labelling stereo. This minimization problem, however, has been largely solved: recent work shows that for many scenes TRW finds the global optimum. Furthermore, it is known that a 4-connected grid-graph is a poor stereo model since it does not take occlusions into account.

We propose the problem of stereo with occlusions as a new test bed for minimization algorithms. This is a more challenging graph since it has much larger connectivity, and it also serves as a better stereo model. An attractive feature of this problem is that increased connectivity does not result in increased complexity of message passing algorithms. Indeed, one contribution of this paper is to show that sophisticated implementations of BP and TRW have the same time and memory complexity as that of 4-connected grid-graph stereo.

The main conclusion of our experimental study is that for our problem graph cut outperforms both TRW and BP considerably. TRW achieves consistently a lower energy than BP. However, as connectivity increases the speed of convergence of TRW becomes slower. Unlike 4-connected grids, the difference between the energy of the best optimization method and the lower bound of TRW appears significant. This shows the hardness of the problem and motivates future research.

Vladimir Kolmogorov, Carsten Rother

A Comparative Study of Energy Minimization Methods for Markov Random Fields

One of the most exciting advances in early vision has been the development of efficient energy minimization algorithms. Many early vision tasks require labeling each pixel with some quantity such as depth or texture. While many such problems can be elegantly expressed in the language of Markov Random Fields (MRF’s), the resulting energy minimization problems were widely viewed as intractable. Recently, algorithms such as graph cuts and loopy belief propagation (LBP) have proven to be very powerful: for example, such methods form the basis for almost all the top-performing stereo methods. Unfortunately, most papers define their own energy function, which is minimized with a specific algorithm of their choice. As a result, the tradeoffs among different energy minimization algorithms are not well understood. In this paper we describe a set of energy minimization benchmarks, which we use to compare the solution quality and running time of several common energy minimization algorithms. We investigate three promising recent methods—graph cuts, LBP, and tree-reweighted message passing—as well as the well-known older iterated conditional modes (ICM) algorithm. Our benchmark problems are drawn from published energy functions used for stereo, image stitching and interactive segmentation. We also provide a general-purpose software interface that allows vision researchers to easily switch between optimization methods with minimal overhead. We expect that the availability of our benchmarks and interface will make it significantly easier for vision researchers to adopt the best method for their specific problems. Benchmarks, code, results and images are available at

http://vision.middlebury.edu/MRF

.

Richard Szeliski, Ramin Zabih, Daniel Scharstein, Olga Veksler, Vladimir Kolmogorov, Aseem Agarwala, Marshall Tappen, Carsten Rother

Measuring Uncertainty in Graph Cut Solutions – Efficiently Computing Min-marginal Energies Using Dynamic Graph Cuts

In recent years the use of graph-cuts has become quite popular in computer vision. However, researchers have repeatedly asked the question whether it might be possible to compute a measure of uncertainty associated with the graph-cut solutions. In this paper we answer this particular question by showing how the min-marginals associated with the label assignments in a

MRF

can be efficiently computed using a new algorithm based on dynamic graph cuts. We start by reporting the discovery of a novel relationship between the min-marginal energy corresponding to a latent variable label assignment, and the flow potentials of the node representing that variable in the graph used in the energy minimization procedure. We then proceed to show how the min-marginal energy can be computed by minimizing a

projection

of the energy function defined by the

MRF

. We propose a fast and novel algorithm based on dynamic graph cuts to efficiently minimize these energy projections. The min-marginal energies obtained by our proposed algorithm are exact, as opposed to the ones obtained from other inference algorithms like loopy belief propagation and generalized belief propagation. We conclude by showing how min-marginals can be used to compute a confidence measure for label assignments in labelling problems such as image segmentation.

Pushmeet Kohli, Philip H. S. Torr

Tracking and Motion

Tracking Dynamic Near-Regular Texture Under Occlusion and Rapid Movements

We present a dynamic near-regular texture (NRT) tracking algorithm nested in a lattice-based Markov-Random-Field (MRF) model of a 3D spatiotemporal space. One basic observation used in our work is that the lattice structure of a dynamic NRT remains invariant despite its drastic geometry or appearance variations. On the other hand, dynamic NRT imposes special computational challenges to the state of the art tracking algorithms: including highly ambiguous correspondences, occlusions, and drastic illumination and appearance variations. Our tracking algorithm takes advantage of the topological invariant property of the dynamic NRT by combining a global lattice structure that characterizes the topological constraint among multiple textons and an image observation model that handles local geometry and appearance variations. Without any assumptions on the types of motion, camera model or lighting conditions, our tracking algorithm can effectively capture the varying underlying lattice structure of a dynamic NRT in different real world examples, including moving cloth, underwater patterns and marching crowd.

Wen-Chieh Lin, Yanxi Liu

Simultaneous Object Pose and Velocity Computation Using a Single View from a Rolling Shutter Camera

An original concept for computing instantaneous 3D pose and 3D velocity of fast moving objects using a single view is proposed, implemented and validated. It takes advantage of the image deformations induced by rolling shutter in CMOS image sensors. First of all, after analysing the rolling shutter phenomenon, we introduce an original model of the image formation when using such a camera, based on a general model of moving rigid sets of 3D points. Using 2D-3D point correspondences, we derive two complementary methods, compensating for the rolling shutter deformations to deliver an accurate 3D pose and exploiting them to also estimate the full 3D velocity. The first solution is a general one based on non-linear optimization and bundle adjustment, usable for any object, while the second one is a closed-form linear solution valid for planar objects. The resulting algorithms enable us to transform a CMOS low cost and low power camera into an innovative and powerful velocity sensor. Finally, experimental results with real data confirm the relevance and accuracy of the approach.

Omar Ait-Aider, Nicolas Andreff, Jean Marc Lavest, Philippe Martinet

A Theory of Multiple Orientation Estimation

Estimation of local orientations in multivariate signals (including optical flow estimation as special case of orientation in space-time-volumes) is an important problem in image processing and computer vision. Modelling a signal using only a

single

orientation is often too restrictive, since occlusions and transparency happen frequently, thus necessitating the modelling and analysis of

multiple orientations

.

In this paper, we therefore develop a unifying mathematical model for multiple orientations: beyond describing an arbitrary number of orientations in multivariate vector-valued image data such as color image sequences, it allows the unified treatment of

transparently

and

occludingly

superimposed oriented structures. Based on this model, we derive novel estimation schemes for an arbitrary number of superimposed orientations in bivariate images as well as for double orientations in signals of arbitrary signal dimensionality. The estimated orientations themselves, but also features like the number of local orientations or the angles between multiple orientations (which are invariant under rotation) can be used for various inspection, tracking and segmentation problems. We evaluate the performance of our framework on both synthetic and real data.

Matthias Mühlich, Til Aach

Poster Session II

Tracking and Motion

Resolution-Aware Fitting of Active Appearance Models to Low Resolution Images

Active Appearance Models (AAM) are compact representations of the shape and appearance of objects. Fitting AAMs to images is a difficult, non-linear optimization task. Traditional approaches minimize the L2 norm error between the model instance and the input image warped onto the model coordinate frame. While this works well for high resolution data, the fitting accuracy degrades quickly at lower resolutions. In this paper, we show that a careful design of the fitting criterion can overcome many of the low resolution challenges. In our

resolution-aware formulation

(RAF), we explicitly account for the finite size sensing elements of digital cameras, and

simultaneously

model the processes of object appearance variation, geometric deformation, and image formation. As such, our Gauss-Newton gradient descent algorithm not only synthesizes model instances as a function of estimated parameters, but also simulates the formation of low resolution images in a digital camera. We compare the RAF algorithm against a state-of-the-art tracker across a variety of resolution and model complexity levels. Experimental results show that RAF considerably improves the estimation accuracy of both shape and appearance parameters when fitting to low resolution data.

Göksel Dedeoǧlu, Simon Baker, Takeo Kanade

High Accuracy Optical Flow Serves 3-D Pose Tracking: Exploiting Contour and Flow Based Constraints

Tracking the 3-D pose of an object needs correspondences between 2-D features in the image and their 3-D counterparts in the object model. A large variety of such features has been suggested in the literature. All of them have drawbacks in one situation or the other since their extraction in the image and/or the matching is prone to errors. In this paper, we propose to use two complementary types of features for pose tracking, such that one type makes up for the shortcomings of the other. Aside from the object contour, which is matched to a free-form object surface, we suggest to employ the optic flow in order to compute additional point correspondences. Optic flow estimation is a mature research field with sophisticated algorithms available. Using here a high quality method ensures a reliable matching. In our experiments we demonstrate the performance of our method and in particular the improvements due to the optic flow.

Thomas Brox, Bodo Rosenhahn, Daniel Cremers, Hans-Peter Seidel

Enhancing the Point Feature Tracker by Adaptive Modelling of the Feature Support

We consider the problem of tracking a given set of point features over large sequences of image frames. A classic procedure for monitoring the tracking quality consists in requiring that the current features nicely warp towards their reference appearances. The procedure recommends focusing on features projected from planar 3D patches (planar features), by enforcing a conservative threshold on the residual of the difference between the warped current feature and the reference. However, in some important contexts, there are many features for which the planarity assumption is only partially satisfied, while the true planar features are not so abundant. This is especially true when the motion of the camera is mainly translational and parallel to the optical axis (such as when driving a car along straight sections of the road), which induces a permanent increase of the apparent feature size. Tracking features containing occluding boundaries then becomes an interesting goal, for which we propose a multi-scale monitoring solution striving to maximize the lifetime of the feature, while also detecting the tracking failures. The devised technique infers the parts of the reference which are not projected from the same 3D surface as the patch which has been consistently tracked until the present moment. The experiments on real sequences taken from cars driving through urban environments show that the technique is effective in increasing the average feature lifetimes, especially in sequences with occlusions and large photometric variations.

Siniša Šegvić, Anthony Remazeilles, François Chaumette

Tracking Objects Across Cameras by Incrementally Learning Inter-camera Colour Calibration and Patterns of Activity

This paper presents a scalable solution to the problem of tracking objects across spatially separated, uncalibrated, non-overlapping cameras. Unlike other approaches this technique uses an incremental learning method, to model both the colour variations and posterior probability distributions of spatio-temporal links between cameras. These operate in parallel and are then used with an appearance model of the object to track across spatially separated cameras. The approach requires no pre-calibration or batch preprocessing, is completely unsupervised, and becomes more accurate over time as evidence is accumulated.

Andrew Gilbert, Richard Bowden

Monocular Tracking of 3D Human Motion with a Coordinated Mixture of Factor Analyzers

Filtering based algorithms have become popular in tracking human body pose. Such algorithms can suffer the curse of dimensionality due to the high dimensionality of the pose state space; therefore, efforts have been dedicated to either smart sampling or reducing the dimensionality of the original pose state space. In this paper, a novel formulation that employs a dimensionality reduced state space for multi-hypothesis tracking is proposed. During off-line training, a mixture of factor analyzers is learned. Each factor analyzer can be thought of as a “local dimensionality reducer” that locally approximates the pose manifold. Global coordination between local factor analyzers is achieved by learning a set of linear mixture functions that enforces agreement between local factor analyzers. The formulation allows easy bidirectional mapping between the original body pose space and the low-dimensional space. During online tracking, the clusters of factor analyzers are utilized in a multiple hypothesis tracking algorithm. Experiments demonstrate that the proposed algorithm tracks 3D body pose efficiently and accurately , even when self-occlusion, motion blur and large limb movements occur. Quantitative comparisons show that the formulation produces more accurate 3D pose estimates over time than those that can be obtained via a number of previously-proposed particle filtering based tracking algorithms.

Rui Li, Ming-Hsuan Yang, Stan Sclaroff, Tai-Peng Tian

Multiview Geometry and 3D Reconstruction

Balanced Exploration and Exploitation Model Search for Efficient Epipolar Geometry Estimation

The estimation of the epipolar geometry is especially difficult where the putative correspondences include a low percentage of inlier correspondences and/or a large subset of the inliers is consistent with a degenerate configuration of the epipolar geometry that is totally incorrect. This work presents the Balanced Exploration and Exploitation Model Search (BEEM) algorithm that works very well especially for these difficult scenes.

The BEEM algorithm handles the above two difficult cases in a unified manner. The algorithm includes the following main features: (1) Balanced use of three search techniques: global random exploration, local exploration near the current best solution and local exploitation to improve the quality of the model. (2) Exploits available prior information to accelerate the search process. (3) Uses the best found model to guide the search process, escape from degenerate models and to define an efficient stopping criterion. (4) Presents a simple and efficient method to estimate the epipolar geometry from two SIFT correspondences. (5) Uses the locality-sensitive hashing (LSH) approximate nearest neighbor algorithm for fast putative correspondences generation.

The resulting algorithm when tested on real images with or without degenerate configurations gives quality estimations and achieves significant speedups compared to the state of the art algorithms!

Liran Goshen, Ilan Shimshoni

Shape-from-Silhouette with Two Mirrors and an Uncalibrated Camera

Two planar mirrors are positioned to show five views of an object, and snapshots are captured from different viewpoints. We present closed form solutions for calculating the focal length, principal point, mirror and camera poses directly from the silhouette outlines of the object and its reflections. In the noisy case, these equations are used to form initial parameter estimates that are refined using iterative minimisation. The self-calibration allows the visual cones from each silhouette to be specified in a common reference frame so that the visual hull can be constructed. The proposed setup provides a simple method for creating 3D multimedia content that does not rely on specialised equipment. Experimental results demonstrate the reconstruction of a toy horse and a locust from real images. Synthetic images are used to quantify the sensitivity of the self-calibration to quantisation noise. In terms of the silhouette calibration ratio, degradation in silhouette quality has a greater effect on silhouette set consistency than computed calibration parameters.

Keith Forbes, Fred Nicolls, Gerhard de Jager, Anthon Voigt

Robust and Efficient Photo-Consistency Estimation for Volumetric 3D Reconstruction

Estimating photo-consistency is one of the most important ingredients for any 3D stereo reconstruction technique that is based on a volumetric scene representation. This paper presents a new, illumination invariant photo-consistency measure for high quality, volumetric 3D reconstruction from calibrated images. In contrast to current standard methods such as normalized cross-correlation it supports unconstrained camera setups and non-planar surface approximations. We show how this measure can be embedded into a highly efficient, completely hardware accelerated volumetric reconstruction pipeline by exploiting current graphics processors. We provide examples of high quality reconstructions with computation times of only a few seconds to minutes, even for large numbers of cameras and high volumetric resolutions.

Alexander Hornung, Leif Kobbelt

An Affine Invariant of Parallelograms and Its Application to Camera Calibration and 3D Reconstruction

In this work, a new affine invariant of parallelograms is introduced, and the explicit constraint equations between the intrinsic matrix of a camera and the similar invariants of a parallelogram or a parallelepiped are established using this affine invariant. Camera calibration and 3D reconstruction from parallelograms are systematically studied based on these constraints. The proposed theoretical results and algorithms have wide applicability as parallelograms and parallelepipeds are not rare in man-made scenes. Experimental results on synthetic and real images validate the proposed approaches.

F. C. Wu, F. Q. Duan, Z. Y. Hu

Nonrigid Shape and Motion from Multiple Perspective Views

We consider the problem of nonrigid shape and motion recovery from point correspondences in multiple perspective views. It is well known that the constraints among multiple views of a rigid shape are multilinear on the image points and can be reduced to bilinear (epipolar) and trilinear constraints among two and three views, respectively. In this paper, we generalize this classic result by showing that the constraints among multiple views of a nonrigid shape consisting of

K

shape bases can be reduced to multilinear constraints among

K

+ ⌈ (

K

+ 1)/2⌉, ⋯, 2

K

+ 1 views. We then present a closed form solution to the reconstruction of a nonrigid shape consisting of two shape bases. We show that point correspondences in five views are related by a nonrigid quintifocal tensor, from which one can linearly compute nonrigid shape and motion. We also demonstrate the existence of intrinsic ambiguities in the reconstruction of camera translation, shape coefficients and shape bases. Examples show the effectiveness of our method on nonrigid scenes with significant perspective effects.

René Vidal, Daniel Abretske

3D Surface Reconstruction Using Graph Cuts with Surface Constraints

We describe a graph cut algorithm to recover the 3D object surface using both silhouette and foreground color information. The graph cut algorithm is used for optimization on a color consistency field. Constraints are added to improve its performance. These constraints are a set of predetermined locations that the true surface of the object is likely to pass through. They are used to preserve protrusions and to pursue concavities respectively in the first and the second phase of the algorithm. We also introduce a method for dealing with silhouette uncertainties arising from background subtraction on real data. We test the approach on synthetic data with different numbers of views (8, 16, 32, 64) and on a real image set containing 30 views of a toy squirrel.

Son Tran, Larry Davis

Statistical Models and Visual Learning

Trace Quotient Problems Revisited

The formulation of

trace quotient

is shared by many computer vision problems; however, it was conventionally approximated by an essentially different formulation of

quotient trace,

which can be solved with the generalized eigenvalue decomposition approach. In this paper, we present a direct solution to the former formulation. First, considering that the feasible solutions are constrained on a Grassmann manifold, we present a necessary condition for the optimal solution of the trace quotient problem, which then naturally elicits an iterative procedure for pursuing the optimal solution. The proposed algorithm, referred to as Optimal Projection Pursuing (OPP), has the following characteristics: 1) OPP directly optimizes the trace quotient, and is theoretically optimal; 2) OPP does not suffer from the solution uncertainty issue existing in the quotient trace formulation that the objective function value is invariant under any nonsingular linear transformation, and OPP is invariant only under orthogonal transformations, which does not affect final distance measurement; and 3) OPP reveals the underlying equivalence between the trace quotient problem and the corresponding trace difference problem. Extensive experiments on face recognition validate the superiority of OPP over the solution of the corresponding quotient trace problem in both objective function value and classification capability.

Shuicheng Yan, Xiaoou Tang

Learning Nonlinear Manifolds from Time Series

There has been growing interest in developing nonlinear dimensionality reduction algorithms for vision applications. Although progress has been made in recent years, conventional nonlinear dimensionality reduction algorithms have been designed to deal with stationary, or independent and identically distributed data. In this paper, we present a novel method that learns nonlinear mapping from time series data to their intrinsic coordinates on the underlying manifold. Our work extends the recent advances in learning nonlinear manifolds within a global coordinate system to account for temporal correlation inherent in sequential data. We formulate the problem with a dynamic Bayesian network and propose an approximate algorithm to tackle the learning and inference problems. Numerous experiments demonstrate the proposed method is able to learn nonlinear manifolds from time series data, and as a result of exploiting the temporal correlation, achieve superior results.

Ruei-Sung Lin, Che-Bin Liu, Ming-Hsuan Yang, Narendra Ahuja, Stephen Levinson

Accelerated Convergence Using Dynamic Mean Shift

Mean shift is an iterative mode-seeking algorithm widely used in pattern recognition and computer vision. However, its convergence is sometimes too slow to be practical. In this paper, we improve the convergence speed of mean shift by dynamically updating the sample set during the iterations, and the resultant procedure is called

dynamic

mean shift (DMS). When the data is locally Gaussian, it can be shown that both the standard and dynamic mean shift algorithms converge to the same optimal solution. However, while standard mean shift only has linear convergence, the dynamic mean shift algorithm has superlinear convergence. Experiments on color image segmentation show that dynamic mean shift produces comparable results as the standard mean shift algorithm, but can significantly reduce the number of iterations for convergence and takes much less time.

Kai Zhang, Jamesk T. Kwok, Ming Tang

Efficient Belief Propagation with Learned Higher-Order Markov Random Fields

Belief propagation (BP) has become widely used for low-level vision problems and various inference techniques have been proposed for loopy graphs. These methods typically rely on ad hoc spatial priors such as the Potts model. In this paper we investigate the use of learned models of image structure, and demonstrate the improvements obtained over previous ad hoc models for the image denoising problem. In particular, we show how both pairwise and higher-order Markov random fields with learned clique potentials capture rich image structures that better represent the properties of natural images. These models are learned using the recently proposed Fields-of-Experts framework. For such models, however, traditional BP is computationally expensive. Consequently we propose some approximation methods that make BP with learned potentials practical. In the case of pairwise models we propose a novel approximation of robust potentials using a finite family of quadratics. In the case of higher order MRFs, with 2× 2 cliques, we use an adaptive state space to handle the increased complexity. Extensive experiments demonstrate the power of learned models, the benefits of higher-order MRFs and the practicality of BP for these problems with the use of simple principled approximations.

Xiangyang Lan, Stefan Roth, Daniel Huttenlocher, Michael J. Black

Non Linear Temporal Textures Synthesis: A Monte Carlo Approach

In this paper we consider the problem of temporal texture modeling and synthesis. A temporal texture (or dynamic texture) is seen as the output of a dynamical system driven by white noise. Experimental evidence shows that linear models such as those introduced in earlier work are sometimes inadequate to fully describe the time evolution of the dynamic scene. Extending upon recent work which is available in the literature, we tackle the synthesis using non-linear dynamical models. The non-linear model is never given explicitly but rather we describe a methodology to generate samples from the model. The method requires estimating the “state” distribution and a linear dynamical model from the original clip which are then used respectively as target distribution and proposal mechanism in a rejection sampling step. We also report extensive experimental results comparing the proposed approach with the results obtained using linear models (

Doretto et al.

) and the “closed-loop” approach presented at ECCV 2004 by

Yuan et al.

Andrea Masiero, Alessandro Chiuso

Low-Level Vision, Image Features

Curvature-Preserving Regularization of Multi-valued Images Using PDE’s

We are interested in diffusion PDE’s for smoothing multi-valued images in an anisotropic manner. By pointing out the pros and cons of existing tensor-driven regularization methods, we introduce a new constrained diffusion PDE that regularizes image data while taking curvatures of image structures into account. Our method has a direct link with a continuous formulation of the Line Integral Convolutions, allowing us to design a very fast and stable algorithm for its implementation. Besides, our smoothing scheme numerically performs with a sub-pixel accuracy and is then able to preserves very thin image structures contrary to classical PDE discretizations based on finite difference approximations. We illustrate our method with different applications on color images.

David Tschumperlé

Higher Order Image Pyramids

The scale invariant property of an ensemble of natural images is examined which motivates a new early visual representation termed the higher order pyramid. The representation is a non-linear generalization of the Laplacian pyramid and is tuned to the type of scale invariance exhibited by natural imagery as opposed to other scale invariant images such as 1/f correlated noise and the step edge. The transformation of an image to a higher order pyramid is simple to compute and straightforward to invert. Because the representation is invertible it is shown that the higher order pyramid can be truncated and quantized with little loss of visual quality. Images coded in this representation have much less redundancy than the raw image pixels and decorrelating transformations such as the Laplacian pyramid. This is demonstrated by showing statistical independence between pairs of coefficients. Because the representation is tuned to the ensemble redundancies the coefficients of the higher order pyramid are more efficient at capturing the variation within the ensemble which leads too improved matching results. This is demonstrated on two recognition tasks, face recognition with illumination changes and object recognition which viewpoint changes.

Joshua Gluckman

Image Specific Feature Similarities

Calculating a reliable similarity measure between pixel features is essential for many computer vision and image processing applications. We propose a similarity measure (affinity) between pixel features, which depends on the feature space histogram of the image. We use the observation that clusters in the feature space histogram are typically smooth and roughly convex. Given two feature points we adjust their similarity according to the bottleneck in the histogram values on the straight line between them. We call our new similarities

Bottleneck Affinities

. These measures are computed efficiently, we demonstrate superior segmentation results compared to the use of the Euclidean metric.

Ido Omer, Michael Werman

Coloring Local Feature Extraction

Although color is commonly experienced as an indispensable quality in describing the world around us, state-of-the art

local

feature-based representations are mostly based on shape description, and ignore color information. The description of color is hampered by the large amount of variations which causes the measured color values to vary significantly. In this paper we aim to extend the description of

local

features with color information. To accomplish a wide applicability of the color descriptor, it should be robust to : 1. photometric changes commonly encountered in the real world, 2. varying image quality, from high quality images to snap-shot photo quality and compressed internet images. Based on these requirements we derive a set of color descriptors. The set of proposed descriptors are compared by extensive testing on multiple applications areas, namely, matching, retrieval and classification, and on a wide variety of image qualities. The results show that color descriptors remain reliable under photometric and geometrical changes, and with decreasing image quality. For all experiments a combination of color and shape outperforms a pure shape-based approach.

Joost van de Weijer, Cordelia Schmid

Defocus Inpainting

In this paper, we propose a method to restore a single image affected by space-varying blur. The main novelty of our method is the use of recurring patterns as regularization during the restoration process. We postulate that restored patterns in the deblurred image should resemble other sharp details in the input image. To this purpose, we establish the correspondence of regions that are similar up to Gaussian blur. When two regions are in correspondence, one can perform deblurring by using the sharpest of the two as a proposal. Our solution consists of two steps: First, estimate correspondence of similar patches and their relative amount of blurring; second, restore the input image by imposing the similarity of such recurring patterns as a prior. Our approach has been successfully tested on both real and synthetic data.

Paolo Favaro, Enrico Grisan

Viewpoint Induced Deformation Statistics and the Design of Viewpoint Invariant Features: Singularities and Occlusions

We study the set of domain deformations induced on images of three-dimensional scenes by changes of the vantage point. We parametrize such deformations and derive empirical statistics on the parameters, that show a kurtotic behavior similar to that of natural image and range statistics. Such a behavior would suggest that most deformations are locally smooth, and therefore could be captured by simple parametric maps, such as affine ones. However, we show that deformations induced by singularities and occluding boundaries, although rare, are highly salient, thus warranting the development of dedicated descriptors. We therefore illustrate the development of viewpoint invariant descriptors for singularities, as well as for occluding boundaries. We test their performance on scenes where the current state of the art based on affine-invariant region descriptors fail to establish correspondence, highlighting the features and shortcomings of our approach.

Andrea Vedaldi, Stefano Soatto

Face/Gesture/Action Detection and Recognition

Spatio-temporal Embedding for Statistical Face Recognition from Video

This paper addresses the problem of how to learn an appropriate feature representation from video to benefit video-based face recognition. By simultaneously exploiting the spatial and temporal information, the problem is posed as learning Spatio-Temporal Embedding (STE) from raw video. STE of a video sequence is defined as its condensed version capturing the essence of space-time characteristics of the video. Relying on the co-occurrence statistics and supervised signatures provided by training videos, STE preserves the intrinsic temporal structures hidden in video volume, meanwhile encodes the discriminative cues into the spatial domain. To conduct STE, we propose two novel techniques, Bayesian keyframe learning and nonparametric discriminant embedding (NDE), for temporal and spatial learning, respectively. In terms of learned STEs, we derive a statistical formulation to the recognition problem with a probabilistic fusion model. On a large face video database containing more than 200 training and testing sequences, our approach consistently outperforms state-of-the-art methods, achieving a perfect recognition accuracy.

Wei Liu, Zhifeng Li, Xiaoou Tang

Super-Resolution of 3D Face

Super-resolution is a technique to restore the detailed information from the degenerated data. Lots of previous work is for 2D images while super-resolution of 3D models was little addressed. This paper focuses on the super-resolution of 3D human faces. We firstly extend the 2D image pyramid model to the progressive resolution chain (PRC) model in 3D domain, to describe the detail variation during resolution decreasing. Then a consistent planar representation of 3D faces is presented, which enables the analysis and comparison among the features of the same facial part for the subsequent restoration process. Finally, formulated as solving an iterative quadratic system by maximizing

a posteriori

, a 3D restoration algorithm using PRC features is given. The experimental results on USF HumanID 3D face database demonstrate the effectiveness of the proposed approach.

Gang Pan, Shi Han, Zhaohui Wu, Yueming Wang

Estimating Gaze Direction from Low-Resolution Faces in Video

In this paper we describe a new method for automatically estimating where a person is looking in images where the head is typically in the range 20 to 40 pixels high. We use a feature vector based on skin detection to estimate the orientation of the head, which is discretised into 8 different orientations, relative to the camera. A fast sampling method returns a distribution over previously-seen head-poses. The overall body pose relative to the camera frame is approximated using the velocity of the body, obtained via automatically-initiated colour-based tracking in the image sequence. We show that, by combining direction and head-pose information gaze is determined more robustly than using each feature alone. We demonstrate this technique on surveillance and sports footage.

Neil Robertson, Ian Reid

Learning Effective Intrinsic Features to Boost 3D-Based Face Recognition

3D image data provide several advantages than 2D data for face recognition and overcome many problems with 2D intensity images based methods. In this paper, we propose a novel approach to 3D-based face recognition. First, a novel representation, called

intrinsic features

, is presented to encode local 3D shapes. It describes complementary non-relational features to provide an

intrinsic representation

of faces. This representation is extracted after alignment, and is invariant to translation, rotation and scale. Without reduction, tens of thousands of intrinsic features can be produced for a face, but not all of them are useful and equally important. Therefore, in the second part of the work, we introduce a

learning

method for learning most effective local features and combining them into a strong classifier using an AdaBoost learning procedure. Experimental results are performed on a large 3D face database obtained with complex illumination, pose and expression variations. The results demonstrate that the proposed approach produces consistently better results than existing methods.

Chenghua Xu, Tieniu Tan, Stan Li, Yunhong Wang, Cheng Zhong

Human Detection Using Oriented Histograms of Flow and Appearance

Detecting humans in films and videos is a challenging problem owing to the motion of the subjects, the camera and the background and to variations in pose, appearance, clothing, illumination and background clutter. We develop a detector for standing and moving people in videos with possibly moving cameras and backgrounds, testing several different motion coding schemes and showing empirically that orientated histograms of differential optical flow give the best overall performance. These motion-based descriptors are combined with our Histogram of Oriented Gradient appearance descriptors. The resulting detector is tested on several databases including a challenging test set taken from feature films and containing wide ranges of pose, motion and background variations, including moving cameras and backgrounds. We validate our results on two challenging test sets containing more than 4400 human examples. The combined detector reduces the false alarm rate by a factor of 10 relative to the best appearance-based detector, for example giving false alarm rates of 1 per 20,000 windows tested at 8% miss rate on our Test Set 1.

Navneet Dalal, Bill Triggs, Cordelia Schmid

Cyclostationary Processes on Shape Spaces for Gait-Based Recognition

We present a novel approach to gait recognition that considers gait sequences as cyclostationary processes on a shape space of simple closed curves. Consequently, gait analysis reduces to quantifying differences between statistics underlying these stochastic processes. The main steps in the proposed approach are: (i) off-line extraction of human silhouettes from IR video data, (ii) use of piecewise-geodesic paths, connecting the observed shapes, to smoothly interpolate between them, (iii) computation of an average gait cycle within class (i.e. associated with a person) using average shapes, (iv) registration of average cycles using linear and nonlinear time scaling, (iv) comparisons of average cycles using geodesic lengths between the corresponding registered shapes. We illustrate this approach on infrared video clips involving 26 subjects.

David Kaziska, Anuj Srivastava

Segmentation and Grouping

Multiclass Image Labeling with Semidefinite Programming

We propose a semidefinite relaxation technique for multiclass image labeling problems. In this context, we consider labeling as a special case of supervised classification with a predefined number of classes and known but arbitrary dissimilarities between each image element and each class. Using Markov random fields to model pairwise relationships, this leads to a global energy minimization problem. In order to handle its combinatorial complexity, we apply Lagrangian relaxation to derive a semidefinite program, which has several advantageous properties over alternative methods like graph cuts. In particular, there are no restrictions concerning the form of the pairwise interactions, which e.g. allows us to incorporate a basic shape concept into the energy function. Based on the solution matrix of our convex relaxation, a suboptimal solution of the original labeling problem can be easily computed. Statistical ground-truth experiments and several examples of multiclass image labeling and restoration problems show that high quality solutions are obtained with this technique.

Jens Keuchel

Automatic Image Segmentation by Positioning a Seed

We present a method that automatically partitions a single image into non-overlapping regions coherent in texture and colour. An assumption that each textured or coloured region can be represented by a small template, called the seed, is used. Positioning of the seed across the input image gives many possible sub-segmentations of the image having same texture and colour property as the pixels behind the seed. A probability map constructed during the sub-segmentations helps to assign each pixel to just one most probable region and produce the final pyramid representing various detailed segmentations at each level. Each sub-segmentation is obtained as the min-cut/max-flow in the graph built from the image and the seed. One segment may consist of several isolated parts. Compared to other methods our approach does not need a learning process or a priori information about the textures in the image. Performance of the method is evaluated on images from the Berkeley database.

Branislav Mičušík, Allan Hanbury

Patch-Based Texture Edges and Segmentation

A novel technique for extracting texture edges is introduced. It is based on the combination of two ideas: the patch-based approach, and non-parametric tests of distributions.

Our method can reliably detect texture edges using only local information. Therefore, it can be computed as a preprocessing step prior to segmentation, and can be very easily combined with parametric deformable models. These models furnish our system with smooth boundaries and globally salient structures.

Lior Wolf, Xiaolei Huang, Ian Martin, Dimitris Metaxas

Unsupervised Texture Segmentation with Nonparametric Neighborhood Statistics

This paper presents a novel approach to unsupervised texture segmentation that relies on a very general nonparametric statistical model of image neighborhoods. The method models image neighborhoods directly, without the construction of intermediate features. It does not rely on using specific descriptors that work for certain kinds of textures, but is rather based on a more generic approach that tries to adaptively capture the core properties of textures. It exploits the fundamental description of textures as images derived from stationary random fields and models the associated higher-order statistics nonparametrically. This general formulation enables the method to easily adapt to various kinds of textures. The method minimizes an entropy-based metric on the probability density functions of image neighborhoods to give an optimal segmentation. The entropy minimization drives a very fast level-set scheme that uses threshold dynamics, which allows for a very rapid evolution towards the optimal segmentation during the initial iterations. The method does not rely on a training stage and, hence, is unsupervised. It automatically tunes its important internal parameters based on the information content of the data. The method generalizes in a straightforward manner from the two-region case to an arbitrary number of regions and incorporates an efficient multi-phase level-set framework. This paper presents numerous results, for both the two-texture and multiple-texture cases, using synthetic and real images that include electron-microscopy images.

Suyash P. Awate, Tolga Tasdizen, Ross T. Whitaker

Detecting Symmetry and Symmetric Constellations of Features

A novel and efficient method is presented for grouping feature points on the basis of their underlying symmetry and characterising the symmetries present in an image. We show how symmetric pairs of features can be efficiently detected, how the symmetry bonding each pair is extracted and evaluated, and how these can be grouped into symmetric constellations that specify the dominant symmetries present in the image. Symmetries over all orientations and radii are considered simultaneously, and the method is able to detect local or global symmetries, locate symmetric figures in complex backgrounds, detect bilateral or rotational symmetry, and detect multiple incidences of symmetry.

Gareth Loy, Jan-Olof Eklundh

Discovering Texture Regularity as a Higher-Order Correspondence Problem

Understanding texture regularity in real images is a challenging computer vision task. We propose a higher-order feature matching algorithm to discover the lattices of near-regular textures in real images. The underlying lattice of a near-regular texture identifies all of the texels as well as the global topology among the texels. A key contribution of this paper is to formulate lattice-finding as a correspondence problem. The algorithm finds a plausible lattice by iteratively proposing texels and assigning neighbors between the texels. Our matching algorithm seeks assignments that maximize both pair-wise visual similarity and higher-order geometric consistency. We approximate the optimal assignment using a recently developed spectral method. We successfully discover the lattices of a diverse set of unsegmented, real-world textures with significant geometric warping and large appearance variation among texels.

James Hays, Marius Leordeanu, Alexei A. Efros, Yanxi Liu

Object Recognition, Retrieval and Indexing

Exploiting Model Similarity for Indexing and Matching to a Large Model Database

This paper proposes a novel method to exploit model similarity in model-based 3D object recognition. The scenario consists of a large 3D model database of vehicles, and rapid indexing and matching needs to be done without sequential model alignment. In this scenario, the competition amongst shape features from similar models may pose serious challenge to recognition. To solve the problem, we propose to use a metric to quantitatively measure model similarities. For each model, we use similarity measures to define a model-centric class (MCC), which contains a group of similar models and the pose transformations between the model and its class members. Similarity information embedded in a MCC is used to boost matching hypotheses generation so that the correct model gains more opportunities to be hypothesized and identified. The algorithm is implemented and extensively tested on 1100 real LADAR scans of vehicles with a model database containing over 360 models.

Yi Tan, Bogdan C. Matei, Harpreet Sawhney

Shift-Invariant Dynamic Texture Recognition

We address the problem of recognition of natural motions such as water, smoke and wind-blown vegetation. Such dynamic scenes exhibit characteristic stochastic motions, and we ask whether the scene contents can be recognized using motion information alone. Previous work on this problem has considered only the case where the texture samples have sufficient overlap to allow registration, so that the visual content of the scene is very similar between examples. In this paper we investigate the recognition of entirely non-overlapping views of the same underlying motion, specifically excluding appearance-based cues.

We describe the scenes with time-series models—specifically multivariate autoregressive (AR) models—so the recognition problem becomes one of measuring distances between AR models. We show that existing techniques, when applied to non-overlapping sequences, have significantly lower performance than on static-camera data. We propose several new schemes, and show that some outperform the existing methods.

Franco Woolfe, Andrew Fitzgibbon

Modeling 3D Objects from Stereo Views and Recognizing Them in Photographs

Local appearance models in the neighborhood of salient image features, together with local and/or global geometric constraints, serve as the basis for several recent and effective approaches to 3D object recognition from photographs. However, these techniques typically either fail to explicitly account for the strong geometric constraints associated with multiple images of the same 3D object, or require a large set of training images with much overlap to construct relatively sparse object models. This paper proposes a simple new method for automatically constructing 3D object models consisting of

dense

assemblies of small surface patches and affine-invariant descriptions of the corresponding texture patterns from

a few

(7 to 12) stereo pairs. Similar constraints are used to effectively identify instances of these models in highly cluttered photographs taken from arbitrary and unknown viewpoints. Experiments with a dataset consisting of 80 test images of 9 objects, including comparisons with a number of baseline algorithms, demonstrate the promise of the proposed approach.

Akash Kushal, Jean Ponce

A Boundary-Fragment-Model for Object Detection

The objective of this work is the detection of object classes, such as airplanes or horses. Instead of using a model based on salient image fragments, we show that object class detection is also possible using only the object’s boundary. To this end, we develop a novel learning technique to extract class-discriminative boundary fragments. In addition to their shape, these “codebook” entries also determine the object’s centroid (in the manner of Leibe

et al.

[19]). Boosting is used to select discriminative combinations of boundary fragments (weak detectors) to form a strong “Boundary-Fragment-Model” (BFM) detector. The generative aspect of the model is used to determine an approximate segmentation.

We demonstrate the following results: (i) the BFM detector is able to represent and detect object classes principally defined by their shape, rather than their appearance; and (ii) in comparison with other published results on several object classes (airplanes, cars-rear, cows) the BFM detector is able to exceed previous performances, and to achieve this with less supervision (such as the number of training images).

Andreas Opelt, Axel Pinz, Andrew Zisserman

Region Covariance: A Fast Descriptor for Detection and Classification

We describe a new region descriptor and apply it to two problems, object detection and texture classification. The covariance of

d

-features, e.g., the three-dimensional color vector, the norm of first and second derivatives of intensity with respect to

x

and

y

, etc., characterizes a region of interest. We describe a fast method for computation of covariances based on

integral images

. The idea presented here is more general than the image sums or histograms, which were already published before, and with a series of integral images the covariances are obtained by a few arithmetic operations. Covariance matrices do not lie on Euclidean space, therefore we use a distance metric involving generalized eigenvalues which also follows from the Lie group structure of positive definite matrices. Feature matching is a simple nearest neighbor search under the distance metric and performed extremely rapidly using the integral images. The performance of the covariance features is superior to other methods, as it is shown, and large rotations and illumination changes are also absorbed by the covariance matrix.

Oncel Tuzel, Fatih Porikli, Peter Meer

Segmentation

Affine-Invariant Multi-reference Shape Priors for Active Contours

We present a new way of constraining the evolution of a region-based active contour with respect to a set of reference shapes. The approach is based on a description of shapes by the Legendre moments computed from their characteristic function. This provides a region-based representation that can handle arbitrary shape topologies. Moreover, exploiting the properties of moments, it is possible to include intrinsic affine invariance in the descriptor, which solves the issue of shape alignment without increasing the number of d.o.f. of the initial problem and allows introducing geometric shape variabilities. Our new

shape prior

is based on a distance between the

descriptors

of the evolving curve and a reference shape. The proposed model naturally extends to the case where multiple reference shapes are simultaneously considered. Minimizing the

shape energy

, leads to a geometric flow that does not rely on any particular representation of the contour and can be implemented with any contour evolution algorithm. We introduce our prior into a two-class segmentation functional, showing its benefits on segmentation results in presence of severe occlusions and clutter. Examples illustrate the ability of the model to deal with large affine deformation and to take into account a set of reference shapes of different topologies.

Alban Foulonneau, Pierre Charbonnier, Fabrice Heitz

Figure/Ground Assignment in Natural Images

Figure/ground assignment is a key step in perceptual organization which assigns contours to one of the two abutting regions, providing information about occlusion and allowing high-level processing to focus on non-accidental shapes of figural regions. In this paper, we develop a computational model for figure/ground assignment in complex natural scenes. We utilize a large dataset of images annotated with human-marked segmentations and figure/ground labels for training and quantitative evaluation.

We operationalize the concept of

familiar configuration

by constructing prototypical local shapes, i.e.

shapemes

, from image data. Shapemes automatically encode mid-level visual cues to figure/ground assignment such as convexity and parallelism. Based on the shapeme representation, we train a logistic classifier to locally predict figure/ground labels. We also consider a global model using a

conditional random field

(CRF) to enforce global figure/ground consistency at T-junctions. We use loopy belief propagation to perform approximate inference on this model and learn maximum likelihood parameters from ground-truth labels.

We find that the local shapeme model achieves an accuracy of 64% in predicting the correct figural assignment. This compares favorably to previous studies using classical figure/ground cues [1]. We evaluate the global model using either a set of contours extracted from a low-level edge detector or the set of contours given by human segmentations. The global CRF model significantly improves the performance over the local model, most notably when using human-marked boundaries (78%). These promising experimental results show that this is a feasible approach to bottom-up figure/ground assignment in natural images.

Xiaofeng Ren, Charless C. Fowlkes, Jitendra Malik

Background Cut

In this paper, we introduce

background cut

, a high quality and real-time foreground layer extraction algorithm. From a single video sequence with a moving foreground object and stationary background, our algorithm combines background subtraction, color and contrast cues to extract a foreground layer accurately and efficiently. The key idea in background cut is

background contrast attenuation

, which adaptively attenuates the contrasts in the background while preserving the contrasts across foreground/background boundaries. Our algorithm builds upon a key observation that the contrast (or more precisely, color image gradient) in the background is dissimilar to the contrast across foreground/background boundaries in most cases. Using background cut, the layer extraction errors caused by background clutter can be substantially reduced. Moreover, we present an adaptive mixture model of global and per-pixel background colors to improve the robustness of our system under various background changes. Experimental results of high quality composite video demonstrate the effectiveness of our background cut algorithm.

Jian Sun, Weiwei Zhang, Xiaoou Tang, Heung-Yeung Shum

PoseCut: Simultaneous Segmentation and 3D Pose Estimation of Humans Using Dynamic Graph-Cuts

We present a novel algorithm for performing integrated segmentation and

3D

pose estimation of a human body from multiple views. Unlike other related state of the art techniques which focus on either segmentation or pose estimation individually, our approach tackles these two tasks together. Normally, when optimizing for pose, it is traditional to use some fixed set of features, e.g. edges or chamfer maps. In contrast, our novel approach consists of optimizing a cost function based on a Markov Random Field (

MRF

). This has the advantage that we can use all the information in the image: edges, background and foreground appearances, as well as the prior information on the shape and pose of the subject and combine them in a Bayesian framework. Previously, optimizing such a cost function would have been computationally infeasible. However, our recent research in dynamic graph cuts allows this to be done much more efficiently than before. We demonstrate the efficacy of our approach on challenging motion sequences. Note that although we target the human pose inference problem in the paper, our method is completely generic and can be used to segment and infer the pose of any specified rigid, deformable or articulated object.

Matthieu Bray, Pushmeet Kohli, Philip H. S. Torr

Springer Professional

Inhaltsverzeichnis

Frontmatter

Energy Minimization

Comparison of Energy Minimization Algorithms for Highly Connected Graphs

A Comparative Study of Energy Minimization Methods for Markov Random Fields

Measuring Uncertainty in Graph Cut Solutions – Efficiently Computing Min-marginal Energies Using Dynamic Graph Cuts

Tracking and Motion

Tracking Dynamic Near-Regular Texture Under Occlusion and Rapid Movements

Simultaneous Object Pose and Velocity Computation Using a Single View from a Rolling Shutter Camera

A Theory of Multiple Orientation Estimation

Poster Session II

Tracking and Motion

Resolution-Aware Fitting of Active Appearance Models to Low Resolution Images

High Accuracy Optical Flow Serves 3-D Pose Tracking: Exploiting Contour and Flow Based Constraints

Enhancing the Point Feature Tracker by Adaptive Modelling of the Feature Support

Tracking Objects Across Cameras by Incrementally Learning Inter-camera Colour Calibration and Patterns of Activity

Monocular Tracking of 3D Human Motion with a Coordinated Mixture of Factor Analyzers

Multiview Geometry and 3D Reconstruction

Balanced Exploration and Exploitation Model Search for Efficient Epipolar Geometry Estimation

Shape-from-Silhouette with Two Mirrors and an Uncalibrated Camera

Robust and Efficient Photo-Consistency Estimation for Volumetric 3D Reconstruction

An Affine Invariant of Parallelograms and Its Application to Camera Calibration and 3D Reconstruction

Nonrigid Shape and Motion from Multiple Perspective Views

3D Surface Reconstruction Using Graph Cuts with Surface Constraints

Statistical Models and Visual Learning

Trace Quotient Problems Revisited

Learning Nonlinear Manifolds from Time Series

Accelerated Convergence Using Dynamic Mean Shift

Efficient Belief Propagation with Learned Higher-Order Markov Random Fields

Non Linear Temporal Textures Synthesis: A Monte Carlo Approach

Low-Level Vision, Image Features

Curvature-Preserving Regularization of Multi-valued Images Using PDE’s

Higher Order Image Pyramids

Image Specific Feature Similarities

Coloring Local Feature Extraction

Defocus Inpainting

Viewpoint Induced Deformation Statistics and the Design of Viewpoint Invariant Features: Singularities and Occlusions

Face/Gesture/Action Detection and Recognition

Spatio-temporal Embedding for Statistical Face Recognition from Video

Super-Resolution of 3D Face

Estimating Gaze Direction from Low-Resolution Faces in Video

Learning Effective Intrinsic Features to Boost 3D-Based Face Recognition

Human Detection Using Oriented Histograms of Flow and Appearance

Cyclostationary Processes on Shape Spaces for Gait-Based Recognition

Segmentation and Grouping

Multiclass Image Labeling with Semidefinite Programming

Automatic Image Segmentation by Positioning a Seed

Patch-Based Texture Edges and Segmentation

Unsupervised Texture Segmentation with Nonparametric Neighborhood Statistics

Detecting Symmetry and Symmetric Constellations of Features

Discovering Texture Regularity as a Higher-Order Correspondence Problem

Object Recognition, Retrieval and Indexing

Exploiting Model Similarity for Indexing and Matching to a Large Model Database

Shift-Invariant Dynamic Texture Recognition

Modeling 3D Objects from Stereo Views and Recognizing Them in Photographs

A Boundary-Fragment-Model for Object Detection

Region Covariance: A Fast Descriptor for Detection and Classification

Segmentation

Affine-Invariant Multi-reference Shape Priors for Active Contours

Figure/Ground Assignment in Natural Images

Background Cut

PoseCut: Simultaneous Segmentation and 3D Pose Estimation of Humans Using Dynamic Graph-Cuts

Backmatter