Face Detection and Recognition

Robust Multi-view Face Detection Using Error Correcting Output Codes

This paper presents a novel method to solve multi-view face detection problem by Error Correcting Output Codes (ECOC). The motivation is that face patterns can be divided into separated classes across views, and ECOC multi-class method can improve the robustness of multi-view face detection compared with the view-based methods because of its inherent error-tolerant ability. One key issue with ECOC-based multi-class classifier is how to construct effective binary classifiers. Besides applying ECOC to multi-view face detection, this paper emphasizes on designing efficient binary classifiers by learning informative features through minimizing the error rate of the ensemble ECOC multi-class classifier. Aiming at designing efficient binary classifiers, we employ spatial histograms as the representation, which provide an over-complete set of optional features that can be efficiently computed from the original images. In addition, the binary classifier is constructed as a coarse to fine procedure using fast histogram matching followed by accurate Support Vector Machine (SVM). The experimental results show that the proposed method is robust to multi-view faces, and achieves performance comparable to that of state-of-the-art approaches to multi-view face detection.

Hongming Zhang, Wen Gao, Xilin Chen, Shiguang Shan, Debin Zhao

Inter-modality Face Recognition

Recently, the wide deployment of practical face recognition systems gives rise to the emergence of the inter-modality face recognition problem. In this problem, the face images in the database and the query images captured on spot are acquired under quite different conditions or even using different equipments. Conventional approaches either treat the samples in a uniform model or introduce an intermediate conversion stage, both of which would lead to severe performance degradation due to the great discrepancies between different modalities. In this paper, we propose a novel algorithm called Common Discriminant Feature Extraction specially tailored to the inter-modality problem. In the algorithm, two transforms are simultaneously learned to transform the samples in both modalities respectively to the common feature space. We formulate the learning objective by incorporating both the empirical discriminative power and the local smoothness of the feature transformation. By explicitly controlling the model complexity through the smoothness constraint, we can effectively reduce the risk of overfitting and enhance the generalization capability. Furthermore, to cope with the nongaussian distribution and diverse variations in the sample space, we develop two nonlinear extensions of the algorithm: one is based on kernelization, while the other is a multi-mode framework. These extensions substantially improve the recognition performance in complex situation. Extensive experiments are conducted to test our algorithms in two application scenarios: optical image-infrared image recognition and photo-sketch recognition. Our algorithms show excellent performance in the experiments.

Dahua Lin, Xiaoou Tang

Face Recognition from Video Using the Generic Shape-Illumination Manifold

In spite of over two decades of intense research, illumination and pose invariance remain prohibitively challenging aspects of face recognition for most practical applications. The objective of this work is to recognize faces using video sequences both for training and recognition input, in a realistic, unconstrained setup in which lighting, pose and user motion pattern have a wide variability and face images are of low resolution. In particular there are three areas of novelty: (i) we show how a photometric model of image formation can be combined with a statistical model of generic face appearance variation, learnt offline, to generalize in the presence of extreme illumination changes; (ii) we use the smoothness of geodesically local appearance manifold structure and a robust same-identity likelihood to achieve invariance to unseen head poses; and (iii) we introduce an accurate video sequence “reillumination” algorithm to achieve robustness to face motion patterns in video. We describe a fully automatic recognition system based on the proposed method and an extensive evaluation on 171 individuals and over 1300 video sequences with extreme illumination, pose and head motion variation. On this challenging data set our system consistently demonstrated a nearly perfect recognition rate (over 99.7%), significantly outperforming state-of-the-art commercial software and methods from the literature.

Ognjen Arandjelović, Roberto Cipolla

Illumination and Reflectance Modelling

A Theory of Spherical Harmonic Identities for BRDF/Lighting Transfer and Image Consistency

We develop new mathematical results based on the spherical harmonic convolution framework for reflection from a curved surface. We derive novel identities, which are the angular frequency domain analogs to common spatial domain invariants such as reflectance ratios. They apply in a number of canonical cases, including single and multiple images of objects under the same and different lighting conditions. One important case we consider is two different glossy objects in two different lighting environments. Denote the spherical harmonic coefficients by

$B_{lm}^{light,{material}}$

, where the subscripts refer to the spherical harmonic indices, and the superscripts to the lighting (1 or 2) and object or material (again 1 or 2). We derive a basic identity,

$B^{\rm 1,1}_{lm}$

$B^{\rm 2,2}_{lm}$

=

$B^{\rm 1,2}_{lm}$

$B^{\rm 2,1}_{lm}$

,

independent

of the specific lighting configurations or BRDFs. While this paper is primarily theoretical, it has the potential to lay the mathematical foundations for two important practical applications. First, we can develop more general algorithms for inverse rendering problems, which can directly relight and change material properties by transferring the BRDF or lighting from another object or illumination. Second, we can check the consistency of an image, to detect tampering or image splicing.

Dhruv Mahajan, Ravi Ramamoorthi, Brian Curless

Covariant Derivatives and Vision

We describe a new theoretical approach to Image Processing and Vision. Expressed in mathemetical terminology, in our formalism image space is a fibre bundle, and the image itself is the graph of a section on it. This mathematical model has advantages to the conventional view of the image as a function on the plane: Based on the new method we are able to do image processing of the image as viewed by the human visual system, which includes adaptation and perceptual correctness of the results. Our formalism is invariant to relighting and handles seamlessly illumination change. It also explains simultaneous contrast visual illusions, which are intrinsically related to the new covariant approach.

Examples include Poisson image editing, Inpainting, gradient domain HDR compression, and others.

Todor Georgiev

Retexturing Single Views Using Texture and Shading

We present a method for retexturing non-rigid objects from a single viewpoint. Without reconstructing 3D geometry, we create realistic video with shape cues at two scales. At a coarse scale, a track of the deforming surface in 2D allows us to erase the old texture and overwrite it with a new texture. At a fine scale, estimates of the local irradiance provide strong cues of fine scale structure in the actual lighting environment. Computing irradiance from explicit correspondence is difficult and unreliable, so we limit our reconstructions to screen printing — a common printing techniques with a finite number of colors. Our irradiance estimates are computed in a local manner: pixels are classified according to color, then irradiance is computed given the color. We demonstrate results in two situations: on a special shirt designed for easy retexturing and on natural clothing with screen prints. Because of the quality of the results, we believe that this technique has wide applications in special effects and advertising.

Ryan White, David Forsyth

Poster Session IV Tracking and Motion

Feature Points Tracking: Robustness to Specular Highlights and Lighting Changes

Since the precise modeling of reflection is a difficult task, most feature points trackers assume that objects are lambertian and that no lighting change occurs. To some extent, a few approaches answer these issues by computing an affine photometric model or by achieving a photometric normalization. Through a study based on specular reflection models, we explain explicitly the assumptions on which these techniques are based. Then we propose a tracker that compensates for specular highlights and lighting variations more efficiently when small windows of interest are considered. Experimental results on image sequences prove the robustness and the accuracy of this technique in comparison with the existing trackers. Moreover, the computation time of the tracking is not significantly increased.

Michèle Gouiffès, Christophe Collewet, Christine Fernandez-Maloigne, Alain Trémeau

A General Framework for Motion Segmentation: Independent, Articulated, Rigid, Non-rigid, Degenerate and Non-degenerate

We cast the problem of motion segmentation of feature trajectories as linear manifold finding problems and propose a general framework for motion segmentation under affine projections which utilizes two properties of trajectory data: geometric constraint and locality. The geometric constraint states that the trajectories of the same motion lie in a low dimensional linear manifold and different motions result in different linear manifolds; locality, by which we mean in a transformed space a data and its neighbors tend to lie in the same linear manifold, provides a cue for efficient estimation of these manifolds. Our algorithm estimates a number of linear manifolds, whose dimensions are unknown beforehand, and segment the trajectories accordingly. It first transforms and normalizes the trajectories; secondly, for each trajectory it estimates a local linear manifold through local sampling; then it derives the affinity matrix based on principal subspace angles between these estimated linear manifolds; at last, spectral clustering is applied to the matrix and gives the segmentation result. Our algorithm is general without restriction on the number of linear manifolds and without prior knowledge of the dimensions of the linear manifolds. We demonstrate in our experiments that it can segment a wide range of motions including independent, articulated, rigid, non-rigid, degenerate, non-degenerate or any combination of them. In some highly challenging cases where other state-of-the-art motion segmentation algorithms may fail, our algorithm gives expected results.

Jingyu Yan, Marc Pollefeys

Robust Visual Tracking for Multiple Targets

We address the problem of robust multi-target tracking within the application of hockey player tracking. The particle filter technique is adopted and modified to fit into the multi-target tracking framework. A rectification technique is employed to find the correspondence between the video frame coordinates and the standard hockey rink coordinates so that the system can compensate for camera motion and improve the dynamics of the players. A global nearest neighbor data association algorithm is introduced to assign boosting detections to the existing tracks for the proposal distribution in particle filters. The mean-shift algorithm is embedded into the particle filter framework to stabilize the trajectories of the targets for robust tracking during mutual occlusion. Experimental results show that our system is able to automatically and robustly track a variable number of targets and correctly maintain their identities regardless of background clutter, camera motion and frequent mutual occlusion between targets.

Yizheng Cai, Nando de Freitas, James J. Little

Multivalued Default Logic for Identity Maintenance in Visual Surveillance

Recognition of complex activities from surveillance video requires detection and temporal ordering of its constituent “atomic” events. It also requires the capacity to robustly track individuals and maintain their identities across single as well as multiple camera views. Identity maintenance is a primary source of uncertainty for activity recognition and has been traditionally addressed via different appearance matching approaches. However these approaches, by themselves, are inadequate. In this paper, we propose a prioritized, multivalued, default logic based framework that allows reasoning about the identities of individuals. This is achieved by augmenting traditional appearance matching with contextual information about the environment and self identifying traits of certain actions. This framework also encodes qualitative confidence measures for the identity decisions it takes and finally, uses this information to reason about the occurrence of certain predefined activities in video.

Vinay D. Shet, David Harwood, Larry S. Davis

A Multiview Approach to Tracking People in Crowded Scenes Using a Planar Homography Constraint

Occlusion and lack of visibility in dense crowded scenes make it very difficult to track individual people correctly and consistently. This problem is particularly hard to tackle in single camera systems. We present a multi-view approach to tracking people in crowded scenes where people may be partially or completely occluding each other. Our approach is to use multiple views in synergy so that information from all views is combined to detect objects. To achieve this we present a novel planar homography constraint to resolve occlusions and robustly determine locations on the ground plane corresponding to the feet of the people. To find tracks we obtain feet regions over a window of frames and stack them creating a space time volume. Feet regions belonging to the same person form contiguous spatio-temporal regions that are clustered using a graph cuts segmentation approach. Each cluster is the track of a person and a slice in time of this cluster gives the tracked location. Experimental results are shown in scenes of dense crowds where severe occlusions are quite common. The algorithm is able to accurately track people in all views maintaining correct correspondences across views. Our algorithm is ideally suited for conditions when occlusions between people would seriously hamper tracking performance or if there simply are not enough features to distinguish between different people.

Saad M. Khan, Mubarak Shah

Multiview Geometry and 3D Methods

Uncalibrated Factorization Using a Variable Symmetric Affine Camera

In order to reconstruct 3-D Euclidean shape by the Tomasi-Kanade factorization, one needs to specify an affine camera model such as orthographic, weak perspective, and paraperspective. We present a new method that does not require any such specific models. We show that a minimal requirement for an affine camera to mimic perspective projection leads to a unique camera model, called

symmetric affine camera

, which has two free functions. We determine their values from input images by linear computation and demonstrate by experiments that an appropriate camera model is automatically selected.

Kenichi Kanatani, Yasuyuki Sugaya, Hanno Ackermann

Dense Photometric Stereo by Expectation Maximization

We formulate a robust method using Expectation Maximization (EM) to address the problem of dense photometric stereo. Previous approaches using Markov Random Fields (MRF) utilized a dense set of noisy photometric images for estimating an initial normal to encode the matching cost at each pixel, followed by normal refinement by considering the neighborhood of the pixel. In this paper, we argue that they had not fully utilized the inherent data redundancy in the dense set and that its full exploitation leads to considerable improvement. Using the same noisy and dense input, this paper contributes in learning relevant observations, recovering accurate normals and very good surface albedos, and inferring optimal parameters in an unifying EM framework that converges to an optimal solution and has no free user-supplied parameter to set. Experiments show that our EM approach for dense photometric stereo outperforms the previous approaches using the same input.

Tai-Pang Wu, Chi-Keung Tang

Space-Time-Scale Registration of Dynamic Scene Reconstructions

The paper presents a method for multi-dimensional registration of two video streams. The sequences are captured by two hand-held cameras moving independently with respect to each other, both observing one object rigidly moving apart from the background. The method is based on uncalibrated Structure-from-Motion (SfM) to extract 3D models for the foreground object and the background, as well as for their relative motion. It fixes the relative scales between the scene parts within and between the videos. It also provides the registration between all partial 3D models, and the temporal synchronization between the videos. The crux is that not a single point on the foreground or background needs to be in common between both video streams. Extensions to more than two cameras and multiple foreground objects are possible.

Kemal E. Ozden, Kurt Cornelis, Luc Van Gool

Self-calibration of a General Radially Symmetric Distortion Model

We present a new approach for self-calibrating the distortion function and the distortion center of cameras with general radially symmetric distortion. In contrast to most current models, we propose a model encompassing fisheye lenses as well as catadioptric cameras with a view angle larger than 180°.

Rather than representing distortion as an image displacement, we model it as a varying focal length, which is a function of the distance to the distortion center. This function can be discretized, acting as a general model, or represented with e.g. a polynomial expression.

We present two flexible approaches for calibrating the distortion function. The first one is a plumbline-type method; images of line patterns are used to formulate linear constraints on the distortion function parameters. This linear system can be solved up to an unknown scale factor (a global focal length), which is sufficient for image rectification. The second approach is based on the first one and performs self-calibration from images of a textured planar object of unknown structure. We also show that by restricting the camera motion, self-calibration is possible from images of a completely unknown, non-planar scene.

The analysis of rectified images, obtained using the computed distortion functions, shows very good results compared to other approaches and models, even those relying on non-linear optimization.

Jean-Philippe Tardif, Peter Sturm, Sébastien Roy

A Simple Solution to the Six-Point Two-View Focal-Length Problem

This paper presents a simple and practical solution to the 6-point 2-view focal-length estimation problem. Based on the

hidden-variable

technique we have derived a 15th degree polynomial in the unknown focal-length. During this course, a simple and constructive algorithm is established. To make use of multiple redundant measurements and then select the best solution, we suggest a kernel-voting scheme. The algorithm has been tested on both synthetic data and real images. Satisfactory results are obtained for both cases. For reference purpose we include our Matlab implementation in the paper, which is quite concise, consisting of 20 lines of code only. The result of this paper will make a small but useful module in many computer vision systems.

Hongdong Li

Iterative Extensions of the Sturm/Triggs Algorithm: Convergence and Nonconvergence

We show that SIESTA, the simplest iterative extension of the Sturm/Triggs algorithm, descends an error function. However, we prove that SIESTA does not converge to usable results. The iterative extension of Mahamud et al. has similar problems, and experiments with “balanced” iterations show that they can fail to converge. We present CIESTA, an algorithm which avoids these problems. It is identical to SIESTA except for one extra, simple stage of computation. We prove that CIESTA descends an error and approaches fixed points. Under weak assumptions, it converges. The CIESTA error can be minimized using a standard descent method such as Gauss–Newton, combining quadratic convergence with the advantage of minimizing in the projective depths.

John Oliensis, Richard Hartley

Low-Level Vision, Image Features

An Efficient Method for Tensor Voting Using Steerable Filters

In many image analysis applications there is a need to extract curves in noisy images. To achieve a more robust extraction, one can exploit correlations of oriented features over a spatial context in the image. Tensor voting is an existing technique to extract features in this way. In this paper, we present a new computational scheme for tensor voting on a dense field of rank-2 tensors. Using steerable filter theory, it is possible to rewrite the tensor voting operation as a linear combination of complex-valued convolutions. This approach has computational advantages since convolutions can be implemented efficiently. We provide speed measurements to indicate the gain in speed, and illustrate the use of steerable tensor voting on medical applications.

Erik Franken, Markus van Almsick, Peter Rongen, Luc Florack, Bart ter Haar Romeny

Interpolating Orientation Fields: An Axiomatic Approach

We develop an axiomatic approach of vector field interpolation, which is useful as a feature extraction preprocessing step. Two operators will be singled out: the curvature operator, appearing in the total variation minimisation for image restoration and inpainting/disocclusion, and the Absolutely Minimizing Lipschitz Extension (AMLE), already known as a robust and coherent scalar image interpolation technique if we relax slightly the axioms. Numerical results, using a multiresolution scheme, show that they produce fields in accordance with the human perception of edges.

Anatole Chessel, Frederic Cao, Ronan Fablet

Alias-Free Interpolation

In this paper we study the possibility of removing aliasing in a scene from a single observation by designing an alias-free upsampling scheme. We generate the unknown high frequency components of the given partially aliased (low resolution) image by minimizing the total variation of the interpolant subject to the constraint that part of unaliased spectral components in the low resolution observation are known precisely and under the assumption of sparsity in the data. This provides a mathematical basis for exact reproduction of high frequency components with probability approaching one, from their aliased observation. The primary application of the given approach would be in super-resolution imaging.

C. V. Jiji, Prakash Neethu, Subhasis Chaudhuri

An Intensity Similarity Measure in Low-Light Conditions

In low-light conditions, it is known that Poisson noise and quantization noise become dominant sources of noise. While intensity difference is usually measured by Euclidean distance, it often breaks down due to an unnegligible amount of uncertainty in observations caused by noise. In this paper, we develop a new noise model based upon Poisson noise and quantization noise. We then propose a new intensity similarity function built upon the proposed noise model. The similarity measure is derived by maximum likelihood estimation based on the nature of Poisson noise and quantization process in digital imaging systems, and it deals with the uncertainty embedded in observations. The proposed intensity similarity measure is useful in many computer vision applications which involve intensity differencing, e.g., block matching, optical flow, and image alignment. We verified the correctness of the proposed noise model by comparisons with real-world noise data and confirmed superior robustness of the proposed similarity measure compared with the standard Euclidean norm.

François Alter, Yasuyuki Matsushita, Xiaoou Tang

Direct Energy Minimization for Super-Resolution on Nonlinear Manifolds

We address the problem of single image super-resolution by exploring the manifold properties. Given a set of low resolution image patches and their corresponding high resolution patches, we assume they respectively reside on two non-linear manifolds that have similar locally-linear structure. This manifold correlation can be realized by a three-layer Markov network that connects performing super-resolution with energy minimization. The main advantage of our approach is that by working directly with the network model, there is no need to actually construct the mappings for the underlying manifolds. To achieve such efficiency, we establish an energy minimization model for the network that directly accounts for the expected property entailed by the manifold assumption. The resulting energy function has two nice properties for super-resolution. First, the function is convex so that the optimization can be efficiently done. Second, it can be shown to be an upper bound of the reconstruction error by our algorithm. Thus, minimizing the energy function automatically guarantees a lower reconstruction error— an important characteristic for promising stable super-resolution results.

Tien-Lung Chang, Tyng-Luh Liu, Jen-Hui Chuang

Wavelet-Based Super-Resolution Reconstruction: Theory and Algorithm

We present a theoretical analysis and a new algorithm for the problem of super-resolution imaging: the reconstruction of HR (high-resolution) images from a sequence of LR (low-resolution) images. Super-resolution imaging entails solutions to two problems. One is the alignment of image frames. The other is the reconstruction of a HR image from multiple aligned LR images. Our analysis of the latter problem reveals insights into the theoretical limits of super-resolution reconstruction. We find that at best we can reconstruct a HR image blurred by a specific low-pass filter. Based on the analysis we present a new wavelet-based iterative reconstruction algorithm which is very robust to noise. Furthermore, it has a computationally efficient built-in denoising scheme with a nearly optimal risk bound. Roughly speaking, our method could be described as a better-conditioned iterative back-projection scheme with a fast and optimal regularization criteria in each iteration step. Experiments with both simulated and real data demonstrate that our approach has significantly better performance than existing super-resolution methods. It has the ability to remove even large amounts of mixed noise without creating smoothing artifacts.

Hui Ji, Cornelia Fermüller

Face/Gesture/Action Detection and Recognition

Extending Kernel Fisher Discriminant Analysis with the Weighted Pairwise Chernoff Criterion

Many linear discriminant analysis (LDA) and kernel Fisher discriminant analysis (KFD) methods are based on the restrictive assumption that the data are homoscedastic. In this paper, we propose a new KFD method called heteroscedastic kernel weighted discriminant analysis (HKWDA) which has several appealing characteristics. First, like all kernel methods, it can handle nonlinearity efficiently in a disciplined manner. Second, by incorporating a weighting function that can capture heteroscedastic data distributions into the discriminant criterion, it can work under more realistic situations and hence can further enhance the classification accuracy in many real-world applications. Moreover, it can effectively deal with the small sample size problem. We have performed some face recognition experiments to compare HKWDA with several linear and nonlinear dimensionality reduction methods, showing that HKWDA consistently gives the best results.

Guang Dai, Dit-Yan Yeung, Hong Chang

Face Authentication Using Adapted Local Binary Pattern Histograms

In this paper, we propose a novel generative approach for face authentication, based on a Local Binary Pattern (LBP) description of the face. A generic face model is considered as a collection of LBP-histograms. Then, a client-specific model is obtained by an adaptation technique from this generic model under a probabilistic framework. We compare the proposed approach to standard state-of-the-art face authentication methods on two benchmark databases, namely XM2VTS and BANCA, associated to their experimental protocol. We also compare our approach to two state-of-the-art LBP-based face recognition techniques, that we have adapted to the verification task.

Yann Rodriguez, Sébastien Marcel

An Integrated Model for Accurate Shape Alignment

In this paper, we propose a two-level integrated model for accurate face shape alignment. At the low level, the shape is split into a set of line segments which serve as the nodes in the hidden layer of a Markov Network. At the high level, all the line segments are constrained by a global Gaussian point distribution model. Furthermore, those already accurately aligned points from the low level are detected and constrained using a constrained regularization algorithm. By analyzing the regularization result, a mask image of local minima is generated to guide the distribution of Markov Network states, which makes our algorithm more robust. Extensive experiments demonstrate the accuracy and effectiveness of our proposed approach.

Lin Liang, Fang Wen, Xiaoou Tang, Ying-qing Xu

Robust Player Gesture Spotting and Recognition in Low-Resolution Sports Video

The determination of the player’s gestures and actions in sports video is a key task in automating the analysis of the video material at a high level. In many sports views, the camera covers a large part of the sports arena, so that the resolution of player’s region is low. This makes the determination of the player’s gestures and actions a challenging task, especially if there is large camera motion. To overcome these problems, we propose a method based on curvature scale space templates of the player’s silhouette. The use of curvature scale space makes the method robust to noise and our method is robust to significant shape corruption of a part of player’s silhouette. We also propose a new recognition method which is robust to noisy sequences of data and needs only a small amount of training data.

Myung-Cheol Roh, Bill Christmas, Joseph Kittler, Seong-Whan Lee

Recognition and Segmentation of 3-D Human Action Using HMM and Multi-class AdaBoost

Our goal is to automatically segment and recognize basic human actions, such as stand, walk and wave hands, from a sequence of joint positions or pose angles. Such recognition is difficult due to high dimensionality of the data and large spatial and temporal variations in the same action. We decompose the high dimensional 3-D joint space into a set of feature spaces where each feature corresponds to the motion of a single joint or combination of related multiple joints. For each feature, the dynamics of each action class is learned with one

HMM

. Given a sequence, the observation probability is computed in each HMM and a weak classifier for that feature is formed based on those probabilities. The weak classifiers with strong discriminative power are then combined by the Multi-Class AdaBoost (

AdaBoost.M2

) algorithm. A dynamic programming algorithm is applied to segment and recognize actions simultaneously. Results of recognizing 22 actions on a large number of motion capture sequences as well as several annotated and automatically tracked sequences show the effectiveness of the proposed algorithms.

Fengjun Lv, Ramakant Nevatia

Segmenting Highly Articulated Video Objects with Weak-Prior Random Forests

We address the problem of segmenting highly articulated video objects in a wide variety of poses. The main idea of our approach is to model the prior information of object appearance via

random forests

. To automatically extract an object from a video sequence, we first build a random forest based on image patches sampled from the initial template. Owing to the nature of using a randomized technique and simple features, the modeled prior information is considered

weak

, but on the other hand appropriate for our application. Furthermore, the random forest can be dynamically updated to generate prior probabilities about the configurations of the object in subsequent image frames. The algorithm then combines the prior probabilities with low-level region information to produce a sequence of figure-ground segmentations. Overall, the proposed segmentation technique is useful and flexible in that one can easily integrate different cues and efficiently select discriminating features to model object appearance and handle various articulations.

Hwann-Tzong Chen, Tyng-Luh Liu, Chiou-Shann Fuh

Segmentation and Grouping

SpatialBoost: Adding Spatial Reasoning to AdaBoost

SpatialBoost extends AdaBoost to incorporate spatial reasoning. We demonstrate the effectiveness of SpatialBoost on the problem of interactive image segmentation. Our application takes as input a tri-map of the original image, trains SpatialBoost on the pixels of the object and the background and use the trained classifier to classify the unlabeled pixels. The spatial reasoning is introduced in the form of weak classifiers that attempt to infer pixel label from the pixel labels of surrounding pixels, after each boosting iteration. We call this variant of AdaBoost — SpatialBoost. We then extend the application to work with “GrabCut”. In GrabCut the user casually marks a rectangle around the object, instead of tediously marking a tri-map, and we pose the segmentation as the problem of learning with outliers, where we know that only positive pixels (i.e. pixels that are assumed to belong to the object) might be outliers and in fact should belong to the background.

Shai Avidan

Database-Guided Simultaneous Multi-slice 3D Segmentation for Volumetric Data

Automatic delineation of anatomical structures in 3-D volumetric data is a challenging task due to the complexity of the object appearance as well as the quantity of information to be processed. This makes it increasingly difficult to encode prior knowledge about the object segmentation in a traditional formulation as a perceptual grouping task. We introduce a fast shape segmentation method for 3-D volumetric data by extending the 2-D database-guided segmentation paradigm which directly exploits expert annotations of the interest object in large medical databases. Rather than dealing with 3-D data directly, we take advantage of the observation that the information about position and appearance of a 3-D shape can be characterized by a set of 2-D slices. Cutting these multiple slices simultaneously from the 3-D shape allows us to represent and process 3-D data as efficiently as 2-D images while keeping most of the information about the 3-D shape. To cut slices consistently for all shapes, an iterative 3-D non-rigid shape alignment method is also proposed for building local coordinates for each shape. Features from all the slices are jointly used to learn to discriminate between the object appearance and background and to learn the association between appearance and shape. The resulting procedure is able to perform shape segmentation in only a few seconds. Extensive experiments on cardiac ultrasound images demonstrate the algorithm’s accuracy and robustness in the presence of large amounts of noise.

Wei Hong, Bogdan Georgescu, Xiang Sean Zhou, Sriram Krishnan, Yi Ma, Dorin Comaniciu

Density Estimation Using Mixtures of Mixtures of Gaussians

In this paper we present a new density estimation algorithm using mixtures of mixtures of Gaussians. The new algorithm overcomes the limitations of the popular Expectation Maximization algorithm. The paper first introduces a new model selection criterion called the Penalty-less Information Criterion, which is based on the Jensen-Shannon divergence. Mean-shift is used to automatically initialize the means and covariances of the Expectation Maximization in order to obtain better structure inference. Finally, a locally linear search is performed using the Penalty-less Information Criterion in order to infer the underlying density of the data. The validity of the algorithm is verified using real color images.

Wael Abd-Almageed, Larry S. Davis

Example Based Non-rigid Shape Detection

Since it is hard to handcraft the prior knowledge in a shape detection framework, machine learning methods are preferred to exploit the expert annotation of the target shape in a database. In the previous approaches [1,2] , an optimal similarity transformation is exhaustively searched for to maximize the response of a trained classification model. At best, these approaches only give a rough estimate of the position of a non-rigid shape. In this paper, we propose a novel machine learning based approach to achieve a refined shape detection result. We train a model that has the largest response on a reference shape and a smaller response on other shapes. During shape detection, we search for an optimal non-rigid deformation to maximize the response of the trained model on the deformed image block. Since exhaustive searching is inapplicable for a non-rigid deformation space with a high dimension, currently, example based searching is used instead. Experiments on two applications, left ventricle endocardial border detection and facial feature detection, demonstrate the robustness of our approach. It outperforms the well-known ASM and AAM approaches on challenging samples.

Yefeng Zheng, Xiang Sean Zhou, Bogdan Georgescu, Shaohua Kevin Zhou, Dorin Comaniciu

Towards Safer, Faster Prenatal Genetic Tests: Novel Unsupervised, Automatic and Robust Methods of Segmentation of Nuclei and Probes

In this paper we present two new methods of segmentation that we developed for nuclei and chromosomic probes – core objects for cytometry medical imaging. Our nucleic segmentation method is mathematically grounded on a novel parametric model of an image histogram, which accounts at the same time for the background noise, the nucleic textures and the nuclei’s alterations to the background. We adapted an Expectation-Maximisation algorithm to adjust this model to the histograms of each image and subregion, in a coarse-to-fine approach. The probe segmentation uses a new dome-detection algorithm, insensitive to background and foreground noise, which detects probes of any intensity. We detail our two segmentation methods and our EM algorithm, and discuss the strengths of our techniques compared with state-of-the-art approaches. Both our segmentation methods are unsupervised, automatic, and require no training nor tuning: as a result, they are directly applicable to a wide range of medical images. We have used them as part of a large-scale project for the improvement of prenatal diagnostic of genetic diseases, and tested them on more than 2,100 images with nearly 14,000 nuclei. We report 99.3% accuracy for each of our segmentation methods, with a robustness to different laboratory conditions unreported before.

Christophe Restif

Object Recognition, Retrieval and Indexing

Fast Memory-Efficient Generalized Belief Propagation

Generalized Belief Propagation (

gbp

) has proven to be a promising technique for performing inference on Markov random fields (

mrf

s). However, its heavy computational cost and large memory requirements have restricted its application to problems with small state spaces. We present methods for reducing both run time and storage needed by

gbp

for a large class of pairwise potentials of the

mrf

. Further, we show how the problem of subgraph matching can be formulated using this class of

mrf

s and thus, solved efficiently using our approach. Our results significantly outperform the state-of-the-art method. We also obtain excellent results for the related problem of matching pictorial structures for object recognition.

M. Pawan Kumar, P. H. S. Torr

Adapted Vocabularies for Generic Visual Categorization

Several state-of-the-art Generic Visual Categorization (GVC) systems are built around a vocabulary of visual terms and characterize images with one histogram of visual word counts. We propose a novel and practical approach to GVC based on a universal vocabulary, which describes the content of all the considered classes of images, and class vocabularies obtained through the adaptation of the universal vocabulary using class-specific data. An image is characterized by a set of histograms – one per class – where each histogram describes whether the image content is best modeled by the universal vocabulary or the corresponding class vocabulary. It is shown experimentally on three very different databases that this novel representation outperforms those approaches which characterize an image with a single histogram.

Florent Perronnin, Christopher Dance, Gabriela Csurka, Marco Bressan

Identification of Highly Similar 3D Objects Using Model Saliency

We present a novel approach for identifying 3D objects from a database of models, highly similar in shape, using range data acquired in unconstrained settings from a limited number of viewing directions. We are addressing also the challenging case of identifying targets not present in the database. The method is based on learning offline

saliency

tests for each object in the database, by maximizing an objective measure of discriminability with respect to other similar models. Our notion of

model saliency

differs from traditionally used structural saliency that characterizes weakly the uniqueness of a region by the amount of 3D texture available, by directly linking discriminability with the Bhattacharyya distance between the distribution of errors between the target and its corresponding ground truth, respectively other similar models. Our approach was evaluated on thousands of queries obtained by different sensors and acquired in various operating conditions and using a database of hundreds of models. The results presented show a significant improvement in the recognition performance when using saliency compared to global point-to-point mismatch errors, traditionally used in matching and verification algorithms.

Bogdan C. Matei, Harpreet S. Sawhney, Clay D. Spence

Sampling Strategies for Bag-of-Features Image Classification

Bag-of-features representations have recently become popular for content based image classification owing to their simplicity and good performance. They evolved from texton methods in texture analysis. The basic idea is to treat images as loose collections of independent patches, sampling a representative set of patches from the image, evaluating a visual descriptor vector for each patch independently, and using the resulting distribution of samples in descriptor space as a characterization of the image. The four main implementation choices are thus how to sample patches, how to describe them, how to characterize the resulting distributions and how to classify images based on the result. We concentrate on the first issue, showing experimentally that for a representative selection of commonly used test databases and for moderate to large numbers of samples, random sampling gives equal or better classifiers than the sophisticated multiscale interest operators that are in common use. Although interest operators work well for small numbers of samples, the single most important factor governing performance is the number of patches sampled from the test image and ultimately interest operators can not provide enough patches to compete. We also study the influence of other factors including codebook size and creation method, histogram normalization method and minimum scale for feature extraction.

Eric Nowak, Frédéric Jurie, Bill Triggs

Maximally Stable Local Description for Scale Selection

Scale and affine-invariant local features have shown excellent performance in image matching, object and texture recognition. This paper optimizes keypoint detection to achieve stable local descriptors, and therefore, an improved image representation. The technique performs scale selection based on a region descriptor, here SIFT, and chooses regions for which this descriptor is maximally stable. Maximal stability is obtained, when the difference between descriptors extracted for consecutive scales reaches a minimum. This scale selection technique is applied to multi-scale Harris and Laplacian points. Affine invariance is achieved by an integrated affine adaptation process based on the second moment matrix. An experimental evaluation compares our detectors to Harris-Laplace and the Laplacian in the context of image matching as well as of category and texture classification. The comparison shows the improved performance of our detector.

Gyuri Dorkó, Cordelia Schmid

Scene Classification Via pLSA

Given a set of images of scenes containing multiple object categories (e.g. grass, roads, buildings) our objective is to discover these objects in each image in an unsupervised manner, and to use this object distribution to perform scene classification. We achieve this discovery using probabilistic Latent Semantic Analysis (pLSA), a generative model from the statistical text literature, here applied to a bag of visual words representation for each image. The scene classification on the object distribution is carried out by a k-nearest neighbour classifier.

We investigate the classification performance under changes in the visual vocabulary and number of latent topics learnt, and develop a novel vocabulary using colour SIFT descriptors. Classification performance is compared to the supervised approaches of Vogel & Schiele [19] and Oliva & Torralba [11], and the semi-supervised approach of Fei Fei & Perona [3] using their own datasets and testing protocols. In all cases the combination of (unsupervised) pLSA followed by (supervised) nearest neighbour classification achieves superior results. We show applications of this method to image retrieval with relevance feedback and to scene classification in videos.

Anna Bosch, Andrew Zisserman, Xavier Muñoz

Probabilistic Linear Discriminant Analysis

Linear dimensionality reduction methods, such as LDA, are often used in object recognition for feature extraction, but do not address the problem of how to use these features for recognition. In this paper, we propose Probabilistic LDA, a generative probability model with which we can both extract the features and combine them for recognition. The latent variables of PLDA represent both the class of the object and the view of the object within a class. By making examples of the same class share the class variable, we show how to train PLDA and use it for recognition on previously unseen classes. The usual LDA features are derived as a result of training PLDA, but in addition have a probability model attached to them, which automatically gives more weight to the more discriminative features. With PLDA, we can build a model of a previously unseen class from a single example, and can combine multiple examples for a better representation of the class. We show applications to classification, hypothesis testing, class inference, and clustering, on classes not observed during training.

Sergey Ioffe

A New 3-D Model Retrieval System Based on Aspect-Transition Descriptor

In this paper, we propose a new 3-D model retrieval system using the

Aspect-Transition Descriptor

which is based on the aspect graph representation [1,2] approach. The proposed method differs from the conventional aspect graph representation in that we utilize transitions as well as aspects. The process of generating the

Aspect-Transition Descriptor

is as follows: First, uniformly sampled views of a 3-D model are separated into a stable and an unstable view sets according to the local variation of their 2-D shape. Next, adjacent stable views and unstable views are grouped into clusters and we select the characteristic aspects and transitions by finding the representative view from each cluster. The 2-D descriptors of the selected characteristic aspects and transitions are concatenated to form the 3-D descriptor. Matching the

Aspect-Transition Descriptor

s is done using a modified Hausdorff distance. To evaluate the proposed 3-D descriptor, we have evaluated the retrieval performance on the Princeton benchmark database [3] and found that our method outperforms other retrieval techniques.

Soochahn Lee, Sehyuk Yoon, Il Dong Yun, Duck Hoon Kim, Kyoung Mu Lee, Sang Uk Lee

Low-Level Vision, Segmentation and Grouping

Unsupervised Patch-Based Image Regularization and Representation

A novel adaptive and patch-based approach is proposed for image regularization and representation. The method is unsupervised and based on a pointwise selection of small image patches of fixed size in the variable

neighborhood

of each pixel. The main idea is to associate with each pixel the weighted sum of data points within an adaptive neighborhood and to use image patches to take into account complex spatial interactions in images. In this paper, we consider the problem of the adaptive neighborhood selection in a manner that it balances the accuracy of the estimator and the stochastic error, at each spatial position. Moreover, we propose a practical algorithm with no hidden parameter for image regularization that uses no library of image patches and no training algorithm. The method is applied to both artificially corrupted and real images and the performance is very close, and in some cases even surpasses, to that of the best published denoising methods.

Charles Kervrann, Jérôme Boulanger

A Fast Approximation of the Bilateral Filter Using a Signal Processing Approach

The bilateral filter is a nonlinear filter that smoothes a signal while preserving strong edges. It has demonstrated great effectiveness for a variety of problems in computer vision and computer graphics, and a fast version has been proposed. Unfortunately, little is known about the accuracy of such acceleration. In this paper, we propose a new signal-processing analysis of the bilateral filter, which complements the recent studies that analyzed it as a PDE or as a robust statistics estimator. Importantly, this signal-processing perspective allows us to develop a novel bilateral filtering acceleration using a downsampling in space and intensity. This affords a principled expression of the accuracy in terms of bandwidth and sampling. The key to our analysis is to express the filter in a higher-dimensional space where the signal intensity is added to the original domain dimensions. The bilateral filter can then be expressed as simple linear convolutions in this augmented space followed by two simple nonlinearities. This allows us to derive simple criteria for downsampling the key operations and to achieve important acceleration of the bilateral filter. We show that, for the same running time, our method is significantly more accurate than previous acceleration techniques.

Sylvain Paris, Frédo Durand

Learning to Combine Bottom-Up and Top-Down Segmentation

Bottom-up segmentation based only on low-level cues is a notoriously difficult problem. This difficulty has lead to recent top-down segmentation algorithms that are based on class-specific image information. Despite the success of top-down algorithms, they often give coarse segmentations that can be significantly refined using low-level cues. This raises the question of how to combine both top-down and bottom-up cues in a principled manner.

In this paper we approach this problem using supervised learning. Given a training set of ground truth segmentations we train a fragment-based segmentation algorithm

which takes into account both bottom-up and top-down cues simultaneously

, in contrast to most existing algorithms which train top-down and bottom-up modules separately. We formulate the problem in the framework of Conditional Random Fields (CRF) and derive a novel feature induction algorithm for CRF, which allows us to efficiently search over thousands of candidate fragments. Whereas pure top-down algorithms often require hundreds of fragments, our simultaneous learning procedure yields algorithms with a handful of fragments that are combined with low-level cues to efficiently compute high quality segmentations.

Anat Levin, Yair Weiss

Multi-way Clustering Using Super-Symmetric Non-negative Tensor Factorization

We consider the problem of clustering data into

k

≥ 2 clusters given complex relations — going beyond pairwise — between the data points. The complex

n

-wise relations are modeled by an

n

-way array where each entry corresponds to an affinity measure over an

n

-tuple of data points. We show that a probabilistic assignment of data points to clusters is equivalent, under mild conditional independence assumptions, to a super-symmetric non-negative factorization of the closest hyper-stochastic version of the input

n

-way affinity array. We derive an algorithm for finding a local minimum solution to the factorization problem whose computational complexity is proportional to the number of

n

-tuple samples drawn from the data. We apply the algorithm to a number of visual interpretation problems including 3D multi-body segmentation and illumination-based clustering of human faces.

Amnon Shashua, Ron Zass, Tamir Hazan

Springer Professional

Inhaltsverzeichnis

Frontmatter

Face Detection and Recognition

Robust Multi-view Face Detection Using Error Correcting Output Codes

Inter-modality Face Recognition

Face Recognition from Video Using the Generic Shape-Illumination Manifold

Illumination and Reflectance Modelling

A Theory of Spherical Harmonic Identities for BRDF/Lighting Transfer and Image Consistency

Covariant Derivatives and Vision

Retexturing Single Views Using Texture and Shading

Poster Session IV Tracking and Motion

Feature Points Tracking: Robustness to Specular Highlights and Lighting Changes

A General Framework for Motion Segmentation: Independent, Articulated, Rigid, Non-rigid, Degenerate and Non-degenerate

Robust Visual Tracking for Multiple Targets

Multivalued Default Logic for Identity Maintenance in Visual Surveillance

A Multiview Approach to Tracking People in Crowded Scenes Using a Planar Homography Constraint

Multiview Geometry and 3D Methods

Uncalibrated Factorization Using a Variable Symmetric Affine Camera

Dense Photometric Stereo by Expectation Maximization

Space-Time-Scale Registration of Dynamic Scene Reconstructions

Self-calibration of a General Radially Symmetric Distortion Model

A Simple Solution to the Six-Point Two-View Focal-Length Problem

Iterative Extensions of the Sturm/Triggs Algorithm: Convergence and Nonconvergence

Low-Level Vision, Image Features

An Efficient Method for Tensor Voting Using Steerable Filters

Interpolating Orientation Fields: An Axiomatic Approach

Alias-Free Interpolation

An Intensity Similarity Measure in Low-Light Conditions

Direct Energy Minimization for Super-Resolution on Nonlinear Manifolds

Wavelet-Based Super-Resolution Reconstruction: Theory and Algorithm

Face/Gesture/Action Detection and Recognition

Extending Kernel Fisher Discriminant Analysis with the Weighted Pairwise Chernoff Criterion

Face Authentication Using Adapted Local Binary Pattern Histograms

An Integrated Model for Accurate Shape Alignment

Robust Player Gesture Spotting and Recognition in Low-Resolution Sports Video

Recognition and Segmentation of 3-D Human Action Using HMM and Multi-class AdaBoost

Segmenting Highly Articulated Video Objects with Weak-Prior Random Forests

Segmentation and Grouping

SpatialBoost: Adding Spatial Reasoning to AdaBoost

Database-Guided Simultaneous Multi-slice 3D Segmentation for Volumetric Data

Density Estimation Using Mixtures of Mixtures of Gaussians

Example Based Non-rigid Shape Detection

Towards Safer, Faster Prenatal Genetic Tests: Novel Unsupervised, Automatic and Robust Methods of Segmentation of Nuclei and Probes

Object Recognition, Retrieval and Indexing

Fast Memory-Efficient Generalized Belief Propagation

Adapted Vocabularies for Generic Visual Categorization

Identification of Highly Similar 3D Objects Using Model Saliency

Sampling Strategies for Bag-of-Features Image Classification

Maximally Stable Local Description for Scale Selection

Scene Classification Via pLSA

Probabilistic Linear Discriminant Analysis

A New 3-D Model Retrieval System Based on Aspect-Transition Descriptor

Low-Level Vision, Segmentation and Grouping

Unsupervised Patch-Based Image Regularization and Representation

A Fast Approximation of the Bilateral Filter Using a Signal Processing Approach

Learning to Combine Bottom-Up and Top-Down Segmentation

Multi-way Clustering Using Super-Symmetric Non-negative Tensor Factorization

Backmatter

Premium Partner