main-content

## Über dieses Buch

Welcome to the proceedings of the 8th European Conference on Computer - sion! Following a very successful ECCV 2002, the response to our call for papers was almost equally strong – 555 papers were submitted. We accepted 41 papers for oral and 149 papers for poster presentation. Several innovations were introduced into the review process. First, the n- ber of program committee members was increased to reduce their review load. We managed to assign to program committee members no more than 12 papers. Second, we adopted a paper ranking system. Program committee members were asked to rank all the papers assigned to them, even those that were reviewed by additional reviewers. Third, we allowed authors to respond to the reviews consolidated in a discussion involving the area chair and the reviewers. Fourth, thereports,thereviews,andtheresponsesweremadeavailabletotheauthorsas well as to the program committee members. Our aim was to provide the authors with maximal feedback and to let the program committee members know how authors reacted to their reviews and how their reviews were or were not re?ected in the ?nal decision. Finally, we reduced the length of reviewed papers from 15 to 12 pages. ThepreparationofECCV2004wentsmoothlythankstothee?ortsofthe- ganizing committee, the area chairs, the program committee, and the reviewers. We are indebted to Anders Heyden, Mads Nielsen, and Henrik J. Nielsen for passing on ECCV traditions and to Dominique Asselineau from ENST/TSI who kindly provided his GestRFIA conference software. We thank Jan-Olof Eklundh and Andrew Zisserman for encouraging us to organize ECCV 2004 in Prague.

## Inhaltsverzeichnis

### A Unified Algebraic Approach to 2-D and 3-D Motion Segmentation

We present an analytic solution to the problem of estimating multiple 2-D and 3-D motion models from two-view correspondences or optical flow. The key to our approach is to view the estimation of multiple motion models as the estimation of a single multibody motion model. This is possible thanks to two important algebraic facts. First, we show that all the image measurements, regardless of their associated motion model, can be fit with a real or complex polynomial. Second, we show that the parameters of the motion model associated with an image measurement can be obtained from the derivatives of the polynomial at the measurement. This leads to a novel motion segmentation algorithm that applies to most of the two-view motion models adopted in computer vision. Our experiments show that the proposed algorithm outperforms existing algebraic methods in terms of efficiency and robustness, and provides a good initialization for iterative techniques, such as EM, which is strongly dependent on correct initialization.

René Vidal, Yi Ma

### Enhancing Particle Filters Using Local Likelihood Sampling

Particle filters provide a means to track the state of an object even when the dynamics and the observations are non-linear/non-Gaussian. However, they can be very inefficient when the observation noise is low as compared to the system noise, as it is often the case in visual tracking applications. In this paper we propose a new two-stage sampling procedure to boost the performance of particle filters under this condition. We provide conditions under which the new procedure is proven to reduce the variance of the weights. Synthetic and real-world visual tracking experiments are used to confirm the validity of the theoretical analysis.

Péter Torma, Csaba Szepesvári

### A Boosted Particle Filter: Multitarget Detection and Tracking

The problem of tracking a varying number of non-rigid objects has two major difficulties. First, the observation models and target distributions can be highly non-linear and non-Gaussian. Second, the presence of a large, varying number of objects creates complex interactions with overlap and ambiguities. To surmount these difficulties, we introduce a vision system that is capable of learning, detecting and tracking the objects of interest. The system is demonstrated in the context of tracking hockey players using video sequences. Our approach combines the strengths of two successful algorithms: mixture particle filters and Adaboost. The mixture particle filter [17] is ideally suited to multi-target tracking as it assigns a mixture component to each player. The crucial design issues in mixture particle filters are the choice of the proposal distribution and the treatment of objects leaving and entering the scene. Here, we construct the proposal distribution using a mixture model that incorporates information from the dynamic models of each player and the detection hypotheses generated by Adaboost. The learned Adaboost proposal distribution allows us to quickly detect players entering the scene, while the filtering process enables us to keep track of the individual players. The result of interleaving Adaboost with mixture particle filters is a simple, yet powerful and fully automatic multiple object tracking system.

Kenji Okuma, Ali Taleghani, Nando de Freitas, James J. Little, David G. Lowe

### Simultaneous Object Recognition and Segmentation by Image Exploration

Methods based on local, viewpoint invariant features have proven capable of recognizing objects in spite of viewpoint changes, occlusion and clutter. However, these approaches fail when these factors are too strong, due to the limited repeatability and discriminative power of the features. As additional shortcomings, the objects need to be rigid and only their approximate location is found. We present a novel Object Recognition approach which overcomes these limitations. An initial set of feature correspondences is first generated. The method anchors on it and then gradually explores the surrounding area, trying to construct more and more matching features, increasingly farther from the initial ones. The resulting process covers the object with matches, and simultaneously separates the correct matches from the wrong ones. Hence, recognition and segmentation are achieved at the same time. Only very few correct initial matches suffice for reliable recognition. The experimental results demonstrate the stronger power of the presented method in dealing with extensive clutter, dominant occlusion, large scale and viewpoint changes. Moreover non-rigid deformations are explicitly taken into account, and the approximative contours of the object are produced. The approach can extend any viewpoint invariant feature extractor.

Vittorio Ferrari, Tinne Tuytelaars, Luc Van Gool

### Recognition by Probabilistic Hypothesis Construction

We present a probabilistic framework for recognizing objects in images of cluttered scenes. Hundreds of objects may be considered and searched in parallel. Each object is learned from a single training image and modeled by the visual appearance of a set of features, and their position with respect to a common reference frame. The recognition process computes identity and position of objects in the scene by finding the best interpretation of the scene in terms of learned objects. Features detected in an input image are either paired with database features, or marked as clutters. Each hypothesis is scored using a generative model of the image which is defined using the learned objects and a model for clutter. While the space of possible hypotheses is enormously large, one may find the best hypothesis efficiently – we explore some heuristics to do so. Our algorithm compares favorably with state-of-the-art recognition systems.

Pierre Moreels, Michael Maire, Pietro Perona

### Human Detection Based on a Probabilistic Assembly of Robust Part Detectors

We describe a novel method for human detection in single images which can detect full bodies as well as close-up views in the presence of clutter and occlusion. Humans are modeled as flexible assemblies of parts, and robust part detection is the key to the approach. The parts are represented by co-occurrences of local features which captures the spatial layout of the part’s appearance. Feature selection and the part detectors are learnt from training images using AdaBoost.The detection algorithm is very efficient as (i) all part detectors use the same initial features, (ii) a coarse-to-fine cascade approach is used for part detection, (iii) a part assembly strategy reduces the number of spurious detections and the search space. The results outperform existing human detectors.

Krystian Mikolajczyk, Cordelia Schmid, Andrew Zisserman

### Model Selection for Range Segmentation of Curved Objects

In the present paper, we address the problem of recovering the true underlying model of a surface while performing the segmentation. A novel criterion for surface (model) selection is introduced and its performance for selecting the underlying model of various surfaces has been tested and compared with many other existing techniques. Using this criterion, we then present a range data segmentation algorithm capable of segmenting complex objects with planar and curved surfaces. The algorithm simultaneously identifies the type (order and geometric shape) of surface and separates all the points that are part of that surface from the rest in a range image. The paper includes the segmentation results of a large collection of range images obtained from objects with planar and curved surfaces.

### High-Contrast Color-Stripe Pattern for Rapid Structured-Light Range Imaging

For structured-light range imaging, color stripes can be used for increasing the number of distinguishable light patterns compared to binary BW stripes. Therefore, an appropriate use of color patterns can reduce the number of light projections and range imaging is achievable in single video frame or in “one shot”. On the other hand, the reliability and range resolution attainable from color stripes is generally lower than those from multiply projected binary BW patterns since color contrast is affected by object color reflectance and ambient light. This paper presents new methods for selecting stripe colors and designing multiple-stripe patterns for “one-shot” and “two-shot” imaging. We show that maximizing color contrast between the stripes in one-shot imaging reduces the ambiguities resulting from colored object surfaces and limitations in sensor/projector resolution. Two-shot imaging adds an extra video frame and maximizes the color contrast between the first and second video frames to diminish the ambiguities even further. Experimental results demonstrate the effectiveness of the presented one-shot and two-shot color-stripe imaging schemes.

Changsoo Je, Sang Wook Lee, Rae-Hong Park

### Using Inter-feature-Line Consistencies for Sequence-Based Object Recognition

An image sequence-based framework for appearance-based object recognition is proposed in this paper. Compared with the methods of using a single view for object recognition, inter-frame consistencies can be exploited in a sequence-based method, so that a better recognition performance can be achieved. We use the nearest feature line method (NFL) [8] to model each object. The NFL method is extended in this paper by further integrating motion-continuity information between features lines in a probabilistic framework. The associated recognition task is formulated as maximizing an a posteriori probability measure. The recognition problem is then further transformed to a shortest-path searching problem, and a dynamic-programming technique is used to solve it.

Jiun-Hung Chen, Chu-Song Chen

### Discriminant Analysis on Embedded Manifold

Previous manifold learning algorithms mainly focus on uncovering the low dimensional geometry structure from a set of samples that lie on or nearly on a manifold in an unsupervised manner. However, the representations from unsupervised learning are not always optimal in discriminating capability. In this paper, a novel algorithm is introduced to conduct discriminant analysis in term of the embedded manifold structure. We propose a novel clustering algorithm, called Intra-Cluster Balanced K-Means (ICBKM), which ensures that there are balanced samples for the classes in a cluster; and the local discriminative features for all clusters are simultaneously calculated by following the global Fisher criterion. Compared to the traditional linear/kernel discriminant analysis algorithms, ours has the following characteristics: 1) it is approximately a locally linear yet globally nonlinear discriminant analyzer; 2) it can be considered a special Kernel-DA with geometry-adaptive-kernel, in contrast to traditional KDA whose kernel is independent to the samples; and 3) its computation and memory cost are reduced a great deal compared to traditional KDA, especially for the cases with large number of samples. It does not need to store the original samples for computing the low dimensional representation for new data. The evaluation on toy problem shows that it is effective in deriving discriminative representations for the problem with nonlinear classification hyperplane. When applied to the face recognition problem, it is shown that, compared with LDA and traditional KDA on YALE and PIE databases, the proposed algorithm significantly outperforms LDA and

Shuicheng Yan, Hongjiang Zhang, Yuxiao Hu, Benyu Zhang, Qiansheng Cheng

### Multiscale Inverse Compositional Alignment for Subdivision Surface Maps

We propose an efficient alignment method for textured Doo-Sabin subdivision surface templates. A variation of the inverse compositional image alignment is derived by introducing smooth adjustments in the parametric space of the surface and relating them to the control point increments. The convergence properties of the proposed method are improved by a coarse-to-fine multiscale matching. The method is applied to real-time tracking of specially marked surfaces from a single camera view.

Igor Guskov

### A Fourier Theory for Cast Shadows

Cast shadows can be significant in many computer vision applications such as lighting-insensitive recognition and surface reconstruction. However, most algorithms neglect them, primarily because they involve non-local interactions in non-convex regions, making formal analysis difficult. While general cast shadowing situations can be arbitrarily complex, many real instances map closely to canonical configurations like a wall, a V-groove type structure, or a pitted surface. In particular, we experiment on 3D textures like moss, gravel and a kitchen sponge, whose surfaces include canonical cast shadowing situations like V-grooves. This paper shows theoretically that many shadowing configurations can be mathematically analyzed using convolutions and Fourier basis functions. Our analysis exposes the mathematical convolution structure of cast shadows, and shows strong connections to recently developed signal-processing frameworks for reflection and illumination. An analytic convolution formula is derived for a 2D V-groove, which is shown to correspond closely to many common shadowing situations, especially in 3D textures. Numerical simulation is used to extend these results to general 3D textures. These results also provide evidence that a common set of illumination basis functions may be appropriate for representing lighting variability due to cast shadows in many 3D textures. We derive a new analytic basis suited for 3D textures to represent illumination on the hemisphere, with some advantages over commonly used Zernike polynomials and spherical harmonics. New experiments on analyzing the variability in appearance of real 3D textures with illumination motivate and validate our theoretical analysis. Empirical results show that illumination eigenfunctions often correspond closely to Fourier bases, while the eigenvalues drop off significantly slower than those for irradiance on a Lambertian curved surface. These new empirical results are explained in this paper, based on our theory.

Ravi Ramamoorthi, Melissa Koudelka, Peter Belhumeur

### Surface Reconstruction by Propagating 3D Stereo Data in Multiple 2D Images

We present a novel approach to surface reconstruction from multiple images. The central idea is to explore the integration of both 3D stereo data and 2D calibrated images. This is motivated by the fact that only robust and accurate feature points that survived the geometry scrutiny of multiple images are reconstructed in space. The density insufficiency and the inevitable holes in the stereo data should be filled in by using information from multiple images. The idea is therefore to first construct small surface patches from stereo points, then to progressively propagate only reliable patches in their neighborhood from images into the whole surface using a best-first strategy. The problem reduces to searching for an optimal local surface patch going through a given set of stereo points from images. This constrained optimization for a surface patch could be handled by a local graph-cut that we develop. Real experiments demonstrate the usability and accuracy of the approach.

Gang Zeng, Sylvain Paris, Long Quan, Maxime Lhuillier

### Visibility Analysis and Sensor Planning in Dynamic Environments

We analyze visibility from static sensors in a dynamic scene with moving obstacles (people). Such analysis is considered in a probabilistic sense in the context of multiple sensors, so that visibility from even one sensor might be sufficient. Additionally, we analyze worst-case scenarios for high-security areas where targets are non-cooperative. Such visibility analysis provides important performance characterization of multi-camera systems. Furthermore, maximization of visibility in a given region of interest yields the optimum number and placement of cameras in the scene. Our analysis has applications in surveillance – manual or automated – and can be utilized for sensor planning in places like museums, shopping malls, subway stations and parking lots. We present several example scenes – simulated and real – for which interesting camera configurations were obtained using the formal analysis developed in the paper.

Anurag Mittal, Larry S. Davis

### Camera Calibration from the Quasi-affine Invariance of Two Parallel Circles

In this paper, a new camera calibration algorithm is proposed, which is from the quasi-affine invariance of two parallel circles. Two parallel circles here mean two circles in one plane, or in two parallel planes. They are quite common in our life.Between two parallel circles and their images under a perspective projection, we set up a quasi-affine invariance. Especially, if their images under a perspective projection are separate, we find out an interesting distribution of the images and the virtual intersections of the images, and prove that it is a quasi-affine invariance.The quasi-affine invariance is very useful which is applied to identify the images of circular points. After the images of the circular points are identified, linear equations on the intrinsic parameters are established, from which a camera calibration algorithm is proposed. We perform both simulated and real experiments to verify it. The results validate this method and show its accuracy and robustness. Compared with the methods in the past literatures, the advantages of this calibration method are: it is from parallel circles with minimal number; it is simple by virtue of the proposed quasi-affine invariance; it does not need any matching.Excepting its application on camera calibration, the proposed quasi-affine invariance can also be used to remove the ambiguity of recovering the geometry of single axis motions by conic fitting method in [8] and [9]. In the two literatures, three conics are needed to remove the ambiguity of their method. While, two conics are enough to remove it if the two conics are separate and the quasi-affine invariance proposed by us is taken into account.

Yihong Wu, Haijiang Zhu, Zhanyi Hu, Fuchao Wu

### Texton Correlation for Recognition

We study the problem of object, in particular face, recognition under varying imaging conditions. Objects are represented using local characteristic features called textons. Appearance variations due to changing conditions are encoded by the correlations between the textons. We propose two solutions to model these correlations. The first one assumes locational independence. We call it the conditional texton distribution model. The second captures the second order variations across locations using Fisher linear discriminant analysis. We call it the Fisher texton model. Our two models are effective in the problem of face recognition from a single image across a wide range of illuminations, poses, and time.

Thomas Leung

### Multiple View Feature Descriptors from Image Sequences via Kernel Principal Component Analysis

We present a method for learning feature descriptors using multiple images, motivated by the problems of mobile robot navigation and localization. The technique uses the relative simplicity of small baseline tracking in image sequences to develop descriptors suitable for the more challenging task of wide baseline matching across significant viewpoint changes. The variations in the appearance of each feature are learned using kernel principal component analysis (KPCA) over the course of image sequences. An approximate version of KPCA is applied to reduce the computational complexity of the algorithms and yield a compact representation. Our experiments demonstrate robustness to wide appearance variations on non-planar surfaces, including changes in illumination, viewpoint, scale, and geometry of the scene.

Jason Meltzer, Ming-Hsuan Yang, Rakesh Gupta, Stefano Soatto

### An Affine Invariant Salient Region Detector

In this paper we describe a novel technique for detecting salient regions in an image. The detector is a generalization to affine invariance of the method introduced by Kadir and Brady [10]. The detector deems a region salient if it exhibits unpredictability in both its attributes and its spatial scale.The detector has significantly different properties to operators based on kernel convolution, and we examine three aspects of its behaviour: invariance to viewpoint change; insensitivity to image perturbations; and repeatability under intra-class variation. Previous work has, on the whole, concentrated on viewpoint invariance. A second contribution of this paper is to propose a performance test for evaluating the two other aspects.We compare the performance of the saliency detector to other standard detectors including an affine invariance interest point detector. It is demonstrated that the saliency detector has comparable viewpoint invariance performance, but superior insensitivity to perturbations and intra-class variation performance for images of certain object classes.

### A Visual Category Filter for Google Images

We extend the constellation model to include heterogeneous parts which may represent either the appearance or the geometry of a region of the object. The parts and their spatial configuration are learnt simultaneously and automatically, without supervision, from cluttered images.We describe how this model can be employed for ranking the output of an image search engine when searching for object categories. It is shown that visual consistencies in the output images can be identified, and then used to rank the images according to their closeness to the visual object category.Although the proportion of good images may be small, the algorithm is designed to be robust and is capable of learning in either a totally unsupervised manner, or with a very limited amount of supervision.We demonstrate the method on image sets returned by Google’s image search for a number of object categories including bottles, camels, cars, horses, tigers and zebras.

Robert Fergus, Pietro Perona, Andrew Zisserman

### Scene and Motion Reconstruction from Defocused and Motion-Blurred Images via Anisotropic Diffusion

We propose a solution to the problem of inferring the depth map, radiance and motion of a scene from a collection of motion-blurred and defocused images. We model motion-blur and defocus as an anisotropic diffusion process, whose initial conditions depend on the radiance and whose diffusion tensor encodes the shape of the scene, the motion field and the optics parameters. We show that this model is well-posed and propose an efficient algorithm to infer the unknowns of the model. Inference is performed by minimizing the discrepancy between the measured blurred images and the ones synthesized via forward diffusion. Since the problem is ill-posed, we also introduce additional Tikhonov regularization terms. The resulting method is fast and robust to noise as shown by experiments with both synthetic and real data.

Paolo Favaro, Martin Burger, Stefano Soatto

### Semantics Discovery for Image Indexing

To bridge the gap between low-level features and high-level semantic queries in image retrieval, detecting meaningful visual entities (e.g. faces, sky, foliage, buildings etc) based on trained pattern classifiers has become an active research trend. However, a drawback of the supervised learning approach is the human effort to provide labeled regions as training samples. In this paper, we propose a new three-stage hybrid framework to discover local semantic patterns and generate their samples for training with minimal human intervention. Support vector machines (SVM) are first trained on local image blocks from a small number of images labeled as several semantic categories. Then to bootstrap the local semantics, image blocks that produce high SVM outputs are grouped into Discovered Semantic Regions (DSRs) using fuzzy c-means clustering. The training samples for these DSRs are automatically induced from cluster memberships and subject to support vector machine learning to form local semantic detectors for DSRs. An image is then indexed as a tessellation of DSR histograms and matched using histogram intersection. We evaluate our method against the linear fusion of color and texture features using 16 semantic queries on 2400 heterogeneous consumer photos. The DSR models achieved a promising 26% improvement in average precision over that of the feature fusion approach.

Joo-Hwee Lim, Jesse S. Jin

### Hand Gesture Recognition within a Linguistics-Based Framework

An approach to recognizing hand gestures from a monocular temporal sequence of images is presented. Of particular concern is the representation and recognition of hand movements that are used in single handed American Sign Language (ASL). The approach exploits previous linguistic analysis of manual languages that decompose dynamic gestures into their static and dynamic components. The first level of decomposition is in terms of three sets of primitives, hand shape, location and movement. Further levels of decomposition involve the lexical and sentence levels and are part of our plan for future work. We propose and demonstrate that given a monocular gesture sequence, kinematic features can be recovered from the apparent motion that provide distinctive signatures for 14 primitive movements of ASL. The approach has been implemented in software and evaluated on a database of 592 gesture sequences with an overall recognition rate of 86.00% for fully automated processing and 97.13% for manually initialized processing.

Konstantinos G. Derpanis, Richard P. Wildes, John K. Tsotsos

### Line Geometry for 3D Shape Understanding and Reconstruction

We understand and reconstruct special surfaces from 3D data with line geometry methods. Based on estimated surface normals we use approximation techniques in line space to recognize and reconstruct rotational, helical, developable and other surfaces, which are characterized by the configuration of locally intersecting surface normals. For the computational solution we use a modified version of the Klein model of line space. Obvious applications of these methods lie in Reverse Engineering. We have tested our algorithms on real world data obtained from objects as antique pottery, gear wheels, and a surface of the ankle joint.

Helmut Pottmann, Michael Hofer, Boris Odehnal, Johannes Wallner

### Extending Interrupted Feature Point Tracking for 3-D Affine Reconstruction

Feature point tracking over a video sequence fails when the points go out of the field of view or behind other objects. In this paper, we extend such interrupted tracking by imposing the constraint that under the affine camera model all feature trajectories should be in an affine space. Our method consists of iterations for optimally extending the trajectories and for optimally estimating the affine space, coupled with an outlier removal process. Using real video images, we demonstrate that our method can restore a sufficient number of trajectories for detailed 3-D reconstruction.

Yasuyuki Sugaya, Kenichi Kanatani

### Many-to-Many Feature Matching Using Spherical Coding of Directed Graphs

In recent work, we presented a framework for many-to-many matching of multi-scale feature hierarchies, in which features and their relations were captured in a vertex-labeled, edge-weighted directed graph. The algorithm was based on a metric-tree representation of labeled graphs and their metric embedding into normed vector spaces, using the embedding algorithm of Matousek [13]. However, the method was limited by the fact that two graphs to be matched were typically embedded into vector spaces with different dimensionality. Before the embeddings could be matched, a dimensionality reduction technique (PCA) was required, which was both costly and prone to error. In this paper, we introduce a more efficient embedding procedure based on a spherical coding of directed graphs. The advantage of this novel embedding technique is that it prescribes a single vector space into which both graphs are embedded. This reduces the problem of directed graph matching to the problem of geometric point matching, for which efficient many-to-many matching algorithms exist, such as the Earth Mover’s Distance. We apply the approach to the problem of multi-scale, view-based object recognition, in which an image is decomposed into a set of blobs and ridges with automatic scale selection.

M. Fatih Demirci, Ali Shokoufandeh, Sven Dickinson, Yakov Keselman, Lars Bretzner

### Coupled-Contour Tracking through Non-orthogonal Projections and Fusion for Echocardiography

Existing methods for incorporating subspace model constraints in contour tracking use only partial information from the measurements and model distribution. We propose a complete fusion formulation for robust contour tracking, optimally resolving uncertainties from heteroscedastic measurement noise, system dynamics, and a subspace model. The resulting non-orthogonal subspace projection is a natural extension of the traditional model constraint using orthogonal projection. We build models for coupled double-contours, and exploit information from the ground truth initialization through a strong model adaptation. Our framework is applied for tracking in echocardiograms where the noise is heteroscedastic, each heart has distinct shape, and the relative motions of epi- and endocardial borders reveal crucial diagnostic features. The proposed method significantly outperforms the traditional shape-space-constrained tracking algorithm. Due to the joint fusion of heteroscedastic uncertainties, the strong model adaptation, and the coupled tracking of double-contours, robust performance is observed even on the most challenging cases.

Xiang Sean Zhou, Dorin Comaniciu, Sriram Krishnan

### A Statistical Model for General Contextual Object Recognition

We consider object recognition as the process of attaching meaningful labels to specific regions of an image, and propose a model that learns spatial relationships between objects. Given a set of images and their associated text (e.g. keywords, captions, descriptions), the objective is to segment an image, in either a crude or sophisticated fashion, then to find the proper associations between words and regions. Previous models are limited by the scope of the representation. In particular, they fail to exploit spatial context in the images and words. We develop a more expressive model that takes this into account. We formulate a spatially consistent probabilistic mapping between continuous image feature vectors and the supplied word tokens. By learning both word-to-region associations and object relations, the proposed model augments scene segmentations due to smoothing implicit in spatial consistency. Context introduces cycles to the undirected graph, so we cannot rely on a straightforward implementation of the EM algorithm for estimating the model parameters and densities of the unknown alignment variables. Instead, we develop an approximate EM algorithm that uses loopy belief propagation in the inference step and iterative scaling on the pseudo-likelihood approximation in the parameter update step. The experiments indicate that our approximate inference and learning algorithm converges to good local solutions. Experiments on a diverse array of images show that spatial context considerably improves the accuracy of object recognition. Most significantly, spatial context combined with a nonlinear discrete object representation allows our models to cope well with over-segmented scenes.

Peter Carbonetto, Nando de Freitas, Kobus Barnard

### Reconstruction from Projections Using Grassmann Tensors

In this paper a general method is given for reconstruction of a set of feature points in an arbitrary dimensional projective space from their projections into lower dimensional spaces. The method extends the methods applied in the well-studied problem of reconstruction of a set of scene points in $\mathcal {P}^3$ given their projections in a set of images. In this case, the bifocal, trifocal and quadrifocal tensors are used to carry out this computation. It is shown that similar methods will apply in a much more general context, and hence may be applied to projections from $\mathcal {P}^n$ to $\mathcal {P}^m$, which have been used in the analysis of dynamic scenes. For sufficiently many generic projections, reconstruction of the scene is shown to be unique up to projectivity, except in the case of projections onto one-dimensional image spaces (lines).

Richard I. Hartley, Fred Schaffalitzky

### Co-operative Multi-target Tracking and Classification

This paper describes a real-time system for multi-target tracking and classification in image sequences from a single stationary camera. Several targets can be tracked simultaneously in spite of splits and merges amongst the foreground objects and presence of clutter in the segmentation results. In results we show tracking of upto 17 targets simultaneously. The algorithm combines Kalman filter-based motion and shape tracking with an efficient pattern matching algorithm. The latter facilitates the use of a dynamic programming strategy to efficiently solve the data association problem in presence of multiple splits and merges. The system is fully automatic and requires no manual input of any kind for initialization of tracking. The initialization for tracking is done using attributed graphs. The algorithm gives stable and noise free track initialization. The image based tracking results are used as inputs to a Bayesian network based classifier to classify the targets into different categories. After classification a simple 3D model for each class is used along with camera calibration to obtain 3D tracking results for the targets. We present results on a large number of real world image sequences, and accurate 3D tracking results compared with the readings from the speedometer of the vehicle. The complete tracking system including segmentation of moving targets works at about 25Hz for 352×288 resolution color images on a 2.8 GHz pentium-4 desktop.

Pankaj Kumar, Surendra Ranganath, Kuntal Sengupta, Huang Weimin

### A Linguistic Feature Vector for the Visual Interpretation of Sign Language

This paper presents a novel approach to sign language recognition that provides extremely high classification rates on minimal training data. Key to this approach is a 2 stage classification procedure where an initial classification stage extracts a high level description of hand shape and motion. This high level description is based upon sign linguistics and describes actions at a conceptual level easily understood by humans. Moreover, such a description broadly generalises temporal activities naturally overcoming variability of people and environments. A second stage of classification is then used to model the temporal transitions of individual signs using a classifier bank of Markov chains combined with Independent Component Analysis. We demonstrate classification rates as high as 97.67% for a lexicon of 43 words using only single instance training outperforming previous approaches where thousands of training examples are required.

### Fast Object Detection with Occlusions

We describe a new framework, based on boosting algorithms and cascade structures, to efficiently detect objects/faces with occlusions. While our approach is motivated by the work of Viola and Jones, several techniques have been developed for establishing a more general system, including (i) a robust boosting scheme, to select useful weak learners and to avoid overfitting; (ii) reinforcement training, to reduce false-positive rates via a more effective training procedure for boosted cascades; and (iii) cascading with evidence, to extend the system to handle occlusions, without compromising in detection speed. Experimental results on detecting faces under various situations are provided to demonstrate the performances of the proposed method.

Yen-Yu Lin, Tyng-Luh Liu, Chiou-Shann Fuh

### Pose Estimation of Free-Form Objects

In this contribution we present an approach for 2D-3D pose estimation of 3D free-form surface models. In our scenario we observe a free-form object in an image of a calibrated camera. Pose estimation means to estimate the relative position and orientation of the 3D object to the reference camera system. The object itself is modeled as a two-parametric 3D surface and extended by one-parametric contour parts of the object. A twist representation, which is equivalent to a Fourier representation allows for a low-pass approximation of the object model, which is advantageously applied to regularize the pose problem. The experiments show, that our developed algorithms are fast (200ms/frame) and accurate (1o rotational error/frame).

Bodo Rosenhahn, Gerald Sommer

### Interactive Image Segmentation Using an Adaptive GMMRF Model

The problem of interactive foreground/background segmentation in still images is of great practical importance in image editing. The state of the art in interactive segmentation is probably represented by the graph cut algorithm of Boykov and Jolly (ICCV 2001). Its underlying model uses both colour and contrast information, together with a strong prior for region coherence. Estimation is performed by solving a graph cut problem for which very efficient algorithms have recently been developed. However the model depends on parameters which must be set by hand and the aim of this work is for those constants to be learned from image data.First, a generative, probabilistic formulation of the model is set out in terms of a “Gaussian Mixture Markov Random Field” (GMMRF). Secondly, a pseudolikelihood algorithm is derived which jointly learns the colour mixture and coherence parameters for foreground and background respectively. Error rates for GMMRF segmentation are calculated throughout using a new image database, available on the web, with ground truth provided by a human segmenter. The graph cut algorithm, using the learned parameters, generates good object-segmentations with little interaction. However, pseudolikelihood learning proves to be frail, which limits the complexity of usable models, and hence also the achievable error rate.

Andrew Blake, Carsten Rother, M. Brown, Patrick Perez, Philip Torr

### Can We Consider Central Catadioptric Cameras and Fisheye Cameras within a Unified Imaging Model

There are two kinds of omnidirectional cameras often used in computer vision: central catadioptric cameras and fisheye cameras. Previous literatures use different imaging models to describe them separately. A unified imaging model is however presented in this paper. The unified model in this paper can be considered as an extension of the unified imaging model for central catadioptric cameras proposed by Geyer and Daniilidis. We show that our unified model can cover some existing models for fisheye cameras and fit well for many actual fisheye cameras used in previous literatures. Under our unified model, central catadioptric cameras and fisheye cameras can be classified by the model’s characteristic parameter, and a fisheye image can be transformed into a central catadioptric one, vice versa. An important merit of our new unified model is that existing calibration methods for central catadioptric cameras can be directly applied to fisheye cameras. Furthermore, the metric calibration from single fisheye image only using projections of lines becomes possible via our unified model but the existing methods for fisheye cameras in the literatures till now are all non-metric under the same conditions. Experimental results of calibration from some central catadioptric and fisheye images confirm the validity and usefulness of our new unified model.

Xianghua Ying, Zhanyi Hu

### Image Clustering with Metric, Local Linear Structure, and Affine Symmetry

This paper addresses the problem of clustering images of objects seen from different viewpoints. That is, given an unlabelled set of images of n objects, we seek an unsupervised algorithm that can group the images into n disjoint subsets such that each subset only contains images of a single object. We formulate this clustering problem under a very broad geometric framework. The theme is the interplay between the geometry of appearance manifolds and the symmetry of the 2D affine group. Specifically, we identify three important notions for image clustering: the L2 distance metric of the image space, the local linear structure of the appearance manifolds, and the action of the 2D affine group in the image space. Based on these notions, we propose a new image clustering algorithm. In a broad outline, the algorithm uses the metric to determine a neighborhood structure in the image space for each input image. Using local linear structure, comparisons (affinities) between images are computed only among the neighbors. These local comparisons are agglomerated into an affinity matrix, and a spectral clustering algorithm is used to yield the final clustering result. The technical part of the algorithm is to make all of these compatible with the action of the 2D affine group. Using human face images and images from the COIL database, we demonstrate experimentally that our algorithm is effective in clustering images (according to ojbect identity) where there is a large range of pose variation.

Jongwoo Lim, Jeffrey Ho, Ming-Hsuan Yang, Kuang-chih Lee, David Kriegman

### Face Recognition with Local Binary Patterns

In this work, we present a novel approach to face recognition which considers both shape and texture information to represent face images. The face area is first divided into small regions from which Local Binary Pattern (LBP) histograms are extracted and concatenated into a single, spatially enhanced feature histogram efficiently representing the face image. The recognition is performed using a nearest neighbour classifier in the computed feature space with Chi square as a dissimilarity measure. Extensive experiments clearly show the superiority of the proposed scheme over all considered methods (PCA, Bayesian Intra/extrapersonal Classifier and Elastic Bunch Graph Matching) on FERET tests which include testing the robustness of the method against different facial expressions, lighting and aging of the subjects. In addition to its efficiency, the simplicity of the proposed method allows for very fast feature extraction.

Timo Ahonen, Abdenour Hadid, Matti Pietikäinen

### Steering in Scale Space to Optimally Detect Image Structures

Detecting low-level image features such as edges and ridges with spatial filters is improved if the scale of the features are known a priori. Scale-space representations and wavelet pyramids address the problem by using filters over multiple scales. However, the scales of the filters are still fixed beforehand and the number of scales is limited by computational power. The filtering operations are thus not adapted to detect image structures at their optimal or intrinsic scales. We adopt the steering approach to obtain filter responses at arbitrary scales from a small set of filters at scales chosen to accurately sample the “scale space” within a given range. In particular, we use the Moore-Penrose inverse to learn the steering coefficients, which we then regress by polynomial function fitting to the scale parameter in order to steer the filter responses continuously across scales. We show that the extrema of the polynomial steering functions can be easily computed to detect interesting features such as phase-independent energy maxima. Such points of energy maxima in our α-scale-space correspond to the intrinsic scale of the filtered image structures. We apply the technique to several well-known images to segment image structures which are mostly characterised by their intrinsic scale.

Jeffrey Ng, Anil A. Bharath

### Hand Motion from 3D Point Trajectories and a Smooth Surface Model

A method is proposed to track the full hand motion from 3D points reconstructed using a stereoscopic set of cameras. This approach combines the advantages of methods that use 2D motion (e.g. optical flow), and those that use a 3D reconstruction at each time frame to capture the hand motion. Matching either contours or a 3D reconstruction against a 3D hand model is usually very difficult due to self-occlusions and the locally-cylindrical structure of each phalanx in the model, but our use of 3D point trajectories constrains the motion and overcomes these problems.Our tracking procedure uses both the 3D point matches between two time frames and a smooth surface model of the hand, build with implicit surface. We used animation techniques to represent faithfully the skin motion, especially near joints. Robustness is obtained by using an EM version of the ICP algorithm for matching points between consecutive frames, and the tracked points are then registered to the surface of the hand model. Results are presented on a stereoscopic sequence of a moving hand, and are evaluated using a side view of the sequence.

Guillaume Dewaele, Frédéric Devernay, Radu Horaud

### A Robust Probabilistic Estimation Framework for Parametric Image Models

Models of spatial variation in images are central to a large number of low-level computer vision problems including segmentation, registration, and 3D structure detection. Often, images are represented using parametric models to characterize (noise-free) image variation, and, additive noise. However, the noise model may be unknown and parametric models may only be valid on individual segments of the image. Consequently, we model noise using a nonparametric kernel density estimation framework and use a locally or globally linear parametric model to represent the noise-free image pattern. This results in a novel, robust, redescending, M- parameter estimator for the above image model which we call the Kernel Maximum Likelihood estimator (KML). We also provide a provably convergent, iterative algorithm for the resultant optimization problem. The estimation framework is empirically validated on synthetic data and applied to the task of range image segmentation.

Maneesh Singh, Himanshu Arora, Narendra Ahuja

### Keyframe Selection for Camera Motion and Structure Estimation from Multiple Views

Estimation of camera motion and structure of rigid objects in the 3D world from multiple camera images by bundle adjustment is often performed by iterative minimization methods due to their low computational effort. These methods need a robust initialization in order to converge to the global minimum. In this paper a new criterion for keyframe selection is presented. While state of the art criteria just avoid degenerated camera motion configurations, the proposed criterion selects the keyframe pairing with the lowest expected estimation error of initial camera motion and object structure. The presented results show, that the convergence probability of bundle adjustment is significantly improved with the new criterion compared to the state of the art approaches.

Thorsten Thormählen, Hellward Broszio, Axel Weissenfeld

### Omnidirectional Vision: Unified Model Using Conformal Geometry

It has been proven that a catadioptric projection can be modeled by an equivalent spherical projection. In this paper we present an extension and improvement of those ideas using the conformal geometric algebra, a modern framework for the projective space of hyper-spheres. Using this mathematical system, the analysis of diverse catadioptric mirrors becomes transparent and computationally simpler. As a result, the algebraic burden is reduced, allowing the user to work in a much more effective framework for the development of algorithms for omnidirectional vision. This paper includes complementary experimental analysis related to omnidirectional vision guided robot navigation.

Eduardo Bayro-Corrochano, Carlos López-Franco

### A Robust Algorithm for Characterizing Anisotropic Local Structures

This paper proposes a robust estimation and validation framework for characterizing local structures in a positive multi-variate continuous function approximated by a Gaussian-based model. The new solution is robust against data with large deviations from the model and margin-truncations induced by neighboring structures. To this goal, it unifies robust statistical estimation for parametric model fitting and multi-scale analysis based on continuous scale-space theory. The unification is realized by formally extending the mean shift-based density analysis towards continuous signals whose local structure is characterized by an anisotropic fully-parameterized covariance matrix. A statistical validation method based on analyzing residual error of the chi-square fitting is also proposed to complement this estimation framework. The strength of our solution is the aforementioned robustness. Experiments with synthetic 1D and 2D data clearly demonstrate this advantage in comparison with the γ-normalized Laplacian approach [12] and the standard sample estimation approach [13, p.179]. The new framework is applied to 3D volumetric analysis of lung tumors. A 3D implementation is evaluated with high-resolution CT images of 14 patients with 77 tumors, including 6 part-solid or ground-glass opacity nodules that are highly non-Gaussian and clinically significant. Our system accurately estimated 3D anisotropic spread and orientation for 82% of the total tumors and also correctly rejected all the failures without any false rejection and false acceptance. This system processes each 32-voxel volume-of-interest by an average of two seconds with a 2.4GHz Intel CPU. Our framework is generic and can be applied for the analysis of blob-like structures in various other applications.

Kazunori Okada, Dorin Comaniciu, Navneet Dalal, Arun Krishnan

### Dimensionality Reduction by Canonical Contextual Correlation Projections

A linear, discriminative, supervised technique for reducing feature vectors extracted from image data to a lower-dimensional representation is proposed. It is derived from classical Fisher linear discriminant analysis (LDA) and useful, for example, in supervised segmentation tasks in which high-dimensional feature vector describes the local structure of the image. In general, the main idea of the technique is applicable in discriminative and statistical modelling that involves contextual data.LDA is a basic, well-known and useful technique in many applications. Our contribution is that we extend the use of LDA to cases where there is dependency between the output variables, i.e., the class labels, and not only between the input variables. The latter can be dealt with in standard LDA.The principal idea is that where standard LDA merely takes into account a single class label for every feature vector, the new technique incorporates class labels of its neighborhood in its analysis as well. In this way, the spatial class label configuration in the vicinity of every feature vector is accounted for, resulting in a technique suitable for e.g. image data. This spatial LDA is derived from a formulation of standard LDA in terms of canonical correlation analysis. The linearly dimension reduction transformation thus obtained is called the canonical contextual correlation projection.An additional drawback of LDA is that it cannot extract more features than the number of classes minus one. In the two-class case this means that only a reduction to one dimension is possible. Our contextual LDA approach can avoid such extreme deterioration of the classification space and retain more than one dimension.The technique is exemplified on a pixel-based segmentation problem. An illustrative experiment on a medical image segmentation task shows the performance improvements possible employing the canonical contextual correlation projection.

Marco Loog, Bram van Ginneken, Robert P. W. Duin

### Accuracy of Spherical Harmonic Approximations for Images of Lambertian Objects under Far and Near Lighting

Various problems in Computer Vision become difficult due to a strong influence of lighting on the images of an object. Recent work showed analytically that the set of all images of a convex, Lambertian object can be accurately approximated by the low-dimensional linear subspace constructed using spherical harmonic functions. In this paper we present two major contributions: first, we extend previous analysis of spherical harmonic approximation to the case of arbitrary objects; second, we analyze its applicability for near light. We begin by showing that under distant lighting, with uniform distribution of light sources, the average accuracy of spherical harmonic representation can be bound from below. This bound holds for objects of arbitrary geometry and color, and for general illuminations (consisting of any number of light sources). We further examine the case when light is coming from above and provide an analytic expression for the accuracy obtained in this case. Finally, we show that low-dimensional representations using spherical harmonics provide an accurate approximation also for fairly near light. Our analysis assumes Lambertian reflectance and accounts for attached, but not for cast shadows. We support this analysis by simulations and real experiments, including an example of a 3D shape reconstruction by photometric stereo under very close, unknown lighting.

Darya Frolova, Denis Simakov, Ronen Basri

### Characterization of Human Faces under Illumination Variations Using Rank, Integrability, and Symmetry Constraints

Photometric stereo algorithms use a Lambertian reflectance model with a varying albedo field and involve the appearances of only one object. This paper extends photometric stereo algorithms to handle all the appearances of all the objects in a class, in particular the class of human faces. Similarity among all facial appearances motivates a rank constraint on the albedos and surface normals in the class. This leads to a factorization of an observation matrix that consists of exemplar images of different objects under different illuminations, which is beyond what can be analyzed using bilinear analysis. Bilinear analysis requires exemplar images of different objects under same illuminations. To fully recover the class-specific albedos and surface normals, integrability and face symmetry constraints are employed. The proposed linear algorithm takes into account the effects of the varying albedo field by approximating the integrability terms using only the surface normals. As an application, face recognition under illumination variation is presented. The rank constraint enables an algorithm to separate the illumination source from the observed appearance and keep the illuminant-invariant information that is appropriate for recognition. Good recognition results have been obtained using the PIE dataset.

S. Kevin Zhou, Rama Chellappa, David W. Jacobs

### User Assisted Separation of Reflections from a Single Image Using a Sparsity Prior

When we take a picture through transparent glass the image we obtain is often a linear superposition of two images: the image of the scene beyond the glass plus the image of the scene reflected by the glass. Decomposing the single input image into two images is a massively ill-posed problem: in the absence of additional knowledge about the scene being viewed there are an infinite number of valid decompositions. In this paper we focus on an easier problem: user assisted separation in which the user interactively labels a small number of gradients as belonging to one of the layers.Even given labels on part of the gradients, the problem is still ill-posed and additional prior knowledge is needed. Following recent results on the statistics of natural images we use a sparsity prior over derivative filters. We first approximate this sparse prior with a Laplacian prior and obtain a simple, convex optimization problem. We then use the solution with the Laplacian prior as an initialization for a simple, iterative optimization for the sparsity prior. Our results show that using a prior derived from the statistics of natural images gives a far superior performance compared to a Gaussian prior and it enables good separations from a small number of labeled gradients.

Anat Levin, Yair Weiss

### The Quality of Catadioptric Imaging – Application to Omnidirectional Stereo

We investigate the influence of the mirror shape on the imaging quality of catadioptric sensors. For axially symmetrical mirrors we calculate the locations of the virtual image points considering incident quasi-parallel light rays. Using second order approximations, we give analytical expressions for the two limiting surfaces of this “virtual image zone”. This is different to numerical or ray tracing approaches for the estimation of the blur region, e.g. [1]. We show how these equations can be used to estimate the image blur caused by the shape of the mirror. As examples, we present two different omnidirectional stereo sensors with single camera and equi-angular mirrors that are used on mobile robots. To obtain a larger stereo baseline one of these sensors consists of two separated mirror of the same angular magnification and differs from a similar configuration proposed by Ollis et al. [2]. We calculate the caustic surfaces and show that this stereo configuration can be approximated by two single view points yielding an effective vertical stereo baseline of approx. 3.7cm. An example of panoramic disparity computation using a physiologically motivated stereo algorithm is given.

Wolfgang Stürzl, Hans jürgen Dahmen, Hanspeter A. Mallot

### Backmatter

Weitere Informationen