Skip to main content

Über dieses Buch

The 2010 edition of the European Conference on Computer Vision was held in Heraklion, Crete. The call for papers attracted an absolute record of 1,174 submissions. We describe here the selection of the accepted papers: ? Thirty-eight area chairs were selected coming from Europe (18), USA and Canada (16), and Asia (4). Their selection was based on the following criteria: (1) Researchers who had served at least two times as Area Chairs within the past two years at major vision conferences were excluded; (2) Researchers who served as Area Chairs at the 2010 Computer Vision and Pattern Recognition were also excluded (exception: ECCV 2012 Program Chairs); (3) Minimization of overlap introduced by Area Chairs being former student and advisors; (4) 20% of the Area Chairs had never served before in a major conference; (5) The Area Chair selection process made all possible efforts to achieve a reasonable geographic distribution between countries, thematic areas and trends in computer vision. ? Each Area Chair was assigned by the Program Chairs between 28–32 papers. Based on paper content, the Area Chair recommended up to seven potential reviewers per paper. Such assignment was made using all reviewers in the database including the conflicting ones. The Program Chairs manually entered the missing conflict domains of approximately 300 reviewers. Based on the recommendation of the Area Chairs, three reviewers were selected per paper (with at least one being of the top three suggestions), with 99.



Computational Imaging

Guided Image Filtering

In this paper, we propose a novel type of explicit image filter -

guided filter

. Derived from a local linear model, the guided filter generates the filtering output by considering the content of a guidance image, which can be the input image itself or another different image. The guided filter can perform as an edge-preserving smoothing operator like the popular bilateral filter [1], but has better behavior near the edges. It also has a theoretical connection with the matting Laplacian matrix [2], so is a more generic concept than a smoothing operator and can better utilize the structures in the guidance image. Moreover, the guided filter has a fast and non-approximate linear-time algorithm, whose computational complexity is independent of the filtering kernel size. We demonstrate that the guided filter is both effective and efficient in a great variety of computer vision and computer graphics applications including noise reduction, detail smoothing/enhancement, HDR compression, image matting/feathering, haze removal, and joint upsampling.

Kaiming He, Jian Sun, Xiaoou Tang

Analysis of Motion Blur with a Flutter Shutter Camera for Non-linear Motion

Motion blurs confound many computer vision problems. The fluttered shutter (FS) camera [1] tackles the motion deblurring problem by emulating invertible broadband blur kernels. However, existing FS methods assume known constant velocity motions, e.g., via user specifications. In this paper, we extend the FS technique to general 1D motions and develop an automatic motion-from-blur framework by analyzing the image statistics under the FS.

We first introduce a fluttered-shutter point-spread-function (FS-PSF) to uniformly model the blur kernel under general motions. We show that many commonly used motions have closed-form FS-PSFs. To recover the FS-PSF from the blurred image, we present a new method by analyzing image power spectrum statistics. We show that the Modulation Transfer Function of the 1D FS-PSF is statistically correlated to the blurred image power spectrum along the motion direction. We then recover the FS-PSF by finding the motion parameters that maximize the correlation. We demonstrate our techniques on a variety of motions including constant velocity, constant acceleration, and harmonic rotation. Experimental results show that our method can automatically and accurately recover the motion from the blurs captured under the fluttered shutter.

Yuanyuan Ding, Scott McCloskey, Jingyi Yu

Error-Tolerant Image Compositing

Gradient-domain compositing is an essential tool in computer vision and its applications, e.g., seamless cloning, panorama stitching, shadow removal, scene completion and reshuffling. While easy to implement, these gradient-domain techniques often generate bleeding artifacts where the composited image regions do not match. One option is to modify the region boundary to minimize such mismatches. However, this option may not always be sufficient or applicable, e.g., the user or algorithm may not allow the selection to be altered. We propose a new approach to gradient-domain compositing that is robust to inaccuracies and prevents color bleeding without changing the boundary location. Our approach improves standard gradient-domain compositing in two ways. First, we define the boundary gradients such that the produced gradient field is nearly integrable. Second, we control the integration process to concentrate residuals where they are less conspicuous. We show that our approach can be formulated as a standard least-squares problem that can be solved with a sparse linear system akin to the classical Poisson equation. We demonstrate results on a variety of scenes. The visual quality and run-time complexity compares favorably to other approaches.

Michael W. Tao, Micah K. Johnson, Sylvain Paris

Blind Reflectometry

We address the problem of inferring homogeneous reflectance (BRDF) from a single image of a known shape in an unknown real-world lighting environment. With appropriate representations of lighting and reflectance, the image provides bilinear constraints on the two signals, and our task is to blindly isolate the latter. We achieve this by leveraging the statistics of real-world illumination and estimating the reflectance that is most likely under a distribution of probable illumination environments. Experimental results with a variety of real and synthetic images suggest that useable reflectance information can be inferred in many cases, and that these estimates are stable under changes in lighting.

Fabiano Romeiro, Todd Zickler

Photometric Stereo for Dynamic Surface Orientations

We present a photometric stereo method for non-rigid objects of unknown and spatially varying materials. The prior art uses time-multiplexed illumination but assumes constant surface normals across several frames, fundamentally limiting the accuracy of the estimated normals. We explicitly account for time-varying surface orientations, and show that for unknown Lambertian materials, five images are sufficient to recover surface orientation in one frame. Our optimized system implementation exploits the physical properties of typical cameras and LEDs to reduce the required number of images to just three, and also facilitates frame-to-frame image alignment using standard optical flow methods, despite varying illumination. We demonstrate the system’s performance by computing surface orientations for several different moving, deforming objects.

Hyeongwoo Kim, Bennett Wilburn, Moshe Ben-Ezra

Fully Isotropic Fast Marching Methods on Cartesian Grids

The existing Fast Marching methods which are used to solve the Eikonal equation use a locally continuous model to estimate the accumulated cost, but a discontinuous (discretized) model for the traveling cost around each grid point. Because the accumulated cost and the traveling (local) cost are treated differently, the estimate of the accumulated cost at any point will vary based on the direction of the arriving front. Instead we propose to estimate the traveling cost at each grid point based on a locally continuous model, where we will interpolate the traveling cost along the direction of the propagating front. We further choose an interpolation scheme that is not biased by the direction of the front. Thus making the fast marching process truly isotropic. We show the significance of removing the directional bias in the computation of the cost in certain applications of fast marching method. We also compare the accuracy and computation times of our proposed methods with the existing state of the art fast marching techniques to demonstrate the superiority of our method.

Vikram Appia, Anthony Yezzi

Spotlights and Posters M1

Descattering Transmission via Angular Filtering

We describe a single-shot method to differentiate unscattered and scattered components of light transmission through a heterogeneous translucent material. Directly-transmitted components travel in a straight line from the light source, while scattered components originate from multiple scattering centers in the volume. Computer vision methods deal with participating media via 2D contrast enhancing software techniques. On the other hand, optics techniques treat scattering as noise and use elaborate methods to reduce the scattering or its impact on the direct unscattered component. We observe the scattered component on its own provides useful information because the angular variation is low frequency. We propose a method to strategically capture angularly varying scattered light and compute the unscattered direct component. We capture the scattering from a single light source via a lenslet array placed close to the image plane. As an application, we demonstrate enhanced tomographic reconstruction of scattering objects using estimated direct transmission images.

Jaewon Kim, Douglas Lanman, Yasuhiro Mukaigawa, Ramesh Raskar

Flexible Voxels for Motion-Aware Videography

The goal of this work is to build video cameras whose spatial and temporal resolutions can be changed post-capture depending on the scene. Building such cameras is difficult due to two reasons. First, current video cameras allow the same spatial resolution and frame rate for the entire captured spatio-temporal volume. Second, both these parameters are fixed


the scene is captured. We propose different components of video camera design: a sampling scheme, processing of captured data and hardware that offer post-capture variable spatial and temporal resolutions, independently at each image location. Using the motion information in the captured data, the correct resolution for each location is decided automatically. Our techniques make it possible to capture fast moving objects without motion blur, while simultaneously preserving high-spatial resolution for static scene parts within the same video sequence. Our sampling scheme requires a fast per-pixel shutter on the sensor-array, which we have implemented using a co-located camera-projector system.

Mohit Gupta, Amit Agrawal, Ashok Veeraraghavan, Srinivasa G. Narasimhan

Learning PDEs for Image Restoration via Optimal Control

Partial differential equations (PDEs) have been successfully applied to many computer vision and image processing problems. However, designing PDEs requires high mathematical skills and good insight into the problems. In this paper, we show that the design of PDEs could be made easier by borrowing the

learning strategy

from machine learning. In our learning-based PDE (L-PDE) framework for image restoration, there are two terms in our PDE model: (i) a regularizer which encodes the prior knowledge of the image model and (ii) a linear combination of differential invariants, which is data-driven and can effectively adapt to different problems and complex conditions. The L-PDE is learnt from some input/output pairs of training samples via an optimal control technique. The effectiveness of our L-PDE framework for image restoration is demonstrated with two exemplary applications: image denoising and inpainting, where the PDEs are obtained easily and the produced results are comparable to or better than those of traditional PDEs, which were elaborately designed.

Risheng Liu, Zhouchen Lin, Wei Zhang, Zhixun Su

Compressive Acquisition of Dynamic Scenes

Compressive sensing (CS) is a new approach for the acquisition and recovery of sparse signals and images that enables sampling rates significantly below the classical Nyquist rate. Despite significant progress in the theory and methods of CS, little headway has been made in compressive video acquisition and recovery. Video CS is complicated by the ephemeral nature of dynamic events, which makes direct extensions of standard CS imaging architectures and signal models infeasible. In this paper, we develop a new framework for video CS for dynamic textured scenes that models the evolution of the scene as a linear dynamical system (LDS). This reduces the video recovery problem to first estimating the model parameters of the LDS from compressive measurements, from which the image frames are then reconstructed. We exploit the low-dimensional dynamic parameters (the state sequence) and high-dimensional static parameters (the observation matrix) of the LDS to devise a novel compressive measurement strategy that measures only the dynamic part of the scene at each instant and accumulates measurements over time to estimate the static parameters. This enables us to considerably lower the compressive measurement rate considerably. We validate our approach with a range of experiments including classification experiments that highlight the effectiveness of the proposed approach.

Aswin C. Sankaranarayanan, Pavan K. Turaga, Richard G. Baraniuk, Rama Chellappa

Scene Carving: Scene Consistent Image Retargeting

Image retargeting algorithms often create visually disturbing distortion. We introduce the property of scene consistency, which is held by images which contain no object distortion and have the correct object depth ordering. We present two new image retargeting algorithms that preserve scene consistency. These algorithms make use of a user-provided relative depth map, which can be created easily using a simple GrabCut-style interface. Our algorithms generalize seam carving. We decompose the image retargeting procedure into (a) removing image content with minimal distortion and (b) re-arrangement of known objects within the scene to maximize their visibility. Our algorithms optimize objectives (a) and (b) jointly. However, they differ considerably in how they achieve this. We discuss this in detail and present examples illustrating the rationale of preserving scene consistency in retargeting.

Alex Mansfield, Peter Gehler, Luc Van Gool, Carsten Rother

Two-Phase Kernel Estimation for Robust Motion Deblurring

We discuss a few new motion deblurring problems that are significant to kernel estimation and non-blind deconvolution. We found that strong edges do not always profit kernel estimation, but instead under certain circumstance degrade it. This finding leads to a new metric to measure the usefulness of image edges in motion deblurring and a gradient selection process to mitigate their possible adverse effect. We also propose an efficient and high-quality kernel estimation method based on using the spatial prior and the iterative support detection (ISD) kernel refinement, which avoids hard threshold of the kernel elements to enforce sparsity. We employ the TV-ℓ


deconvolution model, solved with a new variable substitution scheme to robustly suppress noise.

Li Xu, Jiaya Jia

Single Image Deblurring Using Motion Density Functions

We present a novel single image deblurring method to estimate spatially non-uniform blur that results from camera shake. We use existing spatially invariant deconvolution methods in a local and robust way to compute initial estimates of the latent image. The camera motion is represented as a

Motion Density Function

(MDF) which records the fraction of time spent in each discretized portion of the space of all possible camera poses. Spatially varying blur kernels are derived directly from the MDF. We show that 6D camera motion is well approximated by 3 degrees of motion (in-plane translation and rotation) and analyze the scope of this approximation. We present results on both synthetic and captured data. Our system out-performs current approaches which make the assumption of spatially invariant blur.

Ankit Gupta, Neel Joshi, C. Lawrence Zitnick, Michael Cohen, Brian Curless

An Iterative Method with General Convex Fidelity Term for Image Restoration

We propose a convergent iterative regularization procedure based on the square of a dual norm for image restoration with general (quadratic or non-quadratic) convex fidelity terms. Convergent iterative regularization methods have been employed for image deblurring or denoising in the presence of Gaussian noise, which use



[1] and



[2] fidelity terms. Iusem-Resmerita [3] proposed a proximal point method using inexact Bregman distance for minimizing a general convex function defined on a general non-reflexive Banach space which is the dual of a separable Banach space. Based on this, we investigate several approaches for image restoration (denoising-deblurring) with different types of noise. We test the behavior of proposed algorithms on synthetic and real images. We compare the results with other state-of-the-art iterative procedures as well as the corresponding existing one-step gradient descent implementations. The numerical experiments indicate that the iterative procedure yields high quality reconstructions and superior results to those obtained by one-step gradient descent and similar with other iterative methods.

Miyoun Jung, Elena Resmerita, Luminita Vese

One-Shot Optimal Exposure Control

We introduce an algorithm to estimate the optimal exposure parameters from the analysis of a single, possibly under- or over-exposed, image. This algorithm relies on a new quantitative measure of exposure quality, based on the average rendering error, that is, the difference between the original irradiance and its reconstructed value after processing and quantization. In order to estimate the exposure quality in the presence of saturated pixels, we fit a log-normal distribution to the brightness data, computed from the unsaturated pixels. Experimental results are presented comparing the estimated vs. “ground truth” optimal exposure parameters under various illumination conditions.

David Ilstrup, Roberto Manduchi

Analyzing Depth from Coded Aperture Sets

Computational depth estimation is a central task in computer vision and graphics. A large variety of strategies have been introduced in the past relying on viewpoint variations, defocus changes and general aperture codes. However, the tradeoffs between such designs are not well understood. Depth estimation from computational camera measurements is a highly non-linear process and therefore most research attempts to evaluate depth estimation strategies rely on numerical simulations. Previous attempts to design computational cameras with good depth discrimination optimized highly non-linear and non-convex scores, and hence it is not clear if the constructed designs are optimal. In this paper we address the problem of depth discrimination from


images captured using


arbitrary codes placed within one fixed lens aperture. We analyze the desired properties of discriminative codes under a geometric optics model and propose an

upper bound

on the best possible discrimination. We show that under a multiplicative noise model, the half ring codes discovered by Zhou et al. [1] are near-optimal. When a large number of images are allowed, a multi-aperture camera [2] dividing the aperture into multiple annular rings provides near-optimal discrimination. In contrast, the plenoptic camera of [5] which divides the aperture into compact support circles can achieve at most 50% of the optimal discrimination bound.

Anat Levin

We Are Family: Joint Pose Estimation of Multiple Persons

We present a novel multi-person pose estimation framework, which extends pictorial structures (PS) to explicitly model interactions between people and to estimate their poses jointly. Interactions are modeled as occlusions between people. First, we propose an occlusion probability predictor, based on the location of persons automatically detected in the image, and incorporate the predictions as occlusion priors into our multi-person PS model. Moreover, our model includes an inter-people exclusion penalty, preventing body parts from different people from occupying the same image region. Thanks to these elements, our model has a global view of the scene, resulting in better pose estimates in group photos, where several persons stand nearby and occlude each other. In a comprehensive evaluation on a new, challenging group photo datasets we demonstrate the benefits of our multi-person model over a state-of-the-art single-person pose estimator which treats each person independently.

Marcin Eichner, Vittorio Ferrari

Joint People, Event, and Location Recognition in Personal Photo Collections Using Cross-Domain Context

We present a framework for vision-assisted tagging of personal photo collections using context. Whereas previous efforts mainly focus on tagging people, we develop a unified approach to jointly tag across multiple domains (specifically people, events, and locations). The heart of our approach is a generic probabilistic model of context that couples the domains through a set of cross-domain relations. Each relation models how likely the instances in two domains are to co-occur. Based on this model, we derive an algorithm that simultaneously estimates the cross-domain relations and infers the unknown tags in a semi-supervised manner. We conducted experiments on two well-known datasets and obtained significant performance improvements in both people and location recognition. We also demonstrated the ability to infer event labels with missing timestamps (i.e. with no event features).

Dahua Lin, Ashish Kapoor, Gang Hua, Simon Baker

Chrono-Gait Image: A Novel Temporal Template for Gait Recognition

In this paper, we propose a novel temporal template, called Chrono-Gait Image (


), to describe the spatio-temporal walking pattern for human identification by gait. The


temporal template encodes the temporal information among gait frames via color mapping to improve the recognition performance. Our method starts with the extraction of the contour in each gait image, followed by utilizing a color mapping function to encode each of gait contour images in the same gait sequence and compositing them to a single


. We also obtain the


-based real templates by generating


for each period of one gait sequence and utilize contour distortion to generate the


-based synthetic templates. In addition to independent recognition using either of individual templates, we combine the real and synthetic temporal templates for refining the performance of human recognition. Extensive experiments on the USF HumanID database indicate that compared with the recently published gait recognition approaches, our


-based approach attains better performance in gait recognition with considerable robustness to gait period detection.

Chen Wang, Junping Zhang, Jian Pu, Xiaoru Yuan, Liang Wang

Robust Face Recognition Using Probabilistic Facial Trait Code

Recently, Facial Trait Code (FTC) was proposed for solving face recognition, and was reported with promising recognition rates. However, several simplifications in the FTC encoding make it unable to handle the most rigorous face recognition scenario in which only one facial image per individual is available for enrollment in the gallery set and the probe set includes faces under variations caused by illumination, expression, pose or misalignment. In this study, we propose the Probabilistic Facial Trait Code (PFTC) with a novel encoding scheme and a probabilistic codeword distance measure. We also proposed the Pattern-Specific Subspace Learning (PSSL) scheme that encodes and recognizes faces robustly under aforementioned variations. The proposed PFTC was evaluated and compared with state-of-the-art algorithms, including the FTC, the algorithm using sparse representation, and the one using Local Binary Pattern. Our experimental study considered factors such as the number of enrollment allowed in the gallery, the variation among gallery or probe set, and reported results for both identification and verification problems. The proposed PFTC yielded significant better recognition rates in most of the scenarios than all the states-of-the-art algorithms evaluated in this study.

Ping-Han Lee, Gee-Sern Hsu, Szu-Wei Wu, Yi-Ping Hung

A 2D Human Body Model Dressed in Eigen Clothing

Detection, tracking, segmentation and pose estimation of people in monocular images are widely studied. Two-dimensional models of the human body are extensively used, however, they are typically fairly crude, representing the body either as a rough outline or in terms of articulated geometric primitives. We describe a new 2D model of the human body contour that combines an underlying naked body with a low-dimensional clothing model. The naked body is represented as a Contour Person that can take on a wide variety of poses and body shapes. Clothing is represented as a deformation from the underlying body contour. This deformation is learned from training examples using principal component analysis to produce

eigen clothing

. We find that the statistics of clothing deformations are skewed and we model the

a priori

probability of these deformations using a Beta distribution. The resulting generative model captures realistic human forms in monocular images and is used to infer 2D body shape and pose under clothing. We also use the coefficients of the eigen clothing to recognize different categories of clothing on dressed people. The method is evaluated quantitatively on synthetic and real images and achieves better accuracy than previous methods for estimating body shape under clothing.

Peng Guan, Oren Freifeld, Michael J. Black

Self-Adapting Feature Layers

This paper presents a new approach for fitting a 3D morphable model to images of faces, using self-adapting feature layers (SAFL). The algorithm integrates feature detection into an iterative analysis-by-synthesis framework, combining the robustness of feature search with the flexibility of model fitting. Templates for facial features are created and updated while the fitting algorithm converges, so the templates adapt to the pose, illumination, shape and texture of the individual face. Unlike most existing feature-based methods, the algorithm does not search for the image locations with maximum response, which may be prone to errors, but forms a tradeoff between feature likeness, global feature configuration and image reconstruction error.

The benefit of the proposed method is an increased robustness of model fitting with respect to errors in the initial feature point positions. Such residual errors are a problem when feature detection and model fitting are combined to form a fully automated face reconstruction or recognition system. We analyze the robustness in a face recognition scenario on images from two databases: FRGC and FERET.

Pia Breuer, Volker Blanz

Face Recognition with Patterns of Oriented Edge Magnitudes

This paper addresses the question of computationally inexpensive yet discriminative and robust feature sets for real-world face recognition. The proposed descriptor named Patterns of Oriented Edge Magnitudes (POEM) has desirable properties: POEM (1) is an oriented, spatial multi-resolution descriptor capturing rich information about the original image; (2) is a multi-scale self-similarity based structure that results in robustness to exterior variations; and (3) is of low complexity and is therefore practical for real-time applications. Briefly speaking, for every pixel, the POEM feature is built by applying a self-similarity based structure on oriented magnitudes, calculated by accumulating a local histogram of gradient orientations over all pixels of image cells, centered on the considered pixel. The robustness and discriminative power of the POEM descriptor is evaluated for face recognition on both constrained (FERET) and unconstrained (LFW) datasets. Experimental results show that our algorithm achieves better performance than the state-of-the-art representations. More impressively, the computational cost of extracting the POEM descriptor is so low that it runs around 20 times faster than just the first step of the methods based upon Gabor filters. Moreover, its data storage requirements are 13 and 27 times smaller than those of the LGBP (Local Gabor Binary Patterns) and HGPP (Histogram of Gabor Phase Patterns) descriptors respectively.

Ngoc-Son Vu, Alice Caplier

Spatial-Temporal Granularity-Tunable Gradients Partition (STGGP) Descriptors for Human Detection

This paper presents a novel descriptor for human detection in video sequence. It is referred to as spatial-temporal granularity -tunable gradients partition (STGGP), which is an extension of granularity-tunable gradients partition (GGP) from the still image domain to the spatial-temporal domain. Specifically, the moving human body is considered as a 3-dimensional entity in the spatial-temporal domain. Then in 3D Hough space, we define the generalized plane as a primitive to parse the structure of this 3D entity. The advantage of the generalized plane is that it can tolerate imperfect planes with certain level of uncertainty in rotation and translation. The robustness to the uncertainty is controlled quantitatively by the granularity parameters defined explicitly in the generalized plane. This property endows the STGGP descriptors versatile ability to represent both the deterministic structures and the statistical summarizations of the object. Moreover, the STGGP descriptor encodes much heterogeneous information such as the gradients’ strength, position, and distribution, as well as their temporal motion to enrich its representation ability. We evaluate the STGGP on human detection in sequence on the public datasets and very promising results have been achieved.

Yazhou Liu, Shiguang Shan, Xilin Chen, Janne Heikkila, Wen Gao, Matti Pietikainen

Being John Malkovich

Given a photo of person A, we seek a photo of person B with similar pose and expression. Solving this problem enables a form of


, in which one person appears to control the face of another. When deployed on a webcam-equipped computer, our approach enables a user to control another person’s face in real-time. This image-retrieval-inspired approach employs a fully-automated pipeline of face analysis techniques, and is extremely general—we can puppet anyone directly from their photo collection or videos in which they appear. We show several examples using images and videos of celebrities from the Internet.

Ira Kemelmacher-Shlizerman, Aditya Sankar, Eli Shechtman, Steven M. Seitz

Facial Contour Labeling via Congealing

It is a challenging vision problem to discover non-rigid shape deformation for an image ensemble belonging to a single object class, in an automatic or semi-supervised fashion. The conventional semi- supervised approach [1] uses a congealing-like process to propagate manual landmark labels from a few images to a large ensemble. Although effective on an inter-person database with a large population, there is potential for increased labeling accuracy. With the goal of providing highly accurate labels, in this paper we present a parametric curve representation for each of the seven major facial contours. The appearance information along the curve, named

curve descriptor

, is extracted and used for congealing. Furthermore, we demonstrate that advanced features such as Histogram of Oriented Gradient (HOG) can be utilized in the proposed congealing framework, which operates in a dual-curve congealing manner for the case of a closed contour. With extensive experiments on a 300-image ensemble that exhibits moderate variation in facial pose and shape, we show that substantial progress has been achieved in the labeling accuracy compared to the previous state-of-the-art approach.

Xiaoming Liu, Yan Tong, Frederick W. Wheeler, Peter H. Tu

Cascaded Confidence Filtering for Improved Tracking-by-Detection

We propose a novel approach to increase the robustness of object detection algorithms in surveillance scenarios. The cascaded confidence filter successively incorporates constraints on the size of the objects, on the preponderance of the background and on the smoothness of trajectories. In fact, the continuous detection confidence scores are analyzed locally to adapt the generic detector to the specific scene. The approach does not learn specific object models, reason about complete trajectories or scene structure, nor use multiple cameras. Therefore, it can serve as preprocessing step to robustify many tracking-by-detection algorithms. Our real-world experiments show significant improvements, especially in the case of partial occlusions, changing backgrounds, and similar distractors.

Severin Stalder, Helmut Grabner, Luc Van Gool

Inter-camera Association of Multi-target Tracks by On-Line Learned Appearance Affinity Models

We propose a novel system for associating multi-target tracks across multiple non-overlapping cameras by an on-line learned discriminative appearance affinity model. Collecting reliable training samples is a major challenge in on-line learning since supervised correspondence is not available at runtime. To alleviate the inevitable ambiguities in these samples, Multiple Instance Learning (MIL) is applied to learn an appearance affinity model which effectively combines three complementary image descriptors and their corresponding similarity measurements. Based on the spatial-temporal information and the proposed appearance affinity model, we present an improved inter-camera track association framework to solve the “target handover” problem across cameras. Our evaluations indicate that our method have higher discrimination between different targets than previous methods.

Cheng-Hao Kuo, Chang Huang, Ram Nevatia

Multi-person Tracking with Sparse Detection and Continuous Segmentation

This paper presents an integrated framework for mobile street-level tracking of multiple persons. In contrast to classic tracking-by-detection approaches, our framework employs an efficient level-set tracker in order to follow individual pedestrians over time. This low-level tracker is initialized and periodically updated by a pedestrian detector and is kept robust through a series of consistency checks. In order to cope with drift and to bridge occlusions, the resulting tracklet outputs are fed to a high-level multi-hypothesis tracker, which performs longer-term data association. This design has the advantage of simplifying short-term data association, resulting in higher-quality tracks that can be maintained even in situations where the pedestrian detector does no longer yield good detections. In addition, it requires the pedestrian detector to be active only part of the time, resulting in computational savings. We quantitatively evaluate our approach on several challenging sequences and show that it achieves state-of-the-art performance.

Dennis Mitzel, Esther Horbert, Andreas Ess, Bastian Leibe

Closed-Loop Adaptation for Robust Tracking

Model updating is a critical problem in tracking. Inaccurate extraction of the foreground and background information in model adaptation would cause the model to drift and degrade the tracking performance. The most direct but yet difficult solution to the drift problem is to obtain accurate boundaries of the target. We approach such a solution by proposing a novel closed-loop model adaptation framework based on the combination of matting and tracking. In our framework, the scribbles for matting are all automatically generated, which makes matting applicable in a tracking system. Meanwhile, accurate boundaries of the target can be obtained from matting results even when the target has large deformation. An effective model is further constructed and successfully updated based on such accurate boundaries. Extensive experiments show that our closed-loop adaptation scheme largely avoids model drift and significantly outperforms other discriminative tracking models as well as video matting approaches.

Jialue Fan, Xiaohui Shen, Ying Wu

Gaussian-Like Spatial Priors for Articulated Tracking

We present an analysis of the spatial covariance structure of an articulated motion prior in which joint angles have a known covariance structure. From this, a well-known, but often ignored, deficiency of the kinematic skeleton representation becomes clear: spatial variance not only depends on limb lengths, but also increases as the kinematic chains are traversed. We then present two similar Gaussian-like motion priors that are explicitly expressed spatially and as such avoids any variance coming from the representation. The resulting priors are both simple and easy to implement, yet they provide superior predictions.

Søren Hauberg, Stefan Sommer, Kim Steenstrup Pedersen

Dense Point Trajectories by GPU-Accelerated Large Displacement Optical Flow

Dense and accurate motion tracking is an important requirement for many video feature extraction algorithms. In this paper we provide a method for computing point trajectories based on a fast parallel implementation of a recent optical flow algorithm that tolerates fast motion. The parallel implementation of large displacement optical flow runs about 78× faster than the serial C++ version. This makes it practical to use in a variety of applications, among them point tracking. In the course of obtaining the fast implementation, we also proved that the fixed point matrix obtained in the optical flow technique is positive semi-definite. We compare the point tracking to the most commonly used motion tracker - the KLT tracker - on a number of sequences with ground truth motion. Our resulting technique tracks up to three orders of magnitude more points and is 46% more accurate than the KLT tracker. It also provides a tracking density of 48% and has an occlusion error of 3% compared to a density of 0.1% and occlusion error of 8% for the KLT tracker. Compared to the Particle Video tracker, we achieve 66% better accuracy while retaining the ability to handle large displacements while running an order of magnitude faster.

Narayanan Sundaram, Thomas Brox, Kurt Keutzer

Improving Data Association by Joint Modeling of Pedestrian Trajectories and Groupings

We consider the problem of data association in a multi-person tracking context. In semi-crowded environments, people are still discernible as individually moving entities, that undergo many interactions with other people in their direct surrounding. Finding the correct association is therefore difficult, but higher-order social factors, such as group membership, are expected to ease the problem. However, estimating group membership is a chicken-and-egg problem: knowing pedestrian trajectories, it is rather easy to find out possible groupings in the data, but in crowded scenes, it is often difficult to estimate closely interacting trajectories without further knowledge about groups. To this end, we propose a third-order graphical model that is able to jointly estimate correct trajectories and group memberships over a short time window. A set of experiments on challenging data underline the importance of joint reasoning for data association in crowded scenarios.

Stefano Pellegrini, Andreas Ess, Luc Van Gool

Globally Optimal Multi-target Tracking on a Hexagonal Lattice

We propose a global optimisation approach to multi-target tracking. The method extends recent work which casts tracking as an integer linear program, by discretising the space of target locations. Our main contribution is to show how dynamic models can be integrated in such an approach. The dynamic model, which encodes prior expectations about object motion, has been an important component of tracking systems for a long time, but has recently been dropped to achieve globally optimisable objective functions. We re-introduce it by formulating the optimisation problem such that deviations from the prior can be measured independently for each variable. Furthermore, we propose to sample the location space on a hexagonal lattice to achieve smoother, more accurate trajectories in spite of the discrete setting. Finally, we argue that non-maxima suppression in the measured evidence should be performed during tracking, when the temporal context and the motion prior are available, rather than as a preprocessing step on a per-frame basis. Experiments on five different recent benchmark sequences demonstrate the validity of our approach.

Anton Andriyenko, Konrad Schindler

Discriminative Spatial Attention for Robust Tracking

A major reason leading to tracking failure is the spatial distractions that exhibit similar visual appearances as the target, because they also generate good matches to the target and thus distract the tracker. It is in general very difficult to handle this situation. In a selective attention tracking paradigm, this paper advocates a new approach of discriminative spatial attention that identifies some special regions on the target, called

attentional regions

(ARs). The ARs show strong discriminative power in their discriminative domains where they do not observe similar things. This paper presents an efficient two-stage method that divides the discriminative domain into a local and a semi-local one. In the local domain, the visual appearance of an attentional region is locally linearized and its discriminative power is closely related to the property of the associated linear manifold, so that a gradient-based search is designed to locate the set of local ARs. Based on that, the set of semi-local ARs are identified through an efficient branch-and-bound procedure that guarantees the optimality. Extensive experiments show that such discriminative spatial attention leads to superior performances in many challenging target tracking tasks.

Jialue Fan, Ying Wu, Shengyang Dai

Object, Scene and Actions: Combining Multiple Features for Human Action Recognition

In many cases, human actions can be identified not only by the singular observation of the human body in motion, but also properties of the surrounding scene and the related objects. In this paper, we look into this problem and propose an approach for human action recognition that integrates multiple feature channels from several entities such as objects, scenes and people. We formulate the problem in a multiple instance learning (MIL) framework, based on multiple feature channels. By using a discriminative approach, we join multiple feature channels embedded to the MIL space. Our experiments over the large YouTube dataset show that scene and object information can be used to complement person features for human action recognition.

Nazli Ikizler-Cinbis, Stan Sclaroff

Representing Pairwise Spatial and Temporal Relations for Action Recognition

The popular bag-of-words paradigm for action recognition tasks is based on building histograms of quantized features, typically at the cost of discarding all information about relationships between them. However, although the beneficial nature of including these relationships seems obvious, in practice finding good representations for feature relationships in video is difficult. We propose a simple and computationally efficient method for expressing pairwise relationships between quantized features that combines the power of discriminative representations with key aspects of Naïve Bayes. We demonstrate how our technique can augment both appearance- and motion-based features, and that it significantly improves performance on both types of features.

Pyry Matikainen, Martial Hebert, Rahul Sukthankar

Compact Video Description for Copy Detection with Precise Temporal Alignment

This paper introduces a very compact yet discriminative video description, which allows example-based search in a large number of frames corresponding to thousands of hours of video. Our description extracts one descriptor per indexed video frame by aggregating a set of local descriptors. These frame descriptors are encoded using a time-aware hierarchical indexing structure. A modified temporal Hough voting scheme is used to rank the retrieved database videos and estimate segments in them that match the query. If we use a dense temporal description of the videos, matched video segments are localized with excellent precision.

Experimental results on the


2008 copy detection task and a set of 38000 videos from YouTube show that our method offers an excellent trade-off between search accuracy, efficiency and memory usage.

Matthijs Douze, Hervé Jégou, Cordelia Schmid, Patrick Pérez

Modeling the Temporal Extent of Actions

In this paper, we present a framework for estimating what portions of videos are most discriminative for the task of action recognition. We explore the impact of the temporal cropping of training videos on the overall accuracy of an action recognition system, and we formalize what makes a set of croppings optimal. In addition, we present an algorithm to determine the best set of croppings for a dataset, and experimentally show that our approach increases the accuracy of various state-of-the-art action recognition techniques.

Scott Satkin, Martial Hebert

Content-Based Retrieval of Functional Objects in Video Using Scene Context

Functional object recognition in video is an emerging problem for visual surveillance and video understanding problem. By functional objects, we mean objects with specific purpose such as postman and delivery truck, which are defined more by their actions and behaviors than by appearance. In this work, we present an approach for content-based learning and recognition of the function of moving objects given video-derived tracks. In particular, we show that semantic behaviors of movers can be captured in location-independent manner by attributing them with features which encode their relations and actions w.r.t. scene contexts. By scene context, we mean local scene regions with different functionalities such as doorways and parking spots which moving objects often interact with. Based on these representations, functional models are learned from examples and novel instances are identified from unseen data afterwards. Furthermore, recognition in the presence of track fragmentation, due to imperfect tracking, is addressed by a boosting-based track linking classifier. Our experimental results highlight both promising and practical aspects of our approach.

Sangmin Oh, Anthony Hoogs, Matthew Turek, Roderic Collins

Anomalous Behaviour Detection Using Spatiotemporal Oriented Energies, Subset Inclusion Histogram Comparison and Event-Driven Processing

This paper proposes a novel approach to anomalous behaviour detection in video. The approach is comprised of three key components. First, distributions of spatiotemporal oriented energy are used to model behaviour. This representation can capture a wide range of naturally occurring visual spacetime patterns and has not previously been applied to anomaly detection. Second, a novel method is proposed for comparing an automatically acquired model of normal behaviour with new observations. The method accounts for situations when only a subset of the model is present in the new observation, as when multiple activities are acceptable in a region yet only one is likely to be encountered at any given instant. Third, event driven processing is employed to automatically mark portions of the video stream that are most likely to contain deviations from the expected and thereby focus computational efforts. The approach has been implemented with real-time performance. Quantitative and qualitative empirical evaluation on a challenging set of natural image videos demonstrates the approach’s superior performance relative to various alternatives.

Andrei Zaharescu, Richard Wildes

Tracklet Descriptors for Action Modeling and Video Analysis

We present spatio-temporal feature descriptors that can be inferred from video and used as building blocks in action recognition systems. They capture the evolution of “elementary action elements” under a set of assumptions on the image-formation model and are designed to be insensitive to nuisance variability (absolute position, contrast), while retaining discriminative statistics due to the fine-scale motion and the local shape in compact regions of the image. Despite their simplicity, these descriptors, used in conjunction with basic classifiers, attain state of the art performance in the recognition of actions in benchmark datasets.

Michalis Raptis, Stefano Soatto

Word Spotting in the Wild

We present a method for spotting words

in the wild

, i.e., in real images taken in unconstrained environments. Text found in the wild has a surprising range of difficulty. At one end of the spectrum, Optical Character Recognition (OCR) applied to scanned pages of well formatted printed text is one of the most successful applications of computer vision to date. At the other extreme lie visual CAPTCHAs – text that is constructed explicitly to fool computer vision algorithms. Both tasks involve recognizing text, yet one is nearly solved while the other remains extremely challenging. In this work, we argue that the appearance of words in the wild spans this range of difficulties and propose a new word recognition approach based on state-of-the-art methods from generic object recognition, in which we consider object categories to be the words themselves. We compare performance of leading OCR engines – one open source and one proprietary – with our new approach on the ICDAR Robust Reading data set and a new word spotting data set we introduce in this paper: the Street View Text data set. We show improvements of up to 16% on the data sets, demonstrating the feasibility of a new approach to a seemingly old problem.

Kai Wang, Serge Belongie

A Stochastic Graph Evolution Framework for Robust Multi-target Tracking

Maintaining the stability of tracks on multiple targets in video over extended time periods remains a challenging problem. A few methods which have recently shown encouraging results in this direction rely on learning context models or the availability of training data. However, this may not be feasible in many application scenarios. Moreover, tracking methods should be able to work across different scenarios (e.g. multiple resolutions of the video) making such context models hard to obtain. In this paper, we consider the problem of long-term tracking in video in application domains where context information is not available a priori, nor can it be learned online. We build our solution on the hypothesis that most existing trackers can obtain reasonable short-term tracks (tracklets). By analyzing the statistical properties of these tracklets, we develop associations between them so as to come up with longer tracks. This is achieved through a stochastic graph evolution step that considers the statistical properties of individual tracklets, as well as the statistics of the targets along each proposed long-term track. On multiple real-life video sequences spanning low and high resolution data, we show the ability to accurately track over extended time periods (results are shown on many minutes of continuous video).

Bi Song, Ting-Yueh Jeng, Elliot Staudt, Amit K. Roy-Chowdhury

Spotlights and Posters M2

Backprojection Revisited: Scalable Multi-view Object Detection and Similarity Metrics for Detections

Hough transform based object detectors learn a mapping from the image domain to a Hough voting space. Within this space, object hypotheses are formed by local maxima. The votes contributing to a hypothesis are called support. In this work, we investigate the use of the support and its backprojection to the image domain for multi-view object detection. To this end, we create a shared codebook with training and matching complexities independent of the number of quantized views. We show that since backprojection encodes enough information about the viewpoint all views can be handled together. In our experiments, we demonstrate that superior accuracy and efficiency can be achieved in comparison to the popular one-vs-the-rest detectors by treating views jointly especially with few training examples and no view annotations. Furthermore, we go beyond the detection case and based on the support we introduce a part-based similarity measure between two arbitrary detections which naturally takes spatial relationships of parts into account and is insensitive to partial occlusions. We also show that backprojection can be used to efficiently measure the similarity of a detection to all training examples. Finally, we demonstrate how these metrics can be used to estimate continuous object parameters like human pose and object’s viewpoint. In our experiment, we achieve state-of-the-art performance for view-classification on the PASCAL VOC’06 dataset.

Nima Razavi, Juergen Gall, Luc Van Gool

Multiple Instance Metric Learning from Automatically Labeled Bags of Faces

Metric learning aims at finding a distance that approximates a task-specific notion of semantic similarity. Typically, a Mahalanobis distance is learned from pairs of data labeled as being semantically similar or not. In this paper, we learn such metrics in a weakly supervised setting where “bags” of instances are labeled with “bags” of labels. We formulate the problem as a multiple instance learning (MIL) problem over pairs of bags. If two bags share at least one label, we label the pair positive, and negative otherwise. We propose to learn a metric using those labeled pairs of bags, leading to MildML, for multiple instance logistic discriminant metric learning. MildML iterates between updates of the metric and selection of putative positive pairs of examples from positive pairs of bags. To evaluate our approach, we introduce a large and challenging data set,

Labeled Yahoo! News

, which we have manually annotated and contains 31147 detected faces of 5873 different people in 20071 images. We group the faces detected in an image into a bag, and group the names detected in the caption into a corresponding set of labels. When the labels come from manual annotation, we find that MildML using the bag-level annotation performs as well as fully supervised metric learning using instance-level annotation. We also consider performance in the case of automatically extracted labels for the bags, where some of the bag labels do not correspond to any example in the bag. In this case MildML works substantially better than relying on noisy instance-level annotations derived from the bag-level annotation by resolving face-name associations in images with their captions.

Matthieu Guillaumin, Jakob Verbeek, Cordelia Schmid

Partition Min-Hash for Partial Duplicate Image Discovery

In this paper, we propose Partition min-Hash (PmH), a novel hashing scheme for discovering partial duplicate images from a large database. Unlike the standard min-Hash algorithm that assumes a bag of words image representation, our approach utilizes the fact that duplicate regions among images are often localized. By theoretical analysis, simulation, and empirical study, we show that PmH outperforms standard min-Hash in terms of precision and recall, while being orders of magnitude faster. When combined with the start-of-the-art Geometric min-Hash algorithm, our approach speeds up hashing by 10 times without losing precision or recall. When given a fixed time budget, our method achieves much higher recall than the state-of-the-art.

David C. Lee, Qifa Ke, Michael Isard

Automatic Attribute Discovery and Characterization from Noisy Web Data

It is common to use domain specific terminology – attributes – to describe the visual appearance of objects. In order to scale the use of these describable visual attributes to a large number of categories, especially those not well studied by psychologists or linguists, it will be necessary to find alternative techniques for identifying attribute vocabularies and for learning to recognize attributes without hand labeled training data. We demonstrate that it is possible to accomplish both these tasks automatically by mining text and image data sampled from the Internet. The proposed approach also characterizes attributes according to their visual representation: global or local, and type: color, texture, or shape. This work focuses on discovering attributes and their visual appearance, and is as agnostic as possible about the textual description.

Tamara L. Berg, Alexander C. Berg, Jonathan Shih

Learning to Recognize Objects from Unseen Modalities

In this paper we investigate the problem of exploiting multiple sources of information for object recognition tasks when additional modalities that are not present in the labeled training set are available for inference. This scenario is common to many robotics sensing applications and is in contrast with the assumption made by existing approaches that require at least some labeled examples for each modality. To leverage the previously unseen features, we make use of the unlabeled data to learn a mapping from the existing modalities to the new ones. This allows us to predict the missing data for the labeled examples and exploit all modalities using multiple kernel learning. We demonstrate the effectiveness of our approach on several multi-modal tasks including object recognition from multi-resolution imagery, grayscale and color images, as well as images and text. Our approach outperforms multiple kernel learning on the original modalities, as well as nearest-neighbor and bootstrapping schemes.

C. Mario Christoudias, Raquel Urtasun, Mathieu Salzmann, Trevor Darrell

Building Compact Local Pairwise Codebook with Joint Feature Space Clustering

This paper presents a simple, yet effective method of building a codebook for pairs of spatially close SIFT descriptors. Integrating such codebook into the popular bag-of-words model encodes local spatial information which otherwise cannot be represented with just individual SIFT descriptors. Many previous pairing techniques first quantize the descriptors to learn a set of visual words before they are actually paired. Our approach contrasts with theirs in that each pair of spatially close descriptors is represented as a data point in a joint feature space first and then clustering is applied to build a codebook called Local Pairwise Codebook (LPC). It is advantageous over the previous approaches in that feature selection over quadratic number of possible pairs of visual words is not required and feature aggregation is implicitly performed to achieve a compact codebook. This is all done in an unsupervised manner. Experimental results on challenging datasets, namely 15 Scenes, 67 Indoors, Caltech-101, Caltech-256 and MSRCv2 demonstrate that LPC outperforms the baselines and performs competitively against the state-of-the-art techniques in scene and object categorization tasks where a large number of categories need to be recognized.

Nobuyuki Morioka, Shin’ichi Satoh

Image-to-Class Distance Metric Learning for Image Classification

Image-To-Class (I2C) distance is first used in Naive-Bayes Nearest-Neighbor (NBNN) classifier for image classification and has successfully handled datasets with large intra-class variances. However, the performance of this distance relies heavily on the large number of local features in the training set and test image, which need heavy computation cost for nearest-neighbor (NN) search in the testing phase. If using small number of local features for accelerating the NN search, the performance will be poor.

In this paper, we propose a large margin framework to improve the discrimination of I2C distance especially for small number of local features by learning Per-Class Mahalanobis metrics. Our I2C distance is adaptive to different class by combining with the learned metric for each class. These multiple Per-Class metrics are learned simultaneously by forming a convex optimization problem with the constraints that the I2C distance from each training image to its belonging class should be less than the distance to other classes by a large margin. A gradient descent method is applied to efficiently solve this optimization problem. For efficiency and performance improved, we also adopt the idea of spatial pyramid restriction and learning I2C distance function to improve this I2C distance. We show in experiments that the proposed method can significantly outperform the original NBNN in several prevalent image datasets, and our best results can achieve state-of-the-art performance on most datasets.

Zhengxiang Wang, Yiqun Hu, Liang-Tien Chia

Extracting Structures in Image Collections for Object Recognition

Many computer vision methods rely on annotated image sets without taking advantage of the increasing number of unlabeled images available. This paper explores an alternative approach involving unsupervised structure discovery and semi-supervised learning (SSL) in image collections. Focusing on object classes, the first part of the paper contributes with an extensive evaluation of state-of-the-art image representations. Thus, it underlines the decisive influence of the local neighborhood structure and its direct consequences on SSL results and the importance of developing powerful object representations. In a second part, we propose and explore promising directions to improve results by looking at the local topology between images and feature combination strategies.

Sandra Ebert, Diane Larlus, Bernt Schiele

Size Does Matter: Improving Object Recognition and 3D Reconstruction with Cross-Media Analysis of Image Clusters

Most of the recent work on image-based object recognition and 3D reconstruction has focused on improving the underlying algorithms. In this paper we present a method to automatically improve the quality of the reference database, which, as we will show, also affects recognition and reconstruction performances significantly. Starting out from a reference database of clustered images we expand small clusters. This is done by exploiting cross-media information, which allows for crawling of additional images. For large clusters redundant information is removed by scene analysis. We show how these techniques make object recognition and 3D reconstruction both more efficient and more precise - we observed up to 14.8% improvement for the recognition task. Furthermore, the methods are completely data-driven and fully automatic.

Stephan Gammeter, Till Quack, David Tingdahl, Luc van Gool

Avoiding Confusing Features in Place Recognition

We seek to recognize the place depicted in a query image using a database of “street side” images annotated with geolocation information. This is a challenging task due to changes in scale, viewpoint and lighting between the query and the images in the database. One of the key problems in place recognition is the presence of objects such as trees or road markings, which frequently occur in the database and hence cause significant confusion between different places. As the main contribution, we show how to avoid features leading to


of particular places by using geotags attached to database images as a form of supervision. We develop a method for automatic detection of image-specific and spatially-localized groups of confusing features, and demonstrate that suppressing them significantly improves place recognition performance while reducing the database size. We show the method combines well with the state of the art bag-of-features model including query expansion, and demonstrate place recognition that generalizes over wide range of viewpoints and lighting conditions. Results are shown on a geotagged database of over 17K images of Paris downloaded from Google Street View.

Jan Knopp, Josef Sivic, Tomas Pajdla

Semantic Label Sharing for Learning with Many Categories

In an object recognition scenario with tens of thousands of categories, even a small number of labels per category leads to a very large number of total labels required. We propose a simple method of

label sharing

between semantically similar categories. We leverage the WordNet hierarchy to define semantic distance between any two categories and use this semantic distance to share labels. Our approach can be used with any classifier. Experimental results on a range of datasets, upto 80 million images and 75,000 categories in size, show that despite the simplicity of the approach, it leads to significant improvements in performance.

Rob Fergus, Hector Bernal, Yair Weiss, Antonio Torralba

Efficient Object Category Recognition Using Classemes

We introduce a new descriptor for images which allows the construction of efficient and compact classifiers with good accuracy on object category recognition. The descriptor is the output of a large number of weakly trained object category classifiers on the image. The trained categories are selected from an ontology of visual concepts, but the intention is not to encode an explicit decomposition of the scene. Rather, we accept that existing object category classifiers often encode not the category

per se

but ancillary image characteristics; and that these ancillary characteristics can combine to represent visual classes unrelated to the constituent categories’ semantic meanings.

The advantage of this descriptor is that it allows object-category queries to be made against image databases using efficient classifiers (efficient at test time) such as linear support vector machines, and allows these queries to be for novel categories. Even when the representation is reduced to 200 bytes per image, classification accuracy on object category recognition is comparable with the state of the art (36% versus 42%), but at orders of magnitude lower computational cost.

Lorenzo Torresani, Martin Szummer, Andrew Fitzgibbon

Practical Autocalibration

As it has been noted several times in literature, the difficult part of autocalibration efforts resides in the structural non-linearity of the search for the plane at infinity. In this paper we present a robust and versatile autocalibration method based on the enumeration of the inherently bounded space of the intrinsic parameters of two cameras in order to find the collineation of space that upgrades a given projective reconstruction to Euclidean. Each sample of the search space (which reduces to a finite subset of ℝ


under mild assumptions) defines a consistent plane at infinity. This in turn produces a tentative, approximate Euclidean upgrade of the whole reconstruction which is then scored according to the expected intrinsic parameters of a Euclidean camera. This approach has been compared with several other algorithms on both synthetic and concrete cases, obtaining favourable results.

Riccardo Gherardi, Andrea Fusiello


Weitere Informationen

Premium Partner