Skip to main content

2008 | Buch

Computer Vision – ECCV 2008

10th European Conference on Computer Vision, Marseille, France, October 12-18, 2008, Proceedings, Part IV

herausgegeben von: David Forsyth, Philip Torr, Andrew Zisserman

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

Welcome to the 2008EuropeanConference onComputer Vision. These proce- ings are the result of a great deal of hard work by many people. To produce them, a total of 871 papers were reviewed. Forty were selected for oral pres- tation and 203 were selected for poster presentation, yielding acceptance rates of 4.6% for oral, 23.3% for poster, and 27.9% in total. Weappliedthreeprinciples.First,sincewehadastronggroupofAreaChairs, the ?nal decisions to accept or reject a paper rested with the Area Chair, who wouldbeinformedbyreviewsandcouldactonlyinconsensuswithanotherArea Chair. Second, we felt that authors were entitled to a summary that explained how the Area Chair reached a decision for a paper. Third, we were very careful to avoid con?icts of interest. Each paper was assigned to an Area Chair by the Program Chairs, and each Area Chair received a pool of about 25 papers. The Area Chairs then identi?ed and rankedappropriatereviewersfor eachpaper in their pool, and a constrained optimization allocated three reviewers to each paper. We are very proud that every paper received at least three reviews. At this point, authors were able to respond to reviews. The Area Chairs then needed to reach a decision. We used a series of procedures to ensure careful review and to avoid con?icts of interest. ProgramChairs did not submit papers. The Area Chairs were divided into three groups so that no Area Chair in the group was in con?ict with any paper assigned to any Area Chair in the group.

Inhaltsverzeichnis

Frontmatter

Segmentation

Image Segmentation in the Presence of Shadows and Highlights

The segmentation method proposed in this paper is based on the observation that a single physical reflectance can have many different image values. We call the set of all these values a dominant colour. These variations are caused by shadows, shading and highlights and due to varying object geometry. The main idea is that dominant colours trace connected ridges in the chromatic histogram. To capture them, we propose a new Ridge based Distribution Analysis (RAD) to find the set of ridges representative of the dominant colour. First, a multilocal creaseness technique followed by a ridge extraction algorithm is proposed. Afterwards, a flooding procedure is performed to find the dominant colours in the histogram. Qualitative results illustrate the ability of our method to obtain excellent results in the presence of shadow and highlight edges. Quantitative results obtained on the Berkeley data set show that our method outperforms state-of-the-art segmentation methods at low computational cost.

Eduard Vazquez, Joost van de Weijer, Ramon Baldrich
Image Segmentation by Branch-and-Mincut

Efficient global optimization techniques such as graph cut exist for energies corresponding to binary image segmentation from low-level cues. However, introducing a high-level prior such as a shape prior or a color-distribution prior into the segmentation process typically results in an energy that is much harder to optimize. The main contribution of the paper is a new global optimization framework for a wide class of such energies. The framework is built upon two powerful techniques: graph cut and branch-and-bound. These techniques are unified through the derivation of lower bounds on the energies. Being computable via graph cut, these bounds are used to prune branches within a branch-and-bound search.

We demonstrate that the new framework can compute globally optimal segmentations for a variety of segmentation scenarios in a reasonable time on a modern CPU. These scenarios include unsupervised segmentation of an object undergoing 3D pose change, category-specific shape segmentation, and the segmentation under intensity/color priors defined by Chan-Vese and GrabCut functionals.

Victor Lempitsky, Andrew Blake, Carsten Rother
What Is a Good Image Segment? A Unified Approach to Segment Extraction

There is a huge diversity of definitions of “visually meaningful” image segments, ranging from simple uniformly colored segments, textured segments, through symmetric patterns, and up to complex semantically meaningful objects. This diversity has led to a wide range of different approaches for image segmentation. In this paper we present a single unified framework for addressing this problem – “Segmentation by Composition”. We define a good image segment as one which can be easily composed using its own pieces, but is difficult to compose using pieces from other parts of the image. This non-parametric approach captures a large diversity of segment types, yet requires no pre-definition or modelling of segment types, nor prior training. Based on this definition, we develop a segment extraction algorithm – i.e., given a single point-of-interest, provide the “best” image segment containing that point. This induces a figure-ground image segmentation, which applies to a range of different segmentation tasks: single image segmentation, simultaneous co-segmentation of several images, and class-based segmentations.

Shai Bagon, Oren Boiman, Michal Irani

Computational Photography

Light-Efficient Photography

We consider the problem of imaging a scene with a given depth of field at a given exposure level in the shortest amount of time possible. We show that by (1) collecting a sequence of photos and (2) controlling the aperture, focus and exposure time of each photo individually, we can span the given depth of field in less total time than it takes to expose a single narrower-aperture photo. Using this as a starting point, we obtain two key results. First, for lenses with continuously-variable apertures, we derive a closed-form solution for the

globally optimal

capture sequence,

i

.

e

., that collects light from the specified depth of field in the most efficient way possible. Second, for lenses with discrete apertures, we derive an integer programming problem whose solution is the optimal sequence. Our results are applicable to off-the-shelf cameras and typical photography conditions, and advocate the use of dense, wide-aperture photo sequences as a light-efficient alternative to single-shot, narrow-aperture photography.

Samuel W. Hasinoff, Kiriakos N. Kutulakos
Flexible Depth of Field Photography

The range of scene depths that appear focused in an image is known as the depth of field (DOF). Conventional cameras are limited by a fundamental trade-off between depth of field and signal-to-noise ratio (SNR). For a dark scene, the aperture of the lens must be opened up to maintain SNR, which causes the DOF to reduce. Also, today’s cameras have DOFs that correspond to a single slab that is perpendicular to the optical axis. In this paper, we present an imaging system that enables one to control the DOF in new and powerful ways. Our approach is to vary the position and/or orientation of the image detector,

during

the integration time of a single photograph. Even when the detector motion is very small (tens of microns), a large range of scene depths (several meters) is captured both in and out of focus.

Our prototype camera uses a micro-actuator to translate the detector along the optical axis during image integration. Using this device, we demonstrate three applications of flexible DOF. First, we describe extended DOF, where a large depth range is captured with a very wide aperture (low noise) but with nearly depth-independent defocus blur. Applying deconvolution to a captured image gives an image with extended DOF and yet high SNR. Next, we show the capture of images with discontinuous DOFs. For instance, near and far objects can be imaged with sharpness while objects in between are severely blurred. Finally, we show that our camera can capture images with tilted DOFs (Scheimpflug imaging) without tilting the image detector. We believe flexible DOF imaging can open a new creative dimension in photography and lead to new capabilities in scientific imaging, vision, and graphics.

Hajime Nagahara, Sujit Kuthirummal, Changyin Zhou, Shree K. Nayar
Priors for Large Photo Collections and What They Reveal about Cameras

A large photo collection downloaded from the internet spans a wide range of scenes, cameras, and photographers. In this paper we introduce several novel priors for statistics of such large photo collections that are independent of these factors. We then propose that properties of these factors can be recovered by examining the deviation between these statistical priors and the statistics of a slice of the overall photo collection that holds one factor constant. Specifically, we recover the radiometric properties of a particular camera model by collecting numerous images captured by it, and examining the deviation of this collection’s statistics from that of a broader photo collection whose camera-specific effects have been removed. We show that using this approach we can recover both a camera model’s non-linear response function and the spatially-varying vignetting of the camera’s different lens settings. All this is achieved using publicly available photographs, without requiring images captured under controlled conditions or physical access to the cameras. We also apply this concept to identify bad pixels on the detectors of specific camera instances. We conclude with a discussion of future applications of this general approach to other common computer vision problems.

Sujit Kuthirummal, Aseem Agarwala, Dan B Goldman, Shree K. Nayar
Understanding Camera Trade-Offs through a Bayesian Analysis of Light Field Projections

Computer vision has traditionally focused on extracting structure, such as depth, from images acquired using thin-lens or pinhole optics. The development of computational imaging is broadening this scope; a variety of unconventional cameras do not directly capture a traditional image anymore, but instead require the joint reconstruction of structure and image information. For example, recent coded aperture designs have been optimized to facilitate the joint reconstruction of depth and intensity. The breadth of imaging designs requires new tools to understand the tradeoffs implied by different strategies.

This paper introduces a unified framework for analyzing computational imaging approaches. Each sensor element is modeled as an inner product over the 4D light field. The imaging task is then posed as Bayesian inference: given the observed noisy light field projections and a prior on light field signals, estimate the original light field. Under common imaging conditions, we compare the performance of various camera designs using 2D light field simulations. This framework allows us to better understand the tradeoffs of each camera type and analyze their limitations.

Anat Levin, William T. Freeman, Frédo Durand

Poster Session IV

CenSurE: Center Surround Extremas for Realtime Feature Detection and Matching

We explore the suitability of different feature detectors for the task of image registration, and in particular for visual odometry, using two criteria: stability (persistence across viewpoint change) and accuracy (consistent localization across viewpoint change). In addition to the now-standard SIFT, SURF, FAST, and Harris detectors, we introduce a suite of scale-invariant center-surround detectors (CenSurE) that outperform the other detectors, yet have better computational characteristics than other scale-space detectors, and are capable of real-time implementation.

Motilal Agrawal, Kurt Konolige, Morten Rufus Blas
Searching the World’s Herbaria: A System for Visual Identification of Plant Species

We describe a working computer vision system that aids in the identification of plant species. A user photographs an isolated leaf on a blank background, and the system extracts the leaf shape and matches it to the shape of leaves of known species. In a few seconds, the system displays the top matching species, along with textual descriptions and additional images. This system is currently in use by botanists at the Smithsonian Institution National Museum of Natural History. The primary contributions of this paper are: a description of a working computer vision system and its user interface for an important new application area; the introduction of three new datasets containing thousands of single leaf images, each labeled by species and verified by botanists at the US National Herbarium; recognition results for two of the three leaf datasets; and descriptions throughout of practical lessons learned in constructing this system.

Peter N. Belhumeur, Daozheng Chen, Steven Feiner, David W. Jacobs, W. John Kress, Haibin Ling, Ida Lopez, Ravi Ramamoorthi, Sameer Sheorey, Sean White, Ling Zhang
A Column-Pivoting Based Strategy for Monomial Ordering in Numerical Gröbner Basis Calculations

This paper presents a new fast approach to improving stability in polynomial equation solving. Gröbner basis techniques for equation solving have been applied successfully to several geometric computer vision problems. However, in many cases these methods are plagued by numerical problems. An interesting approach to stabilising the computations is to study basis selection for the quotient space ℂ[

x

]/

I

. In this paper, the exact matrix computations involved in the solution procedure are clarified and using this knowledge we propose a new fast basis selection scheme based on QR-factorization with column pivoting. We also propose an adaptive scheme for truncation of the Gröbner basis to further improve stability. The new basis selection strategy is studied on some of the latest reported uses of Gröbner basis methods in computer vision and we demonstrate a fourfold increase in speed and nearly as good over-all precision as the previous SVD-based method. Moreover, we get typically get similar or better reduction of the largest errors.

Martin Byröd, Klas Josephson, Kalle Åström
Co-recognition of Image Pairs by Data-Driven Monte Carlo Image Exploration

We introduce a new concept of ‘co-recognition’ for object-level image matching between an arbitrary image pair. Our method augments putative local region matches to reliable object-level correspondences without any supervision or prior knowledge on common objects. It provides the number of reliable common objects and the dense correspondences between the image pair. In this paper, generative model for co-recognition is presented. For inference, we propose data-driven Monte Carlo image exploration which clusters and propagates local region matches by Markov chain dynamics. The global optimum is achieved by a guiding force of our data-driven sampling and posterior probability model. In the experiments, we demonstrate the power and utility on image retrieval and unsupervised recognition and segmentation of multiple common objects.

Minsu Cho, Young Min Shin, Kyoung Mu Lee
Movie/Script: Alignment and Parsing of Video and Text Transcription

Movies and TV are a rich source of diverse and complex video of people, objects, actions and locales “in the wild”. Harvesting automatically labeled sequences of actions from video would enable creation of large-scale and highly-varied datasets. To enable such collection, we focus on the task of recovering scene structure in movies and TV series for object tracking and action retrieval. We present a weakly supervised algorithm that uses the screenplay and closed captions to parse a movie into a hierarchy of shots and scenes. Scene boundaries in the movie are aligned with screenplay scene labels and shots are reordered into a sequence of long continuous tracks or

threads

which allow for more accurate tracking of people, actions and objects. Scene segmentation, alignment, and shot threading are formulated as inference in a unified generative model and a novel hierarchical dynamic programming algorithm that can handle alignment and jump-limited reorderings in linear time is presented. We present quantitative and qualitative results on movie alignment and parsing, and use the recovered structure to improve character naming and retrieval of common actions in several episodes of popular TV series.

Timothee Cour, Chris Jordan, Eleni Miltsakaki, Ben Taskar
Using 3D Line Segments for Robust and Efficient Change Detection from Multiple Noisy Images

In this paper, we propose a new approach to change detection that is based on the appearance or disappearance of 3D lines, which may be short, as seen in a new image. These 3D lines are estimated automatically and quickly from a set of previously-taken learning-images from arbitrary view points and under arbitrary lighting conditions. 3D change detection traditionally involves unsupervised estimation of scene geometry and the associated BRDF at each observable voxel in the scene, and the comparison of a new image with its prediction. If a significant number of pixels differ in the two aligned images, a change in the 3D scene is assumed to have occurred. The importance of our approach is that by comparing images of lines rather than of gray levels, we avoid the computationally intensive, and some-times impossible, tasks of estimating 3D surfaces and their associated BRDFs in the model-building stage. We estimate 3D lines instead where the lines are due to 3D ridges or BRDF ridges which are computationally much less costly and are more reliably detected. Our method is widely applicable as man-made structures consisting of 3D line segments are the main focus of most applications. The contributions of this paper are: change detection based on appropriate interpretation of line appearance and disappearance in a new image; unsupervised estimation of “short” 3D lines from multiple images such that the required computation is manageable and the estimation accuracy is high.

Ibrahim Eden, David B. Cooper
Action Recognition with a Bio–inspired Feedforward Motion Processing Model: The Richness of Center-Surround Interactions

Here we show that reproducing the functional properties of MT cells with various center–surround interactions enriches motion representation and improves the action recognition performance. To do so, we propose a simplified bio–inspired model of the motion pathway in primates: It is a feedforward model restricted to V1-MT cortical layers, cortical cells cover the visual space with a foveated structure and, more importantly, we reproduce some of the richness of center-surround interactions of MT cells. Interestingly, as observed in neurophysiology, our MT cells not only behave like simple velocity detectors, but also respond to several kinds of motion contrasts. Results show that this diversity of motion representation at the MT level is a major advantage for an action recognition task. Defining motion maps as our feature vectors, we used a standard classification method on the Weizmann database: We obtained an average recognition rate of 98.9%, which is superior to the recent results by Jhuang et al. (2007). These promising results encourage us to further develop bio–inspired models incorporating other brain mechanisms and cortical layers in order to deal with more complex videos.

Maria-Jose Escobar, Pierre Kornprobst
Linking Pose and Motion

Algorithms designed to estimate 3D pose in video sequences enforce temporal consistency but typically overlook an important source of information: The 3D pose of an object, be it rigid or articulated, has a direct influence on its direction of travel.

In this paper, we use the cases of an airplane performing aerobatic maneuvers and of pedestrians walking and turning to demonstrate that this information can and should be used to increase the accuracy and reliability of pose estimation algorithms.

Andrea Fossati, Pascal Fua
Automated Delineation of Dendritic Networks in Noisy Image Stacks

We present a novel approach to 3D delineation of dendritic networks in noisy image stacks. We achieve a level of automation beyond that of state-of-the-art systems, which model dendrites as continuous tubular structures and postulate simple appearance models. Instead, we learn models from the data itself, which make them better suited to handle noise and deviations from expected appearance.

From very little expert-labeled ground truth, we train both a classifier to recognize individual dendrite voxels and a density model to classify segments connecting pairs of points as dendrite-like or not. Given these models, we can then trace the dendritic trees of neurons automatically by enforcing the tree structure of the resulting graph. We will show that our approach performs better than traditional techniques on brighfield image stacks.

Germán González, François Fleuret, Pascal Fua
Calibration from Statistical Properties of the Visual World

What does a blind entity need in order to determine the geometry of the set of photocells that it carries through a changing lightfield? In this paper, we show that very crude knowledge of some statistical properties of the environment is sufficient for this task.

We show that some dissimilarity measures between pairs of signals produced by photocells are strongly related to the angular separation between the photocells. Based on real-world data, we model this relation quantitatively, using dissimilarity measures based on the correlation and conditional entropy. We show that this model allows to estimate the angular separation from the dissimilarity. Although the resulting estimators are not very accurate, they maintain their performance throughout different visual environments, suggesting that the model encodes a very general property of our visual world.

Finally, leveraging this method to estimate angles from signal pairs, we show how distance geometry techniques allow to recover the complete sensor geometry.

Etienne Grossmann, José António Gaspar, Francesco Orabona
Regular Texture Analysis as Statistical Model Selection

An approach to the analysis of images of regular texture is proposed in which lattice hypotheses are used to define statistical models. These models are then compared in terms of their ability to explain the image. A method based on this approach is described in which lattice hypotheses are generated using analysis of peaks in the image autocorrelation function, statistical models are based on Gaussian or Gaussian mixture clusters, and model comparison is performed using the marginal likelihood as approximated by the Bayes Information Criterion (BIC). Experiments on public domain regular texture images and a commercial textile image archive demonstrate substantially improved accuracy compared to two competing methods. The method is also used for classification of texture images as regular or irregular. An application to thumbnail image extraction is discussed.

Junwei Han, Stephen J. McKenna, Ruixuan Wang
Higher Dimensional Affine Registration and Vision Applications

Affine registration has a long and venerable history in computer vision literature, and extensive work have been done for affine registrations in ℝ

2

and ℝ

3

. In this paper, we study affine registrations in ℝ

m

for

m

 > 3, and to justify breaking this dimension barrier, we show two interesting types of matching problems that can be formulated and solved as affine registration problems in dimensions higher than three: stereo correspondence under motion and image set matching. More specifically, for an object undergoing non-rigid motion that can be linearly modelled using a small number of shape basis vectors, the stereo correspondence problem can be solved by affine registering points in ℝ

3

n

. And given two collections of images related by an unknown linear transformation of the image space, the correspondences between images in the two collections can be recovered by solving an affine registration problem in ℝ

m

, where

m

is the dimension of a PCA subspace. The algorithm proposed in this paper estimates the affine transformation between two point sets in ℝ

m

. It does not require continuous optimization, and our analysis shows that, in the absence of data noise, the algorithm will recover the exact affine transformation for almost all point sets with the worst-case time complexity of

O

(

mk

2

),

k

the size of the point set. We validate the proposed algorithm on a variety of synthetic point sets in different dimensions with varying degrees of deformation and noise, and we also show experimentally that the two types of matching problems can indeed be solved satisfactorily using the proposed affine registration algorithm.

Yu-Tseh Chi, S. M. Nejhum Shahed, Jeffrey Ho, Ming-Hsuan Yang
Semantic Concept Classification by Joint Semi-supervised Learning of Feature Subspaces and Support Vector Machines

The scarcity of labeled training data relative to the high-dimensionality multi-modal features is one of the major obstacles for semantic concept classification of images and videos. Semi-supervised learning leverages the large amount of unlabeled data in developing effective classifiers. Feature subspace learning finds optimal feature subspaces for representing data and helping classification. In this paper, we present a novel algorithm, Locality Preserving Semi-supervised Support Vector Machines (LPSSVM), to jointly learn an optimal feature subspace as well as a large margin SVM classifier. Over both labeled and unlabeled data, an optimal feature subspace is learned that can maintain the smoothness of local neighborhoods as well as being discriminative for classification. Simultaneously, an SVM classifier is optimized in the learned feature subspace to have large margin. The resulting classifier can be readily used to handle unseen test data. Additionally, we show that the LPSSVM algorithm can be used in a Reproducing Kernel Hilbert Space for nonlinear classification. We extensively evaluate the proposed algorithm over four types of data sets: a toy problem, two UCI data sets, the Caltech 101 data set for image classification, and the challenging Kodak’s consumer video data set for semantic concept detection. Promising results are obtained which clearly confirm the effectiveness of the proposed method.

Wei Jiang, Shih-Fu Chang, Tony Jebara, Alexander C. Loui
Learning from Real Images to Model Lighting Variations for Face Images

For robust face recognition, the problem of lighting variation is considered as one of the greatest challenges. Since the nine points of light (9PL) subspace is an appropriate low-dimensional approximation to the illumination cone, it yielded good face recognition results under a wide range of difficult lighting conditions. However building the 9PL subspace for a subject requires 9 gallery images under specific lighting conditions, which are not always possible in practice. Instead, we propose a statistical model for performing face recognition under variable illumination. Through this model, the nine basis images of a face can be recovered via maximum-a-posteriori (MAP) estimation with only one gallery image of that face. Furthermore, the training procedure requires only some real images and avoids tedious processing like SVD decomposition or the use of geometric (3D) or albedo information of a surface. With the recovered nine dimensional lighting subspace, recognition experiments were performed extensively on three publicly available databases which include images under single and multiple distant point light sources. Our approach yields better results than current ones. Even under extreme lighting conditions, the estimated subspace can still represent lighting variation well. The recovered subspace retains the main characteristics of 9PL subspace. Thus, the proposed algorithm can be applied to recognition under variable lighting conditions.

Xiaoyue Jiang, Yuk On Kong, Jianguo Huang, Rongchun Zhao, Yanning Zhang
Toward Global Minimum through Combined Local Minima

There are many local and greedy algorithms for energy minimization over Markov Random Field (MRF) such as iterated condition mode (ICM) and various gradient descent methods. Local minima solutions can be obtained with simple implementations and usually require smaller computational time than global algorithms. Also, methods such as ICM can be readily implemented in a various difficult problems that may involve larger than pairwise clique MRFs. However, their short comings are evident in comparison to newer methods such as graph cut and belief propagation. The local minimum depends largely on the initial state, which is the fundamental problem of its kind. In this paper, disadvantages of local minima techniques are addressed by proposing ways to combine multiple local solutions. First, multiple ICM solutions are obtained using different initial states. The solutions are combined with random partitioning based greedy algorithm called Combined Local Minima (CLM). There are numerous MRF problems that cannot be efficiently implemented with graph cut and belief propagation, and so by introducing ways to effectively combine local solutions, we present a method to dramatically improve many of the pre-existing local minima algorithms. The proposed approach is shown to be effective on pairwise stereo MRF compared with graph cut and sequential tree re-weighted belief propagation (TRW-S). Additionally, we tested our algorithm against belief propagation (BP) over randomly generated 30 ×30 MRF with 2 ×2 clique potentials, and we experimentally illustrate CLM’s advantage over message passing algorithms in computation complexity and performance.

Ho Yub Jung, Kyoung Mu Lee, Sang Uk Lee
Differential Spatial Resection - Pose Estimation Using a Single Local Image Feature

Robust local image features have been used successfully in robot localization and camera pose estimation; region tracking using affine warps is considered state of the art also for many years. Although such correspondences provide a warp of the local image region and are quite powerful, in direct pose estimation they are so far only considered as points and therefore three of them are required to construct a camera pose. In this contribution we show how it is possible to directly compute a pose based upon one such feature, given the plane in space where it lies. This

differential correspondence concept

exploits the texture warp and has recently gained attention in estimation of conjugate rotations. The approach can also be considered as the limiting case of the well-known spatial resection problem when the three 3D points approach each other infinitesimally close. We show that the differential correspondence is more powerful than conic correspondences while its exploitation requires nothing more complicated than the roots of a third order polynomial. We give a detailed sensitivity analysis, a comparison against state-of-the-art pose estimators and demonstrate real-world applicability of the algorithm based on automatic region recognition.

Kevin Köser, Reinhard Koch
Riemannian Anisotropic Diffusion for Tensor Valued Images

Tensor valued images, for instance originating from diffusion tensor magnetic resonance imaging (DT-MRI), have become more and more important over the last couple of years. Due to the nonlinear structure of such data it is nontrivial to adapt well-established image processing techniques to them. In this contribution we derive anisotropic diffusion equations for tensor-valued images based on the intrinsic Riemannian geometric structure of the space of symmetric positive tensors. In contrast to anisotropic diffusion approaches proposed so far, which are based on the Euclidian metric, our approach considers the nonlinear structure of positive definite tensors by means of the intrinsic Riemannian metric. Together with an intrinsic numerical scheme our approach overcomes a main drawback of former proposed anisotropic diffusion approaches, the so-called

eigenvalue swelling effect

. Experiments on synthetic data as well as real DT-MRI data demonstrate the value of a sound differential geometric formulation of diffusion processes for tensor valued data.

Kai Krajsek, Marion I. Menzel, Michael Zwanger, Hanno Scharr
FaceTracer: A Search Engine for Large Collections of Images with Faces

We have created the first image search engine based entirely on faces. Using simple text queries such as “smiling men with blond hair and mustaches,” users can search through over 3.1 million faces which have been automatically labeled on the basis of several facial attributes. Faces in our database have been extracted and aligned from images downloaded from the internet using a commercial face detector, and the number of images and attributes continues to grow daily. Our classification approach uses a novel combination of Support Vector Machines and Adaboost which exploits the strong structure of faces to select and train on the optimal set of features for each attribute. We show state-of-the-art classification results compared to previous works, and demonstrate the power of our architecture through a functional, large-scale face search engine. Our framework is fully automatic, easy to scale, and computes all labels off-line, leading to fast on-line search performance. In addition, we describe how our system can be used for a number of applications, including law enforcement, social networks, and personal photo management. Our search engine will soon be made publicly available.

Neeraj Kumar, Peter Belhumeur, Shree Nayar
What Does the Sky Tell Us about the Camera?

As the main observed illuminant outdoors, the sky is a rich source of information about the scene. However, it is yet to be fully explored in computer vision because its appearance depends on the sun position, weather conditions, photometric and geometric parameters of the camera, and the location of capture. In this paper, we propose the use of a physically-based sky model to analyze the information available within the visible portion of the sky, observed over time. By fitting this model to an image sequence, we show how to extract camera parameters such as the focal length, and the zenith and azimuth angles. In short, the sky serves as a geometric calibration target. Once the camera parameters are recovered, we show how to use the same model in two applications: 1) segmentation of the sky and cloud layers, and 2) data-driven sky matching across different image sequences based on a novel similarity measure defined on sky parameters. This measure, combined with a rich appearance database, allows us to model a wide range of sky conditions.

Jean-François Lalonde, Srinivasa G. Narasimhan, Alexei A. Efros
Three Dimensional Curvilinear Structure Detection Using Optimally Oriented Flux

This paper proposes a novel curvilinear structure detector, called Optimally Oriented Flux (OOF). OOF finds an optimal axis on which image gradients are projected in order to compute the image gradient flux. The computation of OOF is localized at the boundaries of local spherical regions. It avoids considering closely located adjacent structures. The main advantage of OOF is its robustness against the disturbance induced by closely located adjacent objects. Moreover, the analytical formulation of OOF introduces no additional computation load as compared to the calculation of the Hessian matrix which is widely used for curvilinear structure detection. It is experimentally demonstrated that OOF delivers accurate and stable curvilinear structure detection responses under the interference of closely located adjacent structures as well as image noise.

Max W. K. Law, Albert C. S. Chung
Scene Segmentation for Behaviour Correlation

This paper presents a novel framework for detecting abnormal pedestrian and vehicle behaviour by modelling cross-correlation among different co-occurring objects both locally and globally in a given scene. We address this problem by first segmenting a scene into semantic regions according to how object events occur globally in the scene, and second modelling concurrent correlations among regional object events both locally (within the same region) and globally (across different regions). Instead of tracking objects, the model represents behaviour based on classification of atomic video events, designed to be more suitable for analysing crowded scenes. The proposed system works in an unsupervised manner throughout using automatic model order selection to estimate its parameters given video data of a scene for a brief training period. We demonstrate the effectiveness of this system with experiments on public road traffic data.

Jian Li, Shaogang Gong, Tao Xiang
Robust Visual Tracking Based on an Effective Appearance Model

Most existing appearance models for visual tracking usually construct a pixel-based representation of object appearance so that they are incapable of fully capturing both global and local spatial layout information of object appearance. In order to address this problem, we propose a novel spatial Log-Euclidean appearance model (referred as

SLAM

) under the recently introduced Log-Euclidean Riemannian metric [23].

SLAM

is capable of capturing both the global and local spatial layout information of object appearance by constructing a block-based Log-Euclidean eigenspace representation. Specifically, the process of learning the proposed

SLAM

consists of five steps—appearance block division, online Log-Euclidean eigenspace learning, local spatial weighting, global spatial weighting, and likelihood evaluation. Furthermore, a novel online Log-Euclidean Riemannian subspace learning algorithm (

IRSL

) [14] is applied to incrementally update the proposed

SLAM

. Tracking is then led by the Bayesian state inference framework in which a particle filter is used for propagating sample distributions over the time. Theoretic analysis and experimental evaluations demonstrate the promise and effectiveness of the proposed

SLAM

.

Xi Li, Weiming Hu, Zhongfei Zhang, Xiaoqin Zhang
Key Object Driven Multi-category Object Recognition, Localization and Tracking Using Spatio-temporal Context

In this paper we address the problem of recognizing, localizing and tracking multiple objects of different categories in meeting room videos. Difficulties such as lack of detail and multi-object co-occurrence make it hard to directly apply traditional object recognition methods. Under such circumstances, we show that incorporating object-level spatio-temporal relationships can lead to significant improvements in inference of object category and state. Contextual relationships are modeled by a dynamic Markov random field, in which recognition, localization and tracking are done simultaneously. Further, we define human as the

key object

of the scene, which can be detected relatively robustly and therefore is used to guide the inference of other objects. Experiments are done on the CHIL meeting video corpus. Performance is evaluated in terms of object detection and false alarm rates, object recognition confusion matrix and pixel-level accuracy of object segmentation.

Yuan Li, Ram Nevatia
A Pose-Invariant Descriptor for Human Detection and Segmentation

We present a learning-based, sliding window-style approach for the problem of detecting humans in still images. Instead of traditional concatenation-style image location-based feature encoding, a global descriptor more invariant to pose variation is introduced. Specifically, we propose a principled approach to learning and classifying human/non-human image patterns by simultaneously segmenting human shapes and poses, and extracting articulation-insensitive features. The shapes and poses are segmented by an efficient, probabilistic hierarchical part-template matching algorithm, and the features are collected in the context of poses by tracing around the estimated shape boundaries. Histograms of oriented gradients are used as a source of low-level features from which our pose-invariant descriptors are computed, and kernel SVMs are adopted as the test classifiers. We evaluate our detection and segmentation approach on two public pedestrian datasets.

Zhe Lin, Larry S. Davis
Texture-Consistent Shadow Removal

This paper presents an approach to shadow removal that preserves texture consistency between the original shadow and lit area. Illumination reduction in the shadow area not only darkens that area, but also changes the texture characteristics there. We achieve texture-consistent shadow removal by constructing a shadow-free and texture-consistent gradient field. First, we estimate an illumination change surface which causes the shadow and remove the gradients it induces. We approximate the illumination change surface with illumination change splines across the shadow boundary. We formulate estimating these splines as an optimization problem which balances the smoothness between the neighboring splines and their fitness to the image data. Second, we sample the shadow effect on the texture characteristics in the umbra and lit area near the shadow boundary, and remove it by transforming the gradients inside the shadow area to be compatible with the lit area. Experiments on photos from

Flickr

demonstrate the effectiveness of our method.

Feng Liu, Michael Gleicher
Scene Discovery by Matrix Factorization

What constitutes a scene? Defining a meaningful vocabulary for scene discovery is a challenging problem that has important consequences for object recognition. We consider scenes to depict correlated objects and present visual similarity. We introduce a max-margin factorization model that finds a low dimensional subspace with high discriminative power for correlated annotations. We postulate this space should allow us to discover a large number of scenes in unsupervised data; we show scene discrimination results on par with supervised approaches. This model also produces state of the art word prediction results including good annotation completion.

Nicolas Loeff, Ali Farhadi
Simultaneous Detection and Registration for Ileo-Cecal Valve Detection in 3D CT Colonography

Object detection and recognition has achieved a significant progress in recent years. However robust 3D object detection and segmentation in noisy 3D data volumes remains a challenging problem. Localizing an object generally requires its spatial configuration (i.e., pose, size) being aligned with the trained object model, while estimation of an object’s spatial configuration is only valid at locations where the object appears. Detecting object while exhaustively searching its spatial parameters, is computationally prohibitive due to the high dimensionality of 3D search space. In this paper, we circumvent this computational complexity by proposing a novel framework capable of incrementally learning the object parameters (IPL) of location, pose and scale. This method is based on a sequence of binary encodings of the projected true positives from the original 3D object annotations (i.e., the projections of the global optima from the global space into the sections of subspaces). The training samples in each projected subspace are labeled as positive or negative, according their spatial registration distances towards annotations as ground-truth. Each encoding process can be considered as a general binary classification problem and is implemented using probabilistic boosting tree algorithm. We validate our approach with extensive experiments and performance evaluations for Ileo-Cecal Valve (ICV) detection in both clean and tagged 3D CT colonography scans. Our final ICV detection system also includes an optional prior learning procedure for IPL which further speeds up the detection.

Le Lu, Adrian Barbu, Matthias Wolf, Jianming Liang, Luca Bogoni, Marcos Salganicoff, Dorin Comaniciu
Constructing Category Hierarchies for Visual Recognition

Class hierarchies are commonly used to reduce the complexity of the classification problem. This is crucial when dealing with a large number of categories. In this work, we evaluate class hierarchies currently constructed for visual recognition. We show that top-down as well as bottom-up approaches, which are commonly used to automatically construct hierarchies, incorporate assumptions about the separability of classes. Those assumptions do not hold for visual recognition of a large number of object categories. We therefore propose a modification which is appropriate for most top-down approaches. It allows to construct class hierarchies that postpone decisions in the presence of uncertainty and thus provide higher recognition accuracy. We also compare our method to a one-against-all approach and show how to control the speed-for-accuracy trade-off with our method. For the experimental evaluation, we use the Caltech-256 visual object classes dataset and compare to state-of-the-art methods.

Marcin Marszałek, Cordelia Schmid
Sample Sufficiency and PCA Dimension for Statistical Shape Models

Statistical shape modelling(SSM) is a popular technique in computer vision applications, where the variation of shape of a given structure is modelled by principal component analysis (PCA) on a set of training samples. The issue of sample size sufficiency is not generally considered. In this paper, we propose a framework to investigate the sources of SSM inaccuracy. Based on this framework, we propose a procedure to determine sample size sufficiency by testing whether the training data stabilises the SSM. Also, the number of principal modes to retain (PCA dimension) is usually chosen using rules that aim to cover a percentage of the total variance or to limit the residual to a threshold. However, an ideal rule should retain modes that correspond to real structural variation and discard those that are dominated by noise. We show that these commonly used rules are not reliable, and we propose a new rule that uses bootstrap stability analysis on mode directions to determine the PCA dimension.

For validation we use synthetic 3D face datasets generated using a known number of structural modes with added noise. A 4-way ANOVA is applied for the model reconstruction accuracy on sample size, shape vector dimension, PCA dimension, and the noise level. It shows that there is no universal sample size guideline for SSM, nor is there a simple relationship to the shape vector dimension (with p-Value=0.2932). Validation of our rule for retaining structural modes showed it detected the correct number of modes to retain where the conventional methods failed. The methods were also tested on real 2D (22 points) and 3D (500 points) face data, retaining 24 and 70 modes with sample sufficiency being reached at approximately 50 and 150 samples respectively. We provide a foundation for appropriate selection of PCA dimension and determination of sample size sufficiency in statistical shape modelling.

Lin Mei, Michael Figl, Ara Darzi, Daniel Rueckert, Philip Edwards
Locating Facial Features with an Extended Active Shape Model

We make some simple extensions to the Active Shape Model of Cootes et al. [4], and use it to locate features in frontal views of upright faces. We show on independent test data that with the extensions the Active Shape Model compares favorably with more sophisticated methods. The extensions are (i) fitting more landmarks than are actually needed (ii) selectively using two- instead of one-dimensional landmark templates (iii) adding noise to the training set (iv) relaxing the shape model where advantageous (v) trimming covariance matrices by setting most entries to zero, and (vi) stacking two Active Shape Models in series.

Stephen Milborrow, Fred Nicolls
Dynamic Integration of Generalized Cues for Person Tracking

We present an approach for the dynamic combination of multiple cues in a particle filter-based tracking framework. The proposed algorithm is based on a combination of democratic integration and layered sampling. It is capable of dealing with deficiencies of single features as well as partial occlusion using the very same dynamic fusion mechanism. A set of simple but fast cues is defined, which allow us to cope with limited computational resources. The system is capable of automatic track initialization by means of a dedicated attention tracker permanently scanning the surroundings.

Kai Nickel, Rainer Stiefelhagen
Extracting Moving People from Internet Videos

We propose a fully automatic framework to detect and extract arbitrary human motion volumes from real-world videos collected from

YouTube

. Our system is composed of two stages. A person detector is first applied to provide crude information about the possible locations of humans. Then a constrained clustering algorithm groups the detections and rejects false positives based on the appearance similarity and spatio-temporal coherence. In the second stage, we apply a top-down pictorial structure model to complete the extraction of the humans in arbitrary motion. During this procedure, a density propagation technique based on a mixture of Gaussians is employed to propagate temporal information in a principled way. This method reduces greatly the search space for the measurement in the inference stage. We demonstrate the initial success of this framework both quantitatively and qualitatively by using a number of

YouTube

videos.

Juan Carlos Niebles, Bohyung Han, Andras Ferencz, Li Fei-Fei
Multiple Instance Boost Using Graph Embedding Based Decision Stump for Pedestrian Detection

Pedestrian detection in still image should handle the large appearance and stance variations arising from the articulated structure, various clothing of human as well as viewpoints. In this paper, we address this problem from a view which utilizes multiple instances to represent the variations in multiple instance learning (MIL) framework. Specifically, logistic multiple instance boost (LMIBoost) is advocated to learn the pedestrian appearance model. To efficiently use the histogram feature, we propose the graph embedding based decision stump for the data with non-Gaussian distribution. First the topology structure of the examples are carefully designed to keep between-class far and within-class close. Second, K-means algorithm is adopted to fast locate the multiple decision planes for the weak classifier. Experiments show the improved accuracy of the proposed approach in comparison with existing pedestrian detection methods, on two public test sets: INRIA and VOC2006’s person detection subtask [1].

Junbiao Pang, Qingming Huang, Shuqiang Jiang
Object Detection from Large-Scale 3D Datasets Using Bottom-Up and Top-Down Descriptors

We propose an approach for detecting objects in large-scale range datasets that combines bottom-up and top-down processes. In the bottom-up stage, fast-to-compute local descriptors are used to detect potential target objects. The object hypotheses are verified after alignment in a top-down stage using global descriptors that capture larger scale structure information. We have found that the combination of spin images and Extended Gaussian Images, as local and global descriptors respectively, provides a good trade-off between efficiency and accuracy. We present results on real outdoors scenes containing millions of scanned points and hundreds of targets. Our results compare favorably to the state of the art by being applicable to much larger scenes captured under less controlled conditions, by being able to detect object classes and not specific instances, and by being able to align the query with the best matching model accurately, thus obtaining precise segmentation.

Alexander Patterson IV, Philippos Mordohai, Kostas Daniilidis
Making Background Subtraction Robust to Sudden Illumination Changes

Modern background subtraction techniques can handle gradual illumination changes but can easily be confused by rapid ones. We propose a technique that overcomes this limitation by relying on a statistical model, not of the pixel intensities, but of the illumination effects. Because they tend to affect whole areas of the image as opposed to individual pixels, low-dimensional models are appropriate for this purpose and make our method extremely robust to illumination changes, whether slow or fast.

We will demonstrate its performance by comparing it to two representative implementations of state-of-the-art methods, and by showing its effectiveness for occlusion handling in a real-time Augmented Reality context.

Julien Pilet, Christoph Strecha, Pascal Fua
Closed-Form Solution to Non-rigid 3D Surface Registration

We present a closed-form solution to the problem of recovering the 3D shape of a non-rigid inelastic surface from 3D-to-2D correspondences. This lets us detect and reconstruct such a surface by matching individual images against a reference configuration, which is in contrast to all existing approaches that require initial shape estimates and track deformations from image to image.

We represent the surface as a mesh, and write the constraints provided by the correspondences as a linear system whose solution we express as a weighted sum of eigenvectors. Obtaining the weights then amounts to solving a set of quadratic equations accounting for inextensibility constraints between neighboring mesh vertices. Since available closed-form solutions to quadratic systems fail when there are too many variables, we reduce the number of unknowns by expressing the deformations as a linear combination of modes. The overall closed-form solution then becomes tractable even for complex deformations that require many modes.

Mathieu Salzmann, Francesc Moreno-Noguer, Vincent Lepetit, Pascal Fua
Implementing Decision Trees and Forests on a GPU

We describe a method for implementing the evaluation and training of decision trees and forests entirely on a GPU, and show how this method can be used in the context of object recognition.

Our strategy for evaluation involves mapping the data structure describing a decision forest to a 2D texture array. We navigate through the forest for each point of the input data in parallel using an efficient, non-branching pixel shader. For training, we compute the responses of the training data to a set of candidate features, and scatter the responses into a suitable histogram using a vertex shader. The histograms thus computed can be used in conjunction with a broad range of tree learning algorithms.

We demonstrate results for object recognition which are identical to those obtained on a CPU, obtained in about 1% of the time.

To our knowledge, this is the first time a method has been proposed which is capable of evaluating or training decision trees on a GPU. Our method leverages the full parallelism of the GPU.

Although we use features common to computer vision to demonstrate object recognition, our framework can accommodate other kinds of features for more general utility within computer science.

Toby Sharp
General Imaging Geometry for Central Catadioptric Cameras

Catadioptric cameras are a popular type of omnidirectional imaging system. Their imaging and multi-view geometry has been extensively studied; epipolar geometry for instance, is geometrically speaking, well understood. However, the existence of a bilinear matching constraint and an associated fundamental matrix, has so far only been shown for the special case of para-catadioptric cameras (consisting of a paraboloidal mirror and an orthographic camera). The main goal of this work is to obtain such results for all central catadioptric cameras. Our main result is to show the existence of a general 15×15 fundamental matrix. This is based on and completed by a number of other results, e.g. the formulation of general catadioptric projection matrices and plane homographies.

Peter Sturm, João P. Barreto
Estimating Radiometric Response Functions from Image Noise Variance

We propose a method for estimating radiometric response functions from observation of image noise variance, not profile of its distribution. The relationship between radiance intensity and noise variance is affine, but due to the non-linearity of response functions, this affinity is not maintained in the observation domain. We use the non-affinity relationship between the observed intensity and noise variance to estimate radiometric response functions. In addition, we theoretically derive how the response function alters the intensity-variance relationship. Since our method uses noise variance as input, it is fundamentally robust against noise. Unlike prior approaches, our method does not require images taken with different and known exposures. Real-world experiments demonstrate the effectiveness of our method.

Jun Takamatsu, Yasuyuki Matsushita, Katsushi Ikeuchi
Solving Image Registration Problems Using Interior Point Methods

This paper describes a novel approach to recovering a parametric deformation that optimally registers one image to another. The method proceeds by constructing a global convex approximation to the match function which can be optimized using interior point methods. The paper also describes how one can exploit the structure of the resulting optimization problem to develop efficient and effective matching algorithms. Results obtained by applying the proposed scheme to a variety of images are presented.

Camillo Jose Taylor, Arvind Bhusnurmath
3D Face Model Fitting for Recognition

This paper presents an automatic efficient method to fit a statistical deformation model of the human face to 3D scan data. In a global to local fitting scheme, the shape parameters of this model are optimized such that the produced instance of the model accurately fits the 3D scan data of the input face. To increase the expressiveness of the model and to produce a tighter fit of the model, our method fits a set of predefined face components and blends these components afterwards. Quantitative evaluation shows an improvement of the fitting results when multiple components are used instead of one. Compared to existing methods, our fully automatic method achieves a higher accuracy of the fitting results. The accurately generated face instances are manifold meshes without noise and holes, and can be effectively used for 3D face recognition: We achieve 97.5% correct identification for 876 queries in the UND face set with 3D faces. Our results show that contour curve based face matching outperforms landmark based face matching.

Frank B. ter Haar, Remco C. Veltkamp
A Multi-scale Vector Spline Method for Estimating the Fluids Motion on Satellite Images

Satellite image sequences visualize important patterns of the atmospheric and oceanographic circulation. Assessing motion from these data thus has a strong potential for improving the performances of the forecast models. Representing a vector field by a vector spline has been proven efficient for fluid motion assessment: the vector spline formulation makes it possible to initially select the locations where the conservation equation has to be taken into account; it efficiently implements the 2nd order div-curl regularity, advocated for turbulent fluids. The scientific contribution of this article is to formulate vector splines in a multiscale scheme, with the double objective of assessing motion even in the case of large displacements and capturing the spectrum of spatial scales associated to turbulent flows. The proposed method only requires the inversion of a band matrix, which is performed by an efficient numerical scheme making the method tractable for large satellite image sequences.

Till Isambert, Jean-Paul Berroir, Isabelle Herlin
Continuous Energy Minimization Via Repeated Binary Fusion

Variational problems, which are commonly used to solve low-level vision tasks, are typically minimized via a local, iterative optimization strategy, e.g. gradient descent. Since every iteration is restricted to a small, local improvement, the overall convergence can be slow and the algorithm may get stuck in an undesirable local minimum. In this paper, we propose to approximate the minimization by solving a series of binary subproblems to facilitate large optimization moves. The proposed method can be interpreted as an extension of discrete graph-cut based methods such as

α

-expansion or LogCut to a spatially continuous setting. In order to demonstrate the viability of the approach, we evaluated the novel optimization strategy in the context of optical flow estimation, yielding excellent results on the Middlebury optical flow datasets.

Werner Trobin, Thomas Pock, Daniel Cremers, Horst Bischof
Unified Crowd Segmentation

This paper presents a unified approach to crowd segmentation. A global solution is generated using an Expectation Maximization framework. Initially, a head and shoulder detector is used to nominate an exhaustive set of person locations and these form the person hypotheses. The image is then partitioned into a grid of small patches which are each assigned to one of the person hypotheses. A key idea of this paper is that while whole body monolithic person detectors can fail due to occlusion, a partial response to such a detector can be used to evaluate the likelihood of a single patch being assigned to a hypothesis. This captures local appearance information without having to learn specific appearance models. The likelihood of a pair of patches being assigned to a person hypothesis is evaluated based on low level image features such as uniform motion fields and color constancy. During the E-step, the single and pairwise likelihoods are used to compute a globally optimal set of assignments of patches to hypotheses. In the M-step, parameters which enforce global consistency of assignments are estimated. This can be viewed as a form of occlusion reasoning. The final assignment of patches to hypotheses constitutes a segmentation of the crowd. The resulting system provides a global solution that does not require background modeling and is robust with respect to clutter and partial occlusion.

Peter Tu, Thomas Sebastian, Gianfranco Doretto, Nils Krahnstoever, Jens Rittscher, Ting Yu
Quick Shift and Kernel Methods for Mode Seeking

We show that the complexity of the recently introduced medoid-shift algorithm in clustering

N

points is

O

(

N

2

), with a small constant, if the underlying distance is Euclidean. This makes medoid shift considerably faster than mean shift, contrarily to what previously believed. We then exploit kernel methods to extend both mean shift and the improved medoid shift to a large family of distances, with complexity bounded by the effective rank of the resulting kernel matrix, and with explicit regularization constraints. Finally, we show that, under certain conditions, medoid shift fails to cluster data points belonging to the same mode, resulting in over-fragmentation. We propose remedies for this problem, by introducing a novel, simple and extremely efficient clustering algorithm, called quick shift, that explicitly trades off under- and over-fragmentation. Like medoid shift, quick shift operates in non-Euclidean spaces in a straightforward manner. We also show that the accelerated medoid shift can be used to initialize mean shift for increased efficiency. We illustrate our algorithms to clustering data on manifolds, image segmentation, and the automatic discovery of visual categories.

Andrea Vedaldi, Stefano Soatto
A Fast Algorithm for Creating a Compact and Discriminative Visual Codebook

In patch-based object recognition, using a compact visual codebook can boost computational efficiency and reduce memory cost. Nevertheless, compared with a large-sized codebook, it also risks the loss of discriminative power. Moreover, creating a compact visual codebook can be very time-consuming, especially when the number of initial visual words is large. In this paper, to minimize its loss of discriminative power, we propose an approach to build a compact visual codebook by maximally preserving the separability of the object classes. Furthermore, a fast algorithm is designed to accomplish this task effortlessly, which can hierarchically merge 10,000 visual words down to 2 in ninety seconds. Experimental study shows that the compact visual codebook created in this way can achieve excellent classification performance even after a considerable reduction in size.

Lei Wang, Luping Zhou, Chunhua Shen
A Dynamic Conditional Random Field Model for Joint Labeling of Object and Scene Classes

Object detection and pixel-wise scene labeling have both been active research areas in recent years and impressive results have been reported for both tasks separately. The integration of these different types of approaches should boost performance for both tasks as object detection can profit from powerful scene labeling and also pixel-wise scene labeling can profit from powerful object detection. Consequently, first approaches have been proposed that aim to integrate both object detection and scene labeling in one framework. This paper proposes a novel approach based on conditional random field (CRF) models that extends existing work by 1) formulating the integration as a joint labeling problem of object and scene classes and 2) by systematically integrating dynamic information for the object detection task as well as for the scene labeling task. As a result, the approach is applicable to highly dynamic scenes including both fast camera and object movements. Experiments show the applicability of the novel approach to challenging real-world video sequences and systematically analyze the contribution of different system components to the overall performance.

Christian Wojek, Bernt Schiele
Local Regularization for Multiclass Classification Facing Significant Intraclass Variations

We propose a new local learning scheme that is based on the principle of decisiveness: the learned classifier is expected to exhibit large variability in the direction of the test example. We show how this principle leads to optimization functions in which the regularization term is modified, rather than the empirical loss term as in most local learning schemes. We combine this local learning method with a Canonical Correlation Analysis based classification method, which is shown to be similar to multiclass LDA. Finally, we show that the classification function can be computed efficiently by reusing the results of previous computations. In a variety of experiments on new and existing data sets, we demonstrate the effectiveness of the CCA based classification method compared to SVM and Nearest Neighbor classifiers, and show that the newly proposed local learning method improves it even further, and outperforms conventional local learning schemes.

Lior Wolf, Yoni Donner
Saliency Based Opportunistic Search for Object Part Extraction and Labeling

We study the task of object part extraction and labeling, which seeks to understand objects beyond simply identifiying their bounding boxes. We start from bottom-up segmentation of images and search for correspondences between object parts in a few shape models and segments in images. Segments comprising different object parts in the image are usually not equally salient due to uneven contrast, illumination conditions, clutter, occlusion and pose changes. Moreover, object parts may have different scales and some parts are only distinctive and recognizable in a large scale. Therefore, we utilize a multi-scale shape representation of objects and their parts, figural contextual information of the whole object and semantic contextual information for parts. Instead of searching over a large segmentation space, we present a saliency based opportunistic search framework to explore bottom-up segmentation by gradually expanding and bounding the search domain. We tested our approach on a challenging statue face dataset and 3 human face datasets. Results show that our approach significantly outperforms Active Shape Models using far fewer exemplars. Our framework can be applied to other object categories.

Yang Wu, Qihui Zhu, Jianbo Shi, Nanning Zheng
Stereo Matching: An Outlier Confidence Approach

One of the major challenges in stereo matching is to handle partial occlusions. In this paper, we introduce the Outlier Confidence (OC) which dynamically measures how likely one pixel is occluded. Then the occlusion information is softly incorporated into our model. A global optimization is applied to robustly estimating the disparities for both the occluded and non-occluded pixels. Compared to color segmentation with plane fitting which globally partitions the image, our OC model locally infers the possible disparity values for the outlier pixels using a reliable color sample refinement scheme. Experiments on the Middlebury dataset show that the proposed two-frame stereo matching method performs satisfactorily on the stereo images.

Li Xu, Jiaya Jia
Improving Shape Retrieval by Learning Graph Transduction

Shape retrieval/matching is a very important topic in computer vision. The recent progress in this domain has been mostly driven by designing smart features for providing better similarity measure between pairs of shapes. In this paper, we provide a new perspective to this problem by considering the existing shapes as a group, and study their similarity measures to the query shape in a graph structure. Our method is general and can be built on top of any existing shape matching algorithms. It learns a better metric through graph transduction by propagating the model through existing shapes, in a way similar to computing geodesics in shape manifold. However, the proposed method does not require learning the shape manifold explicitly and it does not require knowing any class labels of existing shapes. The presented experimental results demonstrate that the proposed approach yields significant improvements over the state-of-art shape matching algorithms. We obtained a retrieval rate of

91%

on the MPEG-7 data set, which is the highest ever reported in the literature.

Xingwei Yang, Xiang Bai, Longin Jan Latecki, Zhuowen Tu
Cat Head Detection - How to Effectively Exploit Shape and Texture Features

In this paper, we focus on the problem of detecting the head of cat-like animals, adopting cat as a test case. We show that the performance depends crucially on how to effectively utilize the shape and texture features jointly. Specifically, we propose a two step approach for the cat head detection. In the first step, we train two individual detectors on two training sets. One training set is normalized to emphasize the shape features and the other is normalized to underscore the texture features. In the second step, we train a joint shape and texture fusion classifier to make the final decision. We demonstrate that a significant improvement can be obtained by our two step approach. In addition, we also propose a set of novel features based on oriented gradients, which outperforms existing leading features, e. g., Haar, HoG, and EoH. We evaluate our approach on a well labeled cat head data set with 10,000 images and PASCAL 2007 cat data.

Weiwei Zhang, Jian Sun, Xiaoou Tang
Motion Context: A New Representation for Human Action Recognition

One of the key challenges in human action recognition from video sequences is how to model an action sufficiently. Therefore, in this paper we propose a novel motion-based representation called

Motion Context

(MC), which is insensitive to the scale and direction of an action, by employing image representation techniques. A MC captures the distribution of the

motion words

(MWs) over relative locations in a local region of the

motion image

(MI) around a reference point and thus summarizes the local motion information in a rich 3D MC descriptor. In this way, any human action can be represented as a 3D descriptor by summing up all the MC descriptors of this action. For action recognition, we propose 4 different recognition configurations: MW+pLSA, MW+SVM, MC+

w

3

-pLSA (a new direct graphical model by extending pLSA), and MC+SVM. We test our approach on two human action video datasets from KTH and Weizmann Institute of Science (WIS) and our performances are quite promising. For the KTH dataset, the proposed MC representation achieves the highest performance using the proposed

w

3

-pLSA. For the WIS dataset, the best performance of the proposed MC is comparable to the state of the art.

Ziming Zhang, Yiqun Hu, Syin Chan, Liang-Tien Chia

Active Reconstruction

Temporal Dithering of Illumination for Fast Active Vision

Active vision techniques use programmable light sources, such as projectors, whose intensities can be controlled over space and time. We present a broad framework for fast active vision using Digital Light Processing (DLP) projectors. The digital micromirror array (DMD) in a DLP projector is capable of switching mirrors “on” and “off” at high speeds (10

6

/

s

). An off-the-shelf DLP projector, however, effectively operates at much lower rates (30-60Hz) by emitting smaller intensities that are integrated over time by a sensor (eye or camera) to produce the desired brightness value. Our key idea is to exploit this “temporal dithering” of illumination, as observed by a high-speed camera. The dithering encodes each brightness value uniquely and may be used in conjunction with virtually any active vision technique. We apply our approach to five well-known problems: (a) structured light-based range finding, (b) photometric stereo, (c) illumination de-multiplexing, (d) high frequency preserving motion-blur and (e) separation of direct and global scene components, achieving significant speedups in performance. In all our methods, the projector receives a single image as input whereas the camera acquires a sequence of frames.

Srinivasa G. Narasimhan, Sanjeev J. Koppal, Shuntaro Yamazaki
Compressive Structured Light for Recovering Inhomogeneous Participating Media

We propose a new method named compressive structured light for recovering inhomogeneous participating media. Whereas conventional structured light methods emit coded light patterns onto the surface of an opaque object to establish correspondence for triangulation, compressive structured light projects patterns into a

volume

of participating medium to produce images which are

integral measurements

of the volume density along the line of sight. For a typical participating medium encountered in the real world, the integral nature of the acquired images enables the use of

compressive sensing

techniques that can recover the entire volume density from only a few measurements. This makes the acquisition process more efficient and enables reconstruction of dynamic volumetric phenomena. Moreover, our method requires the projection of multiplexed coded illumination, which has the added advantage of increasing the signal-to-noise ratio of the acquisition. Finally, we propose an iterative algorithm to correct for the attenuation of the participating medium during the reconstruction process. We show the effectiveness of our method with simulations as well as experiments on the volumetric recovery of multiple translucent layers, 3D point clouds etched in glass, and the dynamic process of milk drops dissolving in water.

Jinwei Gu, Shree Nayar, Eitan Grinspun, Peter Belhumeur, Ravi Ramamoorthi
Passive Reflectometry

Different materials reflect light in different ways, so reflectance is a useful surface descriptor. Existing systems for measuring reflectance are cumbersome, however, and although the process can be streamlined using cameras, projectors and clever catadioptrics, it generally requires complex infrastructure. In this paper we propose a simpler method for inferring reflectance from images, one that eliminates the need for active lighting and exploits natural illumination instead. The method’s distinguishing property is its ability to handle a broad class of isotropic reflectance functions, including those that are neither radially-symmetric nor well-represented by low-parameter reflectance models. The key to the approach is a bi-variate representation of isotropic reflectance that enables a tractable inference algorithm while maintaining generality. The resulting method requires only a camera, a light probe, and as little as one HDR image of a known, curved, homogeneous surface.

Fabiano Romeiro, Yuriy Vasilyev, Todd Zickler
Fusion of Feature- and Area-Based Information for Urban Buildings Modeling from Aerial Imagery

Accurate and realistic building models of urban environments are increasingly important for applications, like virtual tourism or city planning. Initiatives like Virtual Earth or Google Earth are aiming at offering virtual models of all major cities world wide. The prohibitively high costs of manual generation of such models explain the need for an automatic workflow.

This paper proposes an algorithm for fully automatic building reconstruction from aerial images. Sparse line features delineating height discontinuities and dense depth data providing the roof surface are combined in an innovative manner with a global optimization algorithm based on Graph Cuts. The fusion process exploits the advantages of both information sources and thus yields superior reconstruction results compared to the indiviual sources. The nature of the algorithm also allows to elegantly generate image driven levels of detail of the geometry.

The algorithm is applied to a number of real world data sets encompassing thousands of buildings. The results are analyzed in detail and extensively evaluated using ground truth data.

Lukas Zebedin, Joachim Bauer, Konrad Karner, Horst Bischof
Backmatter
Metadaten
Titel
Computer Vision – ECCV 2008
herausgegeben von
David Forsyth
Philip Torr
Andrew Zisserman
Copyright-Jahr
2008
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-88693-8
Print ISBN
978-3-540-88692-1
DOI
https://doi.org/10.1007/978-3-540-88693-8