Skip to main content

2008 | Buch

Computer Vision – ECCV 2008

10th European Conference on Computer Vision, Marseille, France, October 12-18, 2008, Proceedings, Part I

herausgegeben von: David Forsyth, Philip Torr, Andrew Zisserman

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

Welcome to the 2008EuropeanConference onComputer Vision. These proce- ings are the result of a great deal of hard work by many people. To produce them, a total of 871 papers were reviewed. Forty were selected for oral pres- tation and 203 were selected for poster presentation, yielding acceptance rates of 4.6% for oral, 23.3% for poster, and 27.9% in total. Weappliedthreeprinciples.First,sincewehadastronggroupofAreaChairs, the ?nal decisions to accept or reject a paper rested with the Area Chair, who wouldbeinformedbyreviewsandcouldactonlyinconsensuswithanotherArea Chair. Second, we felt that authors were entitled to a summary that explained how the Area Chair reached a decision for a paper. Third, we were very careful to avoid con?icts of interest. Each paper was assigned to an Area Chair by the Program Chairs, and each Area Chair received a pool of about 25 papers. The Area Chairs then identi?ed and rankedappropriatereviewersfor eachpaper in their pool, and a constrained optimization allocated three reviewers to each paper. We are very proud that every paper received at least three reviews. At this point, authors were able to respond to reviews. The Area Chairs then needed to reach a decision. We used a series of procedures to ensure careful review and to avoid con?icts of interest. ProgramChairs did not submit papers. The Area Chairs were divided into three groups so that no Area Chair in the group was in con?ict with any paper assigned to any Area Chair in the group.

Inhaltsverzeichnis

Frontmatter

Lecture by Prof. Jan Koenderink

Something Old, Something New, Something Borrowed, Something Blue

My first paper of a “Computer Vision” signature (on invariants related to optic flow) dates from 1975. I have published in Computer Vision (next to work in cybernetics, psychology, physics, mathematics and philosophy) till my retirement earlier this year (hence the slightly blue feeling), thus my career roughly covers the history of the field. “Vision” has diverse connotations. The fundamental dichotomy is between “optically guided action” and “visual experience”. The former applies to much of biology and computer vision and involves only concepts from science and engineering (e.g., “inverse optics”), the latter involves intention and meaning and thus additionally involves concepts from psychology and philosophy. David Marr’s notion of “vision” is an uneasy blend of the two: On the one hand the goal is to create a “representation of the scene in front of the eye” (involving intention and meaning), on the other hand the means by which this is attempted are essentially “inverse optics”. Although this has nominally become something of the “Standard Model” of CV, it is actually incoherent. It is the latter notion of “vision” that has always interested me most, mainly because one is still grappling with basic concepts. It has been my aspiration to turn it into science, although in this I failed. Yet much has happened (something old) and is happening now (something new). I will discuss some of the issues that seem crucial to me, mostly illustrated through my own work, though I shamelessly borrow from friends in the CV community where I see fit.

Jan J. Koenderink

Recognition

Learning to Localize Objects with Structured Output Regression

Sliding window classifiers are among the most successful and widely applied techniques for object localization. However, training is typically done in a way that is not specific to the localization task. First a binary classifier is trained using a sample of positive and negative examples, and this classifier is subsequently applied to multiple regions within test images. We propose instead to treat object localization in a principled way by posing it as a problem of

predicting structured data

: we model the problem not as binary classification, but as the prediction of the bounding box of objects located in images. The use of a

joint-kernel

framework allows us to formulate the training procedure as a generalization of an SVM, which can be solved efficiently. We further improve computational efficiency by using a branch-and-bound strategy for localization during both training and testing. Experimental evaluation on the PASCAL VOC and TU Darmstadt datasets show that the structured training procedure improves performance over binary training as well as the best previously published scores.

Matthew B. Blaschko, Christoph H. Lampert
Beyond Nouns: Exploiting Prepositions and Comparative Adjectives for Learning Visual Classifiers

Learning visual classifiers for object recognition from weakly labeled data requires determining correspondence between image regions and semantic object classes. Most approaches use co-occurrence of “nouns” and image features over large datasets to determine the correspondence, but many correspondence ambiguities remain. We further constrain the correspondence problem by exploiting additional language constructs to improve the learning process from weakly labeled data. We consider both “prepositions” and “comparative adjectives” which are used to express relationships between objects. If the models of such relationships can be determined, they help resolve correspondence ambiguities. However, learning models of these relationships requires solving the correspondence problem. We simultaneously learn the visual features defining “nouns” and the differential visual features defining such “binary-relationships” using an EM-based approach.

Abhinav Gupta, Larry S. Davis
Learning Spatial Context: Using Stuff to Find Things

The sliding window approach of detecting rigid objects (such as cars) is predicated on the belief that the object can be identified from the appearance in a small region around the object. Other types of objects of amorphous spatial extent (e.g., trees, sky), however, are more naturally classified based on texture or color. In this paper, we seek to combine recognition of these two types of objects into a system that leverages “context” toward improving detection. In particular, we cluster image regions based on their ability to serve as context for the detection of objects. Rather than providing an explicit training set with region labels, our method automatically groups regions based on both their appearance and their relationships to the detections in the image. We show that our things and stuff (TAS) context model produces meaningful clusters that are readily interpretable, and helps improve our detection ability over state-of-the-art detectors. We also present a method for learning the active set of relationships for a particular dataset. We present results on object detection in images from the PASCAL VOC 2005/2006 datasets and on the task of overhead car detection in satellite images, demonstrating significant improvements over state-of-the-art detectors.

Geremy Heitz, Daphne Koller
Segmentation and Recognition Using Structure from Motion Point Clouds

We propose an algorithm for semantic segmentation based on 3D point clouds derived from ego-motion. We motivate five simple cues designed to model specific patterns of motion and 3D world structure that vary with object category. We introduce features that project the 3D cues back to the 2D image plane while modeling spatial layout and context. A randomized decision forest combines many such features to achieve a coherent 2D segmentation and recognize the object categories present. Our main contribution is to show how semantic segmentation is possible based

solely on motion-derived 3D world structure

. Our method works well on sparse, noisy point clouds, and unlike existing approaches, does not need appearance-based descriptors.

Experiments were performed on a challenging new video database containing sequences filmed from a moving car in daylight and at dusk. The results confirm that indeed, accurate segmentation and recognition are possible using only motion and 3D world structure. Further, we show that the motion-derived information complements an existing state-of-the-art appearance-based method, improving both qualitative and quantitative performance.

Gabriel J. Brostow, Jamie Shotton, Julien Fauqueur, Roberto Cipolla

Poster Session I

Keypoint Signatures for Fast Learning and Recognition

Statistical learning techniques have been used to dramatically speed-up keypoint matching by training a classifier to recognize a specific set of keypoints. However, the training itself is usually relatively slow and performed offline. Although methods have recently been proposed to train the classifier online, they can only learn a very limited number of new keypoints. This represents a handicap for real-time applications, such as Simultaneous Localization and Mapping (SLAM), which require incremental addition of arbitrary numbers of keypoints as they become visible.

In this paper, we overcome this limitation and propose a descriptor that can be learned online fast enough to handle virtually unlimited numbers of keypoints. It relies on the fact that if we train a Randomized Tree classifier to recognize a number of keypoints extracted from an image database, all other keypoints can be characterized in terms of their response to these classification trees. This signature is fast to compute and has a discriminative power that is comparable to that of the much slower SIFT descriptor.

Michael Calonder, Vincent Lepetit, Pascal Fua
Active Matching

In the matching tasks which form an integral part of all types of tracking and geometrical vision, there are invariably priors available on the absolute and/or relative image locations of features of interest. Usually, these priors are used post-hoc in the process of resolving feature matches and obtaining final scene estimates, via ‘first get candidate matches, then resolve’ consensus algorithms such as RANSAC. In this paper we show that the dramatically different approach of using priors dynamically to guide a feature by feature matching search can achieve global matching with much fewer image processing operations and lower overall computational cost. Essentially, we put image processing

into the loop

of the search for global consensus. In particular, our approach is able to cope with significant image ambiguity thanks to a dynamic mixture of Gaussians treatment. In our fully Bayesian algorithm, the choice of the most efficient search action at each step is guided intuitively and rigorously by expected Shannon information gain. We demonstrate the algorithm in feature matching as part of a sequential SLAM system for 3D camera tracking. Robust, real-time matching can be achieved even in the previously unmanageable case of jerky, rapid motion necessitating weak motion modelling and large search regions.

Margarita Chli, Andrew J. Davison
Towards Scalable Dataset Construction: An Active Learning Approach

As computer vision research considers more object categories and greater variation within object categories, it is clear that larger and more exhaustive datasets are necessary. However, the process of collecting such datasets is laborious and monotonous. We consider the setting in which many images have been automatically collected for a visual category (typically by automatic internet search), and we must separate relevant images from noise. We present a discriminative learning process which employs active, online learning to quickly classify many images with minimal user input. The principle advantage of this work over previous endeavors is its scalability. We demonstrate precision which is often superior to the state-of-the-art, with scalability which exceeds previous work.

Brendan Collins, Jia Deng, Kai Li, Li Fei-Fei
GeoS: Geodesic Image Segmentation

This paper presents GeoS, a new algorithm for the efficient segmentation of n-dimensional image and video data.

The segmentation problem is cast as approximate energy minimization in a conditional random field. A new, parallel filtering operator built upon efficient geodesic distance computation is used to propose a set of spatially smooth, contrast-sensitive segmentation hypotheses. An economical search algorithm finds the solution with minimum energy within a sensible and highly restricted subset of all possible labellings.

Advantages include: i) computational efficiency with high segmentation accuracy; ii) the ability to estimate an approximation to the posterior over segmentations; iii) the ability to handle generally complex energy models. Comparison with max-flow indicates up to 60 times greater computational efficiency as well as greater memory efficiency.

GeoS is validated quantitatively and qualitatively by thorough comparative experiments on existing and novel ground-truth data. Numerous results on interactive

and

automatic segmentation of photographs, video and volumetric medical image data are presented.

Antonio Criminisi, Toby Sharp, Andrew Blake
Simultaneous Motion Detection and Background Reconstruction with a Mixed-State Conditional Markov Random Field

We consider the problem of motion detection by background subtraction. An accurate estimation of the background is only possible if we locate the moving objects; meanwhile, a correct motion detection is achieved if we have a good available background model. This work proposes a new direction in the way such problems are considered. The main idea is to formulate this class of problem as a joint decision-estimation unique step. The goal is to exploit the way two processes interact, even if they are of a dissimilar nature (symbolic-continuous), by means of a recently introduced framework called mixed-state Markov random fields. In this paper, we will describe the theory behind such a novel statistical framework, that subsequently will allows us to formulate the specific joint problem of motion detection and background reconstruction. Experiments on real sequences and comparisons with existing methods will give a significant support to our approach. Further implications for video sequence inpainting will be also discussed.

Tomás Crivelli, Gwenaelle Piriou, Patrick Bouthemy, Bruno Cernuschi-Frías, Jian-feng Yao
Semidefinite Programming Heuristics for Surface Reconstruction Ambiguities

We consider the problem of reconstructing a smooth surface under constraints that have discrete ambiguities. These problems arise in areas such as shape from texture, shape from shading, photometric stereo and shape from defocus. While the problem is computationally hard, heuristics based on semidefinite programming may reveal the shape of the surface.

Ady Ecker, Allan D. Jepson, Kiriakos N. Kutulakos
Robust Optimal Pose Estimation

We study the problem of estimating the position and orientation of a calibrated camera from an image of a known scene. A common problem in camera pose estimation is the existence of false correspondences between image features and modeled 3D points. Existing techniques such as

ransac

to handle outliers have no guarantee of optimality. In contrast, we work with a natural extension of the

L

 ∞ 

norm to the outlier case. Using a simple result from classical geometry, we derive necessary conditions for

L

 ∞ 

optimality and show how to use them in a branch and bound setting to find the optimum and to detect outliers. The algorithm has been evaluated on synthetic as well as real data showing good empirical performance. In addition, for cases with no outliers, we demonstrate shorter execution times than existing optimal algorithms.

Olof Enqvist, Fredrik Kahl
Learning to Recognize Activities from the Wrong View Point

Appearance features are good at discriminating activities in a fixed view, but behave poorly when aspect is changed. We describe a method to build features that are highly stable under change of aspect. It is not necessary to have multiple views to extract our features. Our features make it possible to learn a discriminative model of activity in one view, and spot that activity in another view, for which one might poses no labeled examples at all. Our construction uses labeled examples to build activity models, and unlabeled, but corresponding, examples to build an implicit model of how appearance changes with aspect. We demonstrate our method with challenging sequences of real human motion, where discriminative methods built on appearance alone fail badly.

Ali Farhadi, Mostafa Kamali Tabrizi
Joint Parametric and Non-parametric Curve Evolution for Medical Image Segmentation

This paper proposes a new joint parametric and nonparametric curve evolution algorithm of the level set functions for medical image segmentation. Traditional level set algorithms employ non-parametric curve evolution for object matching. Although matching image boundaries accurately, they often suffer from local minima and generate incorrect segmentation of object shapes, especially for images with noise, occlusion and low contrast. On the other hand, statistical model-based segmentation methods allow parametric object shape variations subject to some shape prior constraints, and they are more robust in dealing with noise and low contrast. In this paper, we combine the advantages of both of these methods and jointly use parametric and non-parametric curve evolution in object matching. Our new joint curve evolution algorithm is as robust as and at the same time, yields more accurate segmentation results than the parametric methods using shape prior information. Comparative results on segmenting ventricle frontal horn and putamen shapes in MR brain images confirm both robustness and accuracy of the proposed joint curve evolution algorithm.

Mahshid Farzinfar, Zhong Xue, Eam Khwang Teoh
Localizing Objects with Smart Dictionaries

We present an approach to determine the category and location of objects in images. It performs very fast categorization of each pixel in an image, a brute-force approach made feasible by three key developments: First, our method reduces the size of a large generic dictionary (on the order of ten thousand words) to the low hundreds while increasing classification performance compared to

k

-means. This is achieved by creating a discriminative dictionary tailored to the task by following the information bottleneck principle. Second, we perform feature-based categorization efficiently on a dense grid by extending the concept of integral images to the computation of local histograms. Third, we compute SIFT descriptors densely in linear time. We compare our method to the state of the art and find that it excels in accuracy and simplicity, performing better while assuming less.

Brian Fulkerson, Andrea Vedaldi, Stefano Soatto
Weakly Supervised Object Localization with Stable Segmentations

Multiple Instance Learning (MIL) provides a framework for training a discriminative classifier from data with ambiguous labels. This framework is well suited for the task of learning object classifiers from weakly labeled image data, where only the presence of an object in an image is known, but not its location. Some recent work has explored the application of MIL algorithms to the tasks of image categorization and natural scene classification. In this paper we extend these ideas in a framework that uses MIL to recognize and

localize

objects in images. To achieve this we employ state of the art image descriptors and multiple stable segmentations. These components, combined with a powerful MIL algorithm, form our object recognition system called MILSS. We show highly competitive object categorization results on the Caltech dataset. To evaluate the performance of our algorithm further, we introduce the challenging Landmarks-18 dataset, a collection of photographs of famous landmarks from around the world. The results on this new dataset show the great potential of our proposed algorithm.

Carolina Galleguillos, Boris Babenko, Andrew Rabinovich, Serge Belongie
A Perceptual Comparison of Distance Measures for Color Constancy Algorithms

Color constancy is the ability to measure image features independent of the color of the scene illuminant and is an important topic in color and computer vision. As many color constancy algorithms exist, different distance measures are used to compute their accuracy. In general, these distances measures are based on mathematical principles such as the angular error and Euclidean distance. However, it is unknown to what extent these distance measures correlate to human vision.

Therefore, in this paper, a taxonomy of different distance measures for color constancy algorithms is presented. The main goal is to analyze the correlation between the observed quality of the output images and the different distance measures for illuminant estimates. The output images are the resulting color corrected images using the illuminant estimates of the color constancy algorithms, and the quality of these images is determined by human observers. Distance measures are analyzed how they mimic differences in color naturalness of images as obtained by humans.

Based on the theoretical and experimental results on spectral and real-world data sets, it can be concluded that the perceptual Euclidean distance (PED) with weight-coefficients (

w

R

 = 0.26,

w

G

 = 0.70,

w

B

 = 0.04) finds its roots in human vision and correlates significantly higher than all other distance measures including the angular error and Euclidean distance.

Arjan Gijsenij, Theo Gevers, Marcel P. Lucassen
Scale Invariant Action Recognition Using Compound Features Mined from Dense Spatio-temporal Corners

The use of sparse invariant features to recognise classes of actions or objects has become common in the literature. However, features are often ”engineered” to be both sparse and invariant to transformation and it is assumed that they provide the greatest discriminative information. To tackle activity recognition, we propose learning compound features that are assembled from simple 2D corners in both space and time. Each corner is encoded in relation to its neighbours and from an over complete set (in excess of 1 million possible features), compound features are extracted using data mining. The final classifier, consisting of sets of compound features, can then be applied to recognise and localise an activity in real-time while providing superior performance to other state-of-the-art approaches (including those based upon sparse feature detectors). Furthermore, the approach requires only weak supervision in the form of class labels for each training sequence. No ground truth position or temporal alignment is required during training.

Andrew Gilbert, John Illingworth, Richard Bowden
Semi-supervised On-Line Boosting for Robust Tracking

Recently, on-line adaptation of binary classifiers for tracking have been investigated. On-line learning allows for simple classifiers since only the current view of the object from its surrounding background needs to be discriminiated. However, on-line adaption faces one key problem: Each update of the tracker may introduce an error which, finally, can lead to tracking failure (drifting). The contribution of this paper is a novel on-line semi-supervised boosting method which significantly alleviates the drifting problem in tracking applications. This allows to limit the drifting problem while still staying adaptive to appearance changes. The main idea is to formulate the update process in a semi-supervised fashion as combined decision of a given prior and an on-line classifier. This comes without any parameter tuning. In the experiments, we demonstrate real-time tracking of our SemiBoost tracker on several challenging test sequences where our tracker outperforms other on-line tracking methods.

Helmut Grabner, Christian Leistner, Horst Bischof
Reformulating and Optimizing the Mumford-Shah Functional on a Graph — A Faster, Lower Energy Solution

Active contour formulations predominate current minimization of the Mumford-Shah functional (MSF) for image segmentation and filtering. Unfortunately, these formulations necessitate optimization of the contour by evolving via gradient descent, which is known for its sensitivity to initialization and the tendency to produce undesirable local minima. In order to reduce these problems, we reformulate the corresponding MSF on an arbitrary graph and apply combinatorial optimization to produce a fast, low-energy solution. The solution provided by this graph formulation is compared with the solution computed via traditional narrow-band level set methods. This comparison demonstrates that our graph formulation and optimization produces lower energy solutions than gradient descent based contour evolution methods in significantly less time. Finally, by avoiding evolution of the contour via gradient descent, we demonstrate that our optimization of the MSF is capable of evolving the contour with non-local movement.

Leo Grady, Christopher Alvino
Viewpoint Invariant Pedestrian Recognition with an Ensemble of Localized Features

Viewpoint invariant pedestrian recognition is an important yet under-addressed problem in computer vision. This is likely due to the difficulty in matching two objects with unknown viewpoint and pose. This paper presents a method of performing viewpoint invariant pedestrian recognition using an efficiently and intelligently designed object representation, the ensemble of localized features (ELF). Instead of designing a specific feature by hand to solve the problem, we define a feature space using our intuition about the problem and let a machine learning algorithm find the best representation. We show how both an object class specific representation and a discriminative recognition model can be learned using the AdaBoost algorithm. This approach allows many different kinds of simple features to be combined into a single similarity function. The method is evaluated using a viewpoint invariant pedestrian recognition dataset and the results are shown to be superior to all previous benchmarks for both recognition and reacquisition of pedestrians.

Douglas Gray, Hai Tao
Perspective Nonrigid Shape and Motion Recovery

We present a closed form solution to the nonrigid shape and motion (NRSM) problem from point correspondences in multiple perspective uncalibrated views. Under the assumption that the nonrigid object deforms as a linear combination of

K

rigid shapes, we show that the NRSM problem can be viewed as a reconstruction problem from multiple projections from ℙ

3

K

to ℙ

2

. Therefore, one can linearly solve for the projection matrices by factorizing a multifocal tensor. However, this projective reconstruction in ℙ

3

K

does not satisfy the constraints of the NRSM problem, because it is computed only up to a projective transformation in ℙ

3

K

. Our key contribution is to show that, by exploiting algebraic dependencies among the entries of the projection matrices, one can upgrade the projective reconstruction to determine the affine configuration of the points in ℝ

3

, and the motion of the camera relative to their centroid. Moreover, if

K

 ≥ 2, then either by using calibrated cameras, or by assuming a camera with fixed internal parameters, it is possible to compute the Euclidean structure by a closed form method.

Richard Hartley, René Vidal
Shadows in Three-Source Photometric Stereo

Shadows are one of the most significant difficulties of the photometric stereo method. When four or more images are available, local surface orientation is overdetermined and the shadowed pixels can be discarded. In this paper we look at the challenging case when only three images under three different illuminations are available. In this case, when one of the three pixel intensity constraints is missing due to shadow, a 1 dof ambiguity per pixel arises. We show that using integrability one can resolve this ambiguity and use the remaining two constraints to reconstruct the geometry in the shadow regions. As the problem becomes ill-posed in the presence of noise, we describe a regularization scheme that improves the numerical performance of the algorithm while preserving data. We propose a simple MRF optimization scheme to identify and segment shadow regions in the image. Finally the paper describes how this theory applies in the framework of color photometric stereo where one is restricted to only three images. Experiments on synthetic and real image sequences are presented.

Carlos Hernández, George Vogiatzis, Roberto Cipolla
Hamming Embedding and Weak Geometric Consistency for Large Scale Image Search

This paper improves recent methods for large scale image search. State-of-the-art methods build on the bag-of-features image representation. We, first, analyze bag-of-features in the framework of approximate nearest neighbor search. This shows the sub-optimality of such a representation for matching descriptors and leads us to derive a more precise representation based on 1) Hamming embedding (HE) and 2) weak geometric consistency constraints (WGC). HE provides binary signatures that refine the matching based on visual words. WGC filters matching descriptors that are not consistent in terms of angle and scale. HE and WGC are integrated within the inverted file and are efficiently exploited for all images, even in the case of very large datasets. Experiments performed on a dataset of one million of images show a significant improvement due to the binary signature and the weak geometric consistency constraints, as well as their efficiency. Estimation of the full geometric transformation, i.e., a re-ranking step on a short list of images, is complementary to our weak geometric consistency constraints and allows to further improve the accuracy.

Herve Jegou, Matthijs Douze, Cordelia Schmid
Estimating Geo-temporal Location of Stationary Cameras Using Shadow Trajectories

Using only shadow trajectories of stationary objects in a scene, we demonstrate that using a set of six or more photographs are sufficient to accurately calibrate the camera. Moreover, we present a novel application where, using only three points from the shadow trajectory of the objects, one can accurately determine the geo-location of the camera, up to a longitude ambiguity, and also the date of image acquisition without using any GPS or other special instruments. We refer to this as “geo-temporal localization”. We consider possible cases where ambiguities can be removed if additional information is available. Our method does not require any knowledge of the date or the time when the pictures are taken, and geo-temporal information is recovered directly from the images. We demonstrate the accuracy of our technique for both steps of calibration and geo-temporal localization using synthetic and real data.

Imran N. Junejo, Hassan Foroosh
An Experimental Comparison of Discrete and Continuous Shape Optimization Methods

Shape optimization is a problem which arises in numerous computer vision problems such as image segmentation and multiview reconstruction. In this paper, we focus on a certain class of binary labeling problems which can be globally optimized both in a spatially discrete setting and in a spatially continuous setting. The main contribution of this paper is to present a quantitative comparison of the reconstruction accuracy and computation times which allows to assess some of the strengths and limitations of both approaches. We also present a novel method to approximate length regularity in a graph cut based framework: Instead of using pairwise terms we introduce higher order terms. These allow to represent a more accurate discretization of the

L

2

-norm in the length term.

Maria Klodt, Thomas Schoenemann, Kalin Kolev, Marek Schikora, Daniel Cremers
Image Feature Extraction Using Gradient Local Auto-Correlations

In this paper, we propose a method for extracting image features which utilizes 2

nd

order statistics, i.e., spatial and orientational auto-correlations of local gradients. It enables us to extract richer information from images and to obtain more discriminative power than standard histogram based methods. The image gradients are sparsely described in terms of magnitude and orientation. In addition, normal vectors on the image surface are derived from the gradients and these could also be utilized instead of the gradients. From a geometrical viewpoint, the method extracts information about not only the gradients but also the curvatures of the image surface. Experimental results for pedestrian detection and image patch matching demonstrate the effectiveness of the proposed method compared with other methods, such as HOG and SIFT.

Takumi Kobayashi, Nobuyuki Otsu
Analysis of Building Textures for Reconstructing Partially Occluded Facades

As part of an architectural modeling project, this paper investigates the problem of understanding and manipulating images of buildings. Our primary motivation is to automatically detect and seamlessly remove unwanted foreground elements from urban scenes. Without explicit handling, these objects will appear pasted as artifacts on the model. Recovering the building facade in a video sequence is relatively simple because parallax induces foreground/background depth layers, but here we consider static images only. We develop a series of methods that enable foreground removal from images of buildings or brick walls. The key insight is to use

a priori

knowledge about grid patterns on building facades that can be modeled as Near Regular Textures (NRT). We describe a Markov Random Field (MRF) model for such textures and introduce a Markov Chain Monte Carlo (MCMC) optimization procedure for discovering them. This simple spatial rule is then used as a starting point for inference of missing windows, facade segmentation, outlier identification, and foreground removal.

Thommen Korah, Christopher Rasmussen
Nonrigid Image Registration Using Dynamic Higher-Order MRF Model

In this paper, we propose a nonrigid registration method using the Markov Random Field (MRF) model with a higher-order spatial prior. The registration is designed as finding a set of discrete displacement vectors on a deformable mesh, using the energy model defined by label sets relating to these vectors. This work provides two main ideas to improve the reliability and accuracy of the registration. First, we propose a new energy model which adopts a higher-order spatial prior for the smoothness cost. This model improves limitations of pairwise spatial priors which cannot fully incorporate the natural smoothness of deformations. Next we introduce a

dynamic

energy model to generate optimal displacements. This model works iteratively with optimal data cost while the spatial prior preserve the smoothness cost of previous iteration. For optimization, we convert the proposed model to pairwise MRF model to apply the tree-reweighted message passing (TRW). Concerning the complexity, we apply the

decomposed

scheme to reduce the label dimension of the proposed model and incorporate the linear constrained node (LCN) technique for efficient message passings. In experiments, we demonstrate the competitive performance of the proposed model compared with previous models, presenting both quantitative and qualitative analysis.

Dongjin Kwon, Kyong Joon Lee, Il Dong Yun, Sang Uk Lee
Tracking of Abrupt Motion Using Wang-Landau Monte Carlo Estimation

We propose a novel tracking algorithm based on the Wang-Landau Monte Carlo sampling method which efficiently deals with the abrupt motions. Abrupt motions could cause conventional tracking methods to fail since they violate the motion smoothness constraint. To address this problem, we introduce the Wang-Landau algorithm that has been recently proposed in statistical physics, and integrate this algorithm into the Markov Chain Monte Carlo based tracking method. Our tracking method alleviates the motion smoothness constraint utilizing both the likelihood term and the density of states term, which is estimated by the Wang-Landau algorithm. The likelihood term helps to improve the accuracy in tracking smooth motions, while the density of states term captures abrupt motions robustly. Experimental results reveal that our approach efficiently samples the object’s states even in a whole state space without loss of time. Therefore, it tracks the object of which motion is drastically changing, accurately and robustly.

Junseok Kwon, Kyoung Mu Lee
Surface Visibility Probabilities in 3D Cluttered Scenes

Many methods for 3D reconstruction in computer vision rely on probability models, for example, Bayesian reasoning. Here we introduce a probability model of surface visibilities in densely cluttered 3D scenes. The scenes consist of a large number of small surfaces distributed randomly in a 3D view volume. An example is the leaves or branches on a tree. We derive probabilities for surface visibility, instantaneous image velocity under egomotion, and binocular half–occlusions in these scenes. The probabilities depend on parameters such as scene depth, object size, 3D density, observer speed, and binocular baseline. We verify the correctness of our models using computer graphics simulations, and briefly discuss applications of the model to stereo and motion.

Michael S. Langer
A Generative Shape Regularization Model for Robust Face Alignment

In this paper, we present a robust face alignment system that is capable of dealing with

exaggerating expressions

,

large occlusions

, and

a wide variety of image noises

. The robustness comes from our shape regularization model, which incorporates constrained nonlinear shape prior, geometric transformation, and likelihood of multiple candidate landmarks in a three-layered generative model. The inference algorithm iteratively examines the best candidate positions and updates face shape and pose. This model can effectively recover sufficient shape details from very noisy observations. We demonstrate the performance of this approach on two public domain databases and a large collection of real-world face photographs.

Leon Gu, Takeo Kanade
Modeling and Recognition of Landmark Image Collections Using Iconic Scene Graphs

This paper presents an approach for modeling landmark sites such as the Statue of Liberty based on large-scale contaminated image collections gathered from the Internet. Our system combines 2D appearance and 3D geometric constraints to efficiently extract scene summaries, build 3D models, and recognize instances of the landmark in new test images. We start by clustering images using low-dimensional global “gist” descriptors. Next, we perform geometric verification to retain only the clusters whose images share a common 3D structure. Each valid cluster is then represented by a single iconic view, and geometric relationships between iconic views are captured by an

iconic scene graph

. In addition to serving as a compact scene summary, this graph is used to guide structure from motion to efficiently produce 3D models of the different aspects of the landmark. The set of iconic images is also used for recognition, i.e., determining whether new test images contain the landmark. Results on three data sets consisting of tens of thousands of images demonstrate the potential of the proposed approach.

Xiaowei Li, Changchang Wu, Christopher Zach, Svetlana Lazebnik, Jan-Michael Frahm
VideoCut: Removing Irrelevant Frames by Discovering the Object of Interest

We propose a novel method for removing irrelevant frames from a video given user-provided frame-level labeling for a very small number of frames. We first hypothesize a number of candidate areas which possibly contain the object of interest, and then figure out which area(s) truly contain the object of interest. Our method enjoys several favorable properties. First, compared to approaches where a single descriptor is used to describe a whole frame, each area’s feature descriptor has the chance of genuinely describing the object of interest, hence it is less affected by background clutter. Second, by considering the temporal continuity of a video instead of treating the frames as independent, we can hypothesize the location of the candidate areas more accurately. Third, by infusing prior knowledge into the topic-motion model, we can precisely follow the trajectory of the object of interest. This allows us to largely reduce the number of candidate areas and hence reduce the chance of overfitting the data during learning. We demonstrate the effectiveness of the method by comparing it to several other semi-supervised learning approaches on challenging video clips.

David Liu, Gang Hua, Tsuhan Chen
ASN: Image Keypoint Detection from Adaptive Shape Neighborhood

We describe an accurate keypoint detector that is stable under viewpoint change. In this paper, keypoints correspond to actual junctions in the image. The principle of ASN differs fundamentally from other keypoint detectors. At each position in the image and before any detection, it systematically estimates the position of a potential junction from the local gradient field. Keypoints then appear where multiple position estimates are attracted. This approach allows the detector to adapt in shape and size to the image content. One further obtains the area where the keypoint has attracted solutions. Comparative results with other detectors show the improved accuracy and stability with viewpoint change.

Jean-Nicolas Ouellet, Patrick Hébert
Online Sparse Matrix Gaussian Process Regression and Vision Applications

We present a new Gaussian Process inference algorithm, called Online Sparse Matrix Gaussian Processes (OSMGP), and demonstrate its merits with a few vision applications. The OSMGP is based on the observation that for kernels with local support, the Gram matrix is typically sparse. Maintaining and updating the sparse Cholesky factor of the Gram matrix can be done efficiently using Givens rotations. This leads to an exact, online algorithm whose update time scales linearly with the size of the Gram matrix. Further, if approximate updates are permissible, the Cholesky factor can be maintained at a constant size using hyperbolic rotations to remove certain rows and columns corresponding to discarded training examples. We demonstrate that, using these matrix downdates, online hyperparameter estimation can be included without affecting the linear runtime complexity of the algorithm. The OSMGP algorithm is applied to head-pose estimation and visual tracking problems. Experimental results demonstrate that the proposed method is accurate, efficient and generalizes well using online learning.

Ananth Ranganathan, Ming-Hsuan Yang
Multi-stage Contour Based Detection of Deformable Objects

We present an efficient multi stage approach to detection of deformable objects in real, cluttered images given a single or few hand drawn examples as models. The method handles deformations of the object by first breaking the given model into segments at high curvature points. We allow bending at these points as it has been studied that deformation typically happens at high curvature points. The broken segments are then scaled, rotated, deformed and searched independently in the gradient image. Point maps are generated for each segment that represent the locations of the matches for that segment. We then group

k

points from the point maps of

k

adjacent segments using a cost function that takes into account local scale variations as well as inter-segment orientations. These matched groups yield plausible locations for the objects. In the fine matching stage, the entire object contour in the localized regions is built from the

k

-segment groups and given a comprehensive score in a method that uses dynamic programming. An evaluation of our algorithm on a standard dataset yielded results that are better than published work on the same dataset. At the same time, we also evaluate our algorithm on additional images with considerable object deformations to verify the robustness of our method.

Saiprasad Ravishankar, Arpit Jain, Anurag Mittal
Brain Hallucination

In this paper, we investigate brain hallucination, or generating a high resolution brain image from an input low-resolution image, with the help of another high resolution brain image. Contrary to interpolation techniques, the reconstruction process is based on a physical model of image acquisition. Our contribution is a new regularization approach that uses an example-based framework integrating non-local similarity constraints to handle in a better way repetitive structures and texture. The effectiveness of our approach is demonstrated by experiments on realistic Magnetic Resonance brain images generating automatically high-quality hallucinated brain images from low-resolution input.

François Rousseau
Range Flow for Varying Illumination

In this paper range flow estimation is extended to handle brightness changes in image data caused by inhomogeneous illumination. Standard range flow computes 3d velocity fields from range and intensity image sequences. To this end it combines a depth change model and a brightness constancy model. In this contribution, the brightness constancy model is exchanged by (1) a gradient constancy model, (2) a combination of gradient and brightness constancy constraint that has been used successfully for optical flow estimation in literature, and (3) a physics-based brightness change model. Insensitivity to brightness changes can also be achieved by prefiltering of the input intensity data. High pass or homomorphic filtering are the most well known approaches from literature. In performance tests therefore the well known version and the novel versions of range flow estimation are investigated on prefiltered or non-prefiltered data using synthetic ground-truth and real data from a botanical experiment.

Tobias Schuchert, Til Aach, Hanno Scharr
Some Objects Are More Equal Than Others: Measuring and Predicting Importance

We observe that everyday images contain dozens of objects, and that humans, in describing these images, give different priority to these objects. We argue that a goal of visual recognition is, therefore, not only to detect and classify objects but also to associate with each a level of priority which we call ‘importance’. We propose a definition of importance and show how this may be estimated reliably from data harvested from human observers. We conclude by showing that a first-order estimate of importance may be computed from a number of simple image region measurements and does not require access to image meaning.

Merrielle Spain, Pietro Perona
Robust Multiple Structures Estimation with J-Linkage

This paper tackles the problem of fitting multiple instances of a model to data corrupted by noise and outliers. The proposed solution is based on random sampling and conceptual data representation. Each point is represented with the characteristic function of the set of random models that fit the point. A tailored agglomerative clustering, called J-linkage, is used to group points belonging to the same model. The method does not require prior specification of the number of models, nor it necessitate parameters tuning. Experimental results demonstrate the superior performances of the algorithm.

Roberto Toldo, Andrea Fusiello
Human Activity Recognition with Metric Learning

This paper proposes a metric learning based approach for human activity recognition with two main objectives: (1) reject unfamiliar activities and (2) learn with few examples. We show that our approach outperforms all state-of-the-art methods on numerous standard datasets for traditional action classification problem. Furthermore, we demonstrate that our method not only can accurately label activities but also can reject unseen activities and can learn from few examples with high accuracy. We finally show that our approach works well on noisy YouTube videos.

Du Tran, Alexander Sorokin
Shape Matching by Segmentation Averaging

We use segmentations to match images by shape. To address the unreliability of segmentations, we give a closed form approximation to an average over all segmentations. Our technique has many extensions, yielding new algorithms for tracking, object detection, segmentation, and edge-preserving smoothing. For segmentation, instead of a maximum a posteriori approach, we compute the “central” segmentation minimizing the average distance to all segmentations of an image. Our methods for segmentation and object detection perform competitively, and we also show promising results in tracking and edge–preserving smoothing.

Hongzhi Wang, John Oliensis
Search Space Reduction for MRF Stereo

We present an algorithm to reduce per-pixel search ranges for Markov Random Fields-based stereo algorithms. Our algorithm is based on the intuitions that reliably matched pixels need less regularization in the energy minimization and neighboring pixels should have similar disparity search ranges if their pixel values are similar. We propose a novel bi-labeling process to classify reliable and unreliable pixels that incorporate left-right consistency checks. We then propagate the reliable disparities into unreliable regions to form a complete disparity map and construct per-pixel search ranges based on the difference between the disparity map after propagation and the one computed from a winner-take-all method. Experimental results evaluated on the Middlebury stereo benchmark show our proposed algorithm is able to achieve 77% average reduction rate while preserving satisfactory accuracy.

Liang Wang, Hailin Jin, Ruigang Yang
Estimating 3D Face Model and Facial Deformation from a Single Image Based on Expression Manifold Optimization

Facial expression modeling is central to facial expression recognition and expression synthesis for facial animation. Previous works reported that modeling the facial expression with low-dimensional manifold is more appropriate than using a linear subspace. In this paper, we propose a manifold-based 3D face reconstruction approach to estimating the 3D face model and the associated expression deformation from a single face image. In the training phase, we build a nonlinear 3D expression manifold from a large set of 3D facial expression models to represent the facial shape deformations due to facial expressions. Then a Gaussian mixture model in this manifold is learned to represent the distribution of expression deformation. By combining the merits of morphable neutral face model and the low-dimensional expression manifold, we propose a new algorithm to reconstruct the 3D face geometry as well as the 3D shape deformation from a single face image with expression in an energy minimization framework. Experimental results on CMU-PIE image database and FG-Net video database are shown to validate the effectiveness and accuracy of the proposed algorithm.

Shu-Fan Wang, Shang-Hong Lai
3D Face Recognition by Local Shape Difference Boosting

A new approach, called

Collective Shape Difference Classifier

(CSDC), is proposed to improve the accuracy and computational efficiency of 3D face recognition. The CSDC learns the most discriminative local areas from the

Pure Shape Difference Map

(PSDM) and trains them as weak classifiers for assembling a collective strong classifier using the real-boosting approach. The PSDM is established between two 3D face models aligned by a posture normalization procedure based on facial features. The model alignment is self-dependent, which avoids registering the probe face against every different gallery face during the recognition, so that a high computational speed is obtained. The experiments, carried out on the FRGC v2 and BU-3DFE databases, yield rank-1 recognition rates better than

98

%. Each recognition against a gallery with 1000 faces only needs about

3.05

seconds. These two experimental results together with the high performance recognition on partial faces demonstrate that our algorithm is not only effective but also efficient.

Yueming Wang, Xiaoou Tang, Jianzhuang Liu, Gang Pan, Rong Xiao
Efficiently Learning Random Fields for Stereo Vision with Sparse Message Passing

As richer models for stereo vision are constructed, there is a growing interest in learning model parameters. To estimate parameters in Markov Random Field (MRF) based stereo formulations, one usually needs to perform approximate probabilistic inference. Message passing algorithms based on variational methods and belief propagation are widely used for approximate inference in MRFs. Conditional Random Fields (CRFs) are discriminative versions of traditional MRFs and have recently been applied to the problem of stereo vision. However, CRF parameter training typically requires expensive inference steps for each iteration of optimization. Inference is particularly slow when there are many discrete disparity levels, due to high state space cardinality. We present a novel CRF for stereo matching with an explicit occlusion model and propose sparse message passing to dramatically accelerate the approximate inference needed for parameter optimization. We show that sparse variational message passing iteratively minimizes the KL divergence between the approximation and model distributions by optimizing a lower bound on the partition function. Our experimental results show reductions in inference time of one order of magnitude with no loss in approximation quality. Learning using sparse variational message passing improves results over prior work using graph cuts.

Jerod J. Weinman, Lam Tran, Christopher J. Pal
Recovering Light Directions and Camera Poses from a Single Sphere

This paper introduces a novel method for recovering both the light directions and camera poses from a single sphere. Traditional methods for estimating light directions using spheres either assume both the radius and center of the sphere being known precisely, or they depend on multiple calibrated views to recover these parameters. It will be shown in this paper that the light directions can be uniquely determined from the specular highlights observed in a single view of a sphere without knowing or recovering the exact radius and center of the sphere. Besides, if the sphere is being observed by multiple cameras, its images will uniquely define the translation vector of each camera from a common world origin centered at the sphere center. It will be shown that the relative rotations between the cameras can be recovered using two or more light directions estimated from each view. Closed form solutions for recovering the light directions and camera poses are presented, and experimental results on both synthetic and real data show the practicality of the proposed method.

Kwan-Yee K. Wong, Dirk Schnieders, Shuda Li
Tracking with Dynamic Hidden-State Shape Models

Hidden State Shape Models (HSSMs) were previously proposed to represent and detect objects in images that exhibit not just deformation of their shape but also variation in their structure. In this paper, we introduce Dynamic Hidden-State Shape Models (DHSSMs) to track and recognize the non-rigid motion of such objects, for example, human hands. Our recursive Bayesian filtering method, called

DP-Tracking

, combines an exhaustive local search for a match between image features and model states with a dynamic programming approach to find a global registration between the model and the object in the image. Our contribution is a technique to exploit the hierarchical structure of the dynamic programming approach that on average considerably speeds up the search for matches. We also propose to embed an online learning approach into the tracking mechanism that updates the DHSSM dynamically. The learning approach ensures that the DHSSM accurately represents the tracked object and distinguishes any clutter potentially present in the image. Our experiments show that our method can recognize the digits of a hand while the fingers are being moved and curled to various degrees. The method is robust to various illumination conditions, the presence of clutter, occlusions, and some types of self-occlusions. The experiments demonstrate a significant improvement in both efficiency and accuracy of recognition compared to the non-recursive way of frame-by-frame detection.

Zheng Wu, Margrit Betke, Jingbin Wang, Vassilis Athitsos, Stan Sclaroff
Interactive Tracking of 2D Generic Objects with Spacetime Optimization

We present a continuous optimization framework for interactive tracking of 2D generic objects in a single video stream. The user begins with specifying the locations of a target object in a small set of keyframes; the system then automatically tracks locations of the objects by combining user constraints with visual measurements across the entire sequence. We formulate the problem in a spacetime optimization framework that optimizes over the whole sequence simultaneously. The resulting solution is consistent with visual measurements across the entire sequence while satisfying user constraints. We also introduce prior terms to reduce tracking ambiguity. We demonstrate the power of our algorithm on tracking objects with significant occlusions, scale and orientation changes, illumination changes, sudden movement of objects, and also simultaneous tracking of multiple objects. We compare the performance of our algorithm with alternative methods.

Xiaolin K. Wei, Jinxiang Chai
A Segmentation Based Variational Model for Accurate Optical Flow Estimation

Segmentation has gained in popularity in stereo matching. However, it is not trivial to incorporate it in optical flow estimation due to the possible non-rigid motion problem. In this paper, we describe a new optical flow scheme containing three phases. First, we partition the input images and integrate the segmentation information into a variational model where each of the segments is constrained by an affine motion. Then the errors brought in by segmentation are measured and stored in a confidence map. The final flow estimation is achieved through a global optimization phase that minimizes an energy function incorporating the confidence map. Extensive experiments show that the proposed method not only produces quantitatively accurate optical flow estimates but also preserves sharp motion boundaries, which makes the optical flow result usable in a number of computer vision applications, such as image/video segmentation and editing.

Li Xu, Jianing Chen, Jiaya Jia
Similarity Features for Facial Event Analysis

Each facial event will give rise to complex facial appearance variation. In this paper, we propose similarity features to describe the facial appearance for video-based facial event analysis. Inspired by the kernel features, for each sample, we compare it with the reference set with a similarity function, and we take the log-weighted summarization of the similarities as its similarity feature. Due to the distinctness of the apex images of facial events, we use their cluster-centers as the references. In order to capture the temporal dynamics, we use the K-means algorithm to divide the similarity features into several clusters in temporal domain, and each cluster is modeled by a Gaussian distribution. Based on the Gaussian models, we further map the similarity features into dynamic binary patterns to handle the issue of time resolution, which embed the time-warping operation implicitly. The haar-like descriptor is used to extract the visual features of facial appearance, and Adaboost is performed to learn the final classifiers. Extensive experiments carried on the Cohn-Kanade database show the promising performance of the proposed method.

Peng Yang, Qingshan Liu, Dimitris Metaxas
Building a Compact Relevant Sample Coverage for Relevance Feedback in Content-Based Image Retrieval

Conventional approaches to relevance feedback in content-based image retrieval are based on the assumption that relevant images are physically close to the query image, or the query regions can be identified by a set of clustering centers. However, semantically related images are often scattered across the visual space. It is not always reliable that the refined query point or the clustering centers are capable of representing a complex query region.

In this work, we propose a novel relevance feedback approach which directly aims at extracting a set of samples to represent the query region, regardless of its underlying shape. The sample set extracted by our method is competent as well as compact for subsequent retrieval. Moreover, we integrate feature re-weighting in the process to estimate the importance of each image descriptor. Unlike most existing relevance feedback approaches in which all query points share a same feature weight distribution, our method re-weights the feature importance for each relevant image respectively, so that the representative and discriminative ability for all the images can be maximized. Experimental results on two databases show the effectiveness of our approach.

Bangpeng Yao, Haizhou Ai, Shihong Lao
Discriminative Learning for Deformable Shape Segmentation: A Comparative Study

We present a comparative study on how to use discriminative learning methods such as classification, regression, and ranking to address deformable shape segmentation. Traditional generative models and energy minimization methods suffer from local minima. By casting the segmentation into a discriminative framework, the target fitting function can be steered to possess a desired shape for ease of optimization yet better characterize the relationship between shape and appearance. To address the high-dimensional learning challenge present in the learning framework, we use a multi-level approach to learning discriminative models. Our experimental results on left ventricle segmentation from ultrasound images and facial feature point localization demonstrate that the discriminative models outperform generative models and energy minimization methods by a large margin.

Jingdan Zhang, Shaohua Kevin Zhou, Dorin Comaniciu, Leonard McMillan
Discriminative Locality Alignment

Fisher’s linear discriminant analysis (LDA), one of the most popular dimensionality reduction algorithms for classification, has three particular problems: it fails to find the nonlinear structure hidden in the high dimensional data; it assumes all samples contribute equivalently to reduce dimension for classification; and it suffers from the matrix singularity problem. In this paper, we propose a new algorithm, termed Discriminative Locality Alignment (DLA), to deal with these problems. The algorithm operates in the following three stages: first, in part optimization, discriminative information is imposed over patches, each of which is associated with one sample and its neighbors; then, in sample weighting, each part optimization is weighted by the

margin degree

, a measure of the importance of a given sample; and finally, in whole alignment, the alignment trick is used to align all weighted part optimizations to the whole optimization. Furthermore, DLA is extended to the semi-supervised case, i.e., semi-supervised DLA (SDLA), which utilizes unlabeled samples to improve the classification performance. Thorough empirical studies on the face recognition demonstrate the effectiveness of both DLA and SDLA.

Tianhao Zhang, Dacheng Tao, Jie Yang

Stereo

Efficient Dense Scene Flow from Sparse or Dense Stereo Data

This paper presents a technique for estimating the three-dimensional velocity vector field that describes the motion of each visible scene point (scene flow). The technique presented uses two consecutive image pairs from a stereo sequence. The main contribution is to decouple the position and velocity estimation steps, and to estimate dense velocities using a variational approach. We enforce the scene flow to yield consistent displacement vectors in the left and right images. The decoupling strategy has two main advantages: Firstly, we are independent in choosing a disparity estimation technique, which can yield either sparse or dense correspondences, and secondly, we can achieve frame rates of 5 fps on standard consumer hardware. The approach provides dense velocity estimates with accurate results at distances up to 50 meters.

Andreas Wedel, Clemens Rabe, Tobi Vaudrey, Thomas Brox, Uwe Franke, Daniel Cremers
Integration of Multiview Stereo and Silhouettes Via Convex Functionals on Convex Domains

We propose a convex framework for silhouette and stereo fusion in 3D reconstruction from multiple images. The key idea is to show that the reconstruction problem can be cast as one of minimizing a

convex

functional where the exact silhouette consistency is imposed as a convex constraint that restricts the domain of admissible functions. As a consequence, we can retain the original stereo-weighted surface area as a cost functional without heuristic modifications by balloon terms or other strategies, yet still obtain meaningful (nonempty) global minimizers. Compared to previous methods, the introduced approach does not depend on initialization and leads to a more robust numerical scheme by removing the bias near the visual hull boundary. We propose an efficient parallel implementation of this convex optimization problem on a graphics card. Based on a photoconsistency map and a set of image silhouettes we are therefore able to compute highly-accurate and silhouette-consistent reconstructions for challenging real-world data sets in less than one minute.

Kalin Kolev, Daniel Cremers
Using Multiple Hypotheses to Improve Depth-Maps for Multi-View Stereo

We propose an algorithm to improve the quality of depth-maps used for Multi-View Stereo (MVS). Many existing MVS techniques make use of a two stage approach which estimates depth-maps from neighbouring images and then merges them to extract a final surface. Often the depth-maps used for the merging stage will contain outliers due to errors in the matching process. Traditional systems exploit redundancy in the image sequence (the surface is seen in many views), in order to make the final surface estimate robust to these outliers. In the case of sparse data sets there is often insufficient redundancy and thus performance degrades as the number of images decreases. In order to improve performance in these circumstances it is necessary to remove the outliers from the depth-maps. We identify the two main sources of outliers in a top performing algorithm: (1) spurious matches due to repeated texture and (2) matching failure due to occlusion, distortion and lack of texture. We propose two contributions to tackle these failure modes. Firstly, we store multiple depth hypotheses and use a spatial consistency constraint to extract the true depth. Secondly, we allow the algorithm to return an

unknown

state when the a true depth estimate cannot be found. By combining these in a discrete label MRF optimisation we are able to obtain high accuracy depth-maps with low numbers of outliers. We evaluate our algorithm in a multi-view stereo framework and find it to confer state-of-the-art performance with the leading techniques, in particular on the standard evaluation sparse data sets.

Neill D. F. Campbell, George Vogiatzis, Carlos Hernández, Roberto Cipolla
Sparse Structures in L-Infinity Norm Minimization for Structure and Motion Reconstruction

This paper presents a study on how to numerically solve the feasibility test problem which is the core of the bisection algorithm for minimizing the

L

 ∞ 

error functions. We consider a strategy that minimizes the maximum infeasibility. The minimization can be performed using several numerical computation methods, among which the barrier method and the primal-dual method are examined. In both of the methods, the inequalities are sequentially approximated by log-barrier functions. An initial feasible solution is found easily by the construction of the feasibility problem, and Newton-style update computes the optimal solution iteratively. When we apply the methods to the problem of estimating the structure and motion, every Newton update requires solving a very large system of linear equations. We show that the sparse bundle-adjustment technique, previously developed for structure and motion estimation, can be utilized during the Newton update. In the primal-dual interior-point method, in contrast to the barrier method, the sparse structure is all destroyed due to an extra constraint introduced for finding an initial solution. However, we show that this problem can be overcome by utilizing the matrix inversion lemma which allows us to exploit the sparsity in the same manner as in the barrier method. We finally show that the sparsity appears in both of the

L

 ∞ 

formulations - linear programming and second-order cone programming.

Yongduek Seo, Hyunjung Lee, Sang Wook Lee
Backmatter
Metadaten
Titel
Computer Vision – ECCV 2008
herausgegeben von
David Forsyth
Philip Torr
Andrew Zisserman
Copyright-Jahr
2008
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-88682-2
Print ISBN
978-3-540-88681-5
DOI
https://doi.org/10.1007/978-3-540-88682-2