nach oben

2012 | Buch

Kapitel lesen Erstes Kapitel lesen

Computer Vision – ECCV 2012

12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part V

herausgegeben von: Andrew Fitzgibbon, Svetlana Lazebnik, Pietro Perona, Yoichi Sato, Cordelia Schmid

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

The seven-volume set comprising LNCS volumes 7572-7578 constitutes the refereed proceedings of the 12th European Conference on Computer Vision, ECCV 2012, held in Florence, Italy, in October 2012. The 408 revised papers presented were carefully reviewed and selected from 1437 submissions. The papers are organized in topical sections on geometry, 2D and 3D shapes, 3D reconstruction, visual recognition and classification, visual features and image matching, visual monitoring: action and activities, models, optimisation, learning, visual tracking and image registration, photometry: lighting and colour, and image segmentation.

Inhaltsverzeichnis

Frontmatter

Oral Session 5: MRFs and Early Vision

Diverse M-Best Solutions in Markov Random Fields

Much effort has been directed at algorithms for obtaining the highest probability (MAP) configuration in probabilistic (random field) models. In many situations, one could benefit from additional high-probability solutions. Current methods for computing the

most probable configurations produce solutions that tend to be very similar to the MAP solution and each other. This is often an undesirable property. In this paper we propose an algorithm for the

Diverse M-Best

problem, which involves finding a diverse set of highly probable solutions under a discrete probabilistic model. Given a dissimilarity function measuring closeness of two solutions, our formulation involves maximizing a linear combination of the probability and dissimilarity to previous solutions. Our formulation generalizes the M-Best MAP problem and we show that for certain families of dissimilarity functions we can guarantee that these solutions can be found as easily as the MAP solution.

Dhruv Batra, Payman Yadollahpour, Abner Guzman-Rivera, Gregory Shakhnarovich

Generic Cuts: An Efficient Algorithm for Optimal Inference in Higher Order MRF-MAP

We propose a new algorithm called Generic Cuts for computing optimal solutions to 2 label MRF-MAP problems with higher order clique potentials satisfying submodularity. The algorithm runs in time

) in the worst case (

is clique order and

is the number of pixels). A special gadget is introduced to model flows in a high order clique and a technique for building a flow graph is specified. Based on the primal dual structure of the optimization problem the notions of capacity of an edge and cut are generalized to define a flow problem. We show that in this flow graph max flow is equal to min cut which also is the optimal solution to the problem when potentials are submodular. This is in contrast to all prevalent techniques of optimizing Boolean energy functions involving higher order potentials including those based on reductions to quadratic potential functions which provide only approximate solutions even for submodular functions. We show experimentally that our implementation of the Generic Cuts algorithm is more than an order of magnitude faster than all algorithms including reduction based whose outputs on submodular potentials are near optimal.

Chetan Arora, Subhashis Banerjee, Prem Kalra, S. N. Maheshwari

Filter-Based Mean-Field Inference for Random Fields with Higher-Order Terms and Product Label-Spaces

Recently, a number of cross bilateral filtering methods have been proposed for solving multi-label problems in computer vision, such as stereo, optical flow and object class segmentation that show an order of magnitude improvement in speed over previous methods. These methods have achieved good results despite using models with only unary and/or pairwise terms. However, previous work has shown the value of using models with higher-order terms e.g. to represent label consistency over large regions, or global co-occurrence relations. We show how these higher-order terms can be formulated such that filter-based inference remains possible. We demonstrate our techniques on joint stereo and object labeling problems, as well as object class segmentation, showing in addition for joint object-stereo labeling how our method provides an efficient approach to inference in product label-spaces. We show that we are able to speed up inference in these models around 10-30 times with respect to competing graph-cut/move-making methods, as well as maintaining or improving accuracy in all cases. We show results on PascalVOC-10 for object class segmentation, and Leuven for joint object-stereo labeling.

Vibhav Vineet, Jonathan Warrell, Philip H. S. Torr

Continuous Markov Random Fields for Robust Stereo Estimation

In this paper we present a novel slanted-plane model which reasons jointly about occlusion boundaries as well as depth. We formulate the problem as one of inference in a hybrid MRF composed of both continuous (i.e., slanted 3D planes) and discrete (i.e., occlusion boundaries) random variables. This allows us to define potentials encoding the ownership of the pixels that compose the boundary between segments, as well as potentials encoding which junctions are physically possible. Our approach outperforms the state-of-the-art on Middlebury high resolution imagery [1] as well as in the more challenging KITTI dataset [2], while being more efficient than existing slanted plane MRF methods, taking on average 2 minutes to perform inference on high resolution imagery.

Koichiro Yamaguchi, Tamir Hazan, David McAllester, Raquel Urtasun

Good Regions to Deblur

The goal of single image deblurring is to recover both a latent clear image and an underlying blur kernel from one input blurred image. Recent works focus on exploiting natural image priors or additional image observations for deblurring, but pay less attention to the influence of image structures on estimating blur kernels. What is the useful image structure and how can one select good regions for deblurring? We formulate the problem of learning good regions for deblurring within the Conditional Random Field framework. To better compare blur kernels, we develop an effective similarity metric for labeling training samples. The learned model is able to predict good regions from an input blurred image for deblurring without user guidance. Qualitative and quantitative evaluations demonstrate that good regions can be selected by the proposed algorithms for effective image deblurring.

Zhe Hu, Ming-Hsuan Yang

Patch Complexity, Finite Pixel Correlations and Optimal Denoising

Image restoration tasks are ill-posed problems, typically solved with priors. Since the optimal prior is the exact unknown density of natural images, actual priors are only approximate and typically restricted to small patches. This raises several questions: How much may we hope to improve current restoration results with future sophisticated algorithms? And more fundamentally, even with perfect knowledge of natural image statistics, what is the inherent ambiguity of the problem? In addition, since most current methods are limited to finite support patches or kernels, what is the relation between the patch complexity of natural images, patch size, and restoration errors? Focusing on image denoising, we make several contributions. First, in light of computational constraints, we study the relation between denoising gain and sample size requirements in a non parametric approach. We present a law of diminishing return, namely that with increasing patch size, rare patches not only require a much larger dataset, but also gain little from it. This result suggests novel adaptive variable-sized patch schemes for denoising. Second, we study absolute denoising limits, regardless of the algorithm used, and the converge rate to them as a function of patch size. Scale invariance of natural images plays a key role here and implies both a strictly positive lower bound on denoising and a power law convergence. Extrapolating this parametric law gives a ballpark estimate of the best achievable denoising, suggesting that some improvement, although modest, is still possible.

Anat Levin, Boaz Nadler, Fredo Durand, William T. Freeman

Poster Session 6

Guaranteed Ellipse Fitting with the Sampson Distance

When faced with an ellipse fitting problem, practitioners frequently resort to algebraic ellipse fitting methods due to their simplicity and efficiency. Currently, practitioners must choose between algebraic methods that guarantee an ellipse fit but exhibit high bias, and geometric methods that are less biased but may no longer guarantee an ellipse solution. We address this limitation by proposing a method that strikes a balance between these two objectives. Specifically, we propose a fast stable algorithm for fitting a guaranteed ellipse to data using the Sampson distance as a data-parameter discrepancy measure. We validate the stability, accuracy, and efficiency of our method on both real and synthetic data. Experimental results show that our algorithm is a fast and accurate approximation of the computationally more expensive orthogonal-distance-based ellipse fitting method. In view of these qualities, our method may be of interest to practitioners who require accurate and guaranteed ellipse estimates.

Zygmunt L. Szpak, Wojciech Chojnacki, Anton van den Hengel

A Locally Linear Regression Model for Boundary Preserving Regularization in Stereo Matching

We propose a novel regularization model for stereo matching that uses large neighborhood windows. The model is based on the observation that in a local neighborhood there exists a linear relationship between pixel values and disparities. Compared to the traditional boundary preserving regularization models that use adjacent pixels, the proposed model is robust to image noise and captures higher level interactions. We develop a globally optimized stereo matching algorithm based on this regularization model. The algorithm alternates between finding a quadratic upper bound of the relaxed energy function and solving the upper bound using iterative reweighted least squares. To reduce the chance of being trapped in local minima, we propose a progressive convex-hull filter to tighten the data cost relaxation. Our evaluation on the Middlebury datasets shows the effectiveness of our method in preserving boundary sharpness while keeping regions smooth. We also evaluate our method on a wide range of challenging real-world videos. Experimental results show that our method outperforms existing methods in temporal consistency.

Shengqi Zhu, Li Zhang, Hailin Jin

A Novel Fast Method for L ∞ Problems in Multiview Geometry

Optimization using the

∞

norm is an increasingly important area in multiview geometry. Previous work has shown that globally optimal solutions can be computed reliably using the formulation of generalized fractional programming, in which algorithms solve a sequence of convex problems independently to approximate the optimal

∞

norm error. We found the sequence of convex problems are highly related and we propose a method to derive a Newton-like step from any given point. In our method, the feasible region of the current involved convex problem is contracted gradually along with the Newton-like steps, and the updated point locates on the boundary of the new feasible region. We propose an effective strategy to make the boundary point become an interior point through one dimension augmentation and relaxation. Results are presented and compared to the state of the art algorithms on simulated and real data for some multiview geometry problems with improved performance on both runtime and Newton-like iterations.

Zhijun Dai, Yihong Wu, Fengjun Zhang, Hongan Wang

Visibility Probability Structure from SfM Datasets and Applications

Large scale reconstructions of camera matrices and point clouds have been created using structure from motion from community photo collections. Such a dataset is rich in information; it represents a sampling of the geometry and appearance of the underlying space. In this paper, we encode the visibility information between and among points and cameras as

visibility probabilities

. The conditional visibility probability of a set of points on a point (or a set of cameras on a camera) can rank points (or cameras) based on their mutual dependence. We combine the conditional probability with a distance measure to prioritize points for fast guided search for the image localization problem. We define dual problem of feature triangulation as finding the 3D coordinates of a given image feature point. We use conditional visibility probability to quickly identify a subset of cameras in which a feature is visible.

Siddharth Choudhary, P. J. Narayanan

A Generative Model for Online Depth Fusion

We present a probabilistic, online, depth map fusion framework, whose generative model for the sensor measurement process accurately incorporates both long-range visibility constraints and a spatially varying, probabilistic outlier model. In addition, we propose an inference algorithm that updates the state variables of this model in linear time each frame. Our detailed evaluation compares our approach against several others, demonstrating and explaining the improvements that this model offers, as well as highlighting a problem with all current methods: systemic bias.

Oliver J. Woodford, George Vogiatzis

Depth Recovery Using an Adaptive Color-Guided Auto-Regressive Model

This paper proposes an adaptive color-guided auto-regressive (AR) model for high quality depth recovery from low quality measurements captured by depth cameras. We formulate the depth recovery task into a minimization of AR prediction errors subject to measurement consistency. The AR predictor for each pixel is constructed according to both the local correlation in the initial depth map and the nonlocal similarity in the accompanied high quality color image. Experimental results show that our method outperforms existing state-of-the-art schemes, and is versatile for both mainstream depth sensors: ToF camera and Kinect.

Jingyu Yang, Xinchen Ye, Kun Li, Chunping Hou

Learning Hybrid Part Filters for Scene Recognition

This paper introduces a new image representation for scene recognition, where an image is described based on the response maps of object part filters. The part filters are learned from existing datasets with object location annotations, using deformable part-based models trained by latent SVM [1]. Since different objects may contain similar parts, we describe a method that uses a semantic hierarchy to automatically determine and merge filters shared by multiple objects. The merged

hybrid

filters are then applied to new images. Our proposed representation, called Hybrid-Parts, is generated by pooling the response maps of the hybrid filters. Contrast to previous scene recognition approaches that adopted object-level detections as feature inputs, we harness filter responses of object parts, which enable a richer and finer-grained representation. The use of the hybrid filters is important towards a more compact representation, compared to directly using all the original part filters. Through extensive experiments on several scene recognition benchmarks, we demonstrate that Hybrid-Parts outperforms recent state-of-the-arts, and combining it with standard low-level features such as the GIST descriptor can lead to further improvements.

Yingbin Zheng, Yu-Gang Jiang, Xiangyang Xue

Parametric Manifold of an Object under Different Viewing Directions

The appearance of a 3D object depends on both the viewing directions and illumination conditions. It is proven that all n-pixel images of a convex object with Lambertian surface under variable lighting from infinity form a convex polyhedral cone (called illumination cone) in

-dimensional space. This paper tries to answer the other half of the question: What is the set of images of an object under all viewing directions? A novel image representation is proposed, which transforms any

-pixel image of a 3D object to a vector in a 2

-dimensional pose space. In such a pose space, we prove that the transformed images of a 3D object under all viewing directions form a parametric manifold in a 6-dimensional linear subspace. With in-depth rotations along a single axis in particular, this manifold is an ellipse. Furthermore, we show that this parametric pose manifold of a convex object can be estimated from a few images in different poses and used to predict object’s appearances under unseen viewing directions. These results immediately suggest a number of approaches to object recognition, scene detection, and 3D modelling. Experiments on both synthetic data and real images were reported, which demonstrates the validity of the proposed representation.

Xiaozheng Zhang, Yongsheng Gao, Terry Caelli

Fast Approximations to Structured Sparse Coding and Applications to Object Classification

We describe a method for fast approximation of sparse coding. A given input vector is passed through a binary tree. Each leaf of the tree contains a subset of dictionary elements. The coefficients corresponding to these dictionary elements are allowed to be nonzero and their values are calculated quickly by multiplication with a precomputed pseudoinverse. The tree parameters, the dictionary, and the subsets of the dictionary corresponding to each leaf are learned. In the process of describing this algorithm, we discuss the more general problem of learning the groups in group structured sparse modeling. We show that our method creates good sparse representations by using it in the object recognition framework of [1,2]. Implementing our own fast version of the SIFT descriptor the whole system runs at 20 frames per second on 321 ×481 sized images on a laptop with a quad-core cpu, while sacrificing very little accuracy on the Caltech 101, Caltech 256, and 15 scenes benchmarks.

Arthur Szlam, Karol Gregor, Yann LeCun

Displacement Template with Divide-&-Conquer Algorithm for Significantly Improving Descriptor Based Face Recognition Approaches

This paper proposes a displacement template structure for improving descriptor based face recognition approaches. With this template structure, a face is represented by a template consisting of a set of piled blocks; each block pile consists of a few heavily overlapped blocks from the face image. An ensemble of blocks, one from each pile, is taken as a candidate image of the face. When a descriptor based approach is used, we are able to generate a displacement description template for the face by replacing each block in the template with its local description, where a concatenation of the local descriptions of the blocks, one from each pile, is taken to be a candidate description of the face. Using the description template together with a divide-and-conquer algorithm for computing the similarities between description templates, we have demonstrated the significantly improved performance of LBP, TPLBP and FPLBP templates over original LBP, TPLBP and FPLBP approaches by the experiments on benchmark face databases.

Liang Chen, Ling Yan, Yonghuai Liu, Lixin Gao, Xiaoqin Zhang

Latent Pyramidal Regions for Recognizing Scenes

In this paper we propose a simple but efficient image representation for solving the scene classification problem. Our new representation combines the benefits of spatial pyramid representation using nonlinear feature coding and latent Support Vector Machine (LSVM) to train a set of Latent Pyramidal Regions (LPR). Each of our LPRs captures a discriminative characteristic of the scenes and is trained by searching over all possible sub-windows of the images in a latent SVM training procedure. Each LPR is represented in a spatial pyramid and uses non-linear locality constraint coding for learning both shape and texture patterns of the scene. The final response of the LPRs form a single feature vector which we call the LPR representation and can be used for the classification task. We tested our proposed scene representation model in three datasets which contain a variety of scene categories (15-Scenes, UIUC-Sports and MIT-indoor). Our LPR representation obtains state-of-the-art results on all these datasets which shows that it can simultaneously model the global and local scene characteristics in a single framework and is general enough to be used for both indoor and outdoor scene classification.

Fereshteh Sadeghi, Marshall F. Tappen

Augmented Attribute Representations

We propose a new learning method to infer a mid-level feature representation that combines the advantage of semantic attribute representations with the higher expressive power of non-semantic features. The idea lies in augmenting an existing attribute-based representation with additional dimensions for which an autoencoder model is coupled with a large-margin principle. This construction allows a smooth transition between the zero-shot regime with no training example, the unsupervised regime with training examples but without class labels, and the supervised regime with training examples and with class labels. The resulting optimization problem can be solved efficiently, because several of the necessity steps have closed-form solutions. Through extensive experiments we show that the augmented representation achieves better results in terms of object categorization accuracy than the semantic representation alone.

Viktoriia Sharmanska, Novi Quadrianto, Christoph H. Lampert

Exploring the Spatial Hierarchy of Mixture Models for Human Pose Estimation

Human pose estimation requires a versatile yet well-constrained spatial model for grouping locally ambiguous parts together to produce a globally consistent hypothesis. Previous works either use local deformable models deviating from a certain template, or use a global mixture representation in the pose space. In this paper, we propose a new hierarchical spatial model that can capture an exponential number of poses with a compact mixture representation on each part. Using latent nodes, it can represent high-order spatial relationship among parts with exact inference. Different from recent hierarchical models that associate each latent node to a mixture of appearance templates (like HoG), we use the hierarchical structure as a pure spatial prior avoiding the large and often confounding appearance space. We verify the effectiveness of this model in three ways. First, samples representing human-like poses can be drawn from our model, showing its ability to capture high-order dependencies of parts. Second, our model achieves accurate reconstruction of unseen poses compared to a nearest neighbor pose representation. Finally, our model achieves state-of-art performance on three challenging datasets, and substantially outperforms recent hierarchical models.

Yuandong Tian, C. Lawrence Zitnick, Srinivasa G. Narasimhan

People Orientation Recognition by Mixtures of Wrapped Distributions on Random Trees

The recognition of people orientation in single images is still an open issue in several real cases, when the image resolution is poor, body parts cannot be distinguished and localized or motion cannot be exploited. However, the estimation of a person orientation, even an approximated one, could be very useful to improve people tracking and re-identification systems, or to provide a coarse alignment of body models on the input images. In these situations, holistic features seem to be more effective and faster than model based 3D reconstructions. In this paper we propose to describe the people appearance with multi-level HoG feature sets and to classify their orientation using an array of Extremely Randomized Trees classifiers trained on quantized directions. The outputs of the classifiers are then integrated into a global continuous probability density function using a Mixture of Approximated Wrapped Gaussian distributions. Experiments on the TUD Multiview Pedestrians, the Sarc3D, and the 3DPeS datasets confirm the efficacy of the method and the improvement with respect to state of the art approaches.

Davide Baltieri, Roberto Vezzani, Rita Cucchiara

Hybrid Classifiers for Object Classification with a Rich Background

The majority of current methods in object classification use the one-against-rest training scheme. We argue that when applied to a large number of classes, this strategy is problematic: as the number of classes increases, the negative class becomes a very large and complicated collection of images. The resulting classification problem then becomes extremely unbalanced, and kernel SVM classifiers trained on such sets require long training time and are slow in prediction. To address these problems, we propose to consider the negative class as a

background

and characterize it by a prior

distribution

. Further, we propose to construct ”hybrid” classifiers, which are trained to separate this distribution from the samples of the positive class. A typical classifier first projects (by a function which may be non-linear) the inputs to a one-dimensional space, and then thresholds this projection. Theoretical results and empirical evaluation suggest that, after projection, the background has a relatively simple distribution, which is much easier to parameterize and work with. Our results show that hybrid classifiers offer an advantage over SVM classifiers, both in performance and complexity, especially when the negative (background) class is large.

Margarita Osadchy, Daniel Keren, Bella Fadida-Specktor

Unsupervised and Supervised Visual Codes with Restricted Boltzmann Machines

Recently, the coding of local features (e.g. SIFT) for image categorization tasks has been extensively studied. Incorporated within the Bag of Words (BoW) framework, these techniques optimize the projection of local features into the visual codebook, leading to state-of-the-art performances in many benchmark datasets. In this work, we propose a novel visual codebook learning approach using the restricted Boltzmann machine (RBM) as our generative model. Our contribution is three-fold. Firstly, we steer the unsupervised RBM learning using a regularization scheme, which decomposes into a combined prior for the sparsity of each feature’s representation as well as the selectivity for each codeword. The codewords are then fine-tuned to be discriminative through the supervised learning from top-down labels. Secondly, we evaluate the proposed method with the Caltech-101 and 15-Scenes datasets, either matching or outperforming state-of-the-art results. The codebooks are compact and inference is fast. Finally, we introduce an original method to visualize the codebooks and decipher what each visual codeword encodes.

Hanlin Goh, Nicolas Thome, Matthieu Cord, Joo-Hwee Lim

A New Biologically Inspired Color Image Descriptor

We describe a novel framework for the joint processing of color and shape information in natural images. A hierarchical non-linear spatio-chromatic operator yields spatial and chromatic opponent channels, which mimics processing in the primate visual cortex. We extend two popular object recognition systems (i.e., the

Hmax

hierarchical model of visual processing and a

sift

-based bag-of-words approach) to incorporate color information along with shape information. We further use the framework in combination with the

gist

algorithm for scene categorization as well as the Berkeley segmentation algorithm. In all cases, the proposed approach is shown to outperform standard grayscale/shape-based descriptors as well as alternative color processing schemes on several datasets.

Jun Zhang, Youssef Barhomi, Thomas Serre

Finding Correspondence from Multiple Images via Sparse and Low-Rank Decomposition

We investigate the problem of finding the correspondence from multiple images, which is a challenging combinatorial problem. In this work, we propose a robust solution by exploiting the priors that the rank of the ordered patterns from a set of linearly correlated images should be lower than that of the disordered patterns, and the errors among the reordered patterns are sparse. This problem is equivalent to find a set of optimal partial permutation matrices for the disordered patterns such that the rearranged patterns can be factorized as a sum of a low rank matrix and a sparse error matrix. A scalable algorithm is proposed to approximate the solution by solving two sub-problems sequentially: minimization of the sum of nuclear norm and

norm for solving relaxed partial permutation matrices, followed by a binary integer programming to project each relaxed partial permutation matrix to the feasible solution. We verify the efficacy and robustness of the proposed method with extensive experiments with both images and videos.

Zinan Zeng, Tsung-Han Chan, Kui Jia, Dong Xu

Multidimensional Spectral Hashing

With the growing availability of very large image databases, there has been a surge of interest in methods based on “semantic hashing”, i.e. compact binary codes of data-points so that the Hamming distance between codewords correlates with similarity. In reviewing and comparing existing methods, we show that their relative performance can change drastically depending on the definition of ground-truth neighbors. Motivated by this finding, we propose a new formulation for learning binary codes which seeks to reconstruct the

affinity

between datapoints, rather than their distances. We show that this criterion is intractable to solve exactly, but a spectral relaxation gives an algorithm where the bits correspond to thresholded eigenvectors of the affinity matrix, and as the number of datapoints goes to infinity these eigenvectors converge to eigenfunctions of Laplace-Beltrami operators, similar to the recently proposed Spectral Hashing (SH) method. Unlike SH whose performance may degrade as the number of bits increases, the optimal code using our formulation is guaranteed to faithfully reproduce the affinities as the number of bits increases. We show that the number of eigenfunctions needed may increase exponentially with dimension, but introduce a “kernel trick” to allow us to compute with an exponentially large number of bits but using only memory and computation that grows linearly with dimension. Experiments shows that MDSH outperforms the state-of-the art, especially in the challenging regime of small distance thresholds.

Yair Weiss, Rob Fergus, Antonio Torralba

What Makes a Good Detector? – Structured Priors for Learning from Few Examples

Transfer learning can counter the heavy-tailed nature of the distribution of training examples over object classes. Here, we study transfer learning for object class detection. Starting from the intuition that “what makes a good detector” should manifest itself in the form of repeatable statistics over existing “good” detectors, we design a low-level feature model that can be used as a prior for learning new object class models from scarce training data. Our priors are structured, capturing dependencies both on the level of individual features and spatially neighboring pairs of features. We confirm experimentally the connection between the information captured by our priors and “good” detectors as well as the connection to transfer learning from sources of different quality. We give an in-depth analysis of our priors on a subset of the challenging PASCAL VOC 2007 data set and demonstrate improved average performance over all 20 classes, achieved without manual intervention.

Tianshi Gao, Michael Stark, Daphne Koller

A Convolutional Treelets Binary Feature Approach to Fast Keypoint Recognition

Fast keypoint recognition is essential to many vision tasks. In contrast to the classification-based approaches [1,2], we directly formulate the keypoint recognition as an image patch retrieval problem, which enjoys the merit of finding the matched keypoint and its pose simultaneously. A novel convolutional treelets approach is proposed to effectively extract the binary features from the patches. A corresponding sub-signature-based locality sensitive hashing scheme is employed for the fast approximate nearest neighbor search in patch retrieval. Experiments on both synthetic data and real-world images have shown that our method performs better than state-of-the-art descriptor-based and classification-based approaches.

Chenxia Wu, Jianke Zhu, Jiemi Zhang, Chun Chen, Deng Cai

Categorizing Turn-Taking Interactions

We address the problem of categorizing turn-taking interactions between individuals. Social interactions are characterized by turn-taking and arise frequently in real-world videos. Our approach is based on the use of temporal causal analysis to decompose a space-time visual word representation of video into co-occuring independent segments, called

causal sets

[1]. These causal sets then serves the input to a multiple instance learning framework to categorize turn-taking interactions. We introduce a new turn-taking interactions dataset consisting of social games and sports rallies. We demonstrate that our formulation of multiple instance learning (QP-MISVM) is better able to leverage the repetitive structure in turn-taking interactions and demonstrates superior performance relative to a conventional bag of words model.

Karthir Prabhakar, James M. Rehg

Local Expert Forest of Score Fusion for Video Event Classification

We address the problem of complicated event categorization from a large dataset of videos “in the wild”, where multiple classifiers are applied independently to evaluate each video with a ‘likelihood’ score. The core contribution of this paper is a local expert forest model for meta-level score fusion for event detection under heavily imbalanced class distributions. Our motivation is to adapt to performance variations of the classifiers in different regions of the score space, using a divide-and-conquer technique. We propose a novel method to partition the

likelihood-space

, being sensitive to local label distributions in imbalanced data, and train a pair of locally optimized experts each time. Multiple pairs of experts based on different partitions (‘trees’) form a ‘forest’, balancing local adaptivity and over-fitting of the model. As a result, our model disregards classifiers in regions of the score space where their performance is bad, achieving both local source selection and fusion. We experiment with the TRECVID Multimedia Event Detection (MED) dataset, detecting 15 complicated events from around 34

video clips comprising more than 1000 hours, and demonstrate superior performance compared to other score-level fusion methods.

Jingchen Liu, Scott McCloskey, Yanxi Liu

View-Invariant Action Recognition Using Latent Kernelized Structural SVM

This paper goes beyond recognizing human actions from a fixed view and focuses on action recognition from an arbitrary view. A novel learning algorithm, called latent kernelized structural SVM, is proposed for the view-invariant action recognition, which extends the kernelized structural SVM framework to include latent variables. Due to the changing and frequently unknown positions of the camera, we regard the view label of action as a latent variable and implicitly infer it during both learning and inference. Motivated by the geometric correlation between different views and semantic correlation between different action classes, we additionally propose a mid-level correlation feature which describes an action video by a set of decision values from the pre-learned classifiers of all the action classes from all the views. Each decision value captures both geometric and semantic correlations between the action video and the corresponding action class from the corresponding view. After that, we combine the low-level visual cue, mid-level correlation description, and high-level label information into a novel nonlinear kernel under the latent kernelized structural SVM framework. Extensive experiments on multi-view IXMAS and MuHAVi action datasets demonstrate that our method generally achieves higher recognition accuracy than other state-of-the-art methods.

Xinxiao Wu, Yunde Jia

Trajectory-Based Modeling of Human Actions with Motion Reference Points

Human action recognition in videos is a challenging problem with wide applications. State-of-the-art approaches often adopt the popular bag-of-features representation based on isolated local patches or temporal patch trajectories, where motion patterns like object relationships are mostly discarded. This paper proposes a simple representation specifically aimed at the modeling of such motion relationships. We adopt global and local reference points to characterize motion information, so that the final representation can be robust to camera movement. Our approach operates on top of visual codewords derived from local patch trajectories, and therefore does not require accurate foreground-background separation, which is typically a necessary step to model object relationships. Through an extensive experimental evaluation, we show that the proposed representation offers very competitive performance on challenging benchmark datasets, and combining it with the bag-of-features representation leads to substantial improvement. On Hollywood2, Olympic Sports, and HMDB51 datasets, we obtain 59.5%, 80.6% and 40.7% respectively, which are the best reported results to date.

Yu-Gang Jiang, Qi Dai, Xiangyang Xue, Wei Liu, Chong-Wah Ngo

PatchMatchGraph: Building a Graph of Dense Patch Correspondences for Label Transfer

We address the problem of semantic segmentation, or multi-class pixel labeling, by constructing a graph of dense overlapping patch correspondences across large image sets. We then transfer annotations from labeled images to unlabeled images using the established patch correspondences. Unlike previous approaches to non-parametric label transfer our approach does not require an initial image retrieval step. Moreover, we operate on a graph for computing mappings between images, which avoids the need for exhaustive pairwise comparisons. Consequently, we can leverage offline computation to enhance performance at test time. We conduct extensive experiments to analyze different variants of our graph construction algorithm and evaluate multi-class pixel labeling performance on several challenging datasets.

Stephen Gould, Yuhang Zhang

A Unifying Theory of Active Discovery and Learning

For learning problems where human supervision is expensive, active query selection methods are often exploited to maximise the return of each supervision. Two problems where this has been successfully applied are active discovery – where the aim is to discover at least one instance of each rare class with few supervisions; and active learning – where the aim is to maximise a classifier’s performance with least supervision. Recently, there has been interest in optimising these tasks jointly, i.e., active learning with undiscovered classes, to support efficient interactive modelling of new domains. Mixtures of active discovery and learning and other schemes have been exploited, but perform poorly due to heuristic objectives. In this study, we show with systematic theoretical analysis how the previously disparate tasks of active discovery and learning can be cleanly unified into a single problem, and hence are able for the first time to develop a unified query algorithm to directly optimise this problem. The result is a model which consistently outperforms previous attempts at active learning in the presence of undiscovered classes, with no need to tune parameters for different datasets.

Timothy M. Hospedales, Shaogang Gong, Tao Xiang

Extracting 3D Scene-Consistent Object Proposals and Depth from Stereo Images

This work combines two active areas of research in computer vision: unsupervised object extraction from a single image, and depth estimation from a stereo image pair. A recent, successful trend in unsupervised object extraction is to exploit so-called “3D scene-consistency”, that is enforcing that objects obey underlying physical constraints of the 3D scene, such as occupancy of 3D space and gravity of objects. Our main contribution is to introduce the concept of 3D scene-consistency into stereo matching. We show that this concept is beneficial for both tasks, object extraction and depth estimation. In particular, we demonstrate that our approach is able to create a large set of 3D scene-consistent object proposals, by varying e.g. the prior on the number of objects. After automatically ranking the proposals we show experimentally that our results are considerably closer to ground truth than state-of-the-art techniques which either use stereo or monocular images. We envision that our method will build the front-end of a future object recognition system for stereo images.

Michael Bleyer, Christoph Rhemann, Carsten Rother

Repairing Sparse Low-Rank Texture

In this paper, we show how to harness both low-rank and sparse structures in regular or near regular textures for image completion. Our method leverages the new convex optimization for low-rank and sparse signal recovery and can automatically correctly repair the global structure of a corrupted texture, even without precise information about the regions to be completed. Through extensive simulations, we show our method can complete and repair textures corrupted by errors with both random and contiguous supports better than existing low-rank matrix recovery methods. Through experimental comparisons with existing image completion systems (such as Photoshop) our method demonstrate significant advantage over local patch based texture synthesis techniques in dealing with large corruption, non-uniform texture, and large perspective deformation.

Xiao Liang, Xiang Ren, Zhengdong Zhang, Yi Ma

Active Frame Selection for Label Propagation in Videos

Manually segmenting and labeling objects in video sequences is quite tedious, yet such annotations are valuable for learning-based approaches to object and activity recognition. While automatic label propagation can help, existing methods simply propagate annotations from arbitrarily selected frames (e.g., the first one) and so may fail to best leverage the human effort invested. We define an

active frame selection problem

: select

frames for manual labeling, such that automatic pixel-level label propagation can proceed with minimal expected error. We propose a solution that directly ties a joint frame selection criterion to the predicted errors of a flow-based random field propagation model. It selects the set of

frames that together minimize the total mislabeling risk over the entire sequence. We derive an efficient dynamic programming solution to optimize the criterion. Further, we show how to automatically determine how many total frames

should be labeled in order to minimize the total manual effort spent labeling and correcting propagation errors. We demonstrate our method’s clear advantages over several baselines, saving hours of human effort per video.

Sudheendra Vijayanarasimhan, Kristen Grauman

Non-causal Temporal Prior for Video Deblocking

Real-world video sequences coded at low bit rates suffer from compression artifacts, which are visually disruptive and can cause problems to computer vision algorithms. Unlike the denoising problem where the high frequency components of the signal are present in the noisy observation, most high frequency details are lost during compression and artificial discontinuities arise across the coding block boundaries. In addition to sparse spatial priors that can reduce the blocking artifacts for a single frame, temporal information is needed to recover the lost spatial details. However, establishing accurate temporal correspondences from the compressed videos is challenging because of the loss of high frequency details and the increase of false blocking artifacts. In this paper, we propose a

non-causal temporal prior

model to reduce video compression artifacts by propagating information from adjacent frames and iterating between image reconstruction and motion estimation. Experimental results on real-world sequences demonstrate that the deblocked videos by the proposed system have marginal statistics of high frequency components closer to those of the original ones, and are better input for standard edge and corner detectors than the coded ones.

Deqing Sun, Ce Liu

Text Image Deblurring Using Text-Specific Properties

State-of-the-art blind image deconvolution approaches have difficulties when dealing with text images, since they rely on natural image statistics which do not respect the special properties of text images. On the other hand, previous document image restoring systems and the recently proposed black-and-white document image deblurring method [1] are limited, and cannot handle large motion blurs and complex background. We propose a novel text image deblurring method which takes into account the specific properties of text images. Our method extends the commonly used optimization framework for image deblurring to allow domain-specific properties to be incorporated in the optimization process. Experimental results show that our method can generate higher quality deblurring results on text images than previous approaches.

Hojin Cho, Jue Wang, Seungyong Lee

Sequential Spectral Learning to Hash with Multiple Representations

Learning to hash involves learning hash functions from a set of images for embedding high-dimensional visual descriptors into a similarity-preserving low-dimensional Hamming space. Most of existing methods resort to a single representation of images, that is, only one type of visual descriptors is used to learn a hash function to assign binary codes to images. However, images are often described by multiple different visual descriptors (such as SIFT, GIST, HOG), so it is desirable to incorporate these multiple representations into learning a hash function, leading to

multi-view hashing.

In this paper we present a

sequential spectral learning

approach to multi-view hashing where a hash function is sequentially determined by solving the successive maximization of local variances subject to decorrelation constraints. We compute multi-view local variances by

-averaging view-specific distance matrices such that the best averaged distance matrix is determined by minimizing its

-divergence from view-specific distance matrices. We also present a scalable implementation, exploiting a fast approximate

-NN graph construction method, in which

-averaged distances computed in small partitions determined by recursive spectral bisection are gradually merged in conquer steps until whole examples are used. Numerical experiments on Caltech-256, CIFAR-20, and NUS-WIDE datasets confirm the high performance of our method, in comparison to single-view spectral hashing as well as existing multi-view hashing methods.

Saehoon Kim, Yoonseop Kang, Seungjin Choi

Two-Granularity Tracking: Mediating Trajectory and Detection Graphs for Tracking under Occlusions

We propose a tracking framework that mediates grouping cues from two levels of tracking granularities, detection tracklets and point trajectories, for segmenting objects in crowded scenes. Detection tracklets capture objects when they are mostly visible. They may be sparse in time, may miss partially occluded or deformed objects, or contain false positives. Point trajectories are dense in space and time. Their affinities integrate long range motion and 3D disparity information, useful for segmentation. Affinities may leak though across similarly moving objects, since they lack model knowledge. We establish one trajectory and one detection tracklet graph, encoding grouping affinities in each space and associations across. Two-granularity tracking is cast as simultaneous detection tracklet classification and clustering (cl

) in the joint space of tracklets and trajectories. We solve cl

by explicitly mediating contradictory affinities in the two graphs: Detection tracklet classification

modifies

trajectory affinities to reflect object specific dis-associations. Non-accidental grouping alignment between detection tracklets and trajectory clusters boosts or rejects corresponding detection tracklets, changing accordingly their classification.We show our model can track objects through sparse, inaccurate detections and persistent partial occlusions. It adapts to the changing visibility masks of the targets, in contrast to detection based bounding box trackers, by effectively switching between the two granularities according to object occlusions, deformations and background clutter.

Katerina Fragkiadaki, Weiyu Zhang, Geng Zhang, Jianbo Shi

Taking Mobile Multi-object Tracking to the Next Level: People, Unknown Objects, and Carried Items

In this paper, we aim to take mobile multi-object tracking to the next level. Current approaches work in a

tracking-by-detection

framework, which limits them to object categories for which pre-trained detector models are available. In contrast, we propose a novel

tracking-before-detection

approach that can track both known and unknown object categories in very challenging street scenes. Our approach relies on noisy stereo depth data in order to segment and track objects in 3D. At its core is a novel, compact 3D representation that allows us to robustly track a large variety of objects, while building up models of their 3D shape online. In addition to improving tracking performance, this representation allows us to detect anomalous shapes, such as carried items on a person’s body. We evaluate our approach on several challenging video sequences of busy pedestrian zones and show that it outperforms state-of-the-art approaches.

Dennis Mitzel, Bastian Leibe

Dynamic Context for Tracking behind Occlusions

Tracking objects in the presence of clutter and occlusion remains a challenging problem. Current approaches often rely on

a priori

target dynamics and/or use nearly rigid image context to determine the target position. In this paper, a novel algorithm is proposed to estimate the location of a target while it is hidden due to occlusion. The main idea behind the algorithm is to use contextual dynamical cues from multiple supporter features which may move with the target, move independently of the target, or remain stationary. These dynamical cues are learned directly from the data without making prior assumptions about the motions of the target and/or the support features. As illustrated through several experiments, the proposed algorithm outperforms state of the art approaches under long occlusions and severe camera motion.

Fei Xiong, Octavia I. Camps, Mario Sznaier

To Track or To Detect? An Ensemble Framework for Optimal Selection

This paper presents a novel approach for multi-target tracking using an ensemble framework that optimally chooses target tracking results from that of independent trackers and a detector at each time step. The ensemble model is designed to select the best candidate scored by a function integrating detection confidence, appearance affinity, and smoothness constraints imposed using geometry and motion information. Parameters of our association score function are discriminatively trained with a max-margin framework. Optimal selection is achieved through a hierarchical data association step that progressively associates candidates to targets. By introducing a second target classifier and using the ranking score from the pre-trained classifier as the detection confidence measure, we add additional robustness against unreliable detections. The proposed algorithm robustly tracks a large number of moving objects in complex scenes with occlusions. We evaluate our approach on a variety of public datasets and show promising improvements over state-of-the-art methods.

Xu Yan, Xuqing Wu, Ioannis A. Kakadiaris, Shishir K. Shah

Spatial and Angular Variational Super-Resolution of 4D Light Fields

We present a variational framework to generate super-resolved novel views from 4D light field data sampled at low resolution, for example by a plenoptic camera. In contrast to previous work, we formulate the problem of view synthesis as a continuous inverse problem, which allows us to correctly take into account foreshortening effects caused by scene geometry transformations. High-accuracy depth maps for the input views are locally estimated using epipolar plane image analysis, which yields floating point depth precision without the need for expensive matching cost minimization. The disparity maps are further improved by increasing angular resolution with synthesized intermediate views. Minimization of the super-resolution model energy is performed with state of the art convex optimization algorithms within seconds.

Sven Wanner, Bastian Goldluecke

Blur-Kernel Estimation from Spectral Irregularities

We describe a new method for recovering the blur kernel in motion-blurred images based on statistical irregularities their power spectrum exhibits. This is achieved by a power-law that refines the one traditionally used for describing natural images. The new model better accounts for biases arising from the presence of large and strong edges in the image. We use this model together with an accurate spectral whitening formula to estimate the power spectrum of the blur. The blur kernel is then recovered using a phase retrieval algorithm with improved convergence and disambiguation capabilities. Unlike many existing methods, the new approach does not perform a maximum a posteriori estimation, which involves repeated reconstructions of the latent image, and hence offers attractive running times.

We compare the new method with state-of-the-art methods and report various advantages, both in terms of efficiency and accuracy.

Amit Goldstein, Raanan Fattal

Deconvolving PSFs for a Better Motion Deblurring Using Multiple Images

Blind deconvolution of motion blur is hard, but it can be made easier if multiple images are available. This observation, and an algorithm using two differently-blurred images of a scene are the subject of this paper. While this idea is not new, existing methods have so far not delivered very practical results. In practice, the PSFs corresponding to the two given images are estimated and assumed to be close to the latent motion blurs. But in actual fact, these estimated blurs are often far from the truth, for a simple reason: They often share a common, and unidentified PSF that goes unaccounted for. That is, the estimated PSFs are themselves “blurry”. While this can be due to any number of other blur sources including shallow depth of field, out of focus, lens aberrations, diffraction effects, and the like, it is also a mathematical artifact of the ill-posedness of the deconvolution problem. In this paper, instead of estimating the PSFs directly and only once from the observed images, we first generate a rough estimate of the PSFs using a robust multichannel deconvolution algorithm, and then “deconvolve the PSFs” to refine the outputs. Simulated and real data experiments show that this strategy works quite well for motion blurred images, producing state of the art results.

Xiang Zhu, Filip Šroubek, Peyman Milanfar

Depth and Deblurring from a Spectrally-Varying Depth-of-Field

We propose modifying the aperture of a conventional color camera so that the effective aperture size for one color channel is smaller than that for the other two. This produces an image where different color channels have different depths-of-field, and from this we can computationally recover scene depth, reconstruct an all-focus image and achieve synthetic re-focusing, all from a single shot. These capabilities are enabled by a spatio-spectral image model that encodes the statistical relationship between gradient profiles across color channels. This approach substantially improves depth accuracy over alternative single-shot coded-aperture designs, and since it avoids introducing additional spatial distortions and is light efficient, it allows high-quality deblurring and lower exposure times. We demonstrate these benefits with comparisons on synthetic data, as well as results on images captured with a prototype lens.

Ayan Chakrabarti, Todd Zickler

Segmentation over Detection by Coupled Global and Local Sparse Representations

Motivated by the rising performances of object detection algorithms, we investigate how to further precisely segment out objects within the output bounding boxes. The task is formulated as a unified optimization problem, pursuing a unique latent object mask in non-parametric manner. For a given test image, the objects are first detected by detectors. Then for each detected bounding box, the objects of the same category along with their object masks are extracted from the training set. The latent mask of the object within the bounding box is inferred based on three objectives: 1) the latent mask should be coherent, subject to sparse errors caused by within-category diversities, with the global bounding-box-level mask inferred by sparse representation over the bounding boxes of the same category within the training set; 2) the latent mask should be coherent with local patch-level mask inferred by sparse representation of the individual patch over all spatially nearby (handling local deformations) patches of the same category in the training set; and 3) mask property within each sufficiently small super-pixel should be consistent. All these three objectives are integrated into a unified optimization problem, and finally the sparse representation coefficients and the latent mask are alternately optimized based on Lasso optimization and smooth approximation followed by Accelerated Proximal Gradient method, respectively. Extensive experiments on the Pascal VOC object segmentation datasets, VOC2007 and VOC2010, show that our proposed algorithm achieves competitive results with the state-of-the-art learning based algorithms, and is superior over other detection based object segmentation algorithms.

Wei Xia, Zheng Song, Jiashi Feng, Loong-Fah Cheong, Shuicheng Yan

Moving Object Segmentation Using Motor Signals

Moving object segmentation from an image sequence is essential for a robot to interact with its environment. Traditional vision approaches appeal to pure motion analysis on videos without exploiting the source of the background motion. We observe, however, that the background motion (from the robot’s egocentric view) has stronger correlation to the robot’s motor signals than the foreground motion. We propose a novel approach to detecting moving objects by clustering features into background and foreground according to their motion consistency with motor signals. Specifically, our approach learns homography and fundamental matrices as functions of motor signals, and predict sparse feature locations from the learned matrices. The errors between the predictions and their actual tracked locations are used to label them into background and foreground. The labels of the sparse features are then propagated to all pixels. Our approach does not require building a dense mosaic background or searching for affine, homography, or fundamental matrix parameters for foreground separation. In addition, it does not need to explicitly model the intrinsic and extrinsic calibration parameters hence requires much less prior geometry knowledge. It works completely in 2D image space, and does not involve any complex analysis or computation in 3D space.

Changhai Xu, Jingen Liu, Benjamin Kuipers

Block-Sparse RPCA for Consistent Foreground Detection

Recent evaluation of representative background subtraction techniques demonstrated the drawbacks of these methods, with hardly any approach being able to reach more than 50% precision at recall level higher than 90%. Challenges in realistic environment include illumination change causing complex intensity variation, background motions (trees, waves, etc.) whose magnitude can be greater than the foreground, poor image quality under low light, camouflage etc. Existing methods often handle only part of these challenges; we address all these challenges in a unified framework which makes little specific assumption of the background. We regard the observed image sequence as being made up of the sum of a low-rank background matrix and a sparse outlier matrix and solve the decomposition using the Robust Principal Component Analysis method. We dynamically estimate the support of the foreground regions via a motion saliency estimation step, so as to impose spatial coherence on these regions. Unlike smoothness constraint such as MRF, our method is able to obtain crisply defined foreground regions, and in general, handles large dynamic background motion much better. Extensive experiments on benchmark and additional challenging datasets demonstrate that our method significantly outperforms the state-of-the-art approaches and works effectively on a wide range of complex scenarios.

Zhi Gao, Loong-Fah Cheong, Mo Shan

A Generative Model for Simultaneous Estimation of Human Body Shape and Pixel-Level Segmentation

This paper addresses pixel-level segmentation of a human body from a single image. The problem is formulated as a multi-region segmentation where the human body is constrained to be a collection of geometrically linked regions and the background is split into a small number of distinct zones. We solve this problem in a Bayesian framework for jointly estimating articulated body pose and the pixel-level segmentation of each body part. Using an image likelihood function that simultaneously generates and evaluates the image segmentation corresponding to a given pose, we robustly explore the posterior body shape distribution using a data-driven, coarse-to-fine Metropolis Hastings sampling scheme that includes a strongly data-driven proposal term.

Ingmar Rauschert, Robert T. Collins

Visual Dictionary Learning for Joint Object Categorization and Segmentation

Representing objects using elements from a visual dictionary is widely used in object detection and categorization. Prior work on dictionary learning has shown improvements in the accuracy of object detection and categorization by learning discriminative dictionaries. However none of these dictionaries are learnt for joint object categorization and segmentation. Moreover, dictionary learning is often done separately from classifier training, which reduces the discriminative power of the model. In this paper, we formulate the semantic segmentation problem as a joint categorization, segmentation and dictionary learning problem. To that end, we propose a latent conditional random field (CRF) model in which the observed variables are pixel category labels and the latent variables are visual word assignments. The CRF energy consists of a bottom-up segmentation cost, a top-down bag of (latent) words categorization cost, and a dictionary learning cost. Together, these costs capture relationships between image features and visual words, relationships between visual words and object categories, and spatial relationships among visual words. The segmentation, categorization, and dictionary learning parameters are learnt jointly using latent structural SVMs, and the segmentation and visual words assignments are inferred jointly using energy minimization techniques. Experiments on the Graz02 and CamVid datasets demonstrate the performance of our approach.

Aastha Jain, Luca Zappella, Patrick McClure, René Vidal

Oral Session 6: Geometry and Recognition

People Watching: Human Actions as a Cue for Single View Geometry

We present an approach which exploits the coupling between human actions and scene geometry. We investigate the use of human pose as a cue for single-view 3D scene understanding. Our method builds upon recent advances in still-image pose estimation to extract functional and geometric constraints about the scene. These constraints are then used to improve state-of-the-art single-view 3D scene understanding approaches. The proposed method is validated on a collection of monocular time-lapse sequences collected from YouTube and a dataset of still images of indoor scenes. We demonstrate that observing people performing different actions can significantly improve estimates of 3D scene geometry.

David F. Fouhey, Vincent Delaitre, Abhinav Gupta, Alexei A. Efros, Ivan Laptev, Josef Sivic

Indoor Segmentation and Support Inference from RGBD Images

We present an approach to interpret the major surfaces, objects, and support relations of an indoor scene from an RGBD image. Most existing work ignores physical interactions or is applied only to tidy rooms and hallways. Our goal is to parse typical, often messy, indoor scenes into floor, walls, supporting surfaces, and object regions, and to recover support relationships. One of our main interests is to better understand how 3D cues can best inform a structured 3D interpretation. We also contribute a novel integer programming formulation to infer physical support relations. We offer a new dataset of 1449 RGBD images, capturing 464 diverse indoor scenes, with detailed annotations. Our experiments demonstrate our ability to infer support relations in complex scenes and verify that our 3D scene cues and inferred support lead to better object segmentation.

Nathan Silberman, Derek Hoiem, Pushmeet Kohli, Rob Fergus

Beyond the Line of Sight: Labeling the Underlying Surfaces

Scene understanding requires reasoning about both what we can see and what is occluded. We offer a simple and general approach to infer labels of occluded background regions. Our approach incorporates estimates of visible surrounding background, detected objects, and shape priors from transferred training regions. We demonstrate the ability to infer the labels of occluded background regions in both the outdoor StreetScenes dataset and an indoor scene dataset using the same approach. Our experiments show that our method outperforms competent baselines.

Ruiqi Guo, Derek Hoiem

Depth Extraction from Video Using Non-parametric Sampling

We describe a technique that automatically generates plausible depth maps from videos using non-parametric depth sampling. We demonstrate our technique in cases where past methods fail (non-translating cameras and dynamic scenes). Our technique is applicable to single images as well as videos. For videos, we use local motion cues to improve the inferred depth maps, while optical flow is used to ensure temporal depth consistency. For training and evaluation, we use a Kinect-based system to collect a large dataset containing stereoscopic videos with known depths. We show that our depth estimation technique outperforms the state-of-the-art on benchmark databases. Our technique can be used to automatically convert a monoscopic video into stereo for 3D visualization, and we demonstrate this through a variety of visually pleasing results for indoor and outdoor scenes, including results from the feature film

Charade

Kevin Karsch, Ce Liu, Sing Bing Kang

Multiple View Object Cosegmentation Using Appearance and Stereo Cues

We present an automatic approach to segment an object in calibrated images acquired from multiple viewpoints. Our system starts with a new piecewise planar layer-based stereo algorithm that estimates a dense depth map that consists of a set of 3D planar surfaces. The algorithm is formulated using an energy minimization framework that combines stereo and appearance cues, where for each surface, an appearance model is learnt using an unsupervised approach. By treating the planar surfaces as structural elements of the scene and reasoning about their visibility in multiple views, we segment the object in each image independently. Finally, these segmentations are refined by probabilistically fusing information across multiple views. We demonstrate that our approach can segment challenging objects with complex shapes and topologies, which may have thin structures and non-Lambertian surfaces. It can also handle scenarios where the object and background color distributions overlap significantly.

Adarsh Kowdle, Sudipta N. Sinha, Richard Szeliski

Poster Session 7

Elastic Shape Matching of Parameterized Surfaces Using Square Root Normal Fields

In this paper we define a new methodology for shape analysis of parameterized surfaces, where the main issues are: (1) choice of metric for shape comparisons and (2) invariance to reparameterization. We begin by defining a general elastic metric on the space of parameterized surfaces. The main advantages of this metric are twofold. First, it provides a natural interpretation of elastic shape deformations that are being quantified. Second, this metric is invariant under the action of the reparameterization group. We also introduce a novel representation of surfaces termed square root normal fields or SRNFs. This representation is convenient for shape analysis because, under this representation, a reduced version of the general elastic metric becomes the simple

$\ensuremath{\mathbb{L}^2}$

metric. Thus, this transformation greatly simplifies the implementation of our framework. We validate our approach using multiple shape analysis examples for quadrilateral and spherical surfaces. We also compare the current results with those of Kurtek et al. [1]. We show that the proposed method results in more natural shape matchings, and furthermore, has some theoretical advantages over previous methods.

Ian H. Jermyn, Sebastian Kurtek, Eric Klassen, Anuj Srivastava

N-tuple Color Segmentation for Multi-view Silhouette Extraction

We present a new method to extract multiple segmentations of an object viewed by multiple cameras, given only the camera calibration. We introduce the

-tuple color model to express inter-view consistency when inferring in each view the foreground and background color models permitting the final segmentation. A color

-tuple is a set of pixel colors associated to the

projections of a 3D point. The first goal is set as finding the MAP estimate of background/foreground color models based on an arbitrary sample set of such

-tuples, such that samples are consistently classified, in a soft way, as ”empty” if they project in the background of at least one view, or ”occupied” if they project to foreground pixels in all views. An Expectation Maximization framework is then used to alternate between color models and soft classifications. In a final step, all views are segmented based on their attached color models. The approach is significantly simpler and faster than previous multi-view segmentation methods, while providing results of equivalent or better quality.

Abdelaziz Djelouah, Jean-Sébastien Franco, Edmond Boyer, François Le Clerc, Patrick Pérez

Motion-Aware Structured Light Using Spatio-Temporal Decodable Patterns

Single-shot structured light methods allow 3D reconstruction of dynamic scenes. However, such methods lose spatial resolution and perform poorly around depth discontinuities. Previous single-shot methods project the same pattern repeatedly; thereby spatial resolution is reduced even if the scene is static or has slowly moving parts. We present a structured light system using a sequence of shifted stripe patterns that is decodable both spatially and temporally. By default, our method allows single-shot 3D reconstruction with any of our projected patterns by using spatial windows. Moreover, the sequence is designed so as to progressively improve the reconstruction quality around depth discontinuities by using temporal windows.

Our method enables

motion-aware

reconstruction for each pixel: The best spatio-temporal window is automatically selected depending on the scene structure, motion, and the number of available images. This significantly reduces the number of pixels around discontinuities where depth cannot be recovered in traditional approaches. Our decoding scheme extends the adaptive window matching commonly used in stereo by incorporating temporal windows with 1D spatial windows. We demonstrate the advantages of our approach for a variety of scenarios including thin structures, dynamic scenes, and scenes containing both static and dynamic regions.

Yuichi Taguchi, Amit Agrawal, Oncel Tuzel

Refractive Calibration of Underwater Cameras

In underwater computer vision, images are influenced by the water in two different ways. First, while still traveling through the water, light is absorbed and scattered, both of which are wavelength dependent, thus create the typical green or blue hue in underwater images. Secondly, when entering the underwater housing, the rays are refracted, affecting image formation geometrically. When using underwater images in for example Structure-from-Motion applications, both effects need to be taken into account. Therefore, we present a novel method for calibrating the parameters of an underwater camera housing. An evolutionary optimization algorithm is coupled with an analysis-by-synthesis approach, which allows to calibrate the parameters of a light propagation model for the local water body. This leads to a highly accurate calibration method for camera-glass distance and glass normal with respect to the optical axis. In addition, a model for the distance dependent effect of water on light propagation is parametrized and can be used for color correction.

Anne Jordt-Sedlazeck, Reinhard Koch

Detection of Independently Moving Objects in Non-planar Scenes via Multi-Frame Monocular Epipolar Constraint

In this paper we present a novel approach for detection of independently moving foreground objects in non-planar scenes captured by a moving camera. We avoid the traditional assumptions that the stationary background of the scene is planar, or that it can be approximated by dominant single or multiple planes, or that the camera used to capture the video is orthographic. Instead we utilize a multiframe monocular epipolar constraint of camera motion derived for monocular moving cameras defined by an evolving epipolar plane between the moving camera center and 3D scene points. This constraint is parameterized as a polynomial function of time, and unlike repeated computations of inter-frame fundamental matrix, requires the estimation of fewer unknowns, and provides a more consistent separation between moving and static objects for different levels of noise. This constraint allows us to segment out moving objects in a general 3D scene where other approaches fail because their initial assumptions do not hold, and provides a natural way of fusing temporal information across multiple frames. We use a combination of optical flow and particle advection to capture all motion in the video across a number of frames, in the form of particle trajectories. We then apply the derived multi-frame epipolar constraint to these trajectories to determine which trajectories violate it, thus segmenting out the independently moving objects. We show superior results on a number of moving camera sequences observing non-planar scenes, where other methods fail.

Soumyabrata Dey, Vladimir Reilly, Imran Saleemi, Mubarak Shah

Backmatter

Titel: Computer Vision – ECCV 2012
herausgegeben von: Andrew Fitzgibbon
Svetlana Lazebnik
Pietro Perona
Yoichi Sato
Cordelia Schmid
Verlag: Springer Berlin Heidelberg
Electronic ISBN: 978-3-642-33715-4
Print ISBN: 978-3-642-33714-7
DOI: https://doi.org/10.1007/978-3-642-33715-4

Springer Professional