Skip to main content

2012 | Buch

Computer Vision – ECCV 2012

12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part VI

herausgegeben von: Andrew Fitzgibbon, Svetlana Lazebnik, Pietro Perona, Yoichi Sato, Cordelia Schmid

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

The seven-volume set comprising LNCS volumes 7572-7578 constitutes the refereed proceedings of the 12th European Conference on Computer Vision, ECCV 2012, held in Florence, Italy, in October 2012. The 408 revised papers presented were carefully reviewed and selected from 1437 submissions. The papers are organized in topical sections on geometry, 2D and 3D shapes, 3D reconstruction, visual recognition and classification, visual features and image matching, visual monitoring: action and activities, models, optimisation, learning, visual tracking and image registration, photometry: lighting and colour, and image segmentation.

Inhaltsverzeichnis

Frontmatter

Poster Session 7

Shape from Angle Regularity

This paper deals with automatic Single View Reconstruction (SVR) of multi-planar scenes characterized by a profusion of straight lines and mutually orthogonal line-pairs. We provide a new shape-from-X constraint based on this regularity of angles between line-pairs in man-made scenes. First, we show how the presence of such regular angles can be used for 2D rectification of an image of a plane. Further, we propose an automatic SVR method assuming there are enough orthogonal line-pairs available on each plane. This angle regularity is only imposed on physically intersecting line-pairs, making it a local constraint. Unlike earlier literature, our approach does not make restrictive assumptions about the orientation of the planes or the camera and works for both indoor and outdoor scenes. Results are shown on challenging images which would be difficult to reconstruct for existing automatic SVR algorithms.

Aamer Zaheer, Maheen Rashid, Sohaib Khan
Pose Invariant Approach for Face Recognition at Distance

We propose an automatic pose invariant approach for Face Recognition At a Distance (FRAD). Since face alignment is a crucial step in face recognition systems, we propose a novel facial features extraction model, which guides extended ASM to accurately align the face. Our main concern is to recognize human faces under uncontrolled environment at far distances accurately and fast. To achieve this goal, we perform an offline stage where 3D faces are reconstructed from stereo pair images. These 3D shapes are used to synthesize virtual 2D views in novel poses. To obtain good synthesized images from the 3D shape, we propose an accurate 3D reconstruction framework, which carefully handles illumination variance, occlusion, and the disparity discontinuity. The online phase is fast where a 2D image with unknown pose is matched with the closest virtual images in sampled poses. Experiments show that our approach outperforms the-state-of-the-art approaches.

Eslam Mostafa, Asem Ali, Naif Alajlan, Aly Farag
Minimal Correlation Classification

When the description of the visual data is rich and consists of many features, a classification based on a single model can often be enhanced using an ensemble of models. We suggest a new ensemble learning method that encourages the base classifiers to learn different aspects of the data. Initially, a binary classification algorithm such as Support Vector Machine is applied and its confidence values on the training set are considered. Following the idea that ensemble methods work best when the classification errors of the base classifiers are not related, we serially learn additional classifiers whose output confidences on the training examples are minimally correlated. Finally, these uncorrelated classifiers are assembled using the GentleBoost algorithm. Presented experiments in various visual recognition domains demonstrate the effectiveness of the method.

Noga Levy, Lior Wolf
Contextual Object Detection Using Set-Based Classification

We propose a new model for object detection that is based on set representations of the contextual elements. In this formulation, relative spatial locations and relative scores between pairs of detections are considered as sets of unordered items. Directly training classification models on sets of unordered items, where each set can have varying cardinality can be difficult. In order to overcome this problem, we propose SetBoost, a discriminative learning algorithm for building set classifiers. The SetBoost classifiers are trained to rescore detected objects based on object-object and object-scene context. Our method is able to discover composite relationships, as well as intra-class and inter-class spatial relationships between objects. The experimental evidence shows that our set-based formulation performs comparable to or better than existing contextual methods on the SUN and the VOC 2007 benchmark datasets.

Ramazan Gokberk Cinbis, Stan Sclaroff
Age Invariant Face Verification with Relative Craniofacial Growth Model

Age-separated facial images usually have significant changes in both shape and texture. Although many face recognition algorithms have been proposed in the last two decades, the problem of recognizing facial images across aging remains an open problem. In this paper, we propose a relative craniofacial growth model which is based on the science of craniofacial anthropometry. Compared to the traditional craniofacial growth model, the proposed method introduces a set of linear equations on the relative growth parameters which can be easily applied for facial image verification across aging. We then integrate the relative growth model with the Grassmann manifold and the SVM classifier. We also demonstrate how knowing the age could improve shape-based face recognition algorithms. Experiments show that the proposed method is able to mitigate the variations caused by the aging progress and thus effectively improve the performance of open-set face verification across aging.

Tao Wu, Rama Chellappa
Inferring Gene Interaction Networks from ISH Images via Kernelized Graphical Models

New bio-technologies are being developed that allow high-throughput imaging of gene expressions, where each image captures the spatial gene expression pattern of a single gene in the tissue of interest. This paper addresses the problem of automatically inferring a gene interaction network from such images. We propose a novel kernel-based graphical model learning algorithm, that is both convex and consistent. The algorithm uses multi-instance kernels to compute similarity between the expression patterns of different genes, and then minimizes the

L

1

regularized Bregman divergence to estimate a sparse gene interaction network. We apply our algorithm on a large, publicly available data set of gene expression images of Drosophila embryos, where our algorithm makes novel and interesting predictions.

Kriti Puniyani, Eric P. Xing
Random Forest for Image Annotation

In this paper, we present a novel method for image annotation and made three contributions. Firstly, we propose to use the tags contained in the training images as the supervising information to guide the generation of random trees, thus enabling the retrieved nearest neighbor images not only visually alike but also semantically related. Secondly, different from conventional decision tree methods, which fuse the information contained at each leaf node individually, our method treats the random forest

as a whole

, and introduces the new concepts of semantic nearest neighbors (SNN) and semantic similarity measure (SSM). Thirdly, we annotate an image from the tags of its SNN based on SSM and have developed a novel learning to rank algorithm to systematically assign the optimal tags to the image. The new technique is intrinsically scalable and we will present experimental results to demonstrate that it is competitive to state of the art methods.

Hao Fu, Qian Zhang, Guoping Qiu
(MP)2T: Multiple People Multiple Parts Tracker

We present a method for multi-target tracking that exploits the persistence in detection of object parts. While the implicit representation and detection of body parts have recently been leveraged for improved human detection, ours is the first method that attempts to temporally constrain the location of human body parts with the express purpose of improving pedestrian tracking. We pose the problem of simultaneous tracking of multiple targets and their parts in a network flow optimization framework and show that parts of this network need to be optimized separately and iteratively, due to inter-dependencies of node and edge costs. Given potential detections of humans and their parts separately, an initial set of pedestrian tracklets is first obtained, followed by explicit tracking of human parts as constrained by initial human tracking. A merging step is then performed whereby we attempt to include part-only detections for which the entire human is not observable. This step employs a selective appearance model, which allows us to skip occluded parts in description of positive training samples. The result is high confidence, robust trajectories of pedestrians as well as their parts, which essentially constrain each other’s locations and associations, thus improving human tracking and parts detection. We test our algorithm on multiple real datasets and show that the proposed algorithm is an improvement over the state-of-the-art.

Hamid Izadinia, Imran Saleemi, Wenhui Li, Mubarak Shah
Mixture Component Identification and Learning for Visual Recognition

The non-linear decision boundary between object and background classes - due to large intra-class variations - needs to be modelled by any classifier wishing to achieve good results. While a mixture of linear classifiers is capable of modelling this non-linearity, learning this mixture from weakly annotated data is non-trivial and is the paper’s focus. Our approach is to identify the modes in the distribution of our positive examples by clustering, and to utilize this clustering in a latent SVM formulation to learn the mixture model. The clustering relies on a robust measure of visual similarity which suppresses uninformative clutter by using a novel representation based on the exemplar SVM. This subtle clustering of the data leads to learning better mixture models, as is demonstrated via extensive evaluations on Pascal VOC 2007. The final classifier, using a HOG representation of the global image patch, achieves performance comparable to the state-of-the-art while being more efficient at detection time.

Omid Aghazadeh, Hossein Azizpour, Josephine Sullivan, Stefan Carlsson
Image Retrieval with Structured Object Queries Using Latent Ranking SVM

We consider image retrieval with

structured object queries

– queries that specify the objects that should be present in the scene, and their spatial relations. An example of such queries is “car on the road”. Existing image retrieval systems typically consider queries consisting of object classes (i.e. keywords). They train a separate classifier for each object class and combine the output heuristically. In contrast, we develop a learning framework to jointly consider object classes and their relations. Our method considers not only the objects in the query (“car” and “road” in the above example), but also related object categories can be useful for retrieval. Since we do not have ground-truth labeling of object bounding boxes on the test image, we represent them as latent variables in our model. Our learning method is an extension of the ranking SVM with latent variables, which we call

latent ranking SVM

. We demonstrate image retrieval and ranking results on a dataset with more than a hundred of object classes.

Tian Lan, Weilong Yang, Yang Wang, Greg Mori
A Probabilistic Derivative Measure Based on the Distribution of Intensity Difference

In this paper, we propose a novel derivative measure based on the probability of intensity difference that is defined by observed intensities and their true intensities. Because the true intensity cannot be estimated accurately only using two observed intensities, the probability is marginalized to consider an entire set of possible true values. The proposed measure not only considers intensity dependent noise effectively using a distribution of intensity difference, but also computes the relevant difference of two corresponding pixels that have different true intensities by extending the same intensity assumption in previous works. Using the proposed measure, the estimation result of image derivative for synthetic noisy signals is closer to the ground truth than most of previous measures. We apply the proposed measure for block matching and corner detection that compute intensity similarity in the temporal domain and image derivative in the spatial domain, respectively.

Youngbae Hwang, In-So Kweon
Pairwise Rotation Invariant Co-occurrence Local Binary Pattern

In this work, we introduce a novel pairwise rotation invariant co-occurrence local binary pattern (PRI-CoLBP) feature which incorporates two types of context - spatial co-occurrence and orientation co-occurrence. Different from traditional rotation invariant local features, pairwise rotation invariant co-occurrence features preserve relative angle between the orientations of individual features. The relative angle depicts the local curvature information, which is discriminative and rotation invariant. Experimental results on the CUReT, Brodatz, KTH-TIPS texture dataset, Flickr Material dataset, and Oxford 102 Flower dataset further demonstrate the superior performance of the proposed feature on texture classification, material recognition and flower recognition tasks.

Xianbiao Qi, Rong Xiao, Jun Guo, Lei Zhang
Per-patch Descriptor Selection Using Surface and Scene Properties

Local image descriptors are generally designed for describing all possible image patches. Such patches may be subject to complex variations in appearance due to incidental object, scene and recording conditions. Because of this, a single-best descriptor for accurate image representation under all conditions does not exist. Therefore, we propose to automatically select from a pool of descriptors the one that is best suitable based on object surface and scene properties. These properties are measured on the fly from a single image patch through a set of attributes. Attributes are input to a classifier which selects the best descriptor. Our experiments on a large dataset of colored object patches show that the proposed selection method outperforms the best single descriptor and a-priori combinations of the descriptor pool.

Ivo Everts, Jan C. van Gemert, Theo Gevers
Mixed-Resolution Patch-Matching

Matching patches of a source image with patches of itself or a target image is a first step for many operations. Finding the optimum nearest-neighbors of each patch using a global search of the image is expensive. Optimality is often sacrificed for speed as a result. We present the Mixed-Resolution Patch-Matching (MRPM) algorithm that uses a pyramid representation to perform fast global search. We compare mixed-resolution patches at coarser pyramid levels to alleviate the effects of smoothing. We store more matches at coarser resolutions to ensure wider search ranges and better accuracy at finer levels. Our method achieves near optimality in terms of average error compared to exhaustive search. Our approach is simple compared to complex trees or hash tables used by others. This enables fast parallel implementations on the GPU, yielding upto 70× speedup compared to other iterative approaches. Our approach is best suited when multiple, global matches are needed.

Harshit Sureka, P. J. Narayanan
Exploiting Sparse Representations for Robust Analysis of Noisy Complex Video Scenes

Recent works have shown that, even with simple low level visual cues, complex behaviors can be extracted automatically from crowded scenes,

e.g.

those depicting public spaces recorded from video surveillance cameras. However, low level features as optical flow or foreground pixels are inherently noisy. In this paper we propose a novel unsupervised learning approach for the analysis of complex scenes which is specifically tailored to cope directly with features’ noise and uncertainty. We formalize the task of extracting activity patterns as a matrix factorization problem, considering as reconstruction function the robust Earth Mover’s Distance. A constraint of sparsity on the computed basis matrix is imposed, filtering out noise and leading to the identification of the most relevant elementary activities in a typical high level behavior. We further derive an alternate optimization approach to solve the proposed problem efficiently and we show that it is reduced to a sequence of linear programs. Finally, we propose to use short trajectory snippets to account for object motion information, in alternative to the noisy optical flow vectors used in previous works. Experimental results demonstrate that our method yields similar or superior performance to state-of-the arts approaches.

Gloria Zen, Elisa Ricci, Nicu Sebe
KAZE Features

In this paper, we introduce KAZE features, a novel multiscale 2D feature detection and description algorithm in nonlinear scale spaces. Previous approaches detect and describe features at different scale levels by building or approximating the Gaussian scale space of an image. However, Gaussian blurring does not respect the natural boundaries of objects and smoothes to the same degree both details and noise, reducing localization accuracy and distinctiveness. In contrast, we detect and describe 2D features in a nonlinear scale space by means of nonlinear diffusion filtering. In this way, we can make blurring locally adaptive to the image data, reducing noise but retaining object boundaries, obtaining superior localization accuracy and distinctiviness. The nonlinear scale space is built using efficient Additive Operator Splitting (AOS) techniques and variable conductance diffusion. We present an extensive evaluation on benchmark datasets and a practical matching application on deformable surfaces. Even though our features are somewhat more expensive to compute than SURF due to the construction of the nonlinear scale space, but comparable to SIFT, our results reveal a step forward in performance both in detection and description against previous state-of-the-art methods.

Pablo Fernández Alcantarilla, Adrien Bartoli, Andrew J. Davison
Online Moving Camera Background Subtraction

Recently several methods for background subtraction from moving camera were proposed. They use bottom up cues to segment video frames into foreground and background regions. Due to this lack of explicit models, they can easily fail to detect a foreground object when such cues are ambiguous in certain parts of the video. This becomes even more challenging when videos need to be processed online. We present a method which enables learning of pixel based models for foreground and background regions and, in addition, segments each frame in an online framework. The method uses long term trajectories along with a Bayesian filtering framework to estimate motion and appearance models. We compare our method to previous approaches and show results on challenging video sequences.

Ali Elqursh, Ahmed Elgammal
Coregistration: Simultaneous Alignment and Modeling of Articulated 3D Shape

Three-dimensional (3D) shape models are powerful because they enable the inference of object shape from incomplete, noisy, or ambiguous 2D or 3D data. For example, realistic parameterized 3D human body models have been used to infer the shape and pose of people from images. To train such models, a corpus of 3D body scans is typically brought into registration by aligning a common 3D human-shaped template to each scan. This is an ill-posed problem that typically involves solving an optimization problem with regularization terms that penalize implausible deformations of the template. When aligning a corpus, however, we can do better than generic regularization. If we have a

model

of how the template can deform then alignments can be regularized by this model. Constructing a model of deformations, however, requires having a corpus that is already registered. We address this chicken-and-egg problem by approaching modeling and registration together. By minimizing a single objective function, we reliably obtain high quality registration of noisy, incomplete, laser scans, while simultaneously learning a highly realistic articulated body model. The model greatly improves robustness to noise and missing data. Since the model explains a corpus of body scans, it captures how body shape varies across people and poses.

David A. Hirshberg, Matthew Loper, Eric Rachlin, Michael J. Black
Motion Interchange Patterns for Action Recognition in Unconstrained Videos

Action Recognition in videos is an active research field that is fueled by an acute need, spanning several application domains. Still, existing systems fall short of the applications’ needs in real-world scenarios, where the quality of the video is less than optimal and the viewpoint is uncontrolled and often not static. In this paper, we consider the key elements of motion encoding and focus on capturing local changes in motion directions. In addition, we decouple image edges from motion edges using a suppression mechanism, and compensate for global camera motion by using an especially fitted registration scheme. Combined with a standard bag-of-words technique, our methods achieves state-of-the-art performance in the most recent and challenging benchmarks.

Orit Kliper-Gross, Yaron Gurovich, Tal Hassner, Lior Wolf
A Non-parametric Hierarchical Model to Discover Behavior Dynamics from Tracks

We present a novel non-parametric Bayesian model to jointly discover the dynamics of low-level actions and high-level behaviors of tracked people in open environments. Our model represents behaviors as Markov chains of actions which capture high-level temporal dynamics. Actions may be shared by various behaviors and represent spatially localized occurrences of a person’s low-level motion dynamics using Switching Linear Dynamics Systems. Since the model handles real-valued features directly, we do not lose information by quantizing measurements to ‘visual words’ and can thus discover variations in standing, walking and running without discrete thresholds. We describe inference using Gibbs sampling and validate our approach on several artificial and real-world tracking datasets. We show that our model can distinguish relevant behavior patterns that an existing state-of-the-art method for hierarchical clustering cannot.

Julian F. P. Kooij, Gwenn Englebienne, Dariu M. Gavrila
Scene Semantics from Long-Term Observation of People

Our everyday objects support various tasks and can be used by people for different purposes. While object classification is a widely studied topic in computer vision, recognition of object function, i.e., what people can do with an object and how they do it, is rarely addressed. In this paper we construct a functional object description with the aim to recognize objects by the way people interact with them. We describe scene objects (sofas, tables, chairs) by associated human poses and object appearance. Our model is learned discriminatively from automatically estimated body poses in many realistic scenes. In particular, we make use of time-lapse videos from YouTube providing a rich source of common human-object interactions and minimizing the effort of manual object annotation. We show how the models learned from human observations significantly improve object recognition and enable prediction of characteristic human poses in new scenes. Results are shown on a dataset of more than 400,000 frames obtained from 146 time-lapse videos of challenging and realistic indoor scenes.

Vincent Delaitre, David F. Fouhey, Ivan Laptev, Josef Sivic, Abhinav Gupta, Alexei A. Efros
Efficient Exact Inference for 3D Indoor Scene Understanding

In this paper we propose the first exact solution to the problem of estimating the 3D room layout from a single image. This problem is typically formulated as inference in a Markov random field, where potentials count image features (

e

.

g

., geometric context, orientation maps, lines in accordance with vanishing points) in each face of the layout. We present a novel branch and bound approach which splits the label space in terms of candidate sets of 3D layouts, and efficiently bounds the potentials in these sets by restricting the contribution of each individual face. We employ integral geometry in order to evaluate these bounds in constant time, and as a consequence, we not only obtain the exact solution, but also in less time than approximate inference tools such as message-passing. We demonstrate the effectiveness of our approach in two benchmarks and show that our bounds are tight, and only a few evaluations are necessary.

Alexander G. Schwing, Raquel Urtasun
Seam Segment Carving: Retargeting Images to Irregularly-Shaped Image Domains

Image retargeting algorithms aim to adapt the image to the display screen with the goal of preserving the image content as much as possible. However, existing methods and research efforts have mostly been directed towards retargeting algorithms that retarget images to rectangular domains. This significantly hampers its application to broader classes of display devices and platforms for which the display area can be of any origins and shapes. For example, seam carving-based methods retarget images by carving out seams that run from the top to the bottom of the images, and this results in changing the width and therefore aspect ratio of the image without changing the shape of the image boundary in any essential way. However, by carving out appropriately-chosen seam segments, seams that are not required to cut across the entire image, it is then possible to retarget the images to a broader array of image domains with non-rectangular boundaries. Based on this simple idea of carving out the seam segments, the main contribution of this paper is a novel image retargeting algorithm that is capable of retargeting images to non-rectangular domains. We evaluate the effectiveness of the proposed method on a number of challenging indoor and outdoor scene images, and the results demonstrate that the proposed algorithm is both efficient and effective, and it is capable of providing good-quality retargeted images for a variety of interesting boundary shapes.

Shaoyu Qi, Jeffrey Ho
Estimation of Intrinsic Image Sequences from Image+Depth Video

We present a technique for estimating intrinsic images from image+depth video, such as that acquired from a Kinect camera. Intrinsic image decomposition in this context has importance in applications like object modeling, in which surface colors need to be recovered without illumination effects. The proposed method is based on two new types of decomposition constraints derived from the multiple viewpoints and reconstructed 3D scene geometry of the video data. The first type provides

shading constraints

that enforce relationships among the shading components of different surface points according to their similarity in surface orientation. The second type imposes

temporal constraints

that favor consistency in the intrinsic color of a surface point seen in different video frames, which improves decomposition in cases of view-dependent non-Lambertian reflections. Local and non-local variants of the two constraints are employed in a manner complementary to local and non-local

reflectance constraints

used in previous works. Together they are formulated within a linear system that allows for efficient optimization. Experimental results demonstrate that each of the new constraints appreciably elevates the quality of intrinsic image estimation, and that they jointly yield decompositions that compare favorably to current techniques.

Kyong Joon Lee, Qi Zhao, Xin Tong, Minmin Gong, Shahram Izadi, Sang Uk Lee, Ping Tan, Stephen Lin
Bayesian Blind Deconvolution with General Sparse Image Priors

We present a general method for blind image deconvolution using Bayesian inference with super-Gaussian sparse image priors. We consider a large family of priors suitable for modeling natural images, and develop the general procedure for estimating the unknown image and the blur. Our formulation includes a number of existing modeling and inference methods as special cases while providing additional flexibility in image modeling and algorithm design. We also present an analysis of the proposed inference compared to other methods and discuss its advantages. Theoretical and experimental results demonstrate that the proposed formulation is very effective, efficient, and flexible.

S. Derin Babacan, Rafael Molina, Minh N. Do, Aggelos K. Katsaggelos
3D2PM – 3D Deformable Part Models

As objects are inherently 3-dimensional, they have been modeled in 3D in the early days of computer vision. Due to the ambiguities arising from mapping 2D features to 3D models, 2D feature-based models are the predominant paradigm in object recognition today. While such models have shown competitive bounding box (BB) detection performance, they are clearly limited in their capability of fine-grained reasoning in 3D or continuous viewpoint estimation as required for advanced tasks such as 3D scene understanding. This work extends the deformable part model [1] to a 3D object model. It consists of multiple parts modeled in 3D and a continuous appearance model. As a result, the model generalizes beyond BB oriented object detection and can be jointly optimized in a discriminative fashion for object detection and viewpoint estimation. Our 3D Deformable Part Model (3D

2

PM) leverages on CAD data of the object class, as a 3D geometry proxy.

Bojan Pepik, Peter Gehler, Michael Stark, Bernt Schiele
Efficient Similarity Derived from Kernel-Based Transition Probability

Semi-supervised learning effectively integrates labeled and unlabeled samples for classification, and most of the methods are founded on the pair-wise similarities between the samples. In this paper, we propose methods to construct similarities from the probabilistic viewpoint, whilst the similarities have so far been formulated in a heuristic manner such as by

k

-NN. We first propose the kernel-based formulation of transition probabilities via considering kernel least squares in the probabilistic framework. The similarities are consequently derived from the kernel-based transition probabilities which are efficiently computed, and the similarities are inherently sparse without applying

k

-NN. In the case of multiple types of kernel functions, the multiple transition probabilities are also obtained correspondingly. From the probabilistic viewpoint, they can be integrated with prior probabilities,

i

.

e

., linear weights, and we propose a computationally efficient method to optimize the weights in a discriminative manner, as in multiple kernel learning. The novel similarity is thereby constructed by the composite transition probability and it benefits the semi-supervised learning methods as well. In the various experiments on semi-supervised learning problems, the proposed methods demonstrate favorable performances, compared to the other methods, in terms of classification performances and computation time.

Takumi Kobayashi, Nobuyuki Otsu
A Convex Discrete-Continuous Approach for Markov Random Fields

We propose an extension of the well-known LP relaxation for Markov random fields to explicitly allow continuous label spaces. Unlike conventional continuous formulations of labelling problems which assume that the unary and pairwise potentials are convex, our formulation allows them to be general piecewise convex functions with continuous domains. Furthermore, we present the extension of the widely used efficient scheme for handling

L

1

smoothness priors over discrete ordered label sets to continuous label spaces. We provide a theoretical analysis of the proposed model, and empirically demonstrate that labelling problems with huge or continuous label spaces can benefit from our discrete-continuous representation.

Christopher Zach, Pushmeet Kohli
Generalized Roof Duality for Multi-Label Optimization: Optimal Lower Bounds and Persistency

We extend the concept of generalized roof duality from pseudo-boolean functions to real-valued functions over multi-label variables. In particular, we prove that an analogue of the persistency property holds for energies of any order with any number of linearly ordered labels. Moreover, we show how the optimal submodular relaxation can be constructed in the first-order case.

Thomas Windheuser, Hiroshi Ishikawa, Daniel Cremers
Sparse Embedding: A Framework for Sparsity Promoting Dimensionality Reduction

We introduce a novel framework, called

sparse embedding

(SE), for simultaneous dimensionality reduction and dictionary learning. We formulate an optimization problem for learning a transformation from the original signal domain to a lower-dimensional one in a way that preserves the sparse structure of data. We propose an efficient optimization algorithm and present its non-linear extension based on the kernel methods. One of the key features of our method is that it is computationally efficient as the learning is done in the lower-dimensional space and it discards the irrelevant part of the signal that derails the dictionary learning process. Various experiments show that our method is able to capture the meaningful structure of data and can perform significantly better than many competitive algorithms on signal recovery and object classification tasks.

Hien V. Nguyen, Vishal M. Patel, Nasser M. Nasrabadi, Rama Chellappa
Automatic Localization of Balloon Markers and Guidewire in Rotational Fluoroscopy with Application to 3D Stent Reconstruction

A fully automatic framework is proposed to identify consistent landmarks and wire structures in a rotational X-ray scan. In our application, we localize the balloon marker pair and the guidewire in between the marker pair on each projection angle from a rotational fluoroscopic sequence. We present an effective offline balloon marker tracking algorithm that leverages learning based detectors and employs the Viterbi algorithm to track the balloon markers in a globally optimal manner. Localizing the guidewire in between the tracked markers is formulated as tracking the middle control point of the spline fitting the guidewire. The experimental studies demonstrate that our methods achieve a marker tracking accuracy of 96.33% and a mean guidewire localization error of 0.46 mm, suggesting a great potential of our methods for clinical applications. The proposed offline marker tracking method is also successfully applied to the problem of automatic self-initialization of generic online marker trackers for 2D live fluoroscopy stream, demonstrating a success rate of 95.9% on 318 sequences. Its potential applications also include localization of landmarks in a generic rotational scan.

Yu Wang, Terrence Chen, Peng Wang, Christopher Rohkohl, Dorin Comaniciu
Improving NCC-Based Direct Visual Tracking

Direct visual tracking can be impaired by changes in illumination if the right choice of similarity function and photometric model is not made. Tracking using the sum of squared differences, for instance, often needs to be coupled with a photometric model to mitigate illumination changes. More sophisticated similarities,

e.g.

mutual information and cross cumulative residual entropy, however, can cope with complex illumination variations at the cost of a reduction of the convergence radius, and an increase of the computational effort. In this context, the normalized cross correlation (NCC) represents an interesting alternative. The NCC is intrinsically invariant to affine illumination changes, and also presents low computational cost. This article proposes a new direct visual tracking method based on the NCC. Two techniques have been developed to improve the robustness to complex illumination variations and partial occlusions. These techniques are based on subregion clusterization, and weighting by a residue invariant to affine illumination changes. The last contribution is an efficient Newton-style optimization procedure that does not require the explicit computation of the Hessian. The proposed method is compared against the state of the art using a benchmark database with ground-truth, as well as real-world sequences.

Glauco Garcia Scandaroli, Maxime Meilland, Rogério Richa
Simultaneous Compaction and Factorization of Sparse Image Motion Matrices

Matrices that collect the image coordinates of point features tracked through video – one column per feature – have often low rank, either exactly or approximately. This observation has led to many matrix factorization methods for 3D reconstruction, motion segmentation, or regularization of feature trajectories. However, temporary occlusions, image noise, and variations in lighting, pose, or object geometry often confound trackers. A feature that reappears after a temporary tracking failure – whatever the cause – is regarded as a new feature by typical tracking systems, resulting in very sparse matrices with many columns and rendering factorization problematic. We propose a method to simultaneously factor and compact such a matrix by merging groups of columns that correspond to the same feature into single columns. This combination of compaction and factorization makes trackers more resilient to changes in appearance and short occlusions. Preliminary experiments show that imputation of missing matrix entries – and therefore matrix factorization – becomes significantly more reliable as a result. Clean column merging also required us to develop a history-sensitive feature reinitialization method we call

feature snapping

that aligns merged feature trajectory segments precisely to each other.

Susanna Ricco, Carlo Tomasi
Low-Rank Sparse Learning for Robust Visual Tracking

In this paper, we propose a new particle-filter based tracking algorithm that exploits the relationship between particles (candidate targets). By representing particles as sparse linear combinations of dictionary templates, this algorithm capitalizes on the inherent low-rank structure of particle representations that are learned jointly. As such, it casts the tracking problem as a low-rank matrix learning problem. This low-rank sparse tracker (LRST) has a number of attractive properties.

(1)

Since LRST adaptively updates dictionary templates, it can handle significant changes in appearance due to variations in illumination, pose, scale, etc.

(2)

The linear representation in LRST explicitly incorporates background templates in the dictionary and a sparse error term, which enables LRST to address the tracking drift problem and to be robust against occlusion respectively.

(3)

LRST is computationally attractive, since the low-rank learning problem can be efficiently solved as a sequence of closed form update operations, which yield a time complexity that is linear in the number of particles and the template size. We evaluate the performance of LRST by applying it to a set of challenging video sequences and comparing it to 6 popular tracking methods. Our experiments show that by representing particles jointly, LRST not only outperforms the state-of-the-art in tracking accuracy but also significantly improves the time complexity of methods that use a similar sparse linear representation model for particles [1].

Tianzhu Zhang, Bernard Ghanem, Si Liu, Narendra Ahuja
Towards Optimal Design of Time and Color Multiplexing Codes

Multiplexed illumination has been proved to be valuable and beneficial, in terms of noise reduction, in wide applications of computer vision and graphics, provided that the limitations of photon noise and saturation are properly tackled. Existing optimal multiplexing codes, in the sense of maximum signal-to-noise ratio (SNR), are primarily designed for time multiplexing, but they only apply to a multiplexing system requiring the number of measurements (

M

) equal to the number of illumination sources (

N

). In this paper, we formulate a general code design problem, where

M

 ≥ 

N

, for time and color multiplexing, and develop a sequential semi-definite programming to deal with the formulated optimization problem. The proposed formulation and method can be readily specialized to time multiplexing, thereby making such optimized codes have a much broader application. Computer simulations will discover the main merit of the method— a significant boost of SNR as

M

increases. Experiments will also be presented to demonstrate the effectiveness and superiority of the method in object illumination.

Tsung-Han Chan, Kui Jia, Eliot Wycoff, Chong-Yung Chi, Yi Ma
Dating Historical Color Images

We introduce the task of automatically estimating the age of historical color photographs. We suggest features which attempt to capture temporally discriminative information based on the evolution of color imaging processes over time and evaluate the performance of both these novel features and existing features commonly utilized in other problem domains on a novel historical image data set. For the challenging classification task of sorting historical color images into the decade during which they were photographed, we demonstrate significantly greater accuracy than that shown by untrained humans on the same data set. Additionally, we apply the concept of data-driven camera response function estimation to historical color imagery, demonstrating its relevance to both the age estimation task and the popular application of imitating the appearance of vintage color photography.

Frank Palermo, James Hays, Alexei A. Efros
Rainbow Flash Camera: Depth Edge Extraction Using Complementary Colors

We present a novel color multiplexing method for extracting depth edges in a scene. It has been shown that casting shadows from different light positions provides a simple yet robust cue for extracting depth edges. Instead of flashing a single light source at a time as in conventional methods, our method flashes all light sources simultaneously to reduce the number of captured images. We use a ring light source around a camera and arrange colors on the ring such that the colors form a hue circle. Because complementary colors are arranged at any position and its antipole on the ring, shadow regions where a half of the hue circle is occluded are colorized according to the orientations of depth edges, while non-shadow regions where all the hues are mixed have a neutral color in the captured image. In an ideal situation, the colored shadows in a single image directly provide depth edges and their orientations. In practice, we present a robust depth edge extraction algorithm using an additional image captured by rotating the hue circle with 180°. We demonstrate the advantages of our approach using a camera prototype consisting of a standard camera and 8 color LEDs.

Yuichi Taguchi
Stixels Motion Estimation without Optical Flow Computation

This paper presents a new approach to estimate the motion of objects seen from a stereo rig mounted on a ground mobile robot. We exploit the prior knowledge on ground plane presence and rough shape of objects, to extract a simplified world model, named

stixel world

. The contribution of this paper is to show that stixels motion can be estimated directly solving a single dynamic programming problem instead of an image wide optical flow computation. We compare this new method with baseline methods, show competitive results quality-wise, and a significant gain speed-wise.

Bertan Günyel, Rodrigo Benenson, Radu Timofte, Luc Van Gool
Video Matting Using Multi-frame Nonlocal Matting Laplacian

We present an algorithm for extracting high quality temporally coherent alpha mattes of objects from a video. Our approach extends the conventional image matting approach, i.e. closed-form matting, to video by using multi-frame nonlocal matting Laplacian. Our multi-frame nonlocal matting Laplacian is defined over a nonlocal neighborhood in spatial temporal domain, and it solves the alpha mattes of several video frames all together simultaneously. To speed up computation and to reduce memory requirement for solving the multi-frame nonlocal matting Laplacian, we use the approximate nearest neighbor(ANN) to find the nonlocal neighborhood and the k-d tree implementation to divide the nonlocal matting Laplacian into several smaller linear systems. Finally, we adopt the nonlocal mean regularization to enhance temporal coherence of the estimated alpha mattes and to correct alpha matte errors at low contrast regions. We demonstrate the effectiveness of our approach on various examples with qualitative comparisons to the results from previous matting algorithms.

Inchang Choi, Minhaeng Lee, Yu-Wing Tai
Super-Resolution-Based Inpainting

This paper introduces a new examplar-based inpainting framework. A coarse version of the input image is first inpainted by a non-parametric patch sampling. Compared to existing approaches, some improvements have been done (e.g. filling order computation, combination of K nearest neighbours). The inpainted of a coarse version of the input image allows to reduce the computational complexity, to be less sensitive to noise and to work with the dominant orientations of image structures. From the low-resolution inpainted image, a single-image super-resolution is applied to recover the details of missing areas. Experimental results on natural images and texture synthesis demonstrate the effectiveness of the proposed method.

Olivier Le Meur, Christine Guillemot
Fast Planar Correlation Clustering for Image Segmentation

We describe a new optimization scheme for finding high-quality clusterings in planar graphs that uses weighted perfect matching as a subroutine. Our method provides lower-bounds on the energy of the optimal correlation clustering that are typically fast to compute and tight in practice. We demonstrate our algorithm on the problem of image segmentation where this approach outperforms existing global optimization techniques in minimizing the objective and is competitive with the state of the art in producing high-quality segmentations.

Julian Yarkony, Alexander Ihler, Charless C. Fowlkes

Oral Session 7: Lights, Action!

Reflectance and Natural Illumination from a Single Image

Estimating reflectance and natural illumination from a single image of an object of known shape is a challenging task due to the ambiguities between reflectance and illumination. Although there is an inherent limitation in what can be recovered as the reflectance band-limits the illumination, explicitly estimating both is desirable for many computer vision applications. Achieving this estimation requires that we derive and impose strong constraints on both variables. We introduce a probabilistic formulation that seamlessly incorporates such constraints as priors to arrive at the maximum a posteriori estimates of reflectance and natural illumination. We begin by showing that reflectance modulates the natural illumination in a way that increases its entropy. Based on this observation, we impose a prior on the illumination that favors lower entropy while conforming to natural image statistics. We also impose a prior on the reflectance based on the directional statistics BRDF model that constrains the estimate to lie within the bounds and variability of real-world materials. Experimental results on a number of synthetic and real images show that the method is able to achieve accurate joint estimation for different combinations of materials and lighting.

Stephen Lombardi, Ko Nishino
Frequency-Space Decomposition and Acquisition of Light Transport under Spatially Varying Illumination

We show that, under spatially varying illumination, the light transport of diffuse scenes can be decomposed into direct, near-range (subsurface scattering and local inter-reflections) and far-range transports (diffuse inter-reflections). We show that these three component transports are redundant either in the spatial or the frequency domain and can be separated using appropriate illumination patterns. We propose a novel, efficient method to sequentially separate and acquire the component transports. First, we acquire the direct transport by extending the direct-global separation technique from floodlit images to full transport matrices. Next, we separate and acquire the near-range transport by illuminating patterns sampled uniformly in the frequency domain. Finally, we acquire the far-range transport by illuminating low-frequency patterns. We show that theoretically, our acquisition method achieves the lower bound our model places on the required number of patterns. We quantify the savings in number of patterns over the brute force approach. We validate our observations and acquisition method with rendered and real examples throughout.

Dikpal Reddy, Ravi Ramamoorthi, Brian Curless
A Naturalistic Open Source Movie for Optical Flow Evaluation

Ground truth optical flow is difficult to measure in real scenes with natural motion. As a result, optical flow data sets are restricted in terms of size, complexity, and diversity, making optical flow algorithms difficult to train and test on realistic data. We introduce a new optical flow data set derived from the open source 3D animated short film

Sintel

. This data set has important features not present in the popular Middlebury flow evaluation: long sequences, large motions, specular reflections, motion blur, defocus blur, and atmospheric effects. Because the graphics data that generated the movie is open source, we are able to render scenes under conditions of varying complexity to evaluate where existing flow algorithms fail. We evaluate several recent optical flow algorithms and find that current highly-ranked methods on the Middlebury evaluation have difficulty with this more complex data set suggesting further research on optical flow estimation is needed. To validate the use of synthetic data, we compare the image- and flow-statistics of Sintel to those of real films and videos and show that they are similar. The data set, metrics, and evaluation website are publicly available.

Daniel J. Butler, Jonas Wulff, Garrett B. Stanley, Michael J. Black
Streaming Hierarchical Video Segmentation

The use of video segmentation as an early processing step in video analysis lags behind the use of image segmentation for image analysis, despite many available video segmentation methods. A major reason for this lag is simply that videos are an order of magnitude bigger than images; yet most methods require all voxels in the video to be loaded into memory, which is clearly prohibitive for even medium length videos. We address this limitation by proposing an approximation framework for streaming hierarchical video segmentation motivated by

data stream

algorithms: each video frame is processed only once and does not change the segmentation of previous frames. We implement the graph-based hierarchical segmentation method within our streaming framework; our method is the first streaming hierarchical video segmentation method proposed. We perform thorough experimental analysis on a benchmark video data set and longer videos. Our results indicate the graph-based streaming hierarchical method outperforms other streaming video segmentation methods and performs nearly as well as the full-video hierarchical graph-based method.

Chenliang Xu, Caiming Xiong, Jason J. Corso
Motion Capture of Hands in Action Using Discriminative Salient Points

Capturing the motion of two hands interacting with an object is a very challenging task due to the large number of degrees of freedom, self-occlusions, and similarity between the fingers, even in the case of multiple cameras observing the scene. In this paper we propose to use discriminatively learned salient points on the fingers and to estimate the finger-salient point associations simultaneously with the estimation of the hand pose. We introduce a differentiable objective function that also takes edges, optical flow and collisions into account. Our qualitative and quantitative evaluations show that the proposed approach achieves very accurate results for several challenging sequences containing hands and objects in action.

Luca Ballan, Aparna Taneja, Jürgen Gall, Luc Van Gool, Marc Pollefeys
Photo Sequencing

Dynamic events such as family gatherings, concerts or sports events are often captured by a group of people. The set of still images obtained this way is rich in dynamic content but lacks accurate temporal information. We propose a method for

photo-sequencing

– temporally ordering a set of still images taken asynchronously by a set of uncalibrated cameras. Photo-sequencing is an essential tool in analyzing (or visualizing) a dynamic scene captured by still images. The first step of the method detects sets of corresponding static and dynamic feature points across images. The static features are used to determine the epipolar geometry between pairs of images, and each dynamic feature votes for the temporal order of the images in which it appears. The partial orders provided by the dynamic features are not necessarily consistent, and we use rank aggregation to combine them into a globally consistent temporal order of images. We demonstrate successful photo sequencing on several challenging collections of images taken using a number of mobile phones.

Tali Basha, Yael Moses, Shai Avidan

Poster Session 8

Co-inference for Multi-modal Scene Analysis

We address the problem of understanding scenes from multiple sources of sensor data (

e.g.

, a camera and a laser scanner) in the case where there is no one-to-one correspondence across modalities (

e.g.

, pixels and 3-D points). This is an important scenario that frequently arises in practice not only when two different types of sensors are used, but also when the sensors are not co-located and have different sampling rates. Previous work has addressed this problem by restricting interpretation to a single representation in one of the domains, with augmented features that attempt to encode the information from the other modalities. Instead, we propose to analyze all modalities simultaneously while propagating information across domains during the inference procedure. In addition to the immediate benefit of generating a complete interpretation in all of the modalities, we demonstrate that this

co-inference

approach also improves performance over the canonical approach.

Daniel Munoz, James Andrew Bagnell, Martial Hebert
A Unified View on Deformable Shape Factorizations

Multiple-view geometry and structure-from-motion are well established techniques to compute the structure of a moving rigid object. These techniques are all based on strong algebraic constraints imposed by the rigidity of the object. Unfortunately, many scenes of interest, e.g. faces or cloths, are dynamic and the rigidity constraint no longer holds. Hence, there is a need for non-rigid structure-from-motion (NRSfM) methods which can deal with dynamic scenes. A prominent framework to model deforming and moving non-rigid objects is the factorization technique where the measurements are assumed to lie in a low-dimensional subspace. Many different formulations and variations for factorization-based NRSfM have been proposed in recent years. However, due to the complex interactions between several subspaces, the distinguishing properties between two seemingly related approaches are often unclear. For example, do two approaches just vary in the optimization method used or is really a different model beneath?

In this paper, we show that these NRSfM factorization approaches are most naturally modeled with tensor algebra. This results in a clear presentation which subsumes many previous techniques. In this regard, this paper brings several strings of research together and provides a unified point of view. Moreover, the tensor formulation can be extended to the case of a camera network where multiple static affine cameras observe the same deforming and moving non-rigid object. Thanks to the insights gained through this tensor notation, a closed-form and an efficient iterative algorithm can be derived which provide a reconstruction even if there are no feature point correspondences at all between different cameras. An evaluation of the theory and algorithms on motion capture data show promising results.

Roland Angst, Marc Pollefeys
Finding the Exact Rotation between Two Images Independently of the Translation

In this paper, we present a new epipolar constraint for computing the rotation between two images independently of the translation. Against the common belief in the field of geometric vision that it is not possible to find one independently of the other, we show how this can be achieved by relatively simple two-view constraints. We use the fact that translation and rotation cause fundamentally different flow fields on the unit sphere centered around the camera. This allows to establish independent constraints on translation and rotation, and the latter is solved using the Gröbner basis method. The rotation computation is completed by a solution to the cheiriality problem that depends neither on translation, nor on feature triangulations. Notably, we show for the first time how the constraint on the rotation has the advantage of remaining exact even in the case of translations converging to zero. We use this fact in order to remove the error caused by model selection via a non-linear optimization of rotation hypotheses. We show that our method operates in real-time and compare it to a standard existing approach in terms of both speed and accuracy.

Laurent Kneip, Roland Siegwart, Marc Pollefeys
A New Set of Quartic Trivariate Polynomial Equations for Stratified Camera Self-calibration under Zero-Skew and Constant Parameters Assumptions

This paper deals with the problem of self-calibrating a moving camera with constant parameters. We propose a new set of quartic trivariate polynomial equations in the unknown coordinates of the plane at infinity derived under the no-skew assumption. Our new equations allow to further enforce the constancy of the principal point across all images while retrieving the plane at infinity. Six such polynomials, four of which are independent, are obtained for each triplet of images. The proposed equations can be solved along with the so-called modulus constraints and allow to improve the performance of existing methods.

Adlane Habed, Kassem Al Ismaeil, David Fofi
A Minimal Solution for Camera Calibration Using Independent Pairwise Correspondences

We propose a minimal algorithm for fully calibrating a camera from 11 independent pairwise point correspondences with two other calibrated cameras. Unlike previous approaches, our method neither requires triple correspondences, nor prior knowledge about the viewed scene. This algorithm can be used to insert or re-calibrate a new camera into an existing network, without having to interrupt operation. Its main strength comes from the fact that it is often difficult to find triple correspondences in a camera network. This makes our algorithm, for the specified use cases, probably the most suited calibration solution that does not require a calibration target, and hence can be performed without human interaction.

Francisco Vasconcelos, João Pedro Barreto, Edmond Boyer
Real-Time Human Pose Tracking from Range Data

Tracking human pose in real-time is a difficult problem with many interesting applications. Existing solutions suffer from a variety of problems, especially when confronted with unusual human poses. In this paper, we derive an algorithm for tracking human pose in real-time from depth sequences based on MAP inference in a probabilistic temporal model. The key idea is to extend the iterative closest points (ICP) objective by modeling the constraint that the observed subject cannot enter

free space

, the area of space in front of the true range measurements. Our primary contribution is an extension to the articulated ICP algorithm that can efficiently enforce this constraint. The resulting filter runs at 125 frames per second using a single desktop CPU core. We provide extensive experimental results on challenging real-world data, which show that the algorithm outperforms the previous state-of-the-art trackers both in computational efficiency and accuracy.

Varun Ganapathi, Christian Plagemann, Daphne Koller, Sebastian Thrun
Large-Lexicon Attribute-Consistent Text Recognition in Natural Images

This paper proposes a new model for the task of word recognition in natural images that simultaneously models visual and lexicon consistency of words in a single probabilistic model. Our approach combines local likelihood and pairwise positional consistency priors with higher order priors that enforce consistency of characters (lexicon) and their attributes (font and colour). Unlike traditional stage-based methods, word recognition in our framework is performed by estimating the maximum a posteriori (MAP) solution under the joint posterior distribution of the model. MAP inference in our model is performed through the use of weighted finite-state transducers (WFSTs). We show how the efficiency of certain operations on WFSTs can be utilized to find the most likely word under the model in an efficient manner. We evaluate our method on a range of challenging datasets (ICDAR’03, SVT, ICDAR’11). Experimental results demonstrate that our method outperforms state-of-the-art methods for cropped word recognition.

Tatiana Novikova, Olga Barinova, Pushmeet Kohli, Victor Lempitsky
Dictionary-Based Face Recognition from Video

The main challenge in recognizing faces in video is effectively exploiting the multiple frames of a face and the accompanying dynamic signature. One prominent method is based on extracting joint appearance and behavioral features. A second method models a person by temporal correlations of features in a video. Our approach introduces the concept of video-dictionaries for face recognition, which generalizes the work in sparse representation and dictionaries for faces in still images. Video-dictionaries are designed to implicitly encode temporal, pose, and illumination information. We demonstrate our method on the Face and Ocular Challenge Series (FOCS) Video Challenge, which consists of unconstrained video sequences. We show that our method is efficient and performs significantly better than many competitive video-based face recognition algorithms.

Yi-Chen Chen, Vishal M. Patel, P. Jonathon Phillips, Rama Chellappa
Relaxed Pairwise Learned Metric for Person Re-identification

Matching persons across non-overlapping cameras is a rather challenging task. Thus, successful methods often build on complex feature representations or sophisticated learners. A recent trend to tackle this problem is to use metric learning to find a suitable space for matching samples from different cameras. However, most of these approaches ignore the transition from one camera to the other. In this paper, we propose to learn a metric from pairs of samples from different cameras. In this way, even less sophisticated features describing color and texture information are sufficient for finally getting state-of-the-art classification results. Moreover, once the metric has been learned, only linear projections are necessary at search time, where a simple nearest neighbor classification is performed. The approach is demonstrated on three publicly available datasets of different complexity, where it can be seen that state-of-the-art results can be obtained at much lower computational costs.

Martin Hirzer, Peter M. Roth, Martin Köstinger, Horst Bischof
Connecting Missing Links: Object Discovery from Sparse Observations Using 5 Million Product Images

Object discovery algorithms group together image regions that originate from the same object. This process is effective when the input collection of images contains a large number of densely sampled views of each object, thereby creating strong connections between nearby views. However, existing approaches are less effective when the input data only provide sparse coverage of object views.

We propose an approach for object discovery that addresses this problem. We collect a database of about 5 million product images that capture 1.2 million objects from multiple views. We represent each region in the input image by a “bag” of database object regions. We group input regions together if they share similar “bags of regions.” Our approach can correctly discover links between regions of the same object even if they are captured from dramatically different viewpoints. With the help from these added links, our proposed approach can robustly discover object instances even with sparse coverage of the viewpoints.

Hongwen Kang, Martial Hebert, Alexei A. Efros, Takeo Kanade
Disentangling Factors of Variation for Facial Expression Recognition

We propose a semi-supervised approach to solve the task of emotion recognition in 2D face images using recent ideas in deep learning for handling the factors of variation present in data. An emotion classification algorithm should be both robust to (1) remaining variations due to the pose of the face in the image after centering and alignment, (2) the identity or morphology of the face. In order to achieve this invariance, we propose to learn a hierarchy of features in which we gradually filter the factors of variation arising from both (1) and (2). We address (1) by using a multi-scale contractive convolutional network (CCNET) in order to obtain invariance to translations of the facial traits in the image. Using the feature representation produced by the CCNET, we train a

Contractive Discriminative Analysis

(CDA) feature extractor, a novel variant of the Contractive Auto-Encoder (CAE), designed to learn a representation separating out the emotion-related factors from the others (which mostly capture the subject identity, and what is left of pose after the CCNET). This system beats the state-of-the-art on a recently proposed dataset for facial expression recognition, the Toronto Face Database, moving the state-of-art accuracy from 82.4% to 85.0%, while the CCNET and CDA improve accuracy of a standard CAE by 8%.

Salah Rifai, Yoshua Bengio, Aaron Courville, Pascal Vincent, Mehdi Mirza
Simultaneous Image Classification and Annotation via Biased Random Walk on Tri-relational Graph

Image annotation as well as classification are both critical and challenging work in computer vision research. Due to the rapid increasing number of images and inevitable biased annotation or classification by the human curator, it is desired to have an automatic way. Recently, there are lots of methods proposed regarding image classification or image annotation. However, people usually treat the above two tasks independently and tackle them separately. Actually, there is a relationship between the image class label and image annotation terms. As we know, an image with the sport class label rowing is more likely to be annotated with the terms water, boat and oar than the terms wall, net and floor, which are the descriptions of indoor sports.

In this paper, we propose a new method for jointly class recognition and terms annotation. We present a novel

Tri-Relational Graph

(TG) model that comprises the data graph, annotation terms graph, class label graph, and connect them by two additional graphs induced from class label as well as annotation assignments. Upon the TG model, we introduce a

Biased Random Walk

(BRW) method to jointly recognize class and annotate terms by utilizing the interrelations between two tasks. We conduct the proposed method on two benchmark data sets and the experimental results demonstrate our joint learning method can achieve superior prediction results on both tasks than the state-of-the-art methods.

Xiao Cai, Hua Wang, Heng Huang, Chris Ding
Spring Lattice Counting Grids: Scene Recognition Using Deformable Positional Constraints

Adopting the Counting Grid (CG) representation [1], the Spring Lattice Counting Grid (SLCG) model uses a grid of feature counts to capture the spatial layout that a variety of images tend to follow. The images are mapped to the counting grid with their features rearranged so as to strike a balance between the mapping quality and the extent of the necessary rearrangement. In particular, the feature sets originating from different image sectors are mapped to different sub-windows in the counting grid in a configuration that is close, but not exactly the same as the configuration of the source sectors. The distribution over deformations of the sector configuration is learnable using a new spring lattice model, while the rearrangement of features within a sector is unconstrained. As a result, the CG model gains a more appropriate level of invariance to realistic image transformations like view point changes, rotations or scales. We tested SLCG on standard scene recognition datasets and on a dataset collected with a wearable camera which recorded the wearer’s visual input over three weeks. Our algorithm is capable of correctly classifying the visited locations more than 80% of the time, outperforming previous approaches to visual location recognition. At this level of performance, a variety of real-world applications of wearable cameras become feasible.

Alessandro Perina, Nebojsa Jojic
Hand Pose Estimation and Hand Shape Classification Using Multi-layered Randomized Decision Forests

Vision based articulated hand pose estimation and hand shape classification are challenging problems. This paper proposes novel algorithms to perform these tasks using depth sensors. In particular, we introduce a novel randomized decision forest (RDF) based hand shape classifier, and use it in a novel multi–layered RDF framework for articulated hand pose estimation. This classifier assigns the input depth pixels to hand shape classes, and directs them to the corresponding hand pose estimators trained specifically for that hand shape. We introduce two novel types of multi–layered RDFs: Global Expert Network (GEN) and Local Expert Network (LEN), which achieve significantly better hand pose estimates than a single–layered skeleton estimator and generalize better to previously unseen hand poses. The novel hand shape classifier is also shown to be accurate and fast. The methods run in real–time on the CPU, and can be ported to the GPU for further increase in speed.

Cem Keskin, Furkan Kıraç, Yunus Emre Kara, Lale Akarun
Information Theoretic Learning for Pixel-Based Visual Agents

In this paper we promote the idea of using pixel-based models not only for low level vision, but also to extract high level symbolic representations. We use a deep architecture which has the distinctive property of relying on computational units that incorporate classic computer vision invariances and, especially, the scale invariance. The learning algorithm that is proposed, which is based on information theory principles, develops the parameters of the computational units and, at the same time, makes it possible to detect the optimal scale for each pixel. We give experimental evidence of the mechanism of feature extraction at the first level of the hierarchy, which is very much related to SIFT-like features. The comparison shows clearly that, whenever we can rely on the massive availability of training data, the proposed model leads to better performances with respect to SIFT.

Marco Gori, Stefano Melacci, Marco Lippi, Marco Maggini
Attribute Discovery via Predictable Discriminative Binary Codes

We present images with binary codes in a way that balances discrimination and learnability of the codes. In our method, each image claims its own code in a way that maintains discrimination while being predictable from visual data. Category memberships are usually good proxies for visual similarity but should not be enforced as a hard constraint. Our method learns codes that maximize separability of categories unless there is strong visual evidence against it. Simple linear SVMs can achieve state-of-the-art results with our short codes. In fact, our method produces state-of-the-art results on Caltech256 with only 128-dimensional bit vectors and outperforms state of the art by using longer codes. We also evaluate our method on ImageNet and show that our method outperforms state-of-the-art binary code methods on this large scale dataset. Lastly, our codes can discover a discriminative set of attributes.

Mohammad Rastegari, Ali Farhadi, David Forsyth
Backmatter
Metadaten
Titel
Computer Vision – ECCV 2012
herausgegeben von
Andrew Fitzgibbon
Svetlana Lazebnik
Pietro Perona
Yoichi Sato
Cordelia Schmid
Copyright-Jahr
2012
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-33783-3
Print ISBN
978-3-642-33782-6
DOI
https://doi.org/10.1007/978-3-642-33783-3