Skip to main content
Top

2012 | Book

Computer Vision – ECCV 2012

12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part III

Editors: Andrew Fitzgibbon, Svetlana Lazebnik, Pietro Perona, Yoichi Sato, Cordelia Schmid

Publisher: Springer Berlin Heidelberg

Book Series : Lecture Notes in Computer Science

insite
SEARCH

About this book

The seven-volume set comprising LNCS volumes 7572-7578 constitutes the refereed proceedings of the 12th European Conference on Computer Vision, ECCV 2012, held in Florence, Italy, in October 2012. The 408 revised papers presented were carefully reviewed and selected from 1437 submissions. The papers are organized in topical sections on geometry, 2D and 3D shapes, 3D reconstruction, visual recognition and classification, visual features and image matching, visual monitoring: action and activities, models, optimisation, learning, visual tracking and image registration, photometry: lighting and colour, and image segmentation.

Table of Contents

Frontmatter

Poster Session 3

Polynomial Regression on Riemannian Manifolds

In this paper we develop the theory of parametric polynomial regression in Riemannian manifolds. The theory enables parametric analysis in a wide range of applications, including rigid and non-rigid kinematics as well as shape change of organs due to growth and aging. We show application of Riemannian polynomial regression to shape analysis in Kendall shape space. Results are presented, showing the power of polynomial regression on the classic rat skull growth data of Bookstein and the analysis of the shape changes associated with aging of the corpus callosum from the OASIS Alzheimer’s study.

Jacob Hinkle, Prasanna Muralidharan, P. Thomas Fletcher, Sarang Joshi
Approximate Gaussian Mixtures for Large Scale Vocabularies

We introduce a clustering method that combines the flexibility of Gaussian mixtures with the scaling properties needed to construct visual vocabularies for image retrieval. It is a variant of expectation-maximization that can converge rapidly while dynamically estimating the number of components. We employ approximate nearest neighbor search to speed-up the E-step and exploit its iterative nature to make search incremental, boosting both speed and precision. We achieve superior performance in large scale retrieval, being as fast as the best known approximate

k

-means.

Yannis Avrithis, Yannis Kalantidis
Geodesic Saliency Using Background Priors

Generic object level saliency detection is important for many vision tasks. Previous approaches are mostly built on the prior that “appearance contrast between objects and backgrounds is high”. Although various computational models have been developed, the problem remains challenging and huge behavioral discrepancies between previous approaches can be observed. This suggest that the problem may still be highly ill-posed by using this prior only.

In this work, we tackle the problem from a different viewpoint: we focus more on the background instead of the object. We exploit two common priors about backgrounds in natural images, namely

boundary and connectivity priors

, to provide more clues for the problem. Accordingly, we propose a novel saliency measure called

geodesic saliency

. It is intuitive, easy to interpret and allows fast implementation. Furthermore, it is complementary to previous approaches, because it benefits more from background priors while previous approaches do not.

Evaluation on two databases validates that geodesic saliency achieves superior results and outperforms previous approaches by a large margin, in both accuracy and speed (2 ms per image). This illustrates that appropriate prior exploitation is helpful for the ill-posed saliency detection problem.

Yichen Wei, Fang Wen, Wangjiang Zhu, Jian Sun
Joint Face Alignment with Non-parametric Shape Models

We present a joint face alignment technique that takes a set of images as input and produces a set of shape- and appearance-consistent face alignments as output. Our method is an extension of the recent localization method of Belhumeur

et al.

[1], which combines the output of local detectors with a non-parametric set of face shape models. We are inspired by the recent joint alignment method of Zhao

et al.

[20], which employs a modified Active Appearance Model (AAM) approach to align a batch of images. We introduce an approach for simultaneously optimizing both a local appearance constraint, which couples the local estimates between multiple images, and a global shape constraint, which couples landmarks and images across the image set. In video sequences, our method greatly improves the temporal stability of landmark estimates without compromising accuracy relative to ground truth.

Brandon M. Smith, Li Zhang
Discriminative Bayesian Active Shape Models

This work presents a simple and very efficient solution to align facial parts in unseen images. Our solution relies on a Point Distribution Model (PDM) face model and a set of discriminant local detectors, one for each facial landmark. The patch responses can be embedded into a Bayesian inference problem, where the posterior distribution of the global warp is inferred in a ımaximum a posteriori (MAP) sense. However, previous formulations do not model explicitly the covariance of the latent variables, which represents the confidence in the current solution. In our Discriminative Bayesian Active Shape Model (DBASM) formulation, the MAP global alignment is inferred by a Linear Dynamical System (LDS) that takes this information into account. The Bayesian paradigm provides an effective fitting strategy, since it combines in the same framework both the shape prior and multiple sets of patch alignment classifiers to further improve the accuracy. Extensive evaluations were performed on several datasets including the challenging Labeled Faces in the Wild (LFW). Face parts descriptors were also evaluated, including the recently proposed Minimum Output Sum of Squared Error (MOSSE) filter. The proposed Bayesian optimization strategy improves on the state-of-the-art while using the same local detectors. We also show that MOSSE filters further improve on these results.

Pedro Martins, Rui Caseiro, João F. Henriques, Jorge Batista
Patch Based Synthesis for Single Depth Image Super-Resolution

We present an algorithm to synthetically increase the resolution of a solitary depth image using only a generic database of local patches. Modern range sensors measure depths with non-Gaussian noise and at lower starting resolutions than typical visible-light cameras. While patch based approaches for upsampling intensity images continue to improve, this is the first exploration of patching for depth images.

We match against the height field of each

low resolution

input depth patch, and search our database for a list of appropriate high resolution candidate patches. Selecting the right candidate at each location in the depth image is then posed as a Markov random field labeling problem. Our experiments also show how important further depth-specific processing, such as noise removal and correct patch normalization, dramatically improves our results. Perhaps surprisingly, even better results are achieved on a variety of real test scenes by providing our algorithm with only

synthetic

training depth data.

Oisin Mac Aodha, Neill D. F. Campbell, Arun Nair, Gabriel J. Brostow
Annotation Propagation in Large Image Databases via Dense Image Correspondence

Our goal is to automatically annotate many images with a set of word tags and a pixel-wise map showing where each word tag occurs. Most previous approaches rely on a corpus of training images where each pixel is labeled. However, for large image databases, pixel labels are expensive to obtain and are often unavailable. Furthermore, when classifying multiple images, each image is typically solved for independently, which often results in inconsistent annotations across similar images. In this work, we incorporate dense image correspondence into the annotation model, allowing us to make do with significantly less labeled data and to resolve ambiguities by propagating inferred annotations from images with strong local visual evidence to images with weaker local evidence. We establish a large graphical model spanning all labeled and unlabeled images, then solve it to infer annotations, enforcing consistent annotations over similar visual patterns. Our model is optimized by efficient belief propagation algorithms embedded in an expectation-maximization (EM) scheme. Extensive experiments are conducted to evaluate the performance on several standard large-scale image datasets, showing that the proposed framework outperforms state-of-the-art methods.

Michael Rubinstein, Ce Liu, William T. Freeman
Numerically Stable Optimization of Polynomial Solvers for Minimal Problems

Numerous geometric problems in computer vision involve the solution of systems of polynomial equations. This is particularly true for so called minimal problems, but also for finding stationary points for overdetermined problems. The state-of-the-art is based on the use of numerical linear algebra on the large but sparse coefficient matrix that represents the original equations multiplied with a set of monomials. The key observation in this paper is that the speed and numerical stability of the solver depends heavily on (i) what multiplication monomials are used and (ii) the set of so called permissible monomials from which numerical linear algebra routines choose the basis of a certain quotient ring. In the paper we show that optimizing with respect to these two factors can give both significant improvements to numerical stability as compared to the state of the art, as well as highly compact solvers, while still retaining numerical stability. The methods are validated on several minimal problems that have previously been shown to be challenging with improvement over the current state of the art.

Yubin Kuang, Kalle Åström
Has My Algorithm Succeeded? An Evaluator for Human Pose Estimators

Most current vision algorithms deliver their output ‘as is’, without indicating whether it is correct or not. In this paper we propose

evaluator algorithms

that predict if a vision algorithm has succeeded. We illustrate this idea for the case of Human Pose Estimation (HPE).

We describe the stages required to learn and test an evaluator, including the use of an annotated ground truth dataset for training and testing the evaluator (and we provide a new dataset for the HPE case), and the development of auxiliary features that have not been used by the (HPE) algorithm, but can be learnt by the evaluator to predict if the output is correct or not.

Then an evaluator is built for each of four recently developed HPE algorithms using their publicly available implementations: Eichner and Ferrari [5], Sapp

et al.

[16], Andriluka

et al.

[2] and Yang and Ramanan [22]. We demonstrate that in each case the evaluator is able to predict if the algorithm has correctly estimated the pose or not.

Nataraj Jammalamadaka, Andrew Zisserman, Marcin Eichner, Vittorio Ferrari, C. V. Jawahar
Group Tracking: Exploring Mutual Relations for Multiple Object Tracking

In this paper, we propose to track multiple previously unseen objects in unconstrained scenes. Instead of considering objects individually, we model objects in mutual context with each other to benefit robust and accurate tracking. We introduce a unified framework to combine both Individual Object Models (IOMs) and Mutual Relation Models (MRMs). The MRMs consist of three components, the relational graph to indicate related objects, the mutual relation vectors calculated within related objects to show the interactions, and the relational weights to balance all interactions and IOMs. As MRMs are varying along temporal sequences, we propose online algorithms to make MRMs adapt to current situations. We update relational graphs through analyzing object trajectories and cast the relational weight learning task as an online latent SVM problem. Extensive experiments on challenging real world video sequences demonstrate the efficiency and effectiveness of our framework.

Genquan Duan, Haizhou Ai, Song Cao, Shihong Lao
A Discrete Chain Graph Model for 3d+t Cell Tracking with High Misdetection Robustness

Tracking by assignment is well suited for tracking a varying number of divisible cells, but suffers from false positive detections. We reformulate tracking by assignment as a chain graph–a mixed directed-undirected probabilistic graphical model–and obtain a tracking simultaneously over all time steps from the maximum a-posteriori configuration. The model is evaluated on two challenging four-dimensional data sets from developmental biology. Compared to previous work, we obtain improved tracks due to an increased robustness against false positive detections and the incorporation of temporal domain knowledge.

Bernhard X. Kausler, Martin Schiegg, Bjoern Andres, Martin Lindner, Ullrich Koethe, Heike Leitte, Jochen Wittbrodt, Lars Hufnagel, Fred A. Hamprecht
Robust Tracking with Weighted Online Structured Learning

Robust visual tracking requires constant update of the target appearance model, but without losing track of previous appearance information. One of the difficulties with the online learning approach to this problem has been a lack of flexibility in the modelling of the inevitable variations in target and scene appearance over time. The traditional online learning approach to the problem treats each example equally, which leads to previous appearances being forgotten too quickly and a lack of emphasis on the most current observations. Through analysis of the visual tracking problem, we develop instead a novel weighted form of online risk which allows more subtlety in its representation. However, the traditional online learning framework does not accommodate this weighted form. We thus also propose a principled approach to weighted online learning using weighted reservoir sampling and provide a weighted regret bound as a theoretical guarantee of performance. The proposed novel online learning framework can handle examples with different importance weights for binary, multiclass, and even structured output labels in both linear and non-linear kernels. Applying the method to tracking results in an algorithm which is both efficient and accurate even in the presence of severe appearance changes. Experimental results show that the proposed tracker outperforms the current state-of-the-art.

Rui Yao, Qinfeng Shi, Chunhua Shen, Yanning Zhang, Anton van den Hengel
Fast Regularization of Matrix-Valued Images

Regularization of images with matrix-valued data is important in medical imaging, motion analysis and scene understanding. We propose a novel method for fast regularization of matrix group-valued images.

Using the augmented Lagrangian framework we separate total- variation regularization of matrix-valued images into a regularization and a projection steps. Both steps are computationally efficient and easily parallelizable, allowing real-time regularization of matrix valued images on a graphic processing unit.

We demonstrate the effectiveness of our method for smoothing several group-valued image types, with applications in directions diffusion, motion analysis from depth sensors, and DT-MRI denoising.

Guy Rosman, Yu Wang, Xue-Cheng Tai, Ron Kimmel, Alfred M. Bruckstein
Blind Correction of Optical Aberrations

Camera lenses are a critical component of optical imaging systems, and lens imperfections compromise image quality. While traditionally, sophisticated lens design and quality control aim at limiting optical aberrations, recent works [1,2,3] promote the correction of optical flaws by computational means. These approaches rely on elaborate measurement procedures to characterize an optical system, and perform image correction by non-blind deconvolution.

In this paper, we present a method that utilizes physically plausible assumptions to estimate non-stationary lens aberrations

blindly

, and thus can correct images without knowledge of specifics of camera and lens. The blur estimation features a novel preconditioning step that enables fast deconvolution. We obtain results that are competitive with state-of-the-art non-blind approaches.

Christian J. Schuler, Michael Hirsch, Stefan Harmeling, Bernhard Schölkopf
Inverse Rendering of Faces on a Cloudy Day

In this paper we consider the problem of inverse rendering faces under unknown environment illumination using a morphable model. In contrast to previous approaches, we account for global illumination effects by incorporating statistical models for ambient occlusion and bent normals into our image formation model. We show that solving for ambient occlusion and bent normal parameters as part of the fitting process improves the accuracy of the estimated texture map and illumination environment. We present results on challenging data, rendered under complex natural illumination with both specular reflectance and occlusion of the illumination environment.

Oswald Aldrian, William A. P. Smith
On Tensor-Based PDEs and Their Corresponding Variational Formulations with Application to Color Image Denoising

The case when a partial differential equation (PDE) can be considered as an Euler-Lagrange (E-L) equation of an energy functional, consisting of a data term and a smoothness term is investigated. We show the necessary conditions for a PDE to be the E-L equation for a corresponding functional. This energy functional is applied to a color image denoising problem and it is shown that the method compares favorably to current state-of-the-art color image denoising techniques.

Freddie Åström, George Baravdish, Michael Felsberg
Kernelized Temporal Cut for Online Temporal Segmentation and Recognition

We address the problem of unsupervised online segmenting human motion sequences into different actions. Kernelized Temporal Cut (KTC), is proposed to sequentially cut the structured sequential data into different regimes. KTC extends previous works on online change-point detection by incorporating Hilbert space embedding of distributions to handle the nonparametric and high dimensionality issues. Based on KTC, a realtime online algorithm and a hierarchical extension are proposed for detecting both action transitions and cyclic motions at the same time. We evaluate and compare the approach to state-of-the-art methods on motion capture data, depth sensor data and videos. Experimental results demonstrate the effectiveness of our approach, which yields realtime segmentation, and produces higher action segmentation accuracy. Furthermore, by combining with sequence matching algorithms, we can online recognize actions of an arbitrary person from an arbitrary viewpoint, given realtime depth sensor input.

Dian Gong, Gérard Medioni, Sikai Zhu, Xuemei Zhao
Grain Segmentation of 3D Superalloy Images Using Multichannel EWCVT under Human Annotation Constraints

Grain segmentation on 3D superalloy images provides superalloy’s micro-structures, based on which many physical and mechanical properties can be evaluated. This is a challenging problem in senses of (1) the number of grains in a superalloy sample could be thousands or even more; (2) the intensity within a grain may not be homogeneous; and (3) superalloy images usually contains carbides and noises. Recently, the

Multichannel Edge-Weighted Centroid Voronoi Tessellation

(MCEWCVT) algorithm [1] was developed for grain segmentation and showed better performance than many widely used image segmentation algorithms. However, as a general-purpose clustering algorithm, MCEWCVT does not consider possible prior knowledge from material scientists in the process of grain segmentation. In this paper, we address this issue by defining an energy minimization problem which subject to certain constraints. Then we develop a

Constrained Multichannel Edge-Weighted Centroid Voronoi Tessellation

(CMEWCVT) algorithm to effectively solve this constrained minimization problem. In particular, manually annotated segmentation on a very small set of 2D slices are taken as constraints and incorporated into the whole clustering process. Experimental results demonstrate that the proposed CMEWCVT algorithm significantly improve the previous grain-segmentation performance.

Yu Cao, Lili Ju, Song Wang
Hough Regions for Joining Instance Localization and Segmentation

Object detection and segmentation are two challenging tasks in computer vision, which are usually considered as independent steps. In this paper, we propose a framework which jointly optimizes for both tasks and implicitly provides detection hypotheses and corresponding segmentations. Our novel approach is attachable to any of the available generalized Hough voting methods. We introduce Hough Regions by formulating the problem of Hough space analysis as Bayesian labeling of a random field. This exploits provided classifier responses, object center votes and low-level cues like color consistency, which are combined into a global energy term. We further propose a greedy approach to solve this energy minimization problem providing a pixel-wise assignment to background or to a specific category instance. This way we bypass the parameter sensitive non-maximum suppression that is required in related methods. The experimental evaluation demonstrates that state-of-the-art detection and segmentation results are achieved and that our method is inherently able to handle overlapping instances and an increased range of articulations, aspect ratios and scales.

Hayko Riemenschneider, Sabine Sternig, Michael Donoser, Peter M. Roth, Horst Bischof
Learning to Segment a Video to Clips Based on Scene and Camera Motion

In this paper, we present a novel learning-based algorithm for temporal segmentation of a video into clips based on both camera and scene motion, in particular, based on combinations of static vs. dynamic camera and static vs. dynamic scene. Given a video, we first perform shot boundary detection to segment the video to shots. We enforce temporal continuity by constructing a Markov Random Field (MRF) over the frames of each video shot with edges between consecutive frames and cast the segmentation problem as a frame level discrete labeling problem. Using manually labeled data we learn classifiers exploiting cues from optical flow to provide evidence for the different labels, and infer the best labeling over the frames. We show the effectiveness of the approach using user videos and full-length movies. Using sixty full-length movies spanning 50 years, we show that the proposed algorithm of grouping frames purely based on motion cues can aid computational applications such as recovering depth from a video and also reveal interesting trends in movies, which finds itself interesting novel applications in video analysis (time-stamping archive movies) and film studies.

Adarsh Kowdle, Tsuhan Chen
Evaluation of Image Segmentation Quality by Adaptive Ground Truth Composition

Segmenting an image is an important step in many computer vision applications. However, image segmentation evaluation is far from being well-studied in contrast to the extensive studies on image segmentation algorithms. In this paper, we propose a framework to quantitatively evaluate the quality of a given segmentation with multiple ground truth segmentations. Instead of comparing directly the given segmentation to the ground truths, we assume that if a segmentation is “good”, it can be constructed by pieces of the ground truth segmentations. Then for a given segmentation, we construct adaptively a new ground truth which can be locally matched to the segmentation as much as possible and preserve the structural consistency in the ground truths. The quality of the segmentation can then be evaluated by measuring its distance to the adaptively composite ground truth. To the best of our knowledge, this is the first work that provides a framework to adaptively combine multiple ground truths for quantitative segmentation evaluation. Experiments are conducted on the benchmark Berkeley segmentation database, and the results show that the proposed method can faithfully reflect the perceptual qualities of segmentations.

Bo Peng, Lei Zhang

Oral Session 3: Detection and Attributes

Exact Acceleration of Linear Object Detectors

We describe a general and exact method to considerably speed up linear object detection systems operating in a sliding, multi-scale window fashion, such as the individual part detectors of part-based models. The main bottleneck of many of those systems is the computational cost of the convolutions between the multiple rescalings of the image to process, and the linear filters. We make use of properties of the Fourier transform and of clever implementation strategies to obtain a speedup factor proportional to the filters’ sizes. The gain in performance is demonstrated on the well known Pascal VOC benchmark, where we accelerate the speed of said convolutions by an order of magnitude.

Charles Dubout, François Fleuret
Latent Hough Transform for Object Detection

Hough transform based methods for object detection work by allowing image features to vote for the location of the object. While this representation allows for parts observed in different training instances to support a single object hypothesis, it also produces false positives by accumulating votes that are consistent in location but inconsistent in other properties like pose, color, shape or type. In this work, we propose to augment the Hough transform with latent variables in order to enforce consistency among votes. To this end, only votes that agree on the assignment of the latent variable are allowed to support a single hypothesis. For training a Latent Hough Transform (LHT) model, we propose a learning scheme that exploits the linearity of the Hough transform based methods. Our experiments on two datasets including the challenging PASCAL VOC 2007 benchmark show that our method outperforms traditional Hough transform based methods leading to state-of-the-art performance on some categories.

Nima Razavi, Juergen Gall, Pushmeet Kohli, Luc van Gool
Using Linking Features in Learning Non-parametric Part Models

We present an approach to the detection of parts of highly deformable objects, such as the human body. Instead of using kinematic constraints on relative angles used by most existing approaches for modeling part-to-part relations, we learn and use special observed ‘linking’ features that support particular pairwise part configurations. In addition to modeling the appearance of individual parts, the current approach adds modeling of the appearance of part-linking, which is shown to provide useful information. For example, configurations of the lower and upper arms are supported by observing corresponding appearances of the elbow or other relevant features. The proposed model combines the support from all the linking features observed in a test image to infer the most likely joint configuration of all the parts of interest. The approach is trained using images with annotated parts, but no a-priori known part connections or connection parameters are assumed, and the linking features are discovered automatically during training. We evaluate the performance of the proposed approach on two challenging human body parts detection datasets, and obtain performance comparable, and in some cases superior, to the state-of-the-art. In addition, the approach generality is shown by applying it without modification to part detection on datasets of animal parts and of facial fiducial points.

Leonid Karlinsky, Shimon Ullman
Diagnosing Error in Object Detectors

This paper shows how to analyze the influences of object characteristics on detection performance and the frequency and impact of different types of false positives. In particular, we examine effects of occlusion, size, aspect ratio, visibility of parts, viewpoint, localization error, and confusion with semantically similar objects, other labeled objects, and background. We analyze two classes of detectors: the Vedaldi et al. multiple kernel learning detector and different versions of the Felzenszwalb et al. detector. Our study shows that sensitivity to size, localization error, and confusion with similar objects are the most impactful forms of error. Our analysis also reveals that many different kinds of improvement are necessary to achieve large gains, making more detailed analysis essential for the progress of recognition research. By making our software and annotations available, we make it effortless for future researchers to perform similar analysis.

Derek Hoiem, Yodsawalai Chodpathumwan, Qieyun Dai
Attributes for Classifier Feedback

Traditional active learning allows a (machine) learner to query the (human) teacher for labels on examples it finds confusing. The teacher then provides a label for only that instance. This is quite restrictive. In this paper, we propose a learning paradigm in which the learner communicates its belief (i.e. predicted label) about the actively chosen example to the teacher. The teacher then confirms or rejects the predicted label. More importantly, if rejected, the teacher communicates an explanation for why the learner’s belief was wrong. This explanation allows the learner to propagate the feedback provided by the teacher to many unlabeled images. This allows a classifier to better learn from its mistakes, leading to accelerated discriminative learning of visual concepts even with few labeled images. In order for such communication to be feasible, it is crucial to have a language that both the human supervisor and the machine learner understand.

Attributes

provide precisely this channel. They are human-interpretable mid-level visual concepts shareable across categories

e.g.

“furry”, “spacious”, etc. We advocate the use of attributes for a supervisor to provide feedback to a classifier and directly communicate his knowledge of the world. We employ a straightforward approach to incorporate this feedback in the classifier, and demonstrate its power on a variety of visual recognition scenarios such as image classification and annotation. This application of attributes for providing classifiers feedback is very powerful, and has not been explored in the community. It introduces a new mode of supervision, and opens up several avenues for future research.

Amar Parkash, Devi Parikh
Constrained Semi-Supervised Learning Using Attributes and Comparative Attributes

We consider the problem of semi-supervised bootstrap learning for scene categorization. Existing semi-supervised approaches are typically unreliable and face semantic drift because the learning task is under-constrained. This is primarily because they ignore the strong interactions that often exist between scene categories, such as the common attributes shared across categories as well as the attributes which make one scene different from another. The goal of this paper is to exploit these relationships and constrain the semi-supervised learning problem. For example, the knowledge that an image is an auditorium can improve labeling of amphitheaters by enforcing constraint that an amphitheater image should have more circular structures than an auditorium image. We propose constraints based on mutual exclusion, binary attributes and comparative attributes and show that they help us to constrain the learning problem and avoid semantic drift. We demonstrate the effectiveness of our approach through extensive experiments, including results on a very large dataset of one million images.

Abhinav Shrivastava, Saurabh Singh, Abhinav Gupta

Poster Session 4

Renormalization Returns: Hyper-renormalization and Its Applications

The technique of “renormalization” for geometric estimation attracted much attention when it was proposed in early 1990s for having higher accuracy than any other then known methods. Later, it was replaced by minimization of the reprojection error. This paper points out that renormalization can be modified so that it outperforms reprojection error minimization. The key fact is that renormalization directly specifies equations to solve, just as the “estimation equation” approach in statistics, rather than minimizing some cost. Exploiting this fact, we modify the problem so that the solution has zero bias up to high order error terms; we call the resulting scheme

hyper-renormalization

. We apply it to ellipse fitting to demonstrate that it indeed surpasses reprojection error minimization. We conclude that it is the best method available today.

Kenichi Kanatani, Ali Al-Sharadqah, Nikolai Chernov, Yasuyuki Sugaya
Scale Robust Multi View Stereo

We present a Multi View Stereo approach for huge unstructured image datasets that can deal with large variations in surface sampling rate of single images. Our method reconstructs surface parts always in the best available resolution. It considers scaling not only for large scale differences, but also between arbitrary small ones for a weighted merging of the best partial reconstructions. We create depth maps with our GPU based depth map algorithm, that also performs normal optimization. It matches several images that are found with a heuristic image selection method, to a reference image. We remove outliers by comparing depth maps against each other with a fast but reliable GPU approach. Then, we merge the different reconstructions from depth maps in 3D space by selecting the best points and optimizing them with not selected points. Finally, we create the surface by using a Delaunay graph cut.

Christian Bailer, Manuel Finckh, Hendrik P. A. Lensch
Laplacian Meshes for Monocular 3D Shape Recovery

We show that by extending the Laplacian formalism, which was first introduced in the Graphics community to regularize 3D meshes, we can turn the monocular 3D shape reconstruction of a deformable surface given correspondences with a reference image into a well-posed problem. Furthermore, this does not require any training data and eliminates the need to pre-align the reference shape with the one to be reconstructed, as was done in earlier methods.

Jonas Östlund, Aydin Varol, Dat Tien Ngo, Pascal Fua
Soft Inextensibility Constraints for Template-Free Non-rigid Reconstruction

In this paper, we exploit an inextensibility prior as a way to better constrain the highly ambiguous problem of non-rigid reconstruction from monocular views. While this widely applicable prior has been used before combined with the strong assumption of a known 3D-template, our work achieves template-free reconstruction using only inextensibility constraints. We show how to formulate an energy function that includes soft inextensibility constraints and rely on existing discrete optimisation methods to minimise it. Our method has all of the following advantages: (i) it can be applied to two tasks that have been so far considered independently – template based reconstruction and non-rigid structure from motion – producing comparable or better results than the state-of-the art methods; (ii) it can perform template-free reconstruction from as few as two images; and (iii) it does not require post-processing stitching or surface smoothing.

Sara Vicente, Lourdes Agapito
Spatiotemporal Descriptor for Wide-Baseline Stereo Reconstruction of Non-rigid and Ambiguous Scenes

This paper studies the use of temporal consistency to match appearance descriptors and handle complex ambiguities when computing dynamic depth maps from stereo. Previous attempts have designed 3D descriptors over the spacetime volume and have been mostly used for monocular action recognition, as they cannot deal with perspective changes. Our approach is based on a state-of-the-art 2D dense appearance descriptor which we extend in time by means of optical flow priors, and can be applied to wide-baseline stereo setups. The basic idea behind our approach is to capture the changes around a feature point in time instead of trying to describe the spatiotemporal volume. We demonstrate its effectiveness on very ambiguous synthetic video sequences with ground truth data, as well as real sequences.

Eduard Trulls, Alberto Sanfeliu, Francesc Moreno-Noguer
Elevation Angle from Reflectance Monotonicity: Photometric Stereo for General Isotropic Reflectances

This paper exploits the monotonicity of general isotropic reflectances for estimating elevation angles of surface normal given the azimuth angles. With an assumption that the reflectance includes at least one lobe that is a monotonic function of the angle between the surface normal and half-vector (bisector of lighting and viewing directions), we prove that elevation angles can be uniquely determined when the surface is observed under varying directional lights densely and uniformly distributed over the hemisphere. We evaluate our method by experiments using synthetic and real data to show its wide applicability, even when the assumption does not strictly hold. By combining an existing method for azimuth angle estimation, our method derives complete surface normal estimates for general isotropic reflectances.

Boxin Shi, Ping Tan, Yasuyuki Matsushita, Katsushi Ikeuchi
Local Log-Euclidean Covariance Matrix (L2ECM) for Image Representation and Its Applications

This paper presents Local Log-Euclidean Covariance Matrix (L

2

ECM) to represent neighboring image properties by capturing correlation of various image cues. Our work is inspired by the structure tensor which computes the second-order moment of image gradients for representing local image properties, and the Diffusion Tensor Imaging which produces tensor-valued image characterizing the local tissue structure. Our approach begins with extraction of raw features consisting of multiple image cues. For each pixel we compute a covariance matrix in its neighboring region, producing a tensor-valued image. The covariance matrices are symmetric and positive-definite (SPD) which forms a Riemannian manifold. In the Log-Euclidean framework, the SPD matrices form a Lie group equipped with Euclidean space structure, which enables common Euclidean operations in the logarithm domain. Hence, we compute the covariance matrix logarithm, obtaining the pixel-wise symmetric matrix. After half-vectorization we obtain the vector-valued L

2

ECM image, which can be flexibly handled with Euclidean operations while preserving the geometric structure of SPD matrices. The L

2

ECM features can be used in diverse image or vision tasks. We demonstrate some applications of its statistical modeling by simple second-order central moment and achieve promising performance.

Peihua Li, Qilong Wang
Ensemble Partitioning for Unsupervised Image Categorization

While the quality of object recognition systems can strongly benefit from more data, human annotation and labeling can hardly keep pace. This motivates the usage of autonomous and unsupervised learning methods. In this paper, we present a simple, yet effective method for unsupervised image categorization, which relies on discriminative learners. Since automatically obtaining error-free labeled training data for the learners is infeasible, we propose the concept of weak training (

WT

) set.

WT

sets have various deficiencies, but still carry useful information. Training on a single

WT

set cannot result in good performance, thus we design a random walk sampling scheme to create a series of diverse

WT

sets. This naturally allows our categorization learning to leverage ensemble learning techniques. In particular, for each

WT

set, we train a max-margin classifier to further partition the whole dataset to be categorized. By doing so, each

WT

set leads to a base partitioning of the dataset and all the base partitionings are combined into an ensemble proximity matrix. The final categorization is completed by feeding this proximity matrix into a spectral clustering algorithm. Experiments on a variety of challenging datasets show that our method outperforms competing methods by a considerable margin.

Dengxin Dai, Mukta Prasad, Christian Leistner, Luc Van Gool
Set Based Discriminative Ranking for Recognition

Recently both face recognition and body-based person re-identification have been extended from single-image based scenarios to video-based or even more generally image-set based problems. Set-based recognition brings new research and application opportunities while at the same time raises great modeling and optimization challenges. How to make the best use of the available multiple samples for each individual while at the same time not be disturbed by the great within-set variations is considered by us to be the major issue. Due to the difficulty of designing a global optimal learning model, most existing solutions are still based on unsupervised matching, which can be further categorized into three groups: a) set-based signature generation, b) direct set-to-set matching, and c) between-set distance finding. The first two count on good feature representation while the third explores data set structure and set-based distance measurement. The main shortage of them is the lack of learning-based discrimination ability. In this paper, we propose a set-based discriminative ranking model (SBDR), which iterates between set-to-set distance finding and discriminative feature space projection to achieve simultaneous optimization of these two. Extensive experiments on widely-used face recognition and person re-identification datasets not only demonstrate the superiority of our approach, but also shed some light on its properties and application domain.

Yang Wu, Michihiko Minoh, Masayuki Mukunoki, Shihong Lao
A Global Hypotheses Verification Method for 3D Object Recognition

We propose a novel approach for verifying model

hypotheses

in cluttered and heavily occluded 3D scenes. Instead of verifying one hypothesis at a time, as done by most state-of-the-art 3D object recognition methods, we determine object and pose instances according to a

global optimization

stage based on a cost function which encompasses geometrical cues. Peculiar to our approach is the inherent ability to detect significantly occluded objects without increasing the amount of false positives, so that the operating point of the object recognition algorithm can nicely move toward a higher recall without sacrificing precision. Our approach outperforms state-of-the-art on a challenging dataset including 35 household models obtained with the Kinect sensor, as well as on the standard 3D object recognition benchmark dataset.

Aitor Aldoma, Federico Tombari, Luigi Di Stefano, Markus Vincze
Are You Really Smiling at Me? Spontaneous versus Posed Enjoyment Smiles

Smiling is an indispensable element of nonverbal social interaction. Besides, automatic distinction between spontaneous and posed expressions is important for visual analysis of social signals. Therefore, in this paper, we propose a method to distinguish between spontaneous and posed enjoyment smiles by using the dynamics of eyelid, cheek, and lip corner movements. The discriminative power of these movements, and the effect of different fusion levels are investigated on multiple databases. Our results improve the state-of-the-art. We also introduce the largest spontaneous/posed enjoyment smile database collected to date, and report new empirical and conceptual findings on smile dynamics. The collected database consists of 1240 samples of 400 subjects. Moreover, it has the unique property of having an age range from 8 to 76 years. Large scale experiments on the new database indicate that eyelid dynamics are highly relevant for smile classification, and there are age-related differences in smile dynamics.

Hamdi Dibeklioğlu, Albert Ali Salah, Theo Gevers
Efficient Monte Carlo Sampler for Detecting Parametric Objects in Large Scenes

Point processes have demonstrated efficiency and competitiveness when addressing object recognition problems in vision. However, simulating these mathematical models is a difficult task, especially on large scenes. Existing samplers suffer from average performances in terms of computation time and stability. We propose a new sampling procedure based on a Monte Carlo formalism. Our algorithm exploits Markovian properties of point processes to perform the sampling in parallel. This procedure is embedded into a data-driven mechanism such that the points are non-uniformly distributed in the scene. The performances of the sampler are analyzed through a set of experiments on various object recognition problems from large scenes, and through comparisons to the existing algorithms.

Yannick Verdié, Florent Lafarge
Supervised Geodesic Propagation for Semantic Label Transfer

In this paper we propose a novel semantic label transfer method using supervised geodesic propagation (SGP). We use supervised learning to guide the seed selection and the label propagation. Given an input image, we first retrieve its similar image set from annotated databases. A Joint Boost model is learned on the similar image set of the input image. Then the recognition proposal map of the input image is inferred by this learned model. The initial distance map is defined by the proposal map: the higher probability, the smaller distance. In each iteration step of the geodesic propagation, the seed is selected as the one with the smallest distance from the undetermined superpixels. We learn a classifier as an indicator to indicate whether to propagate labels between two neighboring superpixels. The training samples of the indicator are annotated neighboring pairs from the similar image set. The geodesic distances of its neighbors are updated according to the combination of the texture and boundary features and the indication value. Experiments on three datasets show that our method outperforms the traditional learning based methods and the previous label transfer method for the semantic segmentation work.

Xiaowu Chen, Qing Li, Yafei Song, Xin Jin, Qinping Zhao
Bayesian Face Revisited: A Joint Formulation

In this paper, we revisit the classical Bayesian face recognition method by Baback Moghaddam et al. and propose a new joint formulation. The classical Bayesian method models the appearance difference between two faces. We observe that this “difference” formulation may reduce the separability between classes. Instead, we model two faces jointly with an appropriate prior on the face representation. Our joint formulation leads to an EM-like model learning at the training time and an efficient, closed-formed computation at the test time. On extensive experimental evaluations, our method is superior to the classical Bayesian face and many other supervised approaches. Our method achieved 92.4% test accuracy on the challenging Labeled Face in Wild (LFW) dataset. Comparing with current best commercial system, we reduced the error rate by 10%.

Dong Chen, Xudong Cao, Liwei Wang, Fang Wen, Jian Sun
Beyond Bounding-Boxes: Learning Object Shape by Model-Driven Grouping

Visual recognition requires to learn object models from training data. Commonly, training samples are annotated by marking only the bounding-box of objects, since this appears to be the best trade-off between labeling information and effectiveness. However, objects are typically not box-shaped. Thus, the usual parametrization of object hypotheses by only their location, scale and aspect ratio seems inappropriate since the box contains a significant amount of background clutter. Most important, however, is that object shape becomes only explicit once objects are segregated from the background. Segmentation is an ill-posed problem and so we propose an approach for learning object models for detection while, simultaneously, learning to segregate objects from clutter and extracting their overall shape. For this purpose, we exclusively use bounding-box annotated training data. The approach groups fragmented object regions using the Multiple Instance Learning (MIL) framework to obtain a meaningful representation of object shape which, at the same time, crops away distracting background clutter to improve the appearance representation.

Antonio Monroy, Björn Ommer
In Defence of Negative Mining for Annotating Weakly Labelled Data

We propose a novel approach to annotating weakly labelled data. In contrast to many existing approaches that perform annotation by seeking clusters of self-similar exemplars (minimising intra-class variance), we perform image annotation by selecting exemplars that have never occurred before in the much larger, and strongly annotated, negative training set (maximising inter-class variance). Compared to existing methods, our approach is fast, robust, and obtains state of the art results on two challenging data-sets –

voc

2007 (all poses), and the

msr

2 action data-set, where we obtain a 10% increase. Moreover, this use of negative mining complements existing methods, that seek to minimize the intra-class variance, and can be readily integrated with many of them.

Parthipan Siva, Chris Russell, Tao Xiang
Describing Clothing by Semantic Attributes

Describing clothing appearance with semantic attributes is an appealing technique for many important applications. In this paper, we propose a fully automated system that is capable of generating a list of nameable attributes for clothes on human body in unconstrained images. We extract low-level features in a pose-adaptive manner, and combine complementary features for learning attribute classifiers. Mutual dependencies between the attributes are then explored by a Conditional Random Field to further improve the predictions from independent classifiers. We validate the performance of our system on a challenging clothing attribute dataset, and introduce a novel application of dressing style analysis that utilizes the semantic attributes produced by our system.

Huizhong Chen, Andrew Gallagher, Bernd Girod
Graph Matching via Sequential Monte Carlo

Graph matching is a powerful tool for computer vision and machine learning. In this paper, a novel approach to graph matching is developed based on the sequential Monte Carlo framework. By constructing a sequence of intermediate target distributions, the proposed algorithm sequentially performs a sampling and importance resampling to maximize the graph matching objective. Through the sequential sampling procedure, the algorithm effectively collects potential matches under one-to-one matching constraints to avoid the adverse effect of outliers and deformation. Experimental evaluations on synthetic graphs and real images demonstrate its higher robustness to deformation and outliers.

Yumin Suh, Minsu Cho, Kyoung Mu Lee
Jet-Based Local Image Descriptors

We present a general novel image descriptor based on higherorder differential geometry and investigate the effect of common descriptor choices. Our investigation is twofold in that we develop a jet-based descriptor and perform a comparative evaluation with current state-of-the-art descriptors on the recently released DTU Robot dataset. We demonstrate how the use of higher-order image structures enables us to reduce the descriptor dimensionality while still achieving very good performance. The descriptors are tested in a variety of scenarios including large changes in scale, viewing angle and lighting. We show that the proposed jet-based descriptor is superior to state-of-the-art for DoG interest points and show competitive performance for the other tested interest points.

Anders Boesen Lindbo Larsen, Sune Darkner, Anders Lindbjerg Dahl, Kim Steenstrup Pedersen
Abnormal Object Detection by Canonical Scene-Based Contextual Model

Contextual modeling is a critical issue in scene understanding. Object detection accuracy can be improved by exploiting tendencies that are common among object configurations. However, conventional contextual models only exploit the tendencies of normal objects;

abnormal objects

that do not follow the same tendencies are hard to detect through contextual model. This paper proposes a novel generative model that detects abnormal objects by meeting four proposed criteria of success. This model generates normal as well as abnormal objects, each following their respective tendencies. Moreover, this generation is controlled by a latent scene variable. All latent variables of the proposed model are predicted through optimization via population-based Markov Chain Monte Carlo, which has a relatively short convergence time. We present a new abnormal dataset classified into three categories to thoroughly measure the accuracy of the proposed model for each category; the results demonstrate the superiority of our proposed approach over existing methods.

Sangdon Park, Wonsik Kim, Kyoung Mu Lee
Shapecollage: Occlusion-Aware, Example-Based Shape Interpretation

This paper presents an example-based method to interpret a 3D shape from a single image depicting that shape. A major difficulty in applying an example-based approach to shape interpretation is the combinatorial explosion of shape possibilities that occur at occluding contours. Our key technical contribution is a new shape patch representation and corresponding pairwise compatibility terms that allow for flexible matching of overlapping patches, avoiding the combinatorial explosion by allowing patches to explain only the parts of the image they best fit. We infer the best set of localized shape patches over a graph of keypoints at multiple scales to produce a discontinuous shape representation we term a

shape collage

. To reconstruct a smooth result, we fit a surface to the collage using the predicted confidence of each shape patch. We demonstrate the method on shapes depicted in line drawing, diffuse and glossy shading, and textured styles.

Forrester Cole, Phillip Isola, William T. Freeman, Frédo Durand, Edward H. Adelson
Interactive Facial Feature Localization

We address the problem of interactive facial feature localization from a single image. Our goal is to obtain an accurate segmentation of facial features on high-resolution images under a variety of pose, expression, and lighting conditions. Although there has been significant work in facial feature localization, we are addressing a new application area, namely to facilitate intelligent high-quality editing of portraits, that brings requirements not met by existing methods. We propose an improvement to the Active Shape Model that allows for greater independence among the facial components and improves on the appearance fitting step by introducing a Viterbi optimization process that operates along the facial contours. Despite the improvements, we do not expect perfect results in all cases. We therefore introduce an interaction model whereby a user can efficiently guide the algorithm towards a precise solution. We introduce the Helen Facial Feature Dataset consisting of annotated portrait images gathered from Flickr that are more diverse and challenging than currently existing datasets. We present experiments that compare our automatic method to published results, and also a quantitative evaluation of the effectiveness of our interactive method.

Vuong Le, Jonathan Brandt, Zhe Lin, Lubomir Bourdev, Thomas S. Huang
Propagative Hough Voting for Human Activity Recognition

Hough-transform based voting has been successfully applied to both object and activity detections. However, most current Hough voting methods will suffer when insufficient training data is provided. To address this problem, we propose propagative Hough voting for activity analysis. Instead of letting local features vote individually, we perform feature voting using random projection trees (RPT) which leverage the low-dimension manifold structure to match feature points in the high-dimensional feature space. Our RPT can index the unlabeled feature points in an unsupervised way. After the trees are constructed, the label and spatial-temporal configuration information are propagated from the training samples to the testing data via RPT. The proposed activity recognition method does not rely on human detection and tracking, and can well handle the scale and intra-class variations of the activity patterns. The superior performances on two benchmarked activity datasets validate that our method outperforms the state-of-the-art techniques not only when there is sufficient training data such as in activity recognition, but also when there is limited training data such as in activity search with one query example.

Gang Yu, Junsong Yuan, Zicheng Liu
Spatio-Temporal Phrases for Activity Recognition

The local feature based approaches have become popular for activity recognition. A local feature captures the local movement and appearance of a local region in a video, and thus can be ambiguous; e.g., it cannot tell whether a movement is from a person’s hand or foot, when the camera is far away from the person. To better distinguish different types of activities, people have proposed using the combination of local features to encode the relationships of local movements. Due to the computation limit, previous work only creates a combination from neighboring features in space and/or time. In this paper, we propose an approach that efficiently identifies both local and long-range motion interactions; taking the “push” activity as an example, our approach can capture the combination of the hand movement of one person and the foot response of another person, the local features of which are both spatially and temporally far away from each other. Our computational complexity is in linear time to the number of local features in a video. The extensive experiments show that our approach is generically effective for recognizing a wide variety of activities and activities spanning a long term, compared to a number of state-of-the-art methods.

Yimeng Zhang, Xiaoming Liu, Ming-Ching Chang, Weina Ge, Tsuhan Chen
Complex Events Detection Using Data-Driven Concepts

Automatic event detection in a large collection of unconstrained videos is a challenging and important task. The key issue is to describe long complex video with high level semantic descriptors, which should find the regularity of events in the same category while distinguish those from different categories. This paper proposes a novel unsupervised approach to discover data-driven concepts from multi-modality signals (audio, scene and motion) to describe high level semantics of videos. Our methods consists of three main components: we first learn the low-level features separately from three modalities. Secondly we discover the data-driven concepts based on the statistics of learned features mapped to a low dimensional space using deep belief nets (DBNs). Finally, a compact and robust sparse representation is learned to jointly model the concepts from all three modalities. Extensive experimental results on large in-the-wild dataset show that our proposed method significantly outperforms state-of-the-art methods.

Yang Yang, Mubarak Shah
Directional Space-Time Oriented Gradients for 3D Visual Pattern Analysis

Various visual tasks such as the recognition of human actions, gestures, facial expressions, and classification of dynamic textures require modeling and the representation of spatio-temporal information. In this paper, we propose representing space-time patterns using directional spatio-temporal oriented gradients. In the proposed approach, a 3D video patch is represented by a histogram of oriented gradients over nine symmetric spatio-temporal planes. Video comparison is achieved through a positive definite similarity kernel that is learnt by multiple kernel learning. A rich spatio-temporal descriptor with a simple trade-off between discriminatory power and invariance properties is thereby obtained. To evaluate the proposed approach, we consider three challenging visual recognition tasks, namely the classification of dynamic textures, human gestures and human actions. Our evaluations indicate that the proposed approach attains significant classification improvements in recognition accuracy in comparison to state-of-the-art methods such as LBP-TOP, 3D-SIFT, HOG3D, tensor canonical correlation analysis, and dynamical fractal analysis.

Ehsan Norouznezhad, Mehrtash T. Harandi, Abbas Bigdeli, Mahsa Baktash, Adam Postula, Brian C. Lovell
Learning to Recognize Unsuccessful Activities Using a Two-Layer Latent Structural Model

In this paper, we propose to recognize unsuccessful activities (e.g., one tries to dress himself but fails), which have much more complex temporal structures, as we don’t know when the activity performer fails (which is called the point of failure in this paper). We develop a two-layer latent structural SVM model to tackle this problem: the first layer specifies the point of failure, and the second layer specifies the temporal positions of a number of key stages accordingly. The stages before the point of failure are successful stages, while the stages after the point of failure are background stages. Given weakly labeled training data, our training algorithm alternates between inferring the two-layer latent structure and updating the structural SVM parameters. In recognition, our method can not only recognize unsuccessful activities, but also infer the latent structure. We demonstrate the effectiveness of our proposed method on several newly collected datasets.

Qiang Zhou, Gang Wang
Action Recognition Using Subtensor Constraint

Human action recognition from videos draws tremendous interest in the past many years. In this work, we first find that the trifocal tensor resides in a twelve dimensional subspace of the original space if the first two views are already matched and the fundamental matrix between them is known, which we refer to as subtensor. Then we use the subtensor to perform the task of action recognition under three views. We find that treating the two template views separately or not considering the correspondence relation already known between the first two views omits a lot of useful information. Experiments and datasets are designed to demonstrate the effectiveness and improved performance of the proposed approach.

Qiguang Liu, Xiaochun Cao
Globally Optimal Closed-Surface Segmentation for Connectomics

We address the problem of partitioning a volume image into a previously unknown number of segments, based on a likelihood of merging adjacent supervoxels. Towards this goal, we adapt a higher-order probabilistic graphical model that makes the duality between supervoxels and their joint faces explicit and ensures that merging decisions are consistent and surfaces of final segments are closed. First, we propose a practical cutting-plane approach to solve the MAP inference problem to global optimality despite its NP-hardness. Second, we apply this approach to challenging large-scale 3D segmentation problems for neural circuit reconstruction (Connectomics), demonstrating the advantage of this higher-order model over independent decisions and finite-order approximations.

Bjoern Andres, Thorben Kroeger, Kevin L. Briggman, Winfried Denk, Natalya Korogod, Graham Knott, Ullrich Koethe, Fred A. Hamprecht
Reduced Analytical Dependency Modeling for Classifier Fusion

This paper addresses the independent assumption issue in classifier fusion process. In the last decade, dependency modeling techniques were developed under some specific assumptions which may not be valid in practical applications. In this paper, using analytical functions on posterior probabilities of each feature, we propose a new framework to model dependency without those assumptions. With the analytical dependency model (ADM), we give an equivalent condition to the independent assumption from the properties of marginal distributions, and show that the proposed ADM can model dependency. Since ADM may contain infinite number of undetermined coefficients, we further propose a reduced form of ADM, based on the convergent properties of analytical functions. Finally, under the regularized least square criterion, an optimal Reduced Analytical Dependency Model (RADM) is learned by approximating posterior probabilities such that all training samples are correctly classified. Experimental results show that the proposed RADM outperforms existing classifier fusion methods on Digit, Flower, Face and Human Action databases.

Andy Jinhua Ma, Pong Chi Yuen
Learning to Match Appearances by Correlations in a Covariance Metric Space

This paper addresses the problem of appearance matching across disjoint camera views. Significant appearance changes, caused by variations in view angle, illumination and object pose, make the problem challenging. We propose to formulate the appearance matching problem as the task of learning a model that selects the most descriptive features for a specific class of objects. Learning is performed in a covariance metric space using an entropy-driven criterion. Our main idea is that different regions of the object appearance ought to be matched using various strategies to obtain a distinctive representation. The proposed technique has been successfully applied to the person re-identification problem, in which a human appearance has to be matched across non-overlapping cameras. We demonstrate that our approach improves state of the art performance in the context of pedestrian recognition.

Sławomir Bąk, Guillaume Charpiat, Etienne Corvée, François Brémond, Monique Thonnat
On the Convergence of Graph Matching: Graduated Assignment Revisited

We focus on the problem of graph matching that is fundamental in computer vision and machine learning. Many state-of-the-arts frequently formulate it as integer quadratic programming, which incorporates both unary and second-order terms. This formulation is in general NP-hard thus obtaining an exact solution is computationally intractable. Therefore most algorithms seek the approximate optimum by relaxing techniques. This paper commences with the finding of the “

circular

” character of solution chain obtained by the iterative

Gradient Assignment

(via Hungarian method) in the discrete domain, and proposes a method for guiding the solver converging to a fixed point, resulting a convergent algorithm for graph matching in discrete domain. Furthermore, we extend the algorithms to their counterparts in continuous domain, proving the classical graduated assignment algorithm will converge to a double-circular solution chain, and the proposed Soft Constrained Graduated Assignment (SCGA) method will converge to a fixed (discrete) point, both under wild conditions. Competitive performances are reported in both synthetic and real experiments.

Yu Tian, Junchi Yan, Hequan Zhang, Ya Zhang, Xiaokang Yang, Hongyuan Zha
Image Annotation Using Metric Learning in Semantic Neighbourhoods

Automatic image annotation aims at predicting a set of textual labels for an image that describe its semantics. These are usually taken from an annotation vocabulary of few hundred labels. Because of the large vocabulary, there is a high variance in the number of images corresponding to different labels (“class-imbalance”). Additionally, due to the limitations of manual annotation, a significant number of available images are not annotated with all the relevant labels (“weak-labelling”). These two issues badly affect the performance of most of the existing image annotation models. In this work, we propose

2PKNN

, a two-step variant of the classical

K

-nearest neighbour algorithm, that addresses these two issues in the image annotation task. The first step of

2PKNN

uses “image-to-label” similarities, while the second step uses “image-to-image” similarities; thus combining the benefits of both. Since the performance of nearest-neighbour based methods greatly depends on how features are compared, we also propose a metric learning framework over

2PKNN

that learns weights for multiple features as well as distances together. This is done in a large margin set-up by generalizing a well-known (single-label) classification metric learning algorithm for multi-label prediction. For scalability, we implement it by alternating between stochastic sub-gradient descent and projection steps.

Extensive experiments demonstrate that, though conceptually simple,

2PKNN

alone performs comparable to the current state-of-the-art on three challenging image annotation datasets, and shows significant improvements after metric learning.

Yashaswi Verma, C. V. Jawahar
Dynamic Programming for Approximate Expansion Algorithm

Expansion algorithm is a popular optimization method for labeling problems. For many common energies, each expansion step can be optimally solved with a min-cut/max flow algorithm. While the observed performance of max-flow for the expansion algorithm is fast, its theoretical time complexity is worse than linear in the number of pixels. Recently, Dynamic Programming (DP) was shown to be useful for 2D labeling problems via a “tiered labeling” algorithm, although the structure of allowed (tiered) is quite restrictive. We show another use of DP in a 2D labeling case. Namely, we use DP for an approximate expansion step. Our expansion-like moves are more limited in the structure than the max-flow expansion moves. In fact, our moves are more restrictive than the tiered labeling structure, but their complexity is linear in the number of pixels, making them extremely efficient in practice. We illustrate the performance of our DP-expansion on the Potts energy, but our algorithm can be used for any pairwise energies. We achieve better efficiency with almost the same energy compared to the max-flow expansion moves.

Olga Veksler
Real-Time Compressive Tracking

It is a challenging task to develop effective and efficient appearance models for robust object tracking due to factors such as pose variation, illumination change, occlusion, and motion blur. Existing online tracking algorithms often update models with samples from observations in recent frames. While much success has been demonstrated, numerous issues remain to be addressed. First, while these adaptive appearance models are data-dependent, there does not exist sufficient amount of data for online algorithms to learn at the outset. Second, online tracking algorithms often encounter the drift problems. As a result of self-taught learning, these mis-aligned samples are likely to be added and degrade the appearance models. In this paper, we propose a simple yet effective and efficient tracking algorithm with an appearance model based on features extracted from the multi-scale image feature space with data-independent basis. Our appearance model employs non-adaptive random projections that preserve the structure of the image feature space of objects. A very sparse measurement matrix is adopted to efficiently extract the features for the appearance model. We compress samples of foreground targets and the background using the same sparse measurement matrix. The tracking task is formulated as a binary classification via a naive Bayes classifier with online update in the compressed domain. The proposed compressive tracking algorithm runs in real-time and performs favorably against state-of-the-art algorithms on challenging sequences in terms of efficiency, accuracy and robustness.

Kaihua Zhang, Lei Zhang, Ming-Hsuan Yang
Backmatter
Metadata
Title
Computer Vision – ECCV 2012
Editors
Andrew Fitzgibbon
Svetlana Lazebnik
Pietro Perona
Yoichi Sato
Cordelia Schmid
Copyright Year
2012
Publisher
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-33712-3
Print ISBN
978-3-642-33711-6
DOI
https://doi.org/10.1007/978-3-642-33712-3

Premium Partner