Skip to main content

About this book

The 2010 edition of the European Conference on Computer Vision was held in Heraklion, Crete. The call for papers attracted an absolute record of 1,174 submissions. We describe here the selection of the accepted papers: Thirty-eight area chairs were selected coming from Europe (18), USA and Canada (16), and Asia (4). Their selection was based on the following criteria: (1) Researchers who had served at least two times as Area Chairs within the past two years at major vision conferences were excluded; (2) Researchers who served as Area Chairs at the 2010 Computer Vision and Pattern Recognition were also excluded (exception: ECCV 2012 Program Chairs); (3) Minimization of overlap introduced by Area Chairs being former student and advisors; (4) 20% of the Area Chairs had never served before in a major conference; (5) The Area Chair selection process made all possible efforts to achieve a reasonable geographic distribution between countries, thematic areas and trends in computer vision. Each Area Chair was assigned by the Program Chairs between 28–32 papers. Based on paper content, the Area Chair recommended up to seven potential reviewers per paper. Such assignment was made using all reviewers in the database including the conflicting ones. The Program Chairs manually entered the missing conflict domains of approximately 300 reviewers. Based on the recommendation of the Area Chairs, three reviewers were selected per paper (with at least one being of the top three suggestions), with 99.

Table of Contents


Spotlights and Posters T1

Learning a Fine Vocabulary

A novel similarity measure for bag-of-words type large scale image retrieval is presented. The similarity function is learned in an unsupervised manner, requires no extra space over the standard bag-of-words method and is more discriminative than both L2-based soft assignment and Hamming embedding.

We show experimentally that the novel similarity function achieves mean average precision that is superior to any result published in the literature on a number of standard datasets. At the same time, retrieval with the proposed similarity function is faster than the reference method.

Andrej Mikulík, Michal Perdoch, Ondřej Chum, Jiří Matas

Video Synchronization Using Temporal Signals from Epipolar Lines

Time synchronization of video sequences in a multi-camera system is necessary for successfully analyzing the acquired visual information. Even if synchronization is established, its quality may deteriorate over time due to a variety of reasons, most notably frame dropping. Consequently, synchronization must be actively maintained. This paper presents a method for online synchronization that relies only on the video sequences. We introduce a novel definition of low level temporal signals computed from epipolar lines. The spatial matching of two such temporal signals is given by the fundamental matrix. Thus, no pixel correspondence is required, bypassing the problem of correspondence changes in the presence of motion. The synchronization is determined from registration of the temporal signals. We consider general video data with substantial movement in the scene, for which high level information may be hard to extract from each individual camera (e.g., computing trajectories in crowded scenes). Furthermore, a trivial correspondence between the sequences is not assumed to exist. The method is online and can be used to resynchronize video sequences every few seconds, with only a small delay. Experiments on indoor and outdoor sequences demonstrate the effectiveness of the method.

Dmitry Pundik, Yael Moses

The Generalized PatchMatch Correspondence Algorithm

PatchMatch is a fast algorithm for computing dense approximate nearest neighbor correspondences between patches of two image regions [1]. This paper generalizes PatchMatch in three ways: (1) to find k nearest neighbors, as opposed to just one, (2) to search across scales and rotations, in addition to just translations, and (3) to match using arbitrary descriptors and distances, not just sum-of-squared-differences on patch colors. In addition, we offer new search and parallelization strategies that further accelerate the method, and we show performance improvements over standard kd-tree techniques across a variety of inputs. In contrast to many previous matching algorithms, which for efficiency reasons have restricted matching to sparse interest points, or spatially proximate matches, our algorithm can efficiently find global, dense matches, even while matching across all scales and rotations. This is especially useful for computer vision applications, where our algorithm can be used as an efficient general-purpose component. We explore a variety of vision applications: denoising, finding forgeries by detecting cloned regions, symmetry detection, and object detection.

Connelly Barnes, Eli Shechtman, Dan B. Goldman, Adam Finkelstein

Automated 3D Reconstruction and Segmentation from Optical Coherence Tomography

Ultra-High Resolution Optical Coherence Tomography is a novel imaging technology that allows non-invasive, high speed, cellular resolution imaging of anatomical structures in the human eye, including the retina and the cornea.

A three-dimensional study of the cornea, for example, requires the segmentation and mutual alignment of a large number of two-dimensional images. Such segmentation has, until now, only been undertaken by hand for individual two-dimensional images; this paper presents a method for automated segmentation, opening substantial opportunities for 3D corneal imaging and analysis, using many hundreds of 2D slices.

Justin A. Eichel, Kostadinka K. Bizheva, David A. Clausi, Paul W. Fieguth

Combining Geometric and Appearance Priors for Robust Homography Estimation

The homography between pairs of images are typically computed from the correspondence of keypoints, which are established by using image descriptors. When these descriptors are not reliable, either because of repetitive patterns or large amounts of clutter, additional priors need to be considered. The Blind PnP algorithm makes use of geometric priors to guide the search for matches while computing camera pose. Inspired by this, we propose a novel approach for homography estimation that combines geometric priors with appearance priors of ambiguous descriptors. More specifically, for each point we retain its best candidates according to appearance. We then prune the set of potential matches by iteratively shrinking the regions of the image that are consistent with the geometric prior. We can then successfully compute homographies between pairs of images containing highly repetitive patterns and even under oblique viewing conditions.

Eduard Serradell, Mustafa Özuysal, Vincent Lepetit, Pascal Fua, Francesc Moreno-Noguer

Real-Time Spherical Mosaicing Using Whole Image Alignment

When a purely rotating camera observes a general scene, overlapping views are related by a parallax-free warp which can be estimated by direct image alignment methods that iterate to optimise photo-consistency. However, building globally consistent mosaics from video has usually been tackled as an off-line task, while sequential methods suitable for real-time implementation have often suffered from long-term drift. In this paper we present a high performance real-time video mosaicing algorithm based on parallel image alignment via ESM (Efficient Second-order Minimisation) and global optimisation of a map of keyframes over the whole viewsphere. We present real-time results for drift-free camera rotation tracking and globally consistent spherical mosaicing from a variety of cameras in real scenes, demonstrating high global accuracy and the ability to track very rapid rotation while maintaining solid 30Hz operation. We also show that automatic camera calibration refinement can be straightforwardly built into our framework.

Steven Lovegrove, Andrew J. Davison


Adaptive Metric Registration of 3D Models to Non-rigid Image Trajectories

This paper addresses the problem of registering a 3D model, represented as a cloud of points lying over a surface, to a set of 2D deforming image trajectories in the image plane. The proposed approach can adapt to a scenario where the 3D model to register is not an exact description of the measured image data. This results in finding the best 2D–3D registration, given the complexity of having both 2D deforming data and a coarse description of the image observations. The method acts in two distinct phases. First, an affine step computes a factorization for both the 2D image data and the 3D model using a joint subspace decomposition. This initial solution is then upgraded by finding the best projection to the image plane complying with the metric constraints given by a scaled orthographic camera. Both steps are computed efficiently in closed-form with the additional feature of being robust to degenerate motions which may possibly affect the 2D image data (i.e. lack of relevant rigid motion). Moreover, we present an extension of the approach for the case of missing image data. Synthetic and real experiments show the robustness of the method in registration tasks such as pose estimation of a talking face using a single 3D model.

Alessio Del Bue

Local Occlusion Detection under Deformations Using Topological Invariants

Occlusions provide critical cues about the 3D structure of man-made and natural scenes. We present a mathematical framework and algorithm to detect and localize occlusions in image sequences of scenes that include deforming objects. Our occlusion detector works under far weaker assumptions than other detectors. We prove that occlusions in deforming scenes occur when certain well-defined local topological invariants are not preserved. Our framework employs these invariants to detect occlusions with a zero false positive rate under assumptions of bounded deformations and color variation. The novelty and strength of this methodology is that it does not rely on spatio-temporal derivatives or matching, which can be problematic in scenes including deforming objects, but is instead based on a mathematical representation of the underlying cause of occlusions in a deforming 3D scene. We demonstrate the effectiveness of the occlusion detector using image sequences of natural scenes, including deforming cloth and hand motions.

Edgar Lobaton, Ram Vasudevan, Ruzena Bajcsy, Ron Alterovitz

2.5D Dual Contouring: A Robust Approach to Creating Building Models from Aerial LiDAR Point Clouds

We present a robust approach to creating 2.5D building models from aerial LiDAR point clouds. The method is guaranteed to produce crack-free models composed of complex roofs and vertical walls connecting them. By extending classic dual contouring into a 2.5D method, we achieve a simultaneous optimization over the three dimensional surfaces and the two dimensional boundaries of roof layers. Thus, our method can generate building models with arbitrarily shaped roofs while keeping the verticality of connecting walls. An adaptive grid is introduced to simplify model geometry in an accurate manner. Sharp features are detected and preserved by a novel and efficient algorithm.

Qian-Yi Zhou, Ulrich Neumann

Analytical Forward Projection for Axial Non-central Dioptric and Catadioptric Cameras

We present a technique for modeling non-central catadioptric cameras consisting of a perspective camera and a rotationally symmetric conic reflector. While previous approaches use a central approximation and/or iterative methods for forward projection, we present an analytical solution. This allows computation of the optical path from a given 3D point to the given viewpoint by solving a 6


degree forward projection equation for general conic mirrors. For a spherical mirror, the forward projection reduces to a 4


degree equation, resulting in a closed form solution. We also derive the forward projection equation for imaging through a refractive sphere (non-central dioptric camera) and show that it is a 10


degree equation. While central catadioptric cameras lead to conic epipolar curves, we show the existence of a quartic epipolar curve for catadioptric systems using a spherical mirror. The analytical forward projection leads to accurate and fast 3D reconstruction via bundle adjustment. Simulations and real results on single image sparse 3D reconstruction are presented. We demonstrate ~ 100 times speed up using the analytical solution over iterative forward projection for 3D reconstruction using spherical mirrors.

Amit Agrawal, Yuichi Taguchi, Srikumar Ramalingam

5D Motion Subspaces for Planar Motions

In practice, rigid objects often move on a plane. The object then rotates around a fixed axis and translates in a plane orthogonal to this axis. For a concrete example, think of a car moving on a street. Given multiple static affine cameras which observe such a rigidly moving object and track feature points located on this object, what can be said about the resulting feature point trajectories in the camera views? Are there any useful algebraic constraints hidden in the data? Is a 3D reconstruction of the scene possible even if there are no feature point correspondences between the different cameras? And if so, how many points are sufficient? Does a closed-form solution to this shape from motion reconstruction problem exist?

This paper addresses these questions and thereby introduces the concept of 5 dimensional planar motion subspaces: the trajectory of a feature point seen by any camera is restricted to lie in a 5D subspace. The constraints provided by these motion subspaces enable a closed-form solution for the reconstruction. The solution is based on multilinear analysis, matrix and tensor factorizations. As a key insight, the paper shows that already two points are sufficient to derive a closed-form solution. Hence, even two cameras where each of them is just tracking one single point can be handled. Promising results of a real data sequence act as a proof of concept of the presented insights.

Roland Angst, Marc Pollefeys

3D Reconstruction of a Moving Point from a Series of 2D Projections

This paper presents a linear solution for reconstructing the 3D trajectory of a moving point from its correspondence in a collection of 2D perspective images, given the 3D spatial pose and time of capture of the cameras that produced each image. Triangulation-based solutions do not apply, as multiple views of the point may not exist at each instant in time. A geometric analysis of the problem is presented and a criterion, called reconstructibility, is defined to precisely characterize the cases when reconstruction is possible, and how accurate it can be. We apply the linear reconstruction algorithm to reconstruct the time evolving 3D structure of several real-world scenes, given a collection of non-coincidental 2D images.

Hyun Soo Park, Takaaki Shiratori, Iain Matthews, Yaser Sheikh

Spotlights and Posters T2

Manifold Learning for Object Tracking with Multiple Motion Dynamics

This paper presents a novel manifold learning approach for high dimensional data, with emphasis on the problem of motion tracking in video sequences. In this problem, the samples are time-ordered, providing additional information that most current methods do not take advantage of. Additionally, most methods assume that the manifold topology admits a single chart, which is overly restrictive. Instead, the algorithm can deal with arbitrary manifold topology by decomposing the manifold into multiple local models that are combined in a probabilistic fashion using Gaussian process regression. Thus, the algorithm is termed herein as

Gaussian Process Multiple Local Models


Additionally, the paper describes a multiple filter architecture where standard filtering techniques,


particle and Kalman filtering, are combined with the output of GP–MLM in a principled way. The performance of this approach is illustrated with experimental results using real video sequences. A comparison with GP–LVM [29] is also provided. Our algorithm achieves competitive state-of-the-art results on a public database concerning the left ventricle (LV) ultrasound (US) and lips images.

Jacinto C. Nascimento, Jorge G. Silva

Detection and Tracking of Large Number of Targets in Wide Area Surveillance

In this paper, we tackle the problem of object detection and tracking in a new and challenging domain of wide area surveillance. This problem poses several challenges: large camera motion, strong parallax, large number of moving objects, small number of pixels on target, single channel data and low framerate of video. We propose a method that overcomes these challenges and evaluate it on CLIF dataset. We use median background modeling which requires few frames to obtain a workable model. We remove false detections due to parallax and registration errors using gradient information of the background image. In order to keep complexity of the tracking problem manageable, we divide the scene into grid cells, solve the tracking problem optimally within each cell using bipartite graph matching and then link tracks across cells. Besides tractability, grid cells allow us to define a set of local scene constraints such as road orientation and object context. We use these constraints as part of cost function to solve the tracking problem which allows us to track fast-moving objects in low framerate videos. In addition to that, we manually generated groundtruth for four sequences and performed quantitative evaluation of the proposed algorithm.

Vladimir Reilly, Haroon Idrees, Mubarak Shah

Discriminative Tracking by Metric Learning

We present a discriminative model that casts appearance modeling and visual matching into a single objective for visual tracking. Most previous discriminative models for visual tracking are formulated as supervised learning of binary classifiers. The continuous output of the classification function is then utilized as the cost function for visual tracking. This may be less desirable since the function is optimized for making binary decision. Such a learning objective may make it not to be able to well capture the manifold structure of the discriminative appearances. In contrast, our unified formulation is based on a principled metric learning framework, which seeks for a discriminative embedding for appearance modeling. In our formulation, both appearance modeling and visual matching are performed online by efficient gradient based optimization. Our formulation is also able to deal with multiple targets, where the exclusive principle is naturally reinforced to handle occlusions. Its efficacy is validated in a wide variety of challenging videos. It is shown that our algorithm achieves more persistent results, when compared with previous appearance model based tracking algorithms.

Xiaoyu Wang, Gang Hua, Tony X. Han

Memory-Based Particle Filter for Tracking Objects with Large Variation in Pose and Appearance

A novel memory-based particle filter is proposed to achieve robust visual tracking of a target’s pose even with large variations in target’s position and rotation, i.e. large appearance changes. The memory-based particle filter (M-PF) is a recent extension of the particle filter, and incorporates a memory-based mechanism to predict prior distribution using past memory of target state sequence; it offers robust target tracking against complex motion. This paper extends the M-PF to a unified probabilistic framework for joint estimation of the target’s pose and appearance based on memory-based joint prior prediction using stored past pose and appearance sequences. We call it the Memory-based Particle Filter with Appearance Prediction (M-PFAP). A memory-based approach enables generating the joint prior distribution of pose and appearance without explicit modeling of the complex relationship between them. M-PFAP can robustly handle the large changes in appearance caused by large pose variation, in addition to abrupt changes in moving direction; it allows robust tracking under self and mutual occlusion. Experiments confirm that M-PFAP successfully tracks human faces from frontal view to profile view; it greatly eases the limitations of M-PF.

Dan Mikami, Kazuhiro Otsuka, Junji Yamato

3D Deformable Face Tracking with a Commodity Depth Camera

Recently, there has been an increasing number of depth cameras available at commodity prices. These cameras can usually capture both color and depth images in real-time, with limited resolution and accuracy. In this paper, we study the problem of 3D deformable face tracking with such commodity depth cameras. A regularized maximum likelihood deformable model fitting (DMF) algorithm is developed, with special emphasis on handling the noisy input depth data. In particular, we present a maximum likelihood solution that can accommodate sensor noise represented by an arbitrary covariance matrix, which allows more elaborate modeling of the sensor’s accuracy. Furthermore, an ℓ


regularization scheme is proposed based on the semantics of the deformable face model, which is shown to be very effective in improving the tracking results. To track facial movement in subsequent frames, feature points in the texture images are matched across frames and integrated into the DMF framework seamlessly. The effectiveness of the proposed method is demonstrated with multiple sequences with ground truth information.

Qin Cai, David Gallup, Cha Zhang, Zhengyou Zhang

Human Attributes from 3D Pose Tracking

We show that, from the output of a simple 3D human pose tracker one can infer physical attributes (


, gender and weight) and aspects of mental state (


, happiness or sadness). This task is useful for man-machine communication, and it provides a natural benchmark for evaluating the performance of 3D pose tracking methods (


conventional Euclidean joint error metrics). Based on an extensive corpus of motion capture data, with physical and perceptual ground truth, we analyze the inference of subtle biologically-inspired attributes from cyclic gait data. It is shown that inference is also possible with partial observations of the body, and with motions as short as a single gait cycle. Learning models from small amounts of noisy video pose data is, however, prone to over-fitting. To mitigate this we formulate learning in terms of domain adaptation, for which mocap data is uses to regularize models for inference from video-based data.

Leonid Sigal, David J. Fleet, Nikolaus F. Troje, Micha Livne

Discriminative Nonorthogonal Binary Subspace Tracking

Visual tracking is one of the central problems in computer vision. A crucial problem of tracking is how to represent the object. Traditional appearance-based trackers are using increasingly more complex features in order to be robust. However, complex representations typically will not only require more computation for feature extraction, but also make the state inference complicated. In this paper, we show that with a careful feature selection scheme, extremely simple yet discriminative features can be used for robust object tracking. The central component of the proposed method is a succinct and discriminative representation of image template using discriminative non-orthogonal binary subspace spanned by Haar-like features. These Haar-like bases are selected from the over-complete dictionary using a variation of the OOMP (optimized orthogonal matching pursuit). Such a representation inherits the merits of original NBS in that it can be used to efficiently describe the object. It also incorporates the discriminative information to distinguish the foreground and background. We apply the discriminative NBS to object tracking through SSD-based template matching. An update scheme of the discriminative NBS is devised in order to accommodate object appearance changes. We validate the effectiveness of our method through extensive experiments on challenging videos and demonstrate its capability to track objects in clutter and moving background.

Ang Li, Feng Tang, Yanwen Guo, Hai Tao

TriangleFlow: Optical Flow with Triangulation-Based Higher-Order Likelihoods

We use a simple yet powerful higher-order conditional random field (CRF) to model optical flow. It consists of a standard photo-consistency cost and a prior on affine motions both modeled in terms of higher-order potential functions. Reasoning jointly over a large set of unknown variables provides more reliable motion estimates and a robust matching criterion. One of the main contributions is that unlike previous region-based methods, we omit the assumption of constant flow. Instead, we consider local affine warps whose likelihood energy can be computed exactly without approximations. This results in a tractable, so-called, higher-order likelihood function. We realize this idea by employing triangulation meshes which immensely reduce the complexity of the problem. Optimization is performed by hierarchical fusion moves and an adaptive mesh refinement strategy. Experiments show that we achieve high-quality motion fields on several data sets including the Middlebury optical flow database.

Ben Glocker, T. Hauke Heibel, Nassir Navab, Pushmeet Kohli, Carsten Rother

Articulation-Invariant Representation of Non-planar Shapes

Given a set of points corresponding to a 2D projection of a non-planar shape, we would like to obtain a representation invariant to articulations (under no self-occlusions). It is a challenging problem since we need to account for the changes in 2D shape due to 3D articulations, viewpoint variations, as well as the varying effects of imaging process on different regions of the shape due to its non-planarity. By modeling an articulating shape as a combination of approximate convex parts connected by non-convex junctions, we propose to preserve distances between a pair of points by (i) estimating the parts of the shape through approximate convex decomposition, by introducing a robust measure of convexity and (ii) performing part-wise affine normalization by assuming a weak perspective camera model, and then relating the points using the inner distance which is insensitive to planar articulations. We demonstrate the effectiveness of our representation on a dataset with non-planar articulations, and on standard shape retrieval datasets like MPEG-7.

Raghuraman Gopalan, Pavan Turaga, Rama Chellappa

Inferring 3D Shapes and Deformations from Single Views

In this paper we propose a probabilistic framework that models shape variations and infers dense and detailed 3D shapes from a single silhouette. We model two types of shape variations, the object phenotype variation and its pose variation using two independent Gaussian Process Latent Variable Models (GPLVMs) respectively. The proposed shape variation models are learnt from 3D samples without prior knowledge about object class, e.g. object parts and skeletons, and are combined to fully span the 3D shape space. A novel probabilistic inference algorithm for 3D shape estimation is proposed by maximum likelihood estimates of the GPLVM latent variables and the camera parameters that best fit generated 3D shapes to given silhouettes. The proposed inference involves a small number of latent variables and it is computationally efficient. Experiments on both human body and shark data demonstrate the efficacy of our new approach.

Yu Chen, Tae-Kyun Kim, Roberto Cipolla

Efficient Inference with Multiple Heterogeneous Part Detectors for Human Pose Estimation

We address the problem of estimating human pose in a single image using a part based approach. Pose accuracy is directly affected by the accuracy of the part detectors but more accurate detectors are likely to be also more computationally expensive. We propose to use multiple, heterogeneous part detectors with varying accuracy and computation requirements, ordered in a hierarchy, to achieve more accurate and efficient pose estimation. For inference, we propose an algorithm to localize articulated objects by exploiting an ordered hierarchy of detectors with increasing accuracy. The inference uses branch and bound method to search for each part and use kinematics from neighboring parts to guide the branching behavior and compute bounds on the best part estimate. We demonstrate our approach on a publicly available People dataset and outperform the state-of-art methods. Our inference is 3 times faster than one based on using a single, highly accurate detector.

Vivek Kumar Singh, Ram Nevatia, Chang Huang

Co-transduction for Shape Retrieval

In this paper, we propose a new shape/object retrieval algorithm,


. The performance of a retrieval system is critically decided by the accuracy of adopted similarity measures (distances or metrics). Different types of measures may focus on different aspects of the objects: e.g. measures computed based on contours and skeletons are often complementary to each other. Our goal is to develop an algorithm to fuse different similarity measures for robust shape retrieval through a semi-supervised learning framework. We name our method co-transduction which is inspired by the co-training algorithm [1]. Given two similarity measures and a query shape, the algorithm iteratively retrieves the most similar shapes using one measure and assigns them to a pool for the other measure to do a re-ranking, and vice-versa. Using co-transduction, we achieved a significantly improved result of 97.72% on the MPEG-7 dataset [2] over the state-of-the-art performances (91% in [3], 93.4% in [4]). Our algorithm is general and it works directly on any given similarity measures/metrics; it is not limited to object shape retrieval and can be applied to other tasks for ranking/retrieval.

Xiang Bai, Bo Wang, Xinggang Wang, Wenyu Liu, Zhuowen Tu

Learning Shape Detector by Quantizing Curve Segments with Multiple Distance Metrics

In this paper, we propose a very efficient method to learn shape models using local curve segments with multiple types of distance metrics. Our learning approach includes two key steps: feature generation and model pursuit. In the first step, for each category, we first extract a massive number of local “prototype” curve segments from a few roughly aligned shape instances. Then we quantize these curve segments with three types of distance metrics corresponding to different shape deformations. In each metric space, the quantized curve segments are further grown (spanned) into a large number of ball-like manifolds, and each of them represents a equivalence class of shape variance. In the second step of shape model pursuit, using these manifolds as features, we propose a fast greedy learning algorithm based on the information projection principle. The algorithm is guided by a generative model, and stepwise selects the features that have maximum information gain. The advantage of the proposed method is identified on several public datasets and summarized as follows. (1) Our models consisting of local curve segments with multiple distance metrics are robust to the various shape deformations, and thus enable us to perform robust shape classification and detect shapes against background clutter. (2) The auto-generated curve-based features are very general and convenient, rather than designing specific features for each category.

Ping Luo, Liang Lin, Hongyang Chao

Unique Signatures of Histograms for Local Surface Description

This paper deals with local 3D descriptors for surface matching. First, we categorize existing methods into two classes:




. Then, by discussion and experiments alike, we point out the key issues of uniqueness and repeatability of the local reference frame. Based on these observations, we formulate a novel comprehensive proposal for surface representation, which encompasses a new unique and repeatable local reference frame as well as a new 3D descriptor. The latter lays at the intersection between Signatures and Histograms, so as to possibly achieve a better balance between descriptiveness and robustness. Experiments on publicly available datasets as well as on range scans obtained with

Spacetime Stereo

provide a thorough validation of our proposal.

Federico Tombari, Samuele Salti, Luigi Di Stefano

Exploring Ambiguities for Monocular Non-rigid Shape Estimation

Recovering the 3D shape of deformable surfaces from single images is difficult because many different shapes have very similar projections. This is commonly addressed by restricting the set of possible shapes to linear combinations of deformation modes and by imposing additional geometric constraints. Unfortunately, because image measurements are noisy, such constraints do not always guarantee that the correct shape will be recovered. To overcome this limitation, we introduce an efficient approach to exploring the set of solutions of an objective function based on point-correspondences and to proposing a small set of candidate 3D shapes. This allows the use of additional image information to choose the best one. As a proof of concept, we use either motion or shading cues to this end and show that we can handle a complex objective function without having to solve a difficult non-linear minimization problem.

Francesc Moreno-Noguer, Josep M. Porta, Pascal Fua

Efficient Computation of Scale-Space Features for Deformable Shape Correspondences

With the rapid development of fast data acquisition techniques, 3D scans that record the geometric and photometric information of deformable objects are routinely acquired nowadays. To track surfaces in temporal domain or stitch partially-overlapping scans to form a complete model in spatial domain, robust and efficient feature detection for deformable shape correspondences, as an enabling method, becomes fundamentally critical with pressing needs. In this paper, we propose an efficient method to extract local features in scale spaces of both texture and geometry for deformable shape correspondences. We first build a hierarchical scale space on surface geometry based on geodesic metric, and the pyramid representation of surface geometry naturally engenders the rapid computation of scale-space features. Analogous to the SIFT, our features are found as local extrema in the scale space. We then propose a new feature descriptor for deformable surfaces, which is a gradient histogram within a local region computed by a local parameterization. Both the detector and the descriptor are invariant to isometric deformation, which makes our method a powerful tool for deformable shape correspondences. The performance of the proposed method is evaluated by feature matching on a sequence of deforming surfaces with ground truth correspondences.

Tingbo Hou, Hong Qin

Intrinsic Regularity Detection in 3D Geometry

Automatic detection of symmetries, regularity, and repetitive structures in 3D geometry is a fundamental problem in shape analysis and pattern recognition with applications in computer vision and graphics. Especially challenging is to detect


regularity, where the repetitions are on an intrinsic grid, without any apparent Euclidean pattern to describe the shape, but rising out of (near) isometric deformation of the underlying surface. In this paper, we employ multidimensional scaling to reduce the problem of intrinsic structure detection to a simpler problem of 2D grid detection. Potential 2D grids are then identified using an autocorrelation analysis, refined using local fitting, validated, and finally projected back to the spatial domain. We test the detection algorithm on a variety of scanned plaster models in presence of imperfections like missing data, noise and outliers. We also present a range of applications including scan completion, shape editing, super-resolution, and structural correspondence.

Niloy J. Mitra, Alex Bronstein, Michael Bronstein

Balancing Deformability and Discriminability for Shape Matching

We propose a novel framework,

aspect space

, to balance deformability and discriminability, which are often two competing factors in shape and image representations. In this framework, an object is embedded as a surface in a higher dimensional space with a parameter named

aspect weight

, which controls the importance of intensity in the embedding. We show that this framework naturally unifies existing important shape and image representations by adjusting the aspect weight and the embedding. More importantly, we find that the aspect weight implicitly controls the degree to which a representation handles deformation. Based on this idea, we present the

aspect shape context

, which extends shape context-based descriptors and adaptively selects the “best” aspect weight for shape comparison. Another observation we have is the proposed descriptor nicely fits context-sensitive shape retrieval. The proposed methods are evaluated on two public datasets, MPEG7-CE-Shape-1 and Tari 1000, in comparison to state-of-the-art solutions. In the standard shape retrieval experiment using the MPEG7 CE-Shape-1 database, the new descriptor with context information achieves a bull’s eye score of 95.96%, which surpassed all previous results. In the Tari 1000 dataset, our methods significantly outperform previous tested methods as well.

Haibin Ling, Xingwei Yang, Longin Jan Latecki

2D Action Recognition Serves 3D Human Pose Estimation

3D human pose estimation in multi-view settings benefits from embeddings of human actions in low-dimensional manifolds, but the complexity of the embeddings increases with the number of actions. Creating separate, action-specific manifolds seems to be a more practical solution. Using multiple manifolds for pose estimation, however, requires a joint optimization over the set of manifolds and the human pose embedded in the manifolds. In order to solve this problem, we propose a particle-based optimization algorithm that can efficiently estimate human pose even in challenging in-house scenarios. In addition, the algorithm can directly integrate the results of a 2D action recognition system as prior distribution for optimization. In our experiments, we demonstrate that the optimization handles an 84D search space and provides already competitive results on HumanEva with as few as 25 particles.

Juergen Gall, Angela Yao, Luc Van Gool

A Streakline Representation of Flow in Crowded Scenes

Based on the Lagrangian framework for fluid dynamics, a streakline representation of flow is presented to solve computer vision problems involving crowd and traffic flow. Streaklines are traced in a fluid flow by injecting color material, such as smoke or dye, which is transported with the flow and used for visualization. In the context of computer vision, streaklines may be used in a similar way to transport information about a scene, and they are obtained by repeatedly initializing a fixed grid of particles at each frame, then moving both current and past particles using optical flow. Streaklines are the locus of points that connect particles which originated from the same initial position. In this paper, a streakline technique is developed to compute several important aspects of a scene, such as flow and potential functions using the Helmholtz decomposition theorem. This leads to a representation of the flow that more accurately recognizes spatial and temporal changes in the scene, compared with other commonly used flow representations. Applications of the technique to segmentation and behavior analysis provide comparison to previously employed techniques, showing that the streakline method outperforms the state-of-the-art in segmentation, and opening a new domain of application for crowd analysis based on potentials.

Ramin Mehran, Brian E. Moore, Mubarak Shah

Fast Multi-aspect 2D Human Detection

We address the problem of detecting human figures in images, taking into account that the image of the human figure may be taken from a range of viewpoints. We capture the geometric deformations of the 2D human figure using an extension of the Common Factor Model (CFM) of Lan and Huttenlocher. The key contribution of the paper is an improved iterative message passing inference algorithm that runs faster than the original CFM algorithm. This is based on the insight that messages created using the distance transform are shift invariant and therefore messages can be created once and then shifted for subsequent iterations. Since shifting (


(1) complexity) is faster than computing a distance transform (




) complexity), a significant speedup is observed in the experiments. We demonstrate the effectiveness of the new model for the human parsing problem using the Iterative Parsing data set and results are competitive with the state of the art detection algorithm of Andriluka, et al.

Tai-Peng Tian, Stan Sclaroff

Deterministic 3D Human Pose Estimation Using Rigid Structure

This paper explores a method, first proposed by Wei and Chai [1], for estimating 3D human pose from several frames of uncalibrated 2D point correspondences containing projected body joint locations. In their work Wei and Chai boldly claimed that, through the introduction of rigid constraints to the torso and hip, camera scales, bone lengths and absolute depths could be estimated from a finite number of frames (i.e. ≥ 5). In this paper we show this claim to be false, demonstrating in principle one can never estimate these parameters in a finite number of frames. Further, we demonstrate their approach is only valid for rigid sub-structures of the body (e.g. torso). Based on this analysis we propose a novel approach using deterministic structure from motion based on assumptions of rigidity in the body’s torso. Our approach provides notably more accurate estimates and is substantially faster than Wei and Chai’s approach, and unlike the original, can be solved as a deterministic least-squares problem.

Jack Valmadre, Simon Lucey

Robust Fusion: Extreme Value Theory for Recognition Score Normalization

Recognition problems in computer vision often benefit from a fusion of different algorithms and/or sensors, with score level fusion being among the most widely used fusion approaches. Choosing an appropriate score normalization technique before fusion is a fundamentally difficult problem because of the disparate nature of the underlying distributions of scores for different sources of data. Further complications are introduced when one or more fusion inputs outright fail or have adversarial inputs, which we find in the fields of biometrics and forgery detection. Ideally a score normalization should be robust to model assumptions, modeling errors, and parameter estimation errors, as well as robust to algorithm failure. In this paper, we introduce the w-score, a new technique for robust recognition score normalization. We do not assume a match or non-match distribution, but instead suggest that the top scores of a recognition system’s non-match scores follow the statistical Extreme Value Theory, and show how to use that to provide consistent robust normalization with a strong statistical basis.

Walter Scheirer, Anderson Rocha, Ross Micheals, Terrance Boult

Recognizing Partially Occluded Faces from a Single Sample Per Class Using String-Based Matching

Automatically recognizing human faces with partial occlusions is one of the most challenging problems in face analysis community. This paper presents a novel string-based face recognition approach to address the partial occlusion problem in face recognition. In this approach, a new face representation, Stringface, is constructed to integrate the relational organization of intermediate-level features (line segments) into a high-level global structure (string). The matching of two faces is done by matching two Stringfaces through a




matching scheme, which is able to efficiently find the most discriminative local parts (substrings) for recognition without making any assumption on the distributions of the deformed facial regions. The proposed approach is compared against the state-of-the-art algorithms using both the AR database and FRGC (Face Recognition Grand Challenge) ver2.0 database. Very encouraging experimental results demonstrate, for the first time, the feasibility and effectiveness of a high-level syntactic method in face recognition, showing a new strategy for face representation and recognition.

Weiping Chen, Yongsheng Gao

Real-Time Spatiotemporal Stereo Matching Using the Dual-Cross-Bilateral Grid

We introduce a real-time stereo matching technique based on a reformulation of Yoon and Kweon’s adaptive support weights algorithm [1]. Our implementation uses the bilateral grid to achieve a speedup of 200× compared to a straightforward full-kernel GPU implementation, making it the fastest technique on the Middlebury website. We introduce a colour component into our greyscale approach to recover precision and increase discriminability. Using our implementation, we speed up spatial-depth superresolution 100×. We further present a spatiotemporal stereo matching approach based on our technique that incorporates temporal evidence in real time (> 14


). Our technique visibly reduces flickering and outperforms per-frame approaches in the presence of image noise. We have created five synthetic stereo videos, with ground truth disparity maps, to quantitatively evaluate depth estimation from stereo video. Source code and datasets are available on our project website.

Christian Richardt, Douglas Orr, Ian Davies, Antonio Criminisi, Neil A. Dodgson

Fast Multi-labelling for Stereo Matching

We describe a new fast algorithm for multi-labelling problems. In general, a multi-labelling problem is NP-hard. Widely used algorithms like


-expansion can reach a suboptimal result in a time linear in the number of the labels. In this paper, we propose an algorithm which can obtain results of comparable quality polynomially faster. We use the Divide and Conquer paradigm to separate the complexities induced by the label set and the variable set, and deal with each of them respectively. Such a mechanism improves the solution speed without depleting the memory resource, hence it is particularly valuable for applications where the variable set and the label set are both huge. Another merit of the proposed method is that the trade-off between quality and time efficiency can be varied through using different parameters. The advantage of our method is validated by experiments.

Yuhang Zhang, Richard Hartley, Lei Wang

Anisotropic Minimal Surfaces Integrating Photoconsistency and Normal Information for Multiview Stereo

In this work the weighted minimal surface model traditionally used in multiview stereo is revisited. We propose to generalize the classical photoconsistency-weighted minimal surface approach by means of an anisotropic metric which allows to integrate a specified surface orientation into the optimization process. In contrast to the conventional isotropic case, where all spatial directions are treated equally, the anisotropic metric adaptively weights the regularization along different directions so as to favor certain surface orientations over others. We show that the proposed generalization preserves all properties and globality guarantees of continuous convex relaxation methods. We make use of a recently introduced efficient primal-dual algorithm to solve the arising saddle point problem. In multiple experiments on real image sequences we demonstrate that the proposed anisotropic generalization allows to overcome oversmoothing of small-scale surface details, giving rise to more precise reconstructions.

Kalin Kolev, Thomas Pock, Daniel Cremers

An Efficient Graph Cut Algorithm for Computer Vision Problems

Graph cuts has emerged as a preferred method to solve a class of energy minimization problems in computer vision. It has been shown that graph cut algorithms designed keeping the structure of vision based flow graphs in mind are more efficient than known strongly polynomial time max-flow algorithms based on preflow push or shortest augmenting path paradigms [1]. We present here a new algorithm for graph cuts which not only exploits the structural properties inherent in image based grid graphs but also combines the basic paradigms of max-flow theory in a novel way. The algorithm has a strongly polynomial time bound. It has been bench-marked using samples from Middlebury [2] and UWO [3] database. It runs faster on all 2D samples and is at least two to three times faster on 70% of 2D and 3D samples in comparison to the algorithm reported in [1].

Chetan Arora, Subhashis Banerjee, Prem Kalra, S. N. Maheshwari

Non-Local Kernel Regression for Image and Video Restoration

This paper presents a non-local kernel regression (NL-KR) method for image and video restoration tasks, which exploits both the non-local self-similarity and local structural regularity in natural images. The non-local self-similarity is based on the observation that image patches tend to repeat themselves in natural images and videos; and the local structural regularity reveals that image patches have regular structures where accurate estimation of pixel values via regression is possible. Explicitly unifying both properties, the proposed non-local kernel regression framework is robust and applicable to various image and video restoration tasks. In this work, we are specifically interested in applying the NL-KR model to image and video super-resolution (SR) reconstruction. Extensive experimental results on both single images and realistic video sequences demonstrate the superiority of the proposed framework for SR tasks over previous works both qualitatively and quantitatively.

Haichao Zhang, Jianchao Yang, Yanning Zhang, Thomas S. Huang

A Spherical Harmonics Shape Model for Level Set Segmentation

We introduce a segmentation framework which combines and shares advantages of both an implicit surface representation and a parametric shape model based on spherical harmonics. Besides the elegant surface representation it also inherits the power and flexibility of variational level set methods with respect to the modeling of data terms. At the same time it provides all advantages of parametric shape models such as a sparse and multiscale shape representation. Additionally, we introduce a regularizer that helps to ensure a unique decomposition into spherical harmonics and thus the comparability of parameter values of multiple segmentations. We demonstrate the benefits of our method on medical and photometric data and present two possible extensions.

Maximilian Baust, Nassir Navab

A Model of Volumetric Shape for the Analysis of Longitudinal Alzheimer’s Disease Data

We develop a multi-scale model of shape based on a volumetric representation of solids in 3D space. A signed energy function (SEF) derived from the model is designed to quantify the magnitude of regional shape changes that correlate well with local shrinkage and expansion. The methodology is applied to the analysis of longitudinal morphological data representing hippocampal volumes extracted from one-year repeat magnetic resonance scans of the brain of 381 subjects collected by the Alzheimer’s Disease Neuroimaging Initiative. We first establish a strong correlation between the SEFs and hippocampal volume loss over a one-year period and then use SEFs to characterize specific regions where hippocampal atrophy over the one-year period differ significantly among groups of normal controls and subjects with mild cognitive impairment and Alzheimer’s disease.

Xinyang Liu, Xiuwen Liu, Yonggang Shi, Paul Thompson, Washington Mio

Fast Optimization for Mixture Prior Models

We consider the minimization of a smooth convex function regularized by the mixture of prior models. This problem is generally difficult to solve even each simpler regularization problem is easy. In this paper, we present two algorithms to effectively solve it. First, the original problem is decomposed into multiple simpler subproblems. Then, these subproblems are efficiently solved by existing techniques in parallel. Finally, the result of the original problem is obtained from the weighted average of solutions of subproblems in an iterative framework. We successfully applied the proposed algorithms to compressed MR image reconstruction and low-rank tensor completion. Numerous experiments demonstrate the superior performance of the proposed algorithm in terms of both the accuracy and computational complexity.

Junzhou Huang, Shaoting Zhang, Dimitris Metaxas

3D Point Correspondence by Minimum Description Length in Feature Space

Finding point correspondences plays an important role in automatically building statistical shape models from a training set of 3D surfaces. For the point correspondence problem, Davies

et al.

[1] proposed a minimum-description-length-based objective function to balance the training errors and generalization ability. A recent evaluation study [2] that compares several well-known 3D point correspondence methods for modeling purposes shows that the MDL-based approach [1] is the best method.

We adapt the MDL-based objective function for a feature space that can exploit nonlinear properties in point correspondences, and propose an efficient optimization method to minimize the objective function directly in the feature space, given that the inner product of any vector pair can be computed in the feature space. We further employ a Mercer kernel [3] to define the feature space implicitly. A key aspect of our proposed framework is the generalization of the MDL-based objective function to kernel principal component analysis (KPCA) [4] spaces and the design of a gradient-descent approach to minimize such an objective function. We compare the generalized MDL objective function on KPCA spaces with the original one and evaluate their abilities in terms of reconstruction errors and specificity. From our experimental results on different sets of 3D shapes of human body organs, the proposed method performs significantly better than the original method.

Jiun-Hung Chen, Ke Colin Zheng, Linda G. Shapiro

Making Action Recognition Robust to Occlusions and Viewpoint Changes

Most state-of-the-art approaches to action recognition rely on global representations either by concatenating local information in a long descriptor vector or by computing a single location independent histogram. This limits their performance in presence of occlusions and when running on multiple viewpoints. We propose a novel approach to providing robustness to both occlusions and viewpoint changes that yields significant improvements over existing techniques. At its heart is a local partitioning and hierarchical classification of the 3D Histogram of Oriented Gradients (HOG) descriptor to represent sequences of images that have been concatenated into a data volume. We achieve robustness to occlusions and viewpoint changes by combining training data from all viewpoints to train classifiers that estimate action labels independently over sets of HOG blocks. A top level classifier combines these local labels into a global action class decision.

Daniel Weinland, Mustafa Özuysal, Pascal Fua

Structured Output Ordinal Regression for Dynamic Facial Emotion Intensity Prediction

We consider the task of labeling facial emotion intensities in videos, where the emotion intensities to be predicted have ordinal scales (e.g., low, medium, and high) that change in time. A significant challenge is that the rates of increase and decrease differ substantially across subjects. Moreover, the actual absolute differences of intensity values carry little information, with their relative order being more important. To solve the intensity prediction problem we propose a new dynamic ranking model that models the signal intensity at each time as a label on an ordinal scale and links the temporally proximal labels using dynamic smoothness constraints. This new model extends the successful static ordinal regression to a structured (dynamic) setting by using an analogy with Conditional Random Field (CRF) models in structured classification. We show that, although non-convex, the new model can be accurately learned using efficient gradient search. The predictions resulting from this dynamic ranking model show significant improvements over the regular CRFs, which fail to consider ordinal relationships between predicted labels. We also observe substantial improvements over static ranking models that do not exploit temporal dependencies of ordinal predictions. We demonstrate the benefits of our algorithm on the Cohn-Kanade dataset for the dynamic facial emotion intensity prediction problem and illustrate its performance in a controlled synthetic setting.

Minyoung Kim, Vladimir Pavlovic

Image Features and Motion

Critical Nets and Beta-Stable Features for Image Matching

We propose new ideas and efficient algorithms towards bridging the gap between bag-of-features and constellation descriptors for image matching. Specifically, we show how to compute connections between local image features in the form of a

critical net

whose construction is repeatable across changes of viewing conditions or scene configuration. Arcs of the net provide a more reliable frame of reference than individual features do for the purpose of invariance. In addition, regions associated with either small stars or loops in the critical net can be used as


for recognition or retrieval, and subgraphs of the critical net that are matched across images exhibit

common structures

shared by different images. We also introduce the notion of

beta-stable features

, a variation on the notion of feature lifetime from the literature of scale space. Our experiments show that arc-based SIFT-like descriptors of beta-stable features are more repeatable and more accurate than competing descriptors. We also provide anecdotal evidence of the usefulness of image parts and of the structures that are found to be common across images.

Steve Gu, Ying Zheng, Carlo Tomasi

Descriptor Learning for Efficient Retrieval

Many visual search and matching systems represent images using sparse sets of “visual words”: descriptors that have been quantized by assignment to the best-matching symbol in a discrete vocabulary. Errors in this quantization procedure propagate throughout the rest of the system, either harming performance or requiring correction using additional storage or processing. This paper aims to reduce these quantization errors

at source

, by learning a projection from descriptor space to a new Euclidean space in which standard clustering techniques are more likely to assign matching descriptors to the same cluster, and non-matching descriptors to different clusters.

To achieve this, we learn a non-linear transformation model by minimizing a novel margin-based cost function, which aims to separate matching descriptors from


classes of non-matching descriptors. Training data is generated automatically by leveraging geometric consistency. Scalable, stochastic gradient methods are used for the optimization.

For the case of particular object retrieval, we demonstrate impressive gains in performance on a ground truth dataset: our learnt 32-D descriptor without spatial re-ranking outperforms a baseline method using 128-D SIFT descriptors with spatial re-ranking.

James Philbin, Michael Isard, Josef Sivic, Andrew Zisserman

Texture Regimes for Entropy-Based Multiscale Image Analysis

We present an approach to multiscale image analysis. It hinges on an operative definition of texture that involves a “small region”, where some (unknown) statistic is aggregated, and a “large region” within which it is stationary. At each point, multiple small and large regions co-exist at multiple scales, as image structures are pooled by the scaling and quantization process to form “textures” and then transitions between textures define again “structures.” We present a technique to learn and agglomerate sparse bases at multiple scales. To do so efficiently, we propose an analysis of cluster statistics after a clustering step is performed, and a new clustering method with linear-time performance. In both cases, we can infer all the “small” and “large” regions at multiple scale in one shot.

Sylvain Boltz, Frank Nielsen, Stefano Soatto

A High-Quality Video Denoising Algorithm Based on Reliable Motion Estimation

Although the recent advances in the sparse representations of images have achieved outstanding denosing results, removing real, structured noise in digital videos remains a challenging problem. We show the utility of reliable motion estimation to establish temporal correspondence across frames in order to achieve high-quality video denoising. In this paper, we propose an adaptive video denosing framework that integrates robust optical flow into a non-local means (NLM) framework with noise level estimation. The spatial regularization in optical flow is the key to ensure temporal coherence in removing structured noise. Furthermore, we introduce approximate K-nearest neighbor matching to significantly reduce the complexity of classical NLM methods. Experimental results show that our system is comparable with the state of the art in removing AWGN, and significantly outperforms the state of the art in removing real, structured noise.

Ce Liu, William T. Freeman

An Oriented Flux Symmetry Based Active Contour Model for Three Dimensional Vessel Segmentation

This paper proposes a novel approach to segment three dimensional curvilinear structures, particularly vessels in angiography, by inspecting the symmetry of image gradients. The proposed method stresses the importance of simultaneously considering both the gradient symmetry with respect to the curvilinear structure center, and the gradient antisymmetry with respect to the object boundary. Measuring the image gradient symmetry remarkably suppresses the disturbance introduced by rapid intensity changes along curvilinear structures. Meanwhile, considering the image gradient antisymmetry helps locate the structure boundary. The gradient symmetry and the gradient antisymmetry are evaluated based on the notion of oriented flux. By utilizing the aforementioned gradient symmetry information, an active contour model is tailored to perform segmentation. On the one hand, by exploiting the symmetric image gradient pattern observed at structure centers, the contours expand along curvilinear structures even through there exists intensity fluctuation along the structures. On the other hand, measuring the antisymmetry of the image gradient conveys strong detection responses to precisely drive contours to the structure boundaries, as well as avoiding contour leakages. The proposed method is capable of delivering promising segmentation results. This is validated in the experiments using synthetic data and real vascular images of different modalities, and through the comparison to two well founded and published methods for curvilinear structure segmentation.

Max W. K. Law, Albert C. S. Chung

Spotlights and Posters W1

MRF Inference by k-Fan Decomposition and Tight Lagrangian Relaxation

We present a novel dual decomposition approach to MAP inference with highly connected discrete graphical models. Decompositions into cyclic k-fan structured subproblems are shown to significantly tighten the Lagrangian relaxation relative to the standard local polytope relaxation, while enabling efficient integer programming for solving the subproblems. Additionally, we introduce modified update rules for maximizing the dual function that avoid oscillations and converge faster to an optimum of the relaxed problem, and never get stuck in non-optimal fixed points.

Jörg Hendrik Kappes, Stefan Schmidt, Christoph Schnörr

Randomized Locality Sensitive Vocabularies for Bag-of-Features Model

Visual vocabulary construction is an integral part of the popular Bag-of-Features (BOF) model. When visual data scale up (in terms of the dimensionality of features or/and the number of samples), most existing algorithms (e.g. k-means) become unfavorable due to the prohibitive time and space requirements. In this paper we propose the

random locality sensitive vocabulary

(RLSV) scheme towards efficient visual vocabulary construction in such scenarios. Integrating ideas from the Locality Sensitive Hashing (LSH) and the Random Forest (RF), RLSV generates and aggregates multiple visual vocabularies based on random projections, without taking clustering or training efforts. This simple scheme demonstrates superior time and space efficiency over prior methods, in both theory and practice, while often achieving comparable or even better performances. Besides, extensions to supervised and kernelized vocabulary constructions are also discussed and experimented with.

Yadong Mu, Ju Sun, Tony X. Han, Loong-Fah Cheong, Shuicheng Yan

Image Categorization Using Directed Graphs

Most existing graph-based semi-supervised classification methods use pairwise similarities as edge weights of an

undirected graph

with images as the nodes of the graph. Recently several new graph construction methods produce, however,

directed graph

(asymmetric similarity between nodes). A simple symmetrization is often used to convert a directed graph to an undirected one. This, however, loses important structural information conveyed by asymmetric similarities. In this paper, we propose a novel symmetric co-linkage similarity which captures the essential relationship among the nodes in the directed graph. We apply this new co-linkage similarity in two important computer vision tasks for image categorization: object recognition and image annotation. Extensive empirical studies demonstrate the effectiveness of our method.

Hua Wang, Heng Huang, Chris Ding

Robust Multi-View Boosting with Priors

Many learning tasks for computer vision problems can be described by multiple views or multiple features. These views can be exploited in order to learn from unlabeled data, a.k.a. “multi-view learning”. In these methods, usually the classifiers iteratively label each other a subset of the unlabeled data and ignore the rest. In this work, we propose a new multi-view boosting algorithm that, unlike other approaches, specifically encodes the uncertainties over the unlabeled samples in terms of given priors. Instead of ignoring the unlabeled samples during the training phase of each view, we use the different views to provide an aggregated prior which is then used as a regularization term inside a semi-supervised boosting method. Since we target multi-class applications, we first introduce a multi-class boosting algorithm based on maximizing the mutli-class classification margin. Then, we propose our multi-class semi-supervised boosting algorithm which is able to use priors as a regularization component over the unlabeled data. Since the priors may contain a significant amount of noise, we introduce a new loss function for the unlabeled regularization which is robust to noisy priors. Experimentally, we show that the multi-class boosting algorithms achieves state-of-the-art results in machine learning benchmarks. We also show that the new proposed loss function is more robust compared to other alternatives. Finally, we demonstrate the advantages of our multi-view boosting approach for object category recognition and visual object tracking tasks, compared to other multi-view learning methods.

Amir Saffari, Christian Leistner, Martin Godec, Horst Bischof

Optimum Subspace Learning and Error Correction for Tensors

Confronted with the high-dimensional tensor-like visual data, we derive a method for the decomposition of an observed tensor into a low-dimensional structure plus unbounded but sparse irregular patterns. The optimal rank-(









) tensor decomposition model that we propose in this paper, could automatically explore the low-dimensional structure of the tensor data, seeking optimal dimension and basis for each mode and separating the irregular patterns. Consequently, our method accounts for the implicit multi-factor structure of tensor-like visual data in an explicit and concise manner. In addition, the optimal tensor decomposition is formulated as a convex optimization through relaxation technique. We then develop a block coordinate descent (BCD) based algorithm to efficiently solve the problem. In experiments, we show several applications of our method in computer vision and the results are promising.

Yin Li, Junchi Yan, Yue Zhou, Jie Yang


Additional information

Premium Partner

    Image Credits