Skip to main content

About this book

The seven-volume set comprising LNCS volumes 8689-8695 constitutes the refereed proceedings of the 13th European Conference on Computer Vision, ECCV 2014, held in Zurich, Switzerland, in September 2014.

The 363 revised papers presented were carefully reviewed and selected from 1444 submissions. The papers are organized in topical sections on tracking and activity recognition; recognition; learning and inference; structure from motion and feature matching; computational photography and low-level vision; vision; segmentation and saliency; context and 3D scenes; motion and 3D scene analysis; and poster sessions.

Table of Contents


Tracking and Activity Recognition

Visual Tracking by Sampling Tree-Structured Graphical Models

Probabilistic tracking algorithms typically rely on graphical models based on the first-order Markov assumption. Although such linear structure models are simple and reasonable, it is not appropriate for persistent tracking since temporal failures by short-term occlusion, shot changes, and appearance changes may impair the remaining frames significantly. More general graphical models may be useful to exploit the intrinsic structure of input video and improve tracking performance. Hence, we propose a novel offline tracking algorithm by identifying a tree-structured graphical model, where we formulate a unified framework to optimize tree structure and track a target in a principled way, based on MCMC sampling. To reduce computational cost, we also introduce a technique to find the optimal tree for a small number of key frames first and employ a semi-supervised manifold alignment technique of tree construction for all frames. We evaluated our algorithm in many challenging videos and obtained outstanding results compared to the state-of-the-art techniques quantitatively and qualitatively.

Seunghoon Hong, Bohyung Han

Tracking Interacting Objects Optimally Using Integer Programming

In this paper, we show that tracking different kinds of interacting objects can be formulated as a network-flow Mixed Integer Program. This is made possible by tracking all objects simultaneously and expressing the fact that one object can appear or disappear at locations where another is in terms of linear flow constraints. We demonstrate the power of our approach on scenes involving cars and pedestrians, bags being carried and dropped by people, and balls being passed from one player to the next in a basketball game. In particular, we show that by estimating jointly and globally the trajectories of different types of objects, the presence of the ones which were not initially detected based solely on image evidence can be inferred from the detections of the others.

Xinchao Wang, Engin Türetken, François Fleuret, Pascal Fua

Learning Latent Constituents for Recognition of Group Activities in Video

The collective activity of a group of persons is more than a mere sum of individual person actions, since interactions and the context of the overall group behavior have crucial influence. Consequently, the current standard paradigm for group activity recognition is to model the spatiotemporal pattern of individual person bounding boxes and their interactions. Despite this trend towards increasingly global representations, activities are often defined by semi-local characteristics and their interrelation between different persons. For capturing the large visual variability with small semi-local parts, a large number of them are required, thus rendering manual annotation infeasible. To automatically learn activity constituents that are meaningful for the collective activity, we sample local parts and group related ones not merely based on visual similarity but based on the function they fulfill on a set of validation images. Then max-margin multiple instance learning is employed to jointly i) remove clutter from these groups and focus on only the relevant samples, ii) learn the activity constituents, and iii) train the multi-class activity classifier. Experiments on standard activity benchmark sets show the advantage of this joint procedure and demonstrate the benefit of functionally grouped latent activity constituents for group activity recognition.

Borislav Antic, Björn Ommer


Large-Scale Object Classification Using Label Relation Graphs

In this paper we study how to perform object classification in a principled way that exploits the rich structure of real world labels. We develop a new model that allows encoding of flexible relations between labels. We introduce Hierarchy and Exclusion (HEX) graphs, a new formalism that captures semantic relations between any two labels applied to the same object: mutual exclusion, overlap and subsumption. We then provide rigorous theoretical analysis that illustrates properties of HEX graphs such as consistency, equivalence, and computational implications of the graph structure. Next, we propose a probabilistic classification model based on HEX graphs and show that it enjoys a number of desirable properties. Finally, we evaluate our method using a large-scale benchmark. Empirical results demonstrate that our model can significantly improve object classification by exploiting the label relations.

Jia Deng, Nan Ding, Yangqing Jia, Andrea Frome, Kevin Murphy, Samy Bengio, Yuan Li, Hartmut Neven, Hartwig Adam

30Hz Object Detection with DPM V5

We describe an implementation of the Deformable Parts Model [1] that operates in a user-defined time-frame. Our implementation uses a variety of mechanism to trade-off speed against accuracy. Our implementation can detect all 20 PASCAL 2007 objects simultaneously at 30Hz with an mAP of 0.26. At 15Hz, its mAP is 0.30; and at 100Hz, its mAP is 0.16. By comparison the reference implementation of [1] runs at 0.07Hz and mAP of 0.33 and a fast GPU implementation runs at 1Hz. Our technique is over an order of magnitude faster than the previous fastest DPM implementation. Our implementation exploits a series of important speedup mechanisms. We use the cascade framework of [3] and the vector quantization technique of [2]. To speed up feature computation, we compute HOG features at few scales, and apply many interpolated templates. A hierarchical vector quantization method is used to compress HOG features for fast template evaluation. An object proposal step uses hash-table methods to identify locations where evaluating templates would be most useful; these locations are inserted into a priority queue, and processed in a detection phase. Both proposal and detection phases have an any-time property. Our method applies to legacy templates, and no retraining is required.

Mohammad Amin Sadeghi, David Forsyth

Knowing a Good HOG Filter When You See It: Efficient Selection of Filters for Detection

Collections of filters based on histograms of oriented gradients (HOG) are common for several detection methods, notably, poselets and exemplar SVMs. The main bottleneck in training such systems is the selection of a subset of good filters from a large number of possible choices. We show that one can learn a universal model of part “goodness” based on properties that can be computed from the filter itself. The intuition is that good filters across categories exhibit common traits such as, low clutter and gradients that are spatially correlated. This allows us to quickly discard filters that are not promising thereby speeding up the training procedure. Applied to training the poselet model, our automated selection procedure allows us to improve its detection performance on the PASCAL VOC data sets, while speeding up training by an

order of magnitude

. Similar results are reported for exemplar SVMs.

Ejaz Ahmed, Gregory Shakhnarovich, Subhransu Maji

Linking People in Videos with “Their” Names Using Coreference Resolution

Natural language descriptions of videos provide a potentially rich and vast source of supervision. However, the highly-varied nature of language presents a major barrier to its effective use. What is needed are models that can reason over uncertainty over both videos and text. In this paper, we tackle the core task of person naming: assigning names of people in the cast to human tracks in TV videos. Screenplay scripts accompanying the video provide some crude supervision about who’s in the video. However, even the basic problem of knowing who is


in the script is often difficult, since language often refers to people using pronouns (e.g., “he”) and nominals (e.g., “man”) rather than actual names (e.g., “Susan”). Resolving the identity of these mentions is the task of

coreference resolution

, which is an active area of research in natural language processing. We develop a joint model for person naming and coreference resolution, and in the process, infer a latent alignment between tracks and mentions. We evaluate our model on both vision and NLP tasks on a new dataset of 19 TV episodes. On both tasks, we significantly outperform the independent baselines.

Vignesh Ramanathan, Armand Joulin, Percy Liang, Li Fei-Fei

Poster Session 1

Optimal Essential Matrix Estimation via Inlier-Set Maximization

In this paper, we extend the globally optimal “rotation space search” method [11] to essential matrix estimation in the presence of feature mismatches or outliers. The problem is formulated as inlier-set cardinality maximization, and solved via branch-and-bound global optimization which searches the entire essential manifold formed by all essential matrices. Our main contributions include an explicit, geometrically meaningful essential manifold parametrization using a 5D direct product space of a solid 2D disk and a solid 3D ball, as well as efficient closed-form bounding functions. Experiments on both synthetic data and real images have confirmed the efficacy of our method. The method is mostly suitable for applications where robustness and accuracy are paramount. It can also be used as a benchmark for method evaluation.

Jiaolong Yang, Hongdong Li, Yunde Jia

UPnP: An Optimal O(n) Solution to the Absolute Pose Problem with Universal Applicability

A large number of absolute pose algorithms have been presented in the literature. Common performance criteria are computational complexity, geometric optimality, global optimality, structural degeneracies, and the number of solutions. The ability to handle minimal sets of correspondences, resulting solution multiplicity, and generalized cameras are further desirable properties. This paper presents the first PnP solution that unifies all the above desirable properties within a single algorithm. We compare our result to state-of-the-art minimal, non-minimal, central, and non-central PnP algorithms, and demonstrate universal applicability, competitive noise resilience, and superior computational efficiency. Our algorithm is called

Unified PnP


Laurent Kneip, Hongdong Li, Yongduek Seo

3D Reconstruction of Dynamic Textures in Crowd Sourced Data

We propose a framework to automatically build 3D models for scenes containing structures not amenable for photo-consistency based reconstruction due to having dynamic appearance. We analyze the dynamic appearance elements of a given scene by leveraging the imagery contained in Internet image photo-collections and online video sharing websites. Our approach combines large scale crowd sourced SfM techniques with image content segmentation and shape from silhouette techniques to build an iterative framework for 3D shape estimation. The developed system not only enables more complete and robust 3D modeling, but it also enables more realistic visualizations through the identification of dynamic scene elements amenable to dynamic texture mapping. Experiments on crowd sourced image and video datasets illustrate the effectiveness of our automated data-driven approach.

Dinghuang Ji, Enrique Dunn, Jan-Michael Frahm

3D Interest Point Detection via Discriminative Learning

The task of detecting the interest points in 3D meshes has typically been handled by geometric methods. These methods, while designed according to human preference, can be ill-equipped for handling the variety and subjectivity in human responses. Different tasks have different requirements for interest point detection; some tasks may necessitate high precision while other tasks may require high recall. Sometimes points with high curvature may be desirable, while in other cases high curvature may be an indication of noise. Geometric methods lack the required flexibility to adapt to such changes. As a consequence, interest point detection seems to be well suited for machine learning methods that can be trained to match the criteria applied on the annotated training data. In this paper, we formulate interest point detection as a supervised binary classification problem using a random forest as our classifier. We validate the accuracy of our method and compare our results to those of five state of the art methods on a new, standard benchmark.

Leizer Teran, Philippos Mordohai

Pose Locality Constrained Representation for 3D Human Pose Reconstruction

Reconstructing 3D human poses from a single 2D image is an ill-posed problem without considering the human body model. Explicitly enforcing physiological constraints is known to be non-convex and usually leads to difficulty in finding an optimal solution. An attractive alternative is to learn a prior model of the human body from a set of human pose data. In this paper, we develop a new approach, namely pose locality constrained representation (PLCR), to model the 3D human body and use it to improve 3D human pose reconstruction. In this approach, the human pose space is first hierarchically divided into lower-dimensional pose subspaces by subspace clustering. After that, a block-structural pose dictionary is constructed by concatenating the basis poses from all the pose subspaces. Finally, PLCR utilizes the block-structural pose dictionary to explicitly encourage pose locality in human-body modeling – nonzero coefficients are only assigned to the basis poses from a small number of pose subspaces that are close to each other in the pose-subspace hierarchy. We combine PLCR into the matching-pursuit based 3D human-pose reconstruction algorithm and show that the proposed PLCR-based algorithm outperforms the state-of-the-art algorithm that uses the standard sparse representation and physiological regularity in reconstructing a variety of human poses from both synthetic data and real images.

Xiaochuan Fan, Kang Zheng, Youjie Zhou, Song Wang

Synchronization of Two Independently Moving Cameras without Feature Correspondences

In this work, a method that synchronizes two video sequences is proposed. Unlike previous methods, which require the existence of correspondences between features tracked in the two sequences, and/or that the cameras are static or jointly moving, the proposed approach does not impose any of these constraints. It works when the cameras move independently, even if different features are tracked in the two sequences. The assumptions underlying the proposed strategy are that the intrinsic parameters of the cameras are known and that two rigid objects, with independent motions on the scene, are visible in both sequences. The relative motion between these objects is used as clue for the synchronization. The extrinsic parameters of the cameras are assumed to be unknown. A new synchronization algorithm for static or jointly moving cameras that see (possibly) different parts of a common rigidly moving object is also proposed. Proof-of-concept experiments that illustrate the performance of these methods are presented, as well as a comparison with a state-of-the-art approach.

Tiago Gaspar, Paulo Oliveira, Paolo Favaro

Multi Focus Structured Light for Recovering Scene Shape and Global Illumination

Illumination defocus and global illumination effects are major challenges for active illumination scene recovery algorithms. Illumination defocus limits the working volume of projector-camera systems and global illumination can induce large errors in shape estimates. In this paper, we develop an algorithm for scene recovery in the presence of both defocus and global light transport effects such as interreflections and sub-surface scattering. Our method extends the working volume by using structured light patterns at multiple projector focus settings. A careful characterization of projector blur allows us to decode even partially out-of-focus patterns. This enables our algorithm to recover scene shape and the direct and global illumination components over a large depth of field while still using a relatively small number of images (typically 25-30). We demonstrate the effectiveness of our approach by recovering high quality depth maps of scenes containing objects made of optically challenging materials such as wax, marble, soap, colored glass and translucent plastic.

Supreeth Achar, Srinivasa G. Narasimhan

Coplanar Common Points in Non-centric Cameras

Discovering and extracting new image features pertaining to scene geometry is important to 3D reconstruction and scene understanding. Examples include the classical vanishing points observed in a centric camera and the recent coplanar common points (CCPs) in a crossed-slit camera [21,17]. A CCP is a point in the image plane corresponding to the intersection of the projections of all lines lying on a common 3D plane. In this paper, we address the problem of determining CCP existence in general non-centric cameras. We first conduct a ray-space analysis to show that finding the CCP of a 3D plane is equivalent to solving an array of ray constraint equations. We then derive the necessary and sufficient conditions for CCP to exist in an arbitrary non-centric camera such as non-centric catadioptric mirrors. Finally, we present robust algorithms for extracting the CCPs from a single image and validate our theories and algorithms through experiments.

Wei Yang, Yu Ji, Jinwei Ye, S. Susan Young, Jingyi Yu

SRA: Fast Removal of General Multipath for ToF Sensors

A major issue with Time of Flight sensors is the presence of multipath interference. We present Sparse Reflections Analysis (SRA), an algorithm for removing this interference which has two main advantages. First, it allows for very general forms of multipath, including interference with three or more paths, diffuse multipath resulting from Lambertian surfaces, and combinations thereof. SRA removes this general multipath with robust techniques based on



optimization. Second, due to a novel dimension reduction, we are able to produce a very fast version of SRA, which is able to run at frame rate. Experimental results on both synthetic data with ground truth, as well as real images of challenging scenes, validate the approach.

Daniel Freedman, Yoni Smolin, Eyal Krupka, Ido Leichter, Mirko Schmidt

Sub-pixel Layout for Super-Resolution with Images in the Octic Group

This paper presents a novel super-resolution framework by exploring the properties of non-conventional pixel layouts and shapes. We show that recording multiple images, transformed in the octic group, with a sensor of asymmetric sub-pixel layout increases the spatial sampling compared to a conventional sensor with a rectilinear grid of pixels and hence increases the image resolution. We further prove a theoretical bound for achieving well-posed super-resolution with a designated magnification factor w.r.t. the number and distribution of sub-pixels. We also propose strategies for selecting good sub-pixel layouts and effective super-resolution algorithms for our setup. The experimental results validate the proposed theory and solution, which have the potential to guide the future CCD layout design with super-resolution functionality.

Boxin Shi, Hang Zhao, Moshe Ben-Ezra, Sai-Kit Yeung, Christy Fernandez-Cull, R. Hamilton Shepard, Christopher Barsi, Ramesh Raskar

Simultaneous Feature and Dictionary Learning for Image Set Based Face Recognition

In this paper, we propose a simultaneous feature and dictionary learning (SFDL) method for image set based face recognition, where each training and testing example contains a face image set captured from different poses, illuminations, expressions and resolutions. While several feature learning and dictionary learning methods have been proposed for image set based face recognition in recent years, most of them learn the features and dictionaries separately, which may not be powerful enough because some discriminative information for dictionary learning may be compromised in the feature learning stage if they are applied sequentially, and vice versa. To address this, we propose a SFDL method to learn discriminative features and dictionaries simultaneously from raw face images so that discriminative information can be jointly exploited. Extensive experimental results on four widely used face datasets show that our method achieves better performance than state-of-the-art image set based face recognition methods.

Jiwen Lu, Gang Wang, Weihong Deng, Pierre Moulin

Read My Lips: Continuous Signer Independent Weakly Supervised Viseme Recognition

This work presents a framework to recognise signer independent mouthings in continuous sign language, with no manual annotations needed. Mouthings represent lip-movements that correspond to pronunciations of words or parts of them during signing. Research on sign language recognition has focused extensively on the hands as features. But sign language is multi-modal and a full understanding particularly with respect to its lexical variety, language idioms and grammatical structures is not possible without further exploring the remaining information channels. To our knowledge no previous work has explored dedicated viseme recognition in the context of sign language recognition. The approach is trained on over 180.000 unlabelled frames and reaches 47.1% precision on the frame level. Generalisation across individuals and the influence of context-dependent visemes are analysed.

Oscar Koller, Hermann Ney, Richard Bowden

Multilinear Wavelets: A Statistical Shape Space for Human Faces

We present a statistical model for 3D human faces in varying expression, which decomposes the surface of the face using a wavelet transform, and learns many localized, decorrelated multilinear models on the resulting coefficients. Using this model we are able to reconstruct faces from noisy and occluded 3D face scans, and facial motion sequences. Accurate reconstruction of face shape is important for applications such as tele-presence and gaming. The localized and multi-scale nature of our model allows for recovery of fine-scale detail while retaining robustness to severe noise and occlusion, and is computationally efficient and scalable. We validate these properties experimentally on challenging data in the form of static scans and motion sequences. We show that in comparison to a global multilinear model, our model better preserves fine detail and is computationally faster, while in comparison to a localized PCA model, our model better handles variation in expression, is faster, and allows us to fix identity parameters for a given subject.

Alan Brunton, Timo Bolkart, Stefanie Wuhrer

Distance Estimation of an Unknown Person from a Portrait

We propose the first automated method for estimating distance from frontal pictures of unknown faces. Camera calibration is not necessary, nor is the reconstruction of a 3D representation of the shape of the head. Our method is based on estimating automatically the position of face and head landmarks in the image, and then using a regressor to estimate distance from such measurements. We collected and annotated a dataset of frontal portraits of 53 individuals spanning a number of attributes (sex, age, race, hair), each photographed from seven distances. We find that our proposed method outperforms humans performing the same task. We observe that different physiognomies will bias systematically the estimate of distance, i.e. some people look closer than others. We expire which landmarks are more important for this task.

Xavier P. Burgos-Artizzu, Matteo Ruggero Ronchi, Pietro Perona

Probabilistic Temporal Head Pose Estimation Using a Hierarchical Graphical Model

We present a hierarchical graphical model to probabilistically estimate head pose angles from real-world videos, that leverages the temporal pose information over video frames. The proposed model employs a number of complementary facial features, and performs feature level, probabilistic classifier level and temporal level fusion. Extensive experiments are performed to analyze the pose estimation performance for different combination of features, different levels of the proposed hierarchical model and for different face databases. Experiments show that the proposed head pose model improves on the current state-of-the-art for the unconstrained McGillFaces [10] and the constrained CMU Multi-PIE [14] databases, increasing the pose classification accuracy compared to the current top performing method by 19.38% and 19.89%, respectively.

Meltem Demirkus, Doina Precup, James J. Clark, Tal Arbel

Description-Discrimination Collaborative Tracking

Appearance model is one of the most important components for online visual tracking. An effective appearance model needs to strike the right balance between being adaptive, to account for appearance change, and being conservative, to re-track the object after it loses tracking (


., due to occlusion). Most conventional appearance models focus on one aspect out of the two, and hence are not able to achieve the right balance. In this paper, we approach this problem by a max-margin learning framework collaborating a descriptive component and a discriminative component. Particularly, the two components are for different purposes and with different lifespans. One forms a robust object model, and the other tries to distinguish the object from the current background. Taking advantages of their complementary roles, the components improve each other and collaboratively contribute to a shared score function. Besides, for realtime implementation, we also propose a series of optimization and sample-management strategies. Experiments over 30 challenging videos demonstrate the effectiveness and robustness of the proposed tracker. Our method generally outperforms the existing state-of-the-art methods.

Dapeng Chen, Zejian Yuan, Gang Hua, Yang Wu, Nanning Zheng

Online, Real-Time Tracking Using a Category-to-Individual Detector

A method for online, real-time tracking of objects is presented. Tracking is treated as a repeated detection problem where potential target objects are identified with a pre-trained category detector and object identity across frames is established by individual-specific detectors. The individual detectors are (re-)trained online from a single positive example whenever there is a coincident category detection. This ensures that the tracker is robust to drift. Real-time operation is possible since an individual-object detector is obtained through elementary manipulations of the thresholds of the category detector and therefore only minimal additional computations are required. Our tracking algorithm is benchmarked against nine state-of-the-art trackers on two large, publicly available and challenging video datasets. We find that our algorithm is 10% more accurate and nearly as fast as the fastest of the competing algorithms, and it is as accurate but 20 times faster than the most accurate of the competing algorithms.

David Hall, Pietro Perona

Robust Visual Tracking with Double Bounding Box Model

A novel tracking algorithm that can track a highly non-rigid target robustly is proposed using a new bounding box representation called the Double Bounding Box (DBB). In the DBB, a target is described by the combination of the Inner Bounding Box (IBB) and the Outer Bounding Box (OBB). Then our objective of visual tracking is changed to find the IBB and OBB instead of a single bounding box, where the IBB and OBB can be easily obtained by the Dempster-Shafer (DS) theory. If the target is highly non-rigid, any single bounding box cannot include all foreground regions while excluding all background regions. Using the DBB, our method does not directly handle the ambiguous regions, which include both the foreground and background regions. Hence, it can solve the inherent ambiguity of the single bounding box representation and thus can track highly non-rigid targets robustly. Our method finally finds the best state of the target using a new Constrained Markov Chain Monte Carlo (CMCMC)-based sampling method with the constraint that the OBB should include the IBB. Experimental results show that our method tracks non-rigid targets accurately and robustly, and outperforms state-of-the-art methods.

Junseok Kwon, Junha Roh, Kyoung Mu Lee, Luc Van Gool

Tractable and Reliable Registration of 2D Point Sets

This paper introduces two new methods of registering 2D point sets over rigid transformations when the registration error is based on a robust loss function. In contrast to previous work, our methods are guaranteed to compute the optimal transformation, and at the same time, the worst-case running times are bounded by a low-degree polynomial in the number of correspondences. In practical terms, this means that there is no need to resort to ad-hoc procedures such as random sampling or local descent methods that cannot guarantee the quality of their solutions.

We have tested the methods in several different settings, in particular, a thorough evaluation on two benchmarks of microscopic images used for histologic analysis of prostate cancer has been performed. Compared to the state-of-the-art, our results show that the methods are both tractable and reliable despite the presence of a significant amount of outliers.

Erik Ask, Olof Enqvist, Linus Svärm, Fredrik Kahl, Giuseppe Lippolis

Graduated Consistency-Regularized Optimization for Multi-graph Matching

Graph matching has a wide spectrum of computer vision applications such as finding feature point correspondences across images. The problem of graph matching is generally NP-hard, so most existing work pursues suboptimal solutions between two graphs. This paper investigates a more general problem of matching


attributed graphs to each other, i.e. labeling their common node correspondences such that a certain compatibility/affinity objective is optimized. This multi-graph matching problem involves two key ingredients affecting the overall accuracy: a) the pairwise affinity matching score between two local graphs, and b) global matching consistency that measures the uniqueness and consistency of the pairwise matching results by different sequential matching orders. Previous work typically either enforces the matching consistency constraints in the beginning of iterative optimization, which may propagate matching error both over iterations and across different graph pairs; or separates score optimizing and consistency synchronization in two steps. This paper is motivated by the observation that affinity score and consistency are mutually affected and shall be tackled jointly to capture their correlation behavior. As such, we propose a novel multi-graph matching algorithm to incorporate the two aspects by iteratively approximating the global-optimal affinity score, meanwhile gradually infusing the consistency as a regularizer, which improves the performance of the initial solutions obtained by existing pairwise graph matching solvers. The proposed algorithm with a theoretically proven convergence shows notable efficacy on both synthetic and public image datasets.

Junchi Yan, Yin Li, Wei Liu, Hongyuan Zha, Xiaokang Yang, Stephen Mingyu Chu

Optical Flow Estimation with Channel Constancy

Large motions remain a challenge for current optical flow algorithms. Traditionally, large motions are addressed using multi-resolution representations like Gaussian pyramids. To deal with large displacements, many pyramid levels are needed and, if an object is small, it may be invisible at the highest levels. To address this we decompose images using a

channel representation

(CR) and replace the standard brightness constancy assumption with a descriptor constancy assumption. CRs can be seen as an over-segmentation of the scene into layers based on some image feature. If the appearance of a foreground object differs from the background then its descriptor will be different and they will be represented in different layers. We create a pyramid by smoothing these layers, without mixing foreground and background or losing small objects. Our method estimates more accurate flow than the baseline on the MPI-Sintel benchmark, especially for fast motions and near motion boundaries.

Laura Sevilla-Lara, Deqing Sun, Erik G. Learned-Miller, Michael J. Black

Non-local Total Generalized Variation for Optical Flow Estimation

In this paper we introduce a novel higher-order regularization term. The proposed regularizer is a non-local extension of the popular second-order Total Generalized variation, which favors piecewise affine solutions and allows to incorporate soft-segmentation cues into the regularization term. These properties make this regularizer especially appealing for optical flow estimation, where it offers accurately localized motion boundaries and allows to resolve ambiguities in the matching term. We additionally propose a novel matching term which is robust to illumination and scale changes, two major sources of errors in optical flow estimation algorithms. We extensively evaluate the proposed regularizer and data term on two challenging benchmarks, where we are able to obtain state of the art results. Our method is currently ranked first among classical two-frame optical flow methods on the KITTI optical flow benchmark.

René Ranftl, Kristian Bredies, Thomas Pock

Learning Brightness Transfer Functions for the Joint Recovery of Illumination Changes and Optical Flow

The increasing importance of outdoor applications such as driver assistance systems or video surveillance tasks has recently triggered the development of optical flow methods that aim at performing robustly under uncontrolled illumination. Most of these methods are based on patch-based features such as the normalized cross correlation, the census transform or the rank transform. They achieve their robustness by locally discarding both absolute brightness and contrast. In this paper, we follow an alternative strategy: Instead of discarding potentially important image information, we propose a novel variational model that jointly estimates


illumination changes


optical flow. The key idea is to parametrize the illumination changes in terms of basis functions that are learned from training data. While such basis functions allow for a meaningful representation of illumination effects, they also help to distinguish real illumination changes from motion-induced brightness variations if supplemented by additional smoothness constraints. Experiments on the KITTI benchmark show the clear benefits of our approach. They do not only demonstrate that it is possible to obtain meaningful basis functions, they also show state-of-the-art results for robust optical flow estimation.

Oliver Demetz, Michael Stoll, Sebastian Volz, Joachim Weickert, Andrés Bruhn

Hipster Wars: Discovering Elements of Fashion Styles

The clothing we wear and our identities are closely tied, revealing to the world clues about our wealth, occupation, and socio-identity. In this paper we examine questions related to what our clothing reveals about our personal style. We first design an online competitive Style Rating Game called

Hipster Wars

to crowd source reliable human judgments of style. We use this game to collect a new dataset of clothing outfits with associated style ratings for 5 style categories: hipster, bohemian, pinup, preppy, and goth. Next, we train models for between-class and within-class classification of styles. Finally, we explore methods to identify clothing elements that are generally discriminative for a style, and methods for identifying items in a particular outfit that may indicate a style.

M. Hadi Kiapour, Kota Yamaguchi, Alexander C. Berg, Tamara L. Berg

From Low-Cost Depth Sensors to CAD: Cross-Domain 3D Shape Retrieval via Regression Tree Fields

The recent advances of low-cost and mobile depth sensors dramatically extend the potential of 3D shape retrieval and analysis. While the traditional research of 3D retrieval mainly focused on searching by a rough 2D sketch or with a high-quality CAD model, we tackle a novel and challenging problem of cross-domain 3D shape retrieval, in which users can use 3D scans from low-cost depth sensors like Kinect as queries to search CAD models in the database. To cope with the imperfection of user-captured models such as model noise and occlusion, we propose a cross-domain shape retrieval framework, which minimizes the potential function of a Conditional Random Field to efficiently generate the retrieval scores. In particular, the potential function consists of two critical components: one unary potential term provides robust cross-domain partial matching and the other pairwise potential term embeds spatial structures to alleviate the instability from model noise. Both potential components are efficiently estimated using random forests with 3D local features, forming a

Regression Tree Field

framework. We conduct extensive experiments on two recently released user-captured 3D shape datasets and compare with several state-of-the-art approaches on the cross-domain shape retrieval task. The experimental results demonstrate that our proposed method outperforms the competing methods with a significant performance gain.

Yan Wang, Jie Feng, Zhixiang Wu, Jun Wang, Shih-Fu Chang

Fast and Accurate Texture Recognition with Multilayer Convolution and Multifractal Analysis

A fast and accurate texture recognition system is presented. The new approach consists in extracting locally and globally invariant representations. The locally invariant representation is built on a multi-resolution convolutional network with a local pooling operator to improve robustness to local orientation and scale changes. This representation is mapped into a globally invariant descriptor using multifractal analysis. We propose a new multifractal descriptor that captures rich texture information and is mathematically invariant to various complex transformations. In addition, two more techniques are presented to further improve the robustness of our system. The first technique consists in combining the generative PCA classifier with multiclass SVMs. The second technique consists of two simple strategies to boost classification results by synthetically augmenting the training set. Experiments show that the proposed solution outperforms existing methods on three challenging public benchmark datasets, while being computationally efficient.

Hicham Badri, Hussein Yahia, Khalid Daoudi

Learning to Rank 3D Features

Representation of three dimensional objects using a set of oriented point pair features has been shown to be effective for object recognition and pose estimation. Combined with an efficient voting scheme on a generalized Hough space, existing approaches achieve good recognition accuracy and fast operation. However, the performance of these approaches degrades when the objects are (self-)similar or exhibit degeneracies, such as large planar surfaces which are very common in both man made and natural shapes, or due to heavy object and background clutter. We propose a max-margin learning framework to identify discriminative features on the surface of three dimensional objects. Our algorithm selects and ranks features according to their importance for the specified task, which leads to improved accuracy and reduced computational cost. In addition, we analyze various grouping and optimization strategies to learn the discriminative pair features. We present extensive synthetic and real experiments demonstrating the improved results.

Oncel Tuzel, Ming-Yu Liu, Yuichi Taguchi, Arvind Raghunathan

Salient Color Names for Person Re-identification

Color naming, which relates colors with color names, can help people with a semantic analysis of images in many computer vision applications. In this paper, we propose a novel salient color names based color descriptor (SCNCD) to describe colors. SCNCD utilizes salient color names to guarantee that a higher probability will be assigned to the color name which is nearer to the color. Based on SCNCD, color distributions over color names in different color spaces are then obtained and fused to generate a feature representation. Moreover, the effect of background information is employed and analyzed for person re-identification. With a simple metric learning method, the proposed approach outperforms the state-of-the-art performance (without user’s feedback optimization) on two challenging datasets (VIPeR and PRID 450S). More importantly, the proposed feature can be obtained very fast if we compute SCNCD of each color in advance.

Yang Yang, Jimei Yang, Junjie Yan, Shengcai Liao, Dong Yi, Stan Z. Li

Learning Discriminative and Shareable Features for Scene Classification

In this paper, we propose to learn a discriminative and shareable feature transformation filter bank to transform local image patches (represented as raw pixel values) into features for scene image classification. The learned filters are expected to: (1) encode common visual patterns of a flexible number of categories; (2) encode discriminative and class-specific information. For each category, a subset of the filters are activated in a data-adaptive manner, meanwhile sharing of filters among different categories is also allowed. Discriminative power of the filter bank is further enhanced by enforcing the features from the same category to be close to each other in the feature space, while features from different categories to be far away from each other. The experimental results on three challenging scene image classification datasets indicate that our features can achieve very promising performance. Furthermore, our features also show great complementary effect to the state-of-the-art ConvNets feature.

Zhen Zuo, Gang Wang, Bing Shuai, Lifan Zhao, Qingxiong Yang, Xudong Jiang

Image Retrieval and Ranking via Consistently Reconstructing Multi-attribute Queries

Image retrieval and ranking based on the multi-attribute queries is beneficial to various real world applications. Traditional methods on this problem often utilize intermediate representations generated by attribute classifiers to describe the images, and then the images in the database are sorted according to their similarities to the query. However, such a scheme has two main challenges: 1) how to exploit the correlation between query attributes and non-query attributes, and 2) how to handle noisy representations since the pre-defined attribute classifiers are probably unreliable. To overcome these challenges, we discover the correlation among attributes via expanding the query representation, and imposing the group sparsity on representations to reduce the disturbance of noisy data. Specifically, given a multi-attribute query matrix with each row corresponding to a query attribute and each column the pre-defined attribute, we firstly expand the query based on the correlation of the attributes learned from the training data. Then, the expanded query matrix is reconstructed by the images in the dataset with the ℓ


regularization. Furthermore, we introduce the ranking SVM into the objective function to guarantee the ranking consistency. Finally, we adopt a graph regularization to preserve the local visual similarity among images. Extensive experiments on LFW, CUB-200-2011, and Shoes datasets are conducted to demonstrate the effectiveness of our proposed method.

Xiaochun Cao, Hua Zhang, Xiaojie Guo, Si Liu, Xiaowu Chen

Neural Codes for Image Retrieval

It has been shown that the activations invoked by an image within the top layers of a large convolutional neural network provide a high-level descriptor of the visual content of the image. In this paper, we investigate the use of such descriptors (neural codes) within the image retrieval application. In the experiments with several standard retrieval benchmarks, we establish that neural codes perform competitively even when the convolutional neural network has been trained for an unrelated classification task (e.g. Image-Net). We also evaluate the improvement in the retrieval performance of neural codes, when the network is retrained on a dataset of images that are similar to images encountered at test time.

We further evaluate the performance of the compressed neural codes and show that a simple PCA compression provides very good short codes that give state-of-the-art accuracy on a number of datasets. In general, neural codes turn out to be much more resilient to such compression in comparison other state-of-the-art descriptors. Finally, we show that discriminative dimensionality reduction trained on a dataset of pairs of matched photographs improves the performance of PCA-compressed neural codes even further. Overall, our quantitative experiments demonstrate the promise of neural codes as visual descriptors for image retrieval.

Artem Babenko, Anton Slesarev, Alexandr Chigorin, Victor Lempitsky

Architectural Style Classification Using Multinomial Latent Logistic Regression

Architectural style classification differs from standard classification tasks due to the rich inter-class relationships between different styles, such as re-interpretation, revival, and territoriality. In this paper, we adopt Deformable Part-based Models (DPM) to capture the morphological characteristics of basic architectural components and propose Multinomial Latent Logistic Regression (MLLR) that introduces the probabilistic analysis and tackles the multi-class problem in latent variable models. Due to the lack of publicly available datasets, we release a new large-scale architectural style dataset containing twenty-five classes. Experimentation on this dataset shows that MLLR in combination with standard global image features, obtains the best classification results. We also present interpretable probabilistic explanations for the results, such as the styles of individual buildings and a style relationship network, to illustrate inter-class relationships.

Zhe Xu, Dacheng Tao, Ya Zhang, Junjie Wu, Ah Chung Tsoi

Instance Segmentation of Indoor Scenes Using a Coverage Loss

A major limitation of existing models for semantic segmentation is the inability to identify individual instances of the same class: when labeling pixels with only semantic classes, a set of pixels with the same label could represent a single object or ten. In this work, we introduce a model to perform both semantic and instance segmentation simultaneously. We introduce a new higher-order loss function that directly minimizes the coverage metric and evaluate a variety of region features, including those from a convolutional network. We apply our model to the NYU Depth V2 dataset, obtaining state of the art results.

Nathan Silberman, David Sontag, Rob Fergus

Superpixel Graph Label Transfer with Learned Distance Metric

We present a fast approximate nearest neighbor algorithm for semantic segmentation. Our algorithm builds a graph over superpixels from an annotated set of training images. Edges in the graph represent approximate nearest neighbors in feature space. At test time we match superpixels from a novel image to the training images by adding the novel image to the graph. A move-making search algorithm allows us to leverage the graph and image structure for finding matches. We then transfer labels from the training images to the image under test. To promote good matches between superpixels we propose to learn a distance metric that weights the edges in our graph. Our approach is evaluated on four standard semantic segmentation datasets and achieves results comparable with the state-of-the-art.

Stephen Gould, Jiecheng Zhao, Xuming He, Yuhang Zhang

Precision-Recall-Classification Evaluation Framework: Application to Depth Estimation on Single Images

Many computer vision applications involve algorithms that can be decomposed in two main steps. In a first step, events or objects are detected and, in a second step, detections are assigned to classes. Examples of such “detection plus classification” problems can be found in human pose classification, object recognition or action classification among others. In this paper, we focus on a special case: depth ordering on single images. In this problem, the detection step consists of the image segmentation, and the classification step assigns a depth gradient to each contour or a depth order to each region. We discuss the limitations of the classical Precision-Recall evaluation framework for these kind of problems and define an extended framework called “Precision-Recall-Classfication” (PRC). Then, we apply this framework to depth ordering problems and design two specific PRC measures to evaluate both the local and the global depth consistencies. We use these measures to evaluate precisely state of the art depth ordering systems for monocular images. We also propose an extension to the method of [2] applying an optimal graph cut on a hierarchical segmentation structure. The resulting system is proven to provide better results than state of the art algorithms.

Guillem Palou Visa, Philippe Salembier

A Multi-stage Approach to Curve Extraction

We propose a multi-stage approach to curve extraction where the curve fragment search space is iteratively reduced by removing unlikely candidates using geometric constrains, but without affecting recall, to a point where the application of an objective functional becomes appropriate. The motivation in using multiple stages is to avoid the drawback of using a global functional directly on edges, which can result in non-salient but high scoring curve fragments, which arise from non-uniformly distributed edge evidence. The process progresses in stages from local to global: (


) edges, (


) curvelets, (


) unambiguous curve fragments, (


) resolving ambiguities to generate a full set of curve fragment candidates, (


) merging curve fragments based on a learned photometric and geometric cues as well a novel

lateral edge sparsity

cue, and (


) the application of a learned objective functional to get a final selection of curve fragments. The resulting curve fragments are typically visually salient and have been evaluated in two ways. First, we measure the stability of curve fragments when images undergo visual transformations such as change in viewpoints, illumination, and noise, a critical factor for curve fragments to be useful to later visual processes but one often ignored in evaluation. Second, we use a more traditional comparison against human annotation, but using the CFGD dataset and CFGD evaluation strategy rather than the standard BSDS counterpart, which is shown to be not appropriate for evaluating curve fragments. Under both evaluation schemes our results are significantly better than those state of the art algorithms whose implementations are publicly available.

Yuliang Guo, Naman Kumar, Maruthi Narayanan, Benjamin Kimia

Geometry Driven Semantic Labeling of Indoor Scenes

We present a discriminative graphical model which integrates geometrical information from RGBD images in its unary, pairwise and higher order components. We propose an improved geometry estimation scheme which is robust to erroneous sensor inputs. At the unary level, we combine appearance based beliefs defined on pixels and planes using a hybrid decision fusion scheme. Our proposed location potential gives an improved representation of the planar classes. At the pairwise level, we learn a balanced combination of various boundaries to consider the spatial discontinuity. Finally, we treat planar regions as higher order cliques and use graphcuts to make efficient inference. In our model based formulation, we use structured learning to fine tune the model parameters. We test our approach on two RGBD datasets and demonstrate significant improvements over the state-of-the-art scene labeling techniques.

Salman Hameed Khan, Mohammed Bennamoun, Ferdous Sohel, Roberto Togneri

A Novel Topic-Level Random Walk Framework for Scene Image Co-segmentation

Image co-segmentation is popular with its ability to detour supervisory data by exploiting the common information in multiple images. In this paper, we aim at a more challenging branch called scene image co-segmentation, which jointly segments multiple images captured from the same scene into regions corresponding to their respective classes. We first put forward a novel representation named

Visual Relation Network

(VRN) to organize multiple segments, and then search for meaningful segments for every image through voting on the network. Scalable topic-level random walk is then used to solve the voting problem. Experiments on the benchmark MSRC-v2, the more difficult LabelMe and SUN datasets show the superiority over the state-of-the-art methods.

Zehuan Yuan, Tong Lu, Palaiahnakote Shivakumara

Surface Matching and Registration by Landmark Curve-Driven Canonical Quasiconformal Mapping

This work presents a novel surface matching and registration method based on the landmark curve-driven canonical surface quasiconformal mapping, where an open genus zero surface decorated with landmark curves is mapped to a canonical domain with horizontal or vertical straight segments and the local shapes are preserved as much as possible. The key idea of the canonical mapping is to minimize the harmonic energy with the landmark curve straightening constraints and generate a quasi-holomorphic 1-form which is zero in one parameter along landmark and results in a quasiconformal mapping. The mapping exists and is unique and intrinsic to surface and landmark geometry. The novel shape representation provides a conformal invariant shape signature. We use it as Teichmüller coordinates to construct a subspace of the conventional Teichmüller space which considers geometry feature details and therefore increases the discriminative ability for matching.


, we present a novel and efficient registration method for surfaces with landmark curve constraints by computing an optimal mapping over the canonical domains with straight segments, where the curve constraints become linear forms. Due to the linearity of 1-form and harmonic map, the algorithms are easy to compute, efficient and practical. Experiments on human face and brain surfaces demonstrate the efficiency and efficacy and the potential for broader shape analysis applications.

Wei Zeng, Yi-Jun Yang

Motion Words for Videos

In the task of activity recognition in videos, computing the video representation often involves pooling feature vectors over spatially local neighborhoods. The pooling is done over the entire video, over coarse spatio-temporal pyramids, or over pre-determined rigid cuboids. Similarly to pooling image features over superpixels in images, it is natural to consider pooling spatio-temporal features over video segments, e.g., supervoxels. However, since the number of segments is variable, this produces a video representation of variable size. We propose Motion Words - a new, fixed size video representation, where we pool features over supervoxels. To segment the video into supervoxels, we explore two recent video segmentation algorithms. The proposed representation enables localization of common regions across videos in both space and time. Importantly, since the video segments are meaningful regions, we can interpret the proposed features and obtain a better understanding of


two videos are similar. Evaluation on classification and retrieval tasks on two datasets further shows that Motion Words achieves state-of-the-art performance.

Ekaterina H. Taralova, Fernando De la Torre, Martial Hebert

Activity Group Localization by Modeling the Relations among Participants

Beyond recognizing the actions of individuals, activity group localization aims to determine ‘‘who participates in each group’’ and ‘‘what activity the group performs’’. In this paper, we propose a latent graphical model to group participants while inferring each group’s activity by exploring the relations among them, thus simultaneously addressing the problems of group localization and activity recognition. Our key insight is to exploit the relational graph among the participants. Specifically, each group is represented as a tree with an activity label while relations among groups are modeled as a fully connected graph. Inference of such a graph is reduced into an extended minimum spanning forest problem, which is casted into a max-margin framework. It therefore avoids the limitation of high-ordered hierarchical model and can be solved efficiently. Our model is able to provide strong and discriminative contextual cues for activity recognition and to better interpret scene information for localization. Experiments on three datasets demonstrate that our model achieves significant improvements in activity group. localization and state-of-the-arts performance on activity recognition.

Lei Sun, Haizhou Ai, Shihong Lao

Finding Coherent Motions and Semantic Regions in Crowd Scenes: A Diffusion and Clustering Approach

This paper addresses the problem of detecting coherent motions in crowd scenes and subsequently constructing semantic regions for activity recognition. We first introduce a coarse-to-fine thermal-diffusion-based approach. It processes input motion fields (e.g., optical flow fields) and produces a coherent motion filed, named as thermal energy field. The thermal energy field is able to capture both motion correlation among particles and the motion trends of individual particles which are helpful to discover coherency among them. We further introduce a two-step clustering process to construct stable semantic regions from the extracted time-varying coherent motions. Finally, these semantic regions are used to recognize activities in crowded scenes. Experiments on various videos demonstrate the effectiveness of our approach.

Weiyue Wang, Weiyao Lin, Yuanzhe Chen, Jianxin Wu, Jingdong Wang, Bin Sheng

Semantic Aware Video Transcription Using Random Forest Classifiers

This paper focuses on transcription generation in the form of subject, verb, object (SVO) triplets for videos in the wild, given off-the-shelf visual concept detectors. This problem is challenging due to the availability of sentence only annotations, the unreliability of concept detectors, and the lack of training samples for many words. Facing these challenges, we propose a Semantic Aware Transcription (SAT) framework based on Random Forest classifiers. It takes concept detection results as input, and outputs a distribution of English words. SAT uses video, sentence pairs for training. It hierarchically learns node splits by grouping semantically similar words, measured by a continuous skip-gram language model. This not only addresses the sparsity of training samples per word, but also yields semantically reasonable errors during transcription. SAT provides a systematic way to measure the relatedness of a concept detector to real words, which helps us understand the relationship between current visual detectors and words in a semantic space. Experiments on a large video dataset with 1,970 clips and 85,550 sentences are used to demonstrate our idea.

Chen Sun, Ram Nevatia

Ranking Domain-Specific Highlights by Analyzing Edited Videos

We present a fully automatic system for ranking domain-specific highlights in unconstrained personal videos by analyzing online edited videos. A novel latent linear ranking model is proposed to handle noisy training data harvested online. Specifically, given a search query (domain) such as “surfing”, our system mines the Youtube database to find pairs of raw and corresponding edited videos. Leveraging the assumption that edited video is more likely to contain highlights than the trimmed parts of the raw video, we obtain pair-wise ranking constraints to train our model. The learning task is challenging due to the amount of noise and variation in the mined data. Hence, a latent loss function is incorporated to robustly deal with the noise. We efficiently learn the latent model on a large number of videos (about 700 minutes in all) using a novel EM-like self-paced model selection procedure. Our latent ranking model outperforms its classification counterpart, a motion analysis baseline [15], and a fully-supervised ranking system that requires labels from Amazon Mechanical Turk. Finally, we show that impressive highlights can be retrieved without additional human supervision for domains like skating, surfing, skiing, gymnastics, parkour, and dog activity in unconstrained personal videos.

Min Sun, Ali Farhadi, Steve Seitz

A Multi-transformational Model for Background Subtraction with Moving Cameras

We introduce a new approach to perform background subtraction in moving camera scenarios. Unlike previous treatments of the problem, we do not restrict the camera motion or the scene geometry. The proposed approach relies on Bayesian selection of the transformation that best describes the geometric relation between consecutive frames. Based on the selected transformation, we propagate a set of learned background and foreground appearance models using a single or a series of homography transforms. The propagated models are subjected to MAP-MRF optimization framework that combines motion, appearance, spatial, and temporal cues; the optimization process provides the final background/foreground labels. Extensive experimental evaluation with challenging videos shows that the proposed method outperforms the baseline and state-of-the-art methods in most cases.

Daniya Zamalieva, Alper Yilmaz, James W. Davis

Learning and Inference

Visualizing and Understanding Convolutional Networks

Large Convolutional Network models have recently demonstrated impressive classification performance on the ImageNet benchmark Krizhevsky

et al

. [18]. However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we explore both issues. We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. Used in a diagnostic role, these visualizations allow us to find model architectures that outperform Krizhevsky

et al

on the ImageNet classification benchmark. We also perform an ablation study to discover the performance contribution from different model layers. We show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.

Matthew D. Zeiler, Rob Fergus

Part-Based R-CNNs for Fine-Grained Category Detection

Semantic part localization can facilitate fine-grained categorization by explicitly isolating subtle appearance differences associated with specific object parts. Methods for pose-normalized representations have been proposed, but generally presume bounding box annotations at test time due to the difficulty of object detection. We propose a model for fine-grained categorization that overcomes these limitations by leveraging deep convolutional features computed on bottom-up region proposals. Our method learns whole-object and part detectors, enforces learned geometric constraints between them, and predicts a fine-grained category from a pose-normalized representation. Experiments on the Caltech-UCSD bird dataset confirm that our method outperforms state-of-the-art fine-grained categorization methods in an end-to-end evaluation without requiring a bounding box at test time.

Ning Zhang, Jeff Donahue, Ross Girshick, Trevor Darrell


Additional information

Premium Partner

    Image Credits