Skip to main content

About this book

The eight-volume set comprising LNCS volumes 9905-9912 constitutes the refereed proceedings of the 14th European Conference on Computer Vision, ECCV 2016, held in Amsterdam, The Netherlands, in October 2016.
The 415 revised papers presented were carefully reviewed and selected from 1480 submissions. The papers cover all aspects of computer vision and pattern recognition such as 3D computer vision; computational photography, sensing and display; face and gesture; low-level vision and image processing; motion and tracking; optimization methods; physics-based vision, photometry and shape-from-X; recognition: detection, categorization, indexing, matching; segmentation, grouping and shape representation; statistical methods and learning; video: events, activities and surveillance; applications. They are organized in topical sections on detection, recognition and retrieval; scene understanding; optimization; image and video processing; learning; action, activity and tracking; 3D; and 9 poster sessions.

Table of Contents


Poster Session 7 (Continued)


Angry Crowds: Detecting Violent Events in Videos

Approaches inspired by Newtonian mechanics have been successfully applied for detecting abnormal behaviors in crowd scenarios, being the most notable example the Social Force Model (SFM). This class of approaches describes the movements and local interactions among individuals in crowds by means of repulsive and attractive forces. Despite their promising performance, recent socio-psychology studies have shown that current SFM-based methods may not be capable of explaining behaviors in complex crowd scenarios. An alternative approach consists in describing the cognitive processes that gives rise to the behavioral patterns observed in crowd using heuristics. Inspired by these studies, we propose a new hybrid framework to detect violent events in crowd videos. More specifically, (i) we define a set of simple behavioral heuristics to describe people behaviors in crowd, and (ii) we implement these heuristics into physical equations, being able to model and classify such behaviors in the videos. The resulting heuristic maps are used to extract video features to distinguish violence from normal events. Our violence detection results set the new state of the art on several standard benchmarks and demonstrate the superiority of our method compared to standard motion descriptors, previous physics-inspired models used for crowd analysis and pre-trained ConvNet for crowd behavior analysis.

Sadegh Mohammadi, Alessandro Perina, Hamed Kiani, Vittorio Murino

Sparse Recovery of Hyperspectral Signal from Natural RGB Images

Hyperspectral imaging is an important visual modality with growing interest and range of applications. The latter, however, is hindered by the fact that existing devices are limited in either spatial, spectral, and/or temporal resolution, while yet being both complicated and expensive. We present a low cost and fast method to recover high quality hyperspectral images directly from RGB. Our approach first leverages hyperspectral prior in order to create a sparse dictionary of hyperspectral signatures and their corresponding RGB projections. Describing novel RGB images via the latter then facilitates reconstruction of the hyperspectral image via the former. A novel, larger-than-ever database of hyperspectral images serves as a hyperspectral prior. This database further allows for evaluation of our methodology at an unprecedented scale, and is provided for the benefit of the research community. Our approach is fast, accurate, and provides high resolution hyperspectral cubes despite using RGB-only input.

Boaz Arad, Ohad Ben-Shahar

Light Field Segmentation Using a Ray-Based Graph Structure

In this paper, we introduce a novel graph representation for interactive light field segmentation using Markov Random Field (MRF). The greatest barrier to the adoption of MRF for light field processing is the large volume of input data. The proposed graph structure exploits the redundancy in the ray space in order to reduce the graph size, decreasing the running time of MRF-based optimisation tasks. Concepts of free rays and ray bundles with corresponding neighbourhood relationships are defined to construct the simplified graph-based light field representation. We then propose a light field interactive segmentation algorithm using graph-cuts based on such ray space graph structure, that guarantees the segmentation consistency across all views. Our experiments with several datasets show results that are very close to the ground truth, competing with state of the art light field segmentation methods in terms of accuracy and with a significantly lower complexity. They also show that our method performs well on both densely and sparsely sampled light fields.

Matthieu Hog, Neus Sabater, Christine Guillemot

Design of Kernels in Convolutional Neural Networks for Image Classification

Despite the effectiveness of convolutional neural networks (CNNs) for image classification, our understanding of the effect of shape of convolution kernels on learned representations is limited. In this work, we explore and employ the relationship between shape of kernels which define receptive fields (RFs) in CNNs for learning of feature representations and image classification. For this purpose, we present a feature visualization method for visualization of pixel-wise classification score maps of learned features. Motivated by our experimental results, and observations reported in the literature for modeling of visual systems, we propose a novel design of shape of kernels for learning of representations in CNNs.In the experimental results, the proposed models also outperform the state-of-the-art methods employed on the CIFAR-10/100 datasets [1] for image classification. We also achieved an outstanding performance in the classification task, comparing to a base CNN model that introduces more parameters and computational time, using the ILSVRC-2012 dataset [2]. Additionally, we examined the region of interest (ROI) of different models in the classification task and analyzed the robustness of the proposed method to occluded images. Our results indicate the effectiveness of the proposed approach.

Zhun Sun, Mete Ozay, Takayuki Okatani

Learning Visual Features from Large Weakly Supervised Data

Convolutional networks trained on large supervised datasets produce visual features which form the basis for the state-of-the-art in many computer-vision problems. Further improvements of these visual features will likely require even larger manually labeled data sets, which severely limits the pace at which progress can be made. In this paper, we explore the potential of leveraging massive, weakly-labeled image collections for learning good visual features. We train convolutional networks on a dataset of 100 million Flickr photos and comments, and show that these networks produce features that perform well in a range of vision problems. We also show that the networks appropriately capture word similarity and learn correspondences between different languages.

Armand Joulin, Laurens van der Maaten, Allan Jabri, Nicolas Vasilache

3D Mask Face Anti-spoofing with Remote Photoplethysmography

3D mask spoofing attack has been one of the main challenges in face recognition. Among existing methods, texture-based approaches show powerful abilities and achieve encouraging results on 3D mask face anti-spoofing. However, these approaches may not be robust enough in application scenarios and could fail to detect imposters with hyper-real masks. In this paper, we propose a novel approach to 3D mask face anti-spoofing from a new perspective, by analysing heartbeat signal through remote Photoplethysmography (rPPG). We develop a novel local rPPG correlation model to extract discriminative local heartbeat signal patterns so that an imposter can better be detected regardless of the material and quality of the mask. To further exploit the characteristic of rPPG distribution on real faces, we learn a confidence map through heartbeat signal strength to weight local rPPG correlation pattern for classification. Experiments on both public and self-collected datasets validate that the proposed method achieves promising results under intra and cross dataset scenario.

Siqi Liu, Pong C. Yuen, Shengping Zhang, Guoying Zhao

Guided Matching Based on Statistical Optical Flow for Fast and Robust Correspondence Analysis

In this paper, we present a novel algorithm for reliable and fast feature matching. Inspired by recent efforts in optimizing the matching process using geometric and statistical properties, we developed an approach which constrains the search space by utilizing spatial statistics from a small subset of matched and filtered correspondences. We call this method Guided Matching based on Statistical Optical Flow (GMbSOF). To ensure broad applicability, our approach works on high dimensional descriptors like SIFT but also on binary descriptors like FREAK. To evaluate our algorithm, we developed a novel method for determining ground truth matches, including true negatives, using spatial ground truth information of well known datasets. Therefore, we evaluate not only with precision and recall but also with accuracy and fall-out. We compare our approach in detail to several relevant state-of-the-art algorithms using these metrics. Our experiments show that our method outperforms all other tested solutions in terms of processing time while retaining a comparable level of matching quality.

Josef Maier, Martin Humenberger, Markus Murschitz, Oliver Zendel, Markus Vincze

Pose Estimation Errors, the Ultimate Diagnosis

This paper proposes a thorough diagnosis for the problem of object detection and pose estimation. We provide a diagnostic tool to examine the impact in the performance of the different types of false positives, and the effects of the main object characteristics. We focus our study on the PASCAL 3D+ dataset, developing a complete diagnosis of four different state-of-the-art approaches, which span from hand-crafted models, to deep learning solutions. We show that gaining a clear understanding of typical failure cases and the effects of object characteristics on the performance of the models, is fundamental in order to facilitate further progress towards more accurate solutions for this challenging task.

Carolina Redondo-Cabrera, Roberto J. López-Sastre, Yu Xiang, Tinne Tuytelaars, Silvio Savarese

A Siamese Long Short-Term Memory Architecture for Human Re-identification

Matching pedestrians across multiple camera views known as human re-identification (re-identification) is a challenging problem in visual surveillance. In the existing works concentrating on feature extraction, representations are formed locally and independent of other regions. We present a novel siamese Long Short-Term Memory (LSTM) architecture that can process image regions sequentially and enhance the discriminative capability of local feature representation by leveraging contextual information. The feedback connections and internal gating mechanism of the LSTM cells enable our model to memorize the spatial dependencies and selectively propagate relevant contextual information through the network. We demonstrate improved performance compared to the baseline algorithm with no LSTM units and promising results compared to state-of-the-art methods on Market-1501, CUHK03 and VIPeR datasets. Visualization of the internal mechanism of LSTM cells shows meaningful patterns can be learned by our method.

Rahul Rama Varior, Bing Shuai, Jiwen Lu, Dong Xu, Gang Wang

Integration of Probabilistic Pose Estimates from Multiple Views

We propose an approach to multi-view object detection and pose estimation that considers combinations of single-view estimates. It can be used with most existing single-view pose estimation systems, and can produce improved results even if the individual pose estimates are incoherent. The method is introduced in the context of an existing, probabilistic, view-based detection and pose estimation method (PAPE), which we here extend to incorporate diverse attributes of the scene. We tested the multiview approach with RGB-D cameras in different environments containing several cluttered test scenes and various textured and textureless objects. The results show that the accuracies of object detection and pose estimation increase significantly over single-view PAPE and over other multiple-view integration methods.

Özgür Erkent, Dadhichi Shukla, Justus Piater

SurfCut: Free-Boundary Surface Extraction

We present SurfCut, an algorithm for extracting a smooth simple surface with unknown boundary from a noisy 3D image and a seed point. In contrast to existing approaches that extract smooth simple surfaces with boundary, our method requires less user input, i.e., a seed point, rather than a 3D boundary curve. Our method is built on the novel observation that certain ridge curves of a front propagated using the Fast Marching algorithm are likely to lie on the surface. Using the framework of cubical complexes, we design a novel algorithm to robustly extract such ridge curves and form the surface of interest. Our algorithm automatically cuts these ridge curves to form the surface boundary, and then extracts the surface. Experiments show the robustness of our method to errors in the data, and that we achieve higher accuracy with lower computational cost than comparable methods.

Marei Algarni, Ganesh Sundaramoorthi

CATS: Co-saliency Activated Tracklet Selection for Video Co-localization

Video co-localization is the task of jointly localizing common objects across videos. Due to the appearance variations both across the videos and within the video, it is a challenging problem to identify and track them without any supervision. In contrast to previous joint frameworks that use bounding box proposals to attack the problem, we propose to leverage co-saliency activated tracklets to address the challenge. To identify the common visual object, we first explore inter-video commonness, intra-video commonness, and motion saliency to generate the co-saliency maps. Object proposals of high objectness and co-saliency scores are tracked across short video intervals to build tracklets. The best tube for a video is obtained through tracklet selection from these intervals based on confidence and smoothness between the adjacent tracklets, with the help of dynamic programming. Experimental results on the benchmark YouTube Object dataset show that the proposed method outperforms state-of-the-art methods.

Koteswar Rao Jerripothula, Jianfei Cai, Junsong Yuan

Online Human Action Detection Using Joint Classification-Regression Recurrent Neural Networks

Human action recognition from well-segmented 3D skeleton data has been intensively studied and has been attracting an increasing attention. Online action detection goes one step further and is more challenging, which identifies the action type and localizes the action positions on the fly from the untrimmed stream data. In this paper, we study the problem of online action detection from streaming skeleton data. We propose a multi-task end-to-end Joint Classification-Regression Recurrent Neural Network to better explore the action type and temporal localization information. By employing a joint classification and regression optimization objective, this network is capable of automatically localizing the start and end points of actions more accurately. Specifically, by leveraging the merits of the deep Long Short-Term Memory (LSTM) subnetwork, the proposed model automatically captures the complex long-range temporal dynamics, which naturally avoids the typical sliding window design and thus ensures high computational efficiency. Furthermore, the subtask of regression optimization provides the ability to forecast the action prior to its occurrence. To evaluate our proposed model, we build a large streaming video dataset with annotations. Experimental results on our dataset and the public G3D dataset both demonstrate very promising performance of our scheme.

Yanghao Li, Cuiling Lan, Junliang Xing, Wenjun Zeng, Chunfeng Yuan, Jiaying Liu

Jensen Bregman LogDet Divergence Optimal Filtering in the Manifold of Positive Definite Matrices

In this paper, we consider the problem of optimal estimation of a time-varying positive definite matrix from a collection of noisy measurements. We assume that this positive definite matrix evolves according to an unknown GARCH (generalized auto-regressive conditional heteroskedasticity) model whose parameters must be estimated from experimental data. The main difficulty here, compared against traditional parameter estimation methods, is that the estimation algorithm should take into account the fact that the matrix evolves on the PD manifold. As we show in the paper, measuring the estimation error using the Jensen Bregman LogDet divergence leads to computationally tractable (and in many cases convex) problems that can be efficiently solved using first order methods. Further, since it is known that this metric provides a good surrogate of the Riemannian manifold metric, the resulting algorithm respects the non-Euclidean geometry of the manifold. In the second part of the paper we show how to exploit this model in a maximum likelihood setup to obtain optimal estimates of the unknown matrix. In this case, the use of the JBLD metric allows for obtaining an alternative representation of Gaussian conjugate priors that results in closed form solutions for the maximum likelihood estimate. In turn, this leads to computationally efficient algorithms that take into account the non-Euclidean geometry. These results are illustrated with several examples using both synthetic and real data.

Yin Wang, Octavia Camps, Mario Sznaier, Biel Roig Solvas

SyB3R: A Realistic Synthetic Benchmark for 3D Reconstruction from Images

Benchmark datasets are the foundation of experimental evaluation in almost all vision problems. In the context of 3D reconstruction these datasets are rather difficult to produce. The field is mainly divided into datasets created from real photos with difficult experimental setups and simple synthetic datasets which are easy to produce, but lack many of the real world characteristics. In this work, we seek to find a middle ground by introducing a framework for the synthetic creation of realistic datasets and their ground truths. We show the benefits of such a purely synthetic approach over real world datasets and discuss its limitations.

Andreas Ley, Ronny Hänsch, Olaf Hellwich

Poster Session 8


When is Rotations Averaging Hard?

Rotations averaging has become a key subproblem in global Structure from Motion methods. Several solvers exist, but they do not have guarantees of correctness. They can produce high-quality results, but also sometimes fail. Our understanding of what makes rotations averaging problems easy or hard is still very limited. To investigate the difficulty of rotations averaging, we perform a local convexity analysis under an $$L_2$$ cost function. Although a previous result has shown that in general, this problem is locally convex almost nowhere, we show how this negative conclusion can be reversed by considering the gauge ambiguity. Our theoretical analysis reveals the factors that determine local convexity—noise and graph structure—as well as how they interact, which we describe by a particular Laplacian matrix. Our results are useful for predicting the difficulty of problems, and we demonstrate this on practical datasets. Our work forms the basis of a deeper understanding of the key properties of rotations averaging problems, and we discuss how it can inform the design of future solvers for this important problem.

Kyle Wilson, David Bindel, Noah Snavely

Capturing Dynamic Textured Surfaces of Moving Targets

We present an end-to-end system for reconstructing complete watertight and textured models of moving subjects such as clothed humans and animals, using only three or four handheld sensors. The heart of our framework is a new pairwise registration algorithm that minimizes, using a particle swarm strategy, an alignment error metric based on mutual visibility and occlusion. We show that this algorithm reliably registers partial scans with as little as 15 % overlap without requiring any initial correspondences, and outperforms alternative global registration algorithms. This registration algorithm allows us to reconstruct moving subjects from free-viewpoint video produced by consumer-grade sensors, without extensive sensor calibration, constrained capture volume, expensive arrays of cameras, or templates of the subject geometry.

Ruizhe Wang, Lingyu Wei, Etienne Vouga, Qixing Huang, Duygu Ceylan, Gérard Medioni, Hao Li

ShapeFit and ShapeKick for Robust, Scalable Structure from Motion

We introduce a new method for location recovery from pairwise directions that leverages an efficient convex program that comes with exact recovery guarantees, even in the presence of adversarial outliers. When pairwise directions represent scaled relative positions between pairs of views (estimated for instance with epipolar geometry) our method can be used for location recovery, that is the determination of relative pose up to a single unknown scale. For this task, our method yields performance comparable to the state-of-the-art with an order of magnitude speed-up. Our proposed numerical framework is flexible in that it accommodates other approaches to location recovery and can be used to speed up other methods. These properties are demonstrated by extensively testing against state-of-the-art methods for location recovery on 13 large, irregular collections of images of real scenes in addition to simulated data with ground truth.

Thomas Goldstein, Paul Hand, Choongbum Lee, Vladislav Voroninski, Stefano Soatto

Heat Diffusion Long-Short Term Memory Learning for 3D Shape Analysis

The heat kernel is a fundamental solution in mathematical physics to distribution measurement of heat energy within a fixed region over time, and due to its unique property of being invariant to isometric transformations, the heat kernel has been an effective feature descriptor for spectral shape analysis. The majority of prior heat kernel-based strategies of building 3D shape representations fail to investigate the temporal dynamics of heat flows on 3D shape surfaces over time. In this work, we address the temporal dynamics of heat flows on 3D shapes using the long-short term memory (LSTM). We guide 3D shape descriptors toward discriminative representations by feeding heat distributions throughout time as inputs to units of heat diffusion LSTM (HD-LSTM) blocks with a supervised learning structure. We further extend HD-LSTM to a cross-domain structure (CDHD-LSTM) for learning domain-invariant representations of multi-view data. We evaluate the effectiveness of both HD-LSTM and CDHD-LSTM on 3D shape retrieval and sketch-based 3D shape retrieval tasks respectively. Experimental results on McGill dataset and SHREC 2014 dataset suggest that both methods can achieve state-of-the-art performance.

Fan Zhu, Jin Xie, Yi Fang

Multi-view 3D Models from Single Images with a Convolutional Network

We present a convolutional network capable of inferring a 3D representation of a previously unseen object given a single image of this object. Concretely, the network can predict an RGB image and a depth map of the object as seen from an arbitrary view. Several of these depth maps fused together give a full point cloud of the object. The point cloud can in turn be transformed into a surface mesh. The network is trained on renderings of synthetic 3D models of cars and chairs. It successfully deals with objects on cluttered background and generates reasonable predictions for real images of cars.

Maxim Tatarchenko, Alexey Dosovitskiy, Thomas Brox

Extending Long Short-Term Memory for Multi-View Structured Learning

Long Short-Term Memory (LSTM) networks have been successfully applied to a number of sequence learning problems but they lack the design flexibility to model multiple view interactions, limiting their ability to exploit multi-view relationships. In this paper, we propose a Multi-View LSTM (MV-LSTM), which explicitly models the view-specific and cross-view interactions over time or structured outputs. We evaluate the MV-LSTM model on four publicly available datasets spanning two very different structured learning problems: multimodal behaviour recognition and image captioning. The experimental results show competitive performance on all four datasets when compared with state-of-the-art models.

Shyam Sundar Rajagopalan, Louis-Philippe Morency, Tadas Baltrus̆aitis, Roland Goecke

Gated Bi-directional CNN for Object Detection

The visual cues from multiple support regions of different sizes and resolutions are complementary in classifying a candidate box in object detection. How to effectively integrate local and contextual visual cues from these regions has become a fundamental problem in object detection. Most existing works simply concatenated features or scores obtained from support regions. In this paper, we proposal a novel gated bi-directional CNN (GBD-Net) to pass messages between features from different support regions during both feature learning and feature extraction. Such message passing can be implemented through convolution in two directions and can be conducted in various layers. Therefore, local and contextual visual patterns can validate the existence of each other by learning their nonlinear relationships and their close iterations are modeled in a much more complex way. It is also shown that message passing is not always helpful depending on individual samples. Gated functions are further introduced to control message transmission and their on-and-off is controlled by extra visual evidence from the input sample. GBD-Net is implemented under the Fast RCNN detection framework. Its effectiveness is shown through experiments on three object detection datasets, ImageNet, Pascal VOC2007 and Microsoft COCO.

Xingyu Zeng, Wanli Ouyang, Bin Yang, Junjie Yan, Xiaogang Wang

Graph Based Skeleton Motion Representation and Similarity Measurement for Action Recognition

Most of existing skeleton-based representations for action recognition can not effectively capture the spatio-temporal motion characteristics of joints and are not robust enough to noise from depth sensors and estimation errors of joints. In this paper, we propose a novel low-level representation for the motion of each joint through tracking its trajectory and segmenting it into several semantic parts called motionlets. During this process, the disturbance of noise is reduced by trajectory fitting, sampling and segmentation. Then we construct an undirected complete labeled graph to represent a video by combining these motionlets and their spatio-temporal correlations. Furthermore, a new graph kernel called subgraph-pattern graph kernel (SPGK) is proposed to measure the similarity between graphs. Finally, the SPGK is directly used as the kernel of SVM to classify videos. In order to evaluate our method, we perform a series of experiments on several public datasets and our approach achieves a comparable performance to the state-of-the-art approaches.

Pei Wang, Chunfeng Yuan, Weiming Hu, Bing Li, Yanning Zhang

Reliable Fusion of ToF and Stereo Depth Driven by Confidence Measures

In this paper we propose a framework for the fusion of depth data produced by a Time-of-Flight (ToF) camera and stereo vision system. Initially, depth data acquired by the ToF camera are upsampled by an ad-hoc algorithm based on image segmentation and bilateral filtering. In parallel a dense disparity map is obtained using the Semi-Global Matching stereo algorithm. Reliable confidence measures are extracted for both the ToF and stereo depth data. In particular, ToF confidence also accounts for the mixed-pixel effect and the stereo confidence accounts for the relationship between the pointwise matching costs and the cost obtained by the semi-global optimization. Finally, the two depth maps are synergically fused by enforcing the local consistency of depth data accounting for the confidence of the two data sources at each location. Experimental results clearly show that the proposed method produces accurate high resolution depth maps and outperforms the compared fusion algorithms.

Giulio Marin, Pietro Zanuttigh, Stefano Mattoccia

Fast, Exact and Multi-scale Inference for Semantic Image Segmentation with Deep Gaussian CRFs

In this work we propose a structured prediction technique that combines the virtues of Gaussian Conditional Random Fields (G-CRF) with Deep Learning: (a) our structured prediction task has a unique global optimum that is obtained exactly from the solution of a linear system (b) the gradients of our model parameters are analytically computed using closed form expressions, in contrast to the memory-demanding contemporary deep structured prediction approaches [1, 2] that rely on back-propagation-through-time, (c) our pairwise terms do not have to be simple hand-crafted expressions, as in the line of works building on the DenseCRF [1, 3], but can rather be ‘discovered’ from data through deep architectures, and (d) out system can trained in an end-to-end manner. Building on standard tools from numerical analysis we develop very efficient algorithms for inference and learning, as well as a customized technique adapted to the semantic segmentation task. This efficiency allows us to explore more sophisticated architectures for structured prediction in deep learning: we introduce multi-resolution architectures to couple information across scales in a joint optimization framework, yielding systematic improvements. We demonstrate the utility of our approach on the challenging VOC PASCAL 2012 image segmentation benchmark, showing substantial improvements over strong baselines. We make all of our code and experiments available at

Siddhartha Chandra, Iasonas Kokkinos

Kernel-Based Supervised Discrete Hashing for Image Retrieval

Recently hashing has become an important tool to tackle the problem of large-scale nearest neighbor searching in computer vision. However, learning discrete hashing codes is a very challenging task due to the NP hard optimization problem. In this paper, we propose a novel yet simple kernel-based supervised discrete hashing method via an asymmetric relaxation strategy. Specifically, we present an optimization model with preserving the hashing function and the relaxed linear function simultaneously to reduce the accumulated quantization error between hashing and linear functions. Furthermore, we improve the hashing model by relaxing the hashing function into a general binary code matrix and introducing an additional regularization term. Then we solve these two optimization models via an alternative strategy, which can effectively and stably preserve the similarity of neighbors in a low-dimensional Hamming space. The proposed hashing method can produce informative short binary codes that require less storage volume and lower optimization time cost. Extensive experiments on multiple benchmark databases demonstrate the effectiveness of the proposed hashing method with short binary codes and its superior performance over the state of the arts.

Xiaoshuang Shi, Fuyong Xing, Jinzheng Cai, Zizhao Zhang, Yuanpu Xie, Lin Yang

Iterative Reference Driven Metric Learning for Signer Independent Isolated Sign Language Recognition

Sign language recognition (SLR) is an interesting but difficult problem. One of the biggest challenges comes from the complex inter-signer variations. To address this problem, the basic idea in this paper is to learn a generic model which is robust to different signers. This generic model contains a group of sign references and a corresponding distance metric. The references are constructed by signer invariant representations of each sign class. Motivated by the fact that the probe samples should have high similarities with their own class references, we aim to learn a distance metric which pulls the samples and their true sign classes (references) closer and push away the samples from the false sign classes (references). Therefore, given a group of references, a distance metric can be exploited with our proposed Reference Driven Metric Learning (RDML). In a further step, to obtain more appropriate references, an iterative manner is conducted to update the references and distance metric alternately with iterative RDML (iRDML). The effectiveness and efficiency of the proposed method is evaluated extensively on several public databases for both SLR and human motion recognition tasks.

Fang Yin, Xiujuan Chai, Xilin Chen

Ask, Attend and Answer: Exploring Question-Guided Spatial Attention for Visual Question Answering

We address the problem of Visual Question Answering (VQA), which requires joint image and language understanding to answer a question about a given photograph. Recent approaches have applied deep image captioning methods based on convolutional-recurrent networks to this problem, but have failed to model spatial inference. To remedy this, we propose a model we call the Spatial Memory Network and apply it to the VQA task. Memory networks are recurrent neural networks with an explicit attention mechanism that selects certain parts of the information stored in memory. Our Spatial Memory Network stores neuron activations from different spatial regions of the image in its memory, and uses attention to choose regions relevant for computing the answer. We propose a novel question-guided spatial attention architecture that looks for regions relevant to either individual words or the entire question, repeating the process over multiple recurrent steps, or “hops”. To better understand the inference process learned by the network, we design synthetic questions that specifically require spatial inference and visualize the network’s attention. We evaluate our model on two available visual question answering datasets and obtain improved results.

Huijuan Xu, Kate Saenko

Relay Backpropagation for Effective Learning of Deep Convolutional Neural Networks

Learning deeper convolutional neural networks has become a tendency in recent years. However, many empirical evidences suggest that performance improvement cannot be attained by simply stacking more layers. In this paper, we consider the issue from an information theoretical perspective, and propose a novel method Relay Backpropagation, which encourages the propagation of effective information through the network in training stage. By virtue of the method, we achieved the first place in ILSVRC 2015 Scene Classification Challenge. Extensive experiments on two large scale challenging datasets demonstrate the effectiveness of our method is not restricted to a specific dataset or network architecture.

Li Shen, Zhouchen Lin, Qingming Huang

Counting in the Wild

In this paper we explore the scenario of learning to count multiple instances of objects from images that have been dot-annotated through crowdsourcing. Specifically, we work with a large and challenging image dataset of penguins in the wild, for which tens of thousands of volunteer annotators have placed dots on instances of penguins in tens of thousands of images. The dataset, introduced and released with this paper, shows such a high-degree of object occlusion and scale variation that individual object detection or simple counting-density estimation is not able to estimate the bird counts reliably.To address the challenging counting task, we augment and interleave density estimation with foreground-background segmentation and explicit local uncertainty estimation. The three tasks are solved jointly by a new deep multi-task architecture. Using this multi-task learning, we show that the spread between the annotators can provide hints about local object scale and aid the foreground-background segmentation, which can then be used to set a better target density for learning density prediction. Considerable improvements in counting accuracy over a single-task density estimation approach are observed in our experiments.

Carlos Arteta, Victor Lempitsky, Andrew Zisserman

A Discriminative Feature Learning Approach for Deep Face Recognition

Convolutional neural networks (CNNs) have been widely used in computer vision community, significantly improving the state-of-the-art. In most of the available CNNs, the softmax loss function is used as the supervision signal to train the deep model. In order to enhance the discriminative power of the deeply learned features, this paper proposes a new supervision signal, called center loss, for face recognition task. Specifically, the center loss simultaneously learns a center for deep features of each class and penalizes the distances between the deep features and their corresponding class centers. More importantly, we prove that the proposed center loss function is trainable and easy to optimize in the CNNs. With the joint supervision of softmax loss and center loss, we can train a robust CNNs to obtain the deep features with the two key learning objectives, inter-class dispension and intra-class compactness as much as possible, which are very essential to face recognition. It is encouraging to see that our CNNs (with such joint supervision) achieve the state-of-the-art accuracy on several important face recognition benchmarks, Labeled Faces in the Wild (LFW), YouTube Faces (YTF), and MegaFace Challenge. Especially, our new approach achieves the best results on MegaFace (the largest public domain face benchmark) under the protocol of small training set (contains under 500000 images and under 20000 persons), significantly improving the previous results and setting new state-of-the-art for both face recognition and face verification tasks.

Yandong Wen, Kaipeng Zhang, Zhifeng Li, Yu Qiao

Network of Experts for Large-Scale Image Categorization

We present a tree-structured network architecture for large-scale image classification. The trunk of the network contains convolutional layers optimized over all classes. At a given depth, the trunk splits into separate branches, each dedicated to discriminate a different subset of classes. Each branch acts as an expert classifying a set of categories that are difficult to tell apart, while the trunk provides common knowledge to all experts in the form of shared features. The training of our “network of experts” is completely end-to-end: the partition of categories into disjoint subsets is learned simultaneously with the parameters of the network trunk and the experts are trained jointly by minimizing a single learning objective over all classes. The proposed structure can be built from any existing convolutional neural network (CNN). We demonstrate its generality by adapting 4 popular CNNs for image categorization into the form of networks of experts. Our experiments on CIFAR100 and ImageNet show that in every case our method yields a substantial improvement in accuracy over the base CNN, and gives the best result achieved so far on CIFAR100. Finally, the improvement in accuracy comes at little additional cost: compared to the base network, the training time is only moderately increased and the number of parameters is comparable or in some cases even lower.

Karim Ahmed, Mohammad Haris Baig, Lorenzo Torresani

Zero-Shot Recognition via Structured Prediction

We develop a novel method for zero shot learning (ZSL) based on test-time adaptation of similarity functions learned using training data. Existing methods exclusively employ source-domain side information for recognizing unseen classes during test time. We show that for batch-mode applications, accuracy can be significantly improved by adapting these predictors to the observed test-time target-domain ensemble. We develop a novel structured prediction method for maximum a posteriori (MAP) estimation, where parameters account for test-time domain shift from what is predicted primarily using source domain information. We propose a Gaussian parameterization for the MAP problem and derive an efficient structure prediction algorithm. Empirically we test our method on four popular benchmark image datasets for ZSL, and show significant improvement over the state-of-the-art, on average, by 11.50 % and 30.12 % in terms of accuracy for recognition and mean average precision (mAP) for retrieval, respectively.

Ziming Zhang, Venkatesh Saligrama

What’s the Point: Semantic Segmentation with Point Supervision

The semantic image segmentation task presents a trade-off between test time accuracy and training time annotation cost. Detailed per-pixel annotations enable training accurate models but are very time-consuming to obtain; image-level class labels are an order of magnitude cheaper but result in less accurate models. We take a natural step from image-level annotation towards stronger supervision: we ask annotators to point to an object if one exists. We incorporate this point supervision along with a novel objectness potential in the training loss function of a CNN model. Experimental results on the PASCAL VOC 2012 benchmark reveal that the combined effect of point-level supervision and objectness potential yields an improvement of $$12.9\,\%$$ mIOU over image-level supervision. Further, we demonstrate that models trained with point-level supervision are more accurate than models trained with image-level, squiggle-level or full supervision given a fixed annotation budget.

Amy Bearman, Olga Russakovsky, Vittorio Ferrari, Li Fei-Fei

A Generalized Successive Shortest Paths Solver for Tracking Dividing Targets

Tracking-by-detection methods are prevailing in many tracking scenarios. One attractive property is that in the absence of additional constraints they can be solved optimally in polynomial time, e.g. by min-cost flow solvers. But when potentially dividing targets need to be tracked – as is the case for biological tasks like cell tracking – finding the solution to a global tracking-by-detection model is NP-hard. In this work, we present a flow-based approximate solution to a common cell tracking model that allows for objects to merge and split or divide. We build on the successive shortest path min-cost flow algorithm but alter the residual graph such that the flow through the graph obeys division constraints and always represents a feasible tracking solution. By conditioning the residual arc capacities on the flow along logically associated arcs we obtain a polynomial time heuristic that achieves close-to-optimal tracking results while exhibiting a good anytime performance. We also show that our method is a generalization of an approximate dynamic programming cell tracking solver by Magnusson et al. that stood out in the ISBI Cell Tracking Challenges.

Carsten Haubold, Janez Aleš, Steffen Wolf, Fred A. Hamprecht

Accurate and Linear Time Pose Estimation from Points and Lines

The Perspective-n-Point (PnP) problem seeks to estimate the pose of a calibrated camera from n 3D-to-2D point correspondences. There are situations, though, where PnP solutions are prone to fail because feature point correspondences cannot be reliably estimated (e.g. scenes with repetitive patterns or with low texture). In such scenarios, one can still exploit alternative geometric entities, such as lines, yielding the so-called Perspective-n-Line (PnL) algorithms. Unfortunately, existing PnL solutions are not as accurate and efficient as their point-based counterparts. In this paper we propose a novel approach to introduce 3D-to-2D line correspondences into a PnP formulation, allowing to simultaneously process points and lines. For this purpose we introduce an algebraic line error that can be formulated as linear constraints on the line endpoints, even when these are not directly observable. These constraints can then be naturally integrated within the linear formulations of two state-of-the-art point-based algorithms, the OPnP and the EPnP, allowing them to indistinctly handle points, lines, or a combination of them. Exhaustive experiments show that the proposed formulation brings remarkable boost in performance compared to only point or only line based solutions, with a negligible computational overhead compared to the original OPnP and EPnP.

Alexander Vakhitov, Jan Funke, Francesc Moreno-Noguer

Pseudo-geometric Formulation for Fitting Equidistant Parallel Lines

We present a novel pseudo-geometric formulation for equidistant parallel lines which allows direct linear evaluation against fitted lines in the image space thus providing improved robustness of fit and avoids the need for non-linear optimization. The key idea of our work is to determine an equidistant set of parallel lines which are at minimum orthogonal distance from the edge lines in the image. The comparative results on simulated and real datasets show that a linear solution using the pseudo-geometric formulation is superior to the previous algebraic solution and performs close to the non-linear solution of the true geometric error.

Faisal Azhar, Stephen Pollard

Towards Perspective-Free Object Counting with Deep Learning

In this paper we address the problem of counting objects instances in images. Our models are able to precisely estimate the number of vehicles in a traffic congestion, or to count the humans in a very crowded scene. Our first contribution is the proposal of a novel convolutional neural network solution, named Counting CNN (CCNN). Essentially, the CCNN is formulated as a regression model where the network learns how to map the appearance of the image patches to their corresponding object density maps. Our second contribution consists in a scale-aware counting model, the Hydra CNN, able to estimate object densities in different very crowded scenarios where no geometric information of the scene can be provided. Hydra CNN learns a multiscale non-linear regression model which uses a pyramid of image patches extracted at multiple scales to perform the final density prediction. We report an extensive experimental evaluation, using up to three different object counting benchmarks, where we show how our solutions achieve a state-of-the-art performance.

Daniel Oñoro-Rubio, Roberto J. López-Sastre

Information Bottleneck Domain Adaptation with Privileged Information for Visual Recognition

We address the unsupervised domain adaptation problem for visual recognition when an auxiliary data view is available during training. This is important because it allows improving the training of visual classifiers on a new target visual domain when paired additional source data is cheaply available. This is the case when we learn from a source of RGB plus depth data, for then test on a new RGB domain. The problem is challenging because of the intrinsic asymmetry caused by the missing auxiliary view during testing and from which discriminative information should be carried over to the new domain. We jointly account for the auxiliary view during training and for the domain shift by extending the information bottleneck method, and by combining it with risk minimization. In this way, we establish an information theoretic principle for learning any type of visual classifier under this particular settings. We use this principle to design a multi-class large-margin classifier with an efficient optimization in the primal space. We extensively compare our method with the state-of-the-art on several datasets, by effectively learning from RGB plus depth data to recognize objects and gender from a new RGB domain.

Saeid Motiian, Gianfranco Doretto

Template-Free 3D Reconstruction of Poorly-Textured Nonrigid Surfaces

Two main classes of approaches have been studied to perform monocular nonrigid 3D reconstruction: Template-based methods and Non-rigid Structure from Motion techniques. While the first ones have been applied to reconstruct poorly-textured surfaces, they assume the availability of a 3D shape model prior to reconstruction. By contrast, the second ones do not require such a shape template, but, instead, rely on points being tracked throughout a video sequence, and are thus ill-suited to handle poorly-textured surfaces. In this paper, we introduce a template-free approach to reconstructing a poorly-textured, deformable surface. To this end, we leverage surface isometry and formulate 3D reconstruction as the joint problem of non-rigid image registration and depth estimation. Our experiments demonstrate that our approach yields much more accurate 3D reconstructions than state-of-the-art techniques.

Xuan Wang, Mathieu Salzmann, Fei Wang, Jizhong Zhao

FigureSeer: Parsing Result-Figures in Research Papers

‘Which are the pedestrian detectors that yield a precision above 95 % at 25 % recall?’ Answering such a complex query involves identifying and analyzing the results reported in figures within several research papers. Despite the availability of excellent academic search engines, retrieving such information poses a cumbersome challenge today as these systems have primarily focused on understanding the text content of scholarly documents. In this paper, we introduce FigureSeer, an end-to-end framework for parsing result-figures, that enables powerful search and retrieval of results in research papers. Our proposed approach automatically localizes figures from research papers, classifies them, and analyses the content of the result-figures. The key challenge in analyzing the figure content is the extraction of the plotted data and its association with the legend entries. We address this challenge by formulating a novel graph-based reasoning approach using a CNN-based similarity metric. We present a thorough evaluation on a real-word annotated dataset to demonstrate the efficacy of our approach.

Noah Siegel, Zachary Horvitz, Roie Levin, Santosh Divvala, Ali Farhadi

Approximate Search with Quantized Sparse Representations

This paper tackles the task of storing a large collection of vectors, such as visual descriptors, and of searching in it. To this end, we propose to approximate database vectors by constrained sparse coding, where possible atom weights are restricted to belong to a finite subset. This formulation encompasses, as particular cases, previous state-of-the-art methods such as product or residual quantization. As opposed to traditional sparse coding methods, quantized sparse coding includes memory usage as a design constraint, thereby allowing us to index a large collection such as the BIGANN billion-sized benchmark. Our experiments, carried out on standard benchmarks, show that our formulation leads to competitive solutions when considering different trade-offs between learning/coding time, index size and search quality.

Himalaya Jain, Patrick Pérez, Rémi Gribonval, Joaquin Zepeda, Hervé Jégou

Sympathy for the Details: Dense Trajectories and Hybrid Classification Architectures for Action Recognition

Action recognition in videos is a challenging task due to the complexity of the spatio-temporal patterns to model and the difficulty to acquire and learn on large quantities of video data. Deep learning, although a breakthrough for image classification and showing promise for videos, has still not clearly superseded action recognition methods using hand-crafted features, even when training on massive datasets. In this paper, we introduce hybrid video classification architectures based on carefully designed unsupervised representations of hand-crafted spatio-temporal features classified by supervised deep networks. As we show in our experiments on five popular benchmarks for action recognition, our hybrid model combines the best of both worlds: it is data efficient (trained on 150 to 10000 short clips) and yet improves significantly on the state of the art, including recent deep models trained on millions of manually labelled images and videos.

César Roberto de Souza, Adrien Gaidon, Eleonora Vig, Antonio Manuel López

Human Pose Estimation via Convolutional Part Heatmap Regression

This paper is on human pose estimation using Convolutional Neural Networks. Our main contribution is a CNN cascaded architecture specifically designed for learning part relationships and spatial context, and robustly inferring pose even for the case of severe part occlusions. To this end, we propose a detection-followed-by-regression CNN cascade. The first part of our cascade outputs part detection heatmaps and the second part performs regression on these heatmaps. The benefits of the proposed architecture are multi-fold: It guides the network where to focus in the image and effectively encodes part constraints and context. More importantly, it can effectively cope with occlusions because part detection heatmaps for occluded parts provide low confidence scores which subsequently guide the regression part of our network to rely on contextual information in order to predict the location of these parts. Additionally, we show that the proposed cascade is flexible enough to readily allow the integration of various CNN architectures for both detection and regression, including recent ones based on residual learning. Finally, we illustrate that our cascade achieves top performance on the MPII and LSP data sets. Code can be downloaded from

Adrian Bulat, Georgios Tzimiropoulos

Collaborative Layer-Wise Discriminative Learning in Deep Neural Networks

Intermediate features at different layers of a deep neural network are known to be discriminative for visual patterns of different complexities. However, most existing works ignore such cross-layer heterogeneities when classifying samples of different complexities. For example, if a training sample has already been correctly classified at a specific layer with high confidence, we argue that it is unnecessary to enforce rest layers to classify this sample correctly and a better strategy is to encourage those layers to focus on other samples.In this paper, we propose a layer-wise discriminative learning method to enhance the discriminative capability of a deep network by allowing its layers to work collaboratively for classification. Towards this target, we introduce multiple classifiers on top of multiple layers. Each classifier not only tries to correctly classify the features from its input layer, but also coordinates with other classifiers to jointly maximize the final classification performance. Guided by the other companion classifiers, each classifier learns to concentrate on certain training examples and boosts the overall performance. Allowing for end-to-end training, our method can be conveniently embedded into state-of-the-art deep networks. Experiments with multiple popular deep networks, including Network in Network, GoogLeNet and VGGNet, on scale-various object classification benchmarks, including CIFAR100, MNIST and ImageNet, and scene classification benchmarks, including MIT67, SUN397 and Places205, demonstrate the effectiveness of our method. In addition, we also analyze the relationship between the proposed method and classical conditional random fields models.

Xiaojie Jin, Yunpeng Chen, Jian Dong, Jiashi Feng, Shuicheng Yan

Deep Decoupling of Defocus and Motion Blur for Dynamic Segmentation

We address the challenging problem of segmenting dynamic objects given a single space-variantly blurred image of a 3D scene captured using a hand-held camera. The blur induced at a particular pixel on a moving object is due to the combined effects of camera motion, the object’s own independent motion during exposure, its relative depth in the scene, and defocusing due to lens settings. We develop a deep convolutional neural network (CNN) to predict the probabilistic distribution of the composite kernel which is the convolution of motion blur and defocus kernels at each pixel. Based on the defocus component, we segment the image into different depth layers. We then judiciously exploit the motion component present in the composite kernels to automatically segment dynamic objects at each depth layer. Jointly handling defocus and motion blur enables us to resolve depth-motion ambiguity which has been a major limitation of the existing segmentation algorithms. Experimental evaluations on synthetic and real data reveal that our method significantly outperforms contemporary techniques.

Abhijith Punnappurath, Yogesh Balaji, Mahesh Mohan, Ambasamudram Narayanan Rajagopalan

Video Summarization with Long Short-Term Memory

We propose a novel supervised learning technique for summarizing videos by automatically selecting keyframes or key subshots. Casting the task as a structured prediction problem, our main idea is to use Long Short-Term Memory (LSTM) to model the variable-range temporal dependency among video frames, so as to derive both representative and compact video summaries. The proposed model successfully accounts for the sequential structure crucial to generating meaningful video summaries, leading to state-of-the-art results on two benchmark datasets. In addition to advances in modeling techniques, we introduce a strategy to address the need for a large amount of annotated data for training complex learning approaches to summarization. There, our main idea is to exploit auxiliary annotated video summarization datasets, in spite of their heterogeneity in visual styles and contents. Specifically, we show that domain adaptation techniques can improve learning by reducing the discrepancies in the original datasets’ statistical properties.

Ke Zhang, Wei-Lun Chao, Fei Sha, Kristen Grauman

Leaving Some Stones Unturned: Dynamic Feature Prioritization for Activity Detection in Streaming Video

Current approaches for activity recognition often ignore constraints on computational resources: (1) they rely on extensive feature computation to obtain rich descriptors on all frames, and (2) they assume batch-mode access to the entire test video at once. We propose a new active approach to activity recognition that prioritizes “what to compute when” in order to make timely predictions. The main idea is to learn a policy that dynamically schedules the sequence of features to compute on selected frames of a given test video. In contrast to traditional static feature selection, our approach continually re-prioritizes computation based on the accumulated history of observations and accounts for the transience of those observations in ongoing video. We develop variants to handle both the batch and streaming settings. On two challenging datasets, our method provides significantly better accuracy than alternative techniques for a wide range of computational budgets.

Yu-Chuan Su, Kristen Grauman

Robust and Accurate Line- and/or Point-Based Pose Estimation without Manhattan Assumptions

Usual Structure from Motion techniques based on feature points have a hard time on scenes with little texture or presenting a single plane, as in indoor environments. Line segments are more robust features in this case. We propose a novel geometrical criterion for two-view pose estimation using lines, that does not assume a Manhattan world. We also define a parameterless (a contrario) RANSAC-like method to discard calibration outliers and provide more robust pose estimations, possibly using points as well when available. Finally, we provide quantitative experimental data that illustrate failure cases of other methods and that show how our approach outperforms them, both in robustness and precision.

Yohann Salaün , Renaud Marlet, Pascal Monasse

MARLow: A Joint Multiplanar Autoregressive and Low-Rank Approach for Image Completion

In this paper, we propose a novel multiplanar autoregressive (AR) model to exploit the correlation in cross-dimensional planes of a similar patch group collected in an image, which has long been neglected by previous AR models. On that basis, we then present a joint multiplanar AR and low-rank based approach (MARLow) for image completion from random sampling, which exploits the nonlocal self-similarity within natural images more effectively. Specifically, the multiplanar AR model constraints the local stationarity in different cross-sections of the patch group, while the low-rank minimization captures the intrinsic coherence of nonlocal patches. The proposed approach can be readily extended to multichannel images (e.g. color images), by simultaneously considering the correlation in different channels. Experimental results demonstrate that the proposed approach significantly outperforms state-of-the-art methods, even if the pixel missing rate is as high as 90 %.

Mading Li, Jiaying Liu, Zhiwei Xiong, Xiaoyan Sun, Zongming Guo

An Uncertain Future: Forecasting from Static Images Using Variational Autoencoders

In a given scene, humans can easily predict a set of immediate future events that might happen. However, pixel-level anticipation in computer vision is difficult because machine learning struggles with the ambiguity in predicting the future. In this paper, we focus on predicting the dense trajectory of pixels in a scene—what will move in the scene, where it will travel, and how it will deform over the course of one second. We propose a conditional variational autoencoder as a solution to this problem. In this framework, direct inference from the image shapes the distribution of possible trajectories while latent variables encode information that is not available in the image. We show that our method predicts events in a variety of scenes and can produce multiple different predictions for an ambiguous future. We also find that our method learns a representation that is applicable to semantic vision tasks.

Jacob Walker, Carl Doersch, Abhinav Gupta, Martial Hebert

Carried Object Detection Based on an Ensemble of Contour Exemplars

We study the challenging problem of detecting carried objects (CO) in surveillance videos. For this purpose, we formulate CO detection in terms of determining a person’s contour hypothesis and detecting CO by exploiting the remaining contours. A hypothesis mask for a person’s contours is generated based on an ensemble of contour exemplars of humans with different standing and walking poses. Contours that are not falling in a person’s contour hypothesis mask are considered as candidates for CO contours. Then, a region is assigned to each CO candidate contour using biased normalized cut and is scored by a weighted function of its overlap with the person’s contour hypothesis mask and segmented foreground. To detect COs from obtained candidate regions, a non-maximum suppression method is applied to eliminate the low score candidates. We detect COs without protrusion assumption from a normal silhouette as well as without any prior information about the COs. Experimental results show that our method outperforms state-of-the-art methods even if we are using fewer assumptions.

Farnoosh Ghadiri, Robert Bergevin, Guillaume-Alexandre Bilodeau


Additional information

Premium Partner

    Image Credits