Skip to main content

Über dieses Buch

The eight-volume set comprising LNCS volumes 9905-9912 constitutes the refereed proceedings of the 14th European Conference on Computer Vision, ECCV 2016, held in Amsterdam, The Netherlands, in October 2016.
The 415 revised papers presented were carefully reviewed and selected from 1480 submissions. The papers cover all aspects of computer vision and pattern recognition such as 3D computer vision; computational photography, sensing and display; face and gesture; low-level vision and image processing; motion and tracking; optimization methods; physics-based vision, photometry and shape-from-X; recognition: detection, categorization, indexing, matching; segmentation, grouping and shape representation; statistical methods and learning; video: events, activities and surveillance; applications. They are organized in topical sections on detection, recognition and retrieval; scene understanding; optimization; image and video processing; learning; action, activity and tracking; 3D; and 9 poster sessions.



Detection, Recognition and Retrieval


CNN Image Retrieval Learns from BoW: Unsupervised Fine-Tuning with Hard Examples

Convolutional Neural Networks (CNNs) achieve state-of-the-art performance in many computer vision tasks. However, this achievement is preceded by extreme manual annotation in order to perform either training from scratch or fine-tuning for the target task. In this work, we propose to fine-tune CNN for image retrieval from a large collection of unordered images in a fully automated manner. We employ state-of-the-art retrieval and Structure-from-Motion (SfM) methods to obtain 3D models, which are used to guide the selection of the training data for CNN fine-tuning. We show that both hard positive and hard negative examples enhance the final performance in particular object retrieval with compact codes.

Filip Radenović, Giorgos Tolias, Ondřej Chum

SSD: Single Shot MultiBox Detector

We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in each default box and produces adjustments to the box to better match the object shape. Additionally, the network combines predictions from multiple feature maps with different resolutions to naturally handle objects of various sizes. SSD is simple relative to methods that require object proposals because it completely eliminates proposal generation and subsequent pixel or feature resampling stages and encapsulates all computation in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component. Experimental results on the PASCAL VOC, COCO, and ILSVRC datasets confirm that SSD has competitive accuracy to methods that utilize an additional object proposal step and is much faster, while providing a unified framework for both training and inference. For $$300 \times 300$$300×300 input, SSD achieves 74.3 % mAP on VOC2007 test at 59 FPS on a Nvidia Titan X and for $$512 \times 512$$512×512 input, SSD achieves 76.9 % mAP, outperforming a comparable state of the art Faster R-CNN model. Compared to other single stage methods, SSD has much better accuracy even with a smaller input image size. Code is available at

Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg

A Recurrent Encoder-Decoder Network for Sequential Face Alignment

We propose a novel recurrent encoder-decoder network model for real-time video-based face alignment. Our proposed model predicts 2D facial point maps regularized by a regression loss, while uniquely exploiting recurrent learning at both spatial and temporal dimensions. At the spatial level, we add a feedback loop connection between the combined output response map and the input, in order to enable iterative coarse-to-fine face alignment using a single network model. At the temporal level, we first decouple the features in the bottleneck of the network into temporal-variant factors, such as pose and expression, and temporal-invariant factors, such as identity information. Temporal recurrent learning is then applied to the decoupled temporal-variant features, yielding better generalization and significantly more accurate results at test time. We perform a comprehensive experimental analysis, showing the importance of each component of our proposed model, as well as superior results over the state-of-the-art in standard datasets.

Xi Peng, Rogerio S. Feris, Xiaoyu Wang, Dimitris N. Metaxas

Robust Facial Landmark Detection via Recurrent Attentive-Refinement Networks

In this work, we introduce a novel Recurrent Attentive-Refinement (RAR) network for facial landmark detection under unconstrained conditions, suffering from challenges like facial occlusions and/or pose variations. RAR follows the pipeline of cascaded regressions that refines landmark locations progressively. However, instead of updating all the landmark locations together, RAR refines the landmark locations sequentially at each recurrent stage. In this way, more reliable landmark points are refined earlier and help to infer locations of other challenging landmarks that may stay with occlusions and/or extreme poses. RAR can thus effectively control detection errors from those challenging landmarks and improve overall performance even in presence of heavy occlusions and/or extreme conditions. To determine the sequence of landmarks, RAR employs an attentive-refinement mechanism. The attention LSTM (A-LSTM) and refinement LSTM (R-LSTM) models are introduced in RAR. At each recurrent stage, A-LSTM implicitly identifies a reliable landmark as the attention center. Following the sequence of attention centers, R-LSTM sequentially refines the landmarks near or correlated with the attention centers and provides ultimate detection results finally. To further enhance algorithmic robustness, instead of using mean shape for initialization, RAR adaptively determines the initialization by selecting from a pool of shape centers clustered from all training shapes. As an end-to-end trainable model, RAR demonstrates superior performance in detecting challenging landmarks in comprehensive experiments and it also establishes new state-of-the-arts on the 300-W, COFW and AFLW benchmark datasets.

Shengtao Xiao, Jiashi Feng, Junliang Xing, Hanjiang Lai, Shuicheng Yan, Ashraf Kassim

Poster Session 1


Learning to Refine Object Segments

Object segmentation requires both object-level information and low-level pixel data. This presents a challenge for feedforward networks: lower layers in convolutional nets capture rich spatial information, while upper layers encode object-level knowledge but are invariant to factors such as pose and appearance. In this work we propose to augment feedforward nets for object segmentation with a novel top-down refinement approach. The resulting bottom-up/top-down architecture is capable of efficiently generating high-fidelity object masks. Similarly to skip connections, our approach leverages features at all layers of the net. Unlike skip connections, our approach does not attempt to output independent predictions at each layer. Instead, we first output a coarse ‘mask encoding’ in a feedforward pass, then refine this mask encoding in a top-down pass utilizing features at successively lower layers. The approach is simple, fast, and effective. Building on the recent DeepMask network for generating object proposals, we show accuracy improvements of 10–20% in average recall for various setups. Additionally, by optimizing the overall network architecture, our approach, which we call SharpMask, is 50 % faster than the original DeepMask network (under .8 s per image).

Pedro O. Pinheiro, Tsung-Yi Lin, Ronan Collobert, Piotr Dollár

Deep Automatic Portrait Matting

We propose an automatic image matting method for portrait images. This method does not need user interaction, which was however essential in most previous approaches. In order to accomplish this goal, a new end-to-end convolutional neural network (CNN) based framework is proposed taking the input of a portrait image. It outputs the matte result. Our method considers not only image semantic prediction but also pixel-level image matte optimization. A new portrait image dataset is constructed with our labeled matting ground truth. Our automatic method achieves comparable results with state-of-the-art methods that require specified foreground and background regions or pixels. Many applications are enabled given the automatic nature of our system.

Xiaoyong Shen, Xin Tao, Hongyun Gao, Chao Zhou, Jiaya Jia

Segmentation from Natural Language Expressions

In this paper we approach the novel problem of segmenting an image based on a natural language expression. This is different from traditional semantic segmentation over a predefined set of semantic classes, as e.g., the phrase “two men sitting on the right bench” requires segmenting only the two people on the right bench and no one standing or sitting on another bench. Previous approaches suitable for this task were limited to a fixed set of categories and/or rectangular regions. To produce pixelwise segmentation for the language expression, we propose an end-to-end trainable recurrent and convolutional network model that jointly learns to process visual and linguistic information. In our model, a recurrent neural network is used to encode the referential expression into a vector representation, and a fully convolutional network is used to a extract a spatial feature map from the image and output a spatial response map for the target object. We demonstrate on a benchmark dataset that our model can produce quality segmentation output from the natural language expression, and outperforms baseline methods by a large margin.

Ronghang Hu, Marcus Rohrbach, Trevor Darrell

Semantic Object Parsing with Graph LSTM

By taking the semantic object parsing task as an exemplar application scenario, we propose the Graph Long Short-Term Memory (Graph LSTM) network, which is the generalization of LSTM from sequential data or multi-dimensional data to general graph-structured data. Particularly, instead of evenly and fixedly dividing an image to pixels or patches in existing multi-dimensional LSTM structures (e.g., Row, Grid and Diagonal LSTMs), we take each arbitrary-shaped superpixel as a semantically consistent node, and adaptively construct an undirected graph for each image, where the spatial relations of the superpixels are naturally used as edges. Constructed on such an adaptive graph topology, the Graph LSTM is more naturally aligned with the visual patterns in the image (e.g., object boundaries or appearance similarities) and provides a more economical information propagation route. Furthermore, for each optimization step over Graph LSTM, we propose to use a confidence-driven scheme to update the hidden and memory states of nodes progressively till all nodes are updated. In addition, for each node, the forgets gates are adaptively learned to capture different degrees of semantic correlation with neighboring nodes. Comprehensive evaluations on four diverse semantic object parsing datasets well demonstrate the significant superiority of our Graph LSTM over other state-of-the-art solutions.

Xiaodan Liang, Xiaohui Shen, Jiashi Feng, Liang Lin, Shuicheng Yan

SSHMT: Semi-supervised Hierarchical Merge Tree for Electron Microscopy Image Segmentation

Region-based methods have proven necessary for improving segmentation accuracy of neuronal structures in electron microscopy (EM) images. Most region-based segmentation methods use a scoring function to determine region merging. Such functions are usually learned with supervised algorithms that demand considerable ground truth data, which are costly to collect. We propose a semi-supervised approach that reduces this demand. Based on a merge tree structure, we develop a differentiable unsupervised loss term that enforces consistent predictions from the learned function. We then propose a Bayesian model that combines the supervised and the unsupervised information for probabilistic learning. The experimental results on three EM data sets demonstrate that by using a subset of only $$3\,\%$$3% to $$7\,\%$$7% of the entire ground truth data, our approach consistently performs close to the state-of-the-art supervised method with the full labeled data set, and significantly outperforms the supervised method with the same labeled subset.

Ting Liu, Miaomiao Zhang, Mehran Javanmardi, Nisha Ramesh, Tolga Tasdizen

Towards Viewpoint Invariant 3D Human Pose Estimation

We propose a viewpoint invariant model for 3D human pose estimation from a single depth image. To achieve this, our discriminative model embeds local regions into a learned viewpoint invariant feature space. Formulated as a multi-task learning problem, our model is able to selectively predict partial poses in the presence of noise and occlusion. Our approach leverages a convolutional and recurrent network architecture with a top-down error feedback mechanism to self-correct previous pose estimates in an end-to-end manner. We evaluate our model on a previously published depth dataset and a newly collected human pose dataset containing 100 K annotated depth images from extreme viewpoints. Experiments show that our model achieves competitive performance on frontal views while achieving state-of-the-art performance on alternate viewpoints.

Albert Haque, Boya Peng, Zelun Luo, Alexandre Alahi, Serena Yeung, Li Fei-Fei

Person Re-Identification by Unsupervised Graph Learning

Most existing person re-identification (Re-ID) methods are based on supervised learning of a discriminative distance metric. They thus require a large amount of labelled training image pairs which severely limits their scalability. In this work, we propose a novel unsupervised Re-ID approach which requires no labelled training data yet is able to capture discriminative information for cross-view identity matching. Our model is based on a new graph regularised dictionary learning algorithm. By introducing a $$\ell _1$$ℓ1-norm graph Laplacian term, instead of the conventional squared $$\ell _2$$ℓ2-norm, our model is robust against outliers caused by dramatic changes in background, pose, and occlusion typical in a Re-ID scenario. Importantly we propose to learn jointly the graph and representation resulting in further alleviation of the effects of data outliers. Experiments on four benchmark datasets demonstrate that the proposed model significantly outperforms the state-of-the-art unsupervised learning based alternatives whilst being extremely efficient to compute.

Elyor Kodirov, Tao Xiang, Zhenyong Fu, Shaogang Gong

Deep Learning the City: Quantifying Urban Perception at a Global Scale

Computer vision methods that quantify the perception of urban environment are increasingly being used to study the relationship between a city’s physical appearance and the behavior and health of its residents. Yet, the throughput of current methods is too limited to quantify the perception of cities across the world. To tackle this challenge, we introduce a new crowdsourced dataset containing 110,988 images from 56 cities, and 1,170,000 pairwise comparisons provided by 81,630 online volunteers along six perceptual attributes: safe, lively, boring, wealthy, depressing, and beautiful. Using this data, we train a Siamese-like convolutional neural architecture, which learns from a joint classification and ranking loss, to predict human judgments of pairwise image comparisons. Our results show that crowdsourcing combined with neural networks can produce urban perception data at the global scale.

Abhimanyu Dubey, Nikhil Naik, Devi Parikh, Ramesh Raskar, César A. Hidalgo

4D Match Trees for Non-rigid Surface Alignment

This paper presents a method for dense 4D temporal alignment of partial reconstructions of non-rigid surfaces observed from single or multiple moving cameras of complex scenes. 4D Match Trees are introduced for robust global alignment of non-rigid shape based on the similarity between images across sequences and views. Wide-timeframe sparse correspondence between arbitrary pairs of images is established using a segmentation-based feature detector (SFD) which is demonstrated to give improved matching of non-rigid shape. Sparse SFD correspondence allows the similarity between any pair of image frames to be estimated for moving cameras and multiple views. This enables the 4D Match Tree to be constructed which minimises the observed change in non-rigid shape for global alignment across all images. Dense 4D temporal correspondence across all frames is then estimated by traversing the 4D Match tree using optical flow initialised from the sparse feature matches. The approach is evaluated on single and multiple view images sequences for alignment of partial surface reconstructions of dynamic objects in complex indoor and outdoor scenes to obtain a temporally consistent 4D representation. Comparison to previous 2D and 3D scene flow demonstrates that 4D Match Trees achieve reduced errors due to drift and improved robustness to large non-rigid deformations.

Armin Mustafa, Hansung Kim, Adrian Hilton

Eigen Appearance Maps of Dynamic Shapes

We address the problem of building efficient appearance representations of shapes observed from multiple viewpoints and in several movements. Multi-view systems now allow the acquisition of spatio-temporal models of such moving objects. While efficient geometric representations for these models have been widely studied, appearance information, as provided by the observed images, is mainly considered on a per frame basis, and no global strategy yet addresses the case where several temporal sequences of a shape are available. We propose a per subject representation that builds on PCA to identify the underlying manifold structure of the appearance information relative to a shape. The resulting eigen representation encodes shape appearance variabilities due to viewpoint and motion, with Eigen textures, and due to local inaccuracies in the geometric model, with Eigen warps. In addition to providing compact representations, such decompositions also allow for appearance interpolation and appearance completion. We evaluate their performances over different characters and with respect to their ability to reproduce compelling appearances in a compact way.

Adnane Boukhayma, Vagia Tsiminaki, Jean-Sébastien Franco, Edmond Boyer

Learnable Histogram: Statistical Context Features for Deep Neural Networks

Statistical features, such as histogram, Bag-of-Words (BoW) and Fisher Vector, were commonly used with hand-crafted features in conventional classification methods, but attract less attention since the popularity of deep learning methods. In this paper, we propose a learnable histogram layer, which learns histogram features within deep neural networks in end-to-end training. Such a layer is able to back-propagate (BP) errors, learn optimal bin centers and bin widths, and be jointly optimized with other layers in deep networks during training. Two vision problems, semantic segmentation and object detection, are explored by integrating the learnable histogram layer into deep networks, which show that the proposed layer could be well generalized to different applications. In-depth investigations are conducted to provide insights on the newly introduced layer.

Zhe Wang, Hongsheng Li, Wanli Ouyang, Xiaogang Wang

Pedestrian Behavior Understanding and Prediction with Deep Neural Networks

In this paper, a deep neural network (Behavior-CNN) is proposed to model pedestrian behaviors in crowded scenes, which has many applications in surveillance. A pedestrian behavior encoding scheme is designed to provide a general representation of walking paths, which can be used as the input and output of CNN. The proposed Behavior-CNN is trained with real-scene crowd data and then thoroughly investigated from multiple aspects, including the location map and location awareness property, semantic meanings of learned filters, and the influence of receptive fields on behavior modeling. Multiple applications, including walking path prediction, destination prediction, and tracking, demonstrate the effectiveness of Behavior-CNN on pedestrian behavior modeling.

Shuai Yi, Hongsheng Li, Xiaogang Wang

Real-Time RGB-D Activity Prediction by Soft Regression

In this paper, we propose a novel approach for predicting ongoing activities captured by a low-cost depth camera. Our approach avoids a usual assumption in existing activity prediction systems that the progress level of ongoing sequence is given. We overcome this limitation by learning a soft label for each subsequence and develop a soft regression framework for activity prediction to learn both predictor and soft labels jointly. In order to make activity prediction work in a real-time manner, we introduce a new RGB-D feature called “local accumulative frame feature (LAFF)”, which can be computed efficiently by constructing an integral feature map. Our experiments on two RGB-D benchmark datasets demonstrate that the proposed regression-based activity prediction model outperforms existing models significantly and also show that the activity prediction on RGB-D sequence is more accurate than that on RGB channel.

Jian-Fang Hu, Wei-Shi Zheng, Lianyang Ma, Gang Wang, Jianhuang Lai

A 3D Morphable Eye Region Model for Gaze Estimation

Morphable face models are a powerful tool, but have previously failed to model the eye accurately due to complexities in its material and motion. We present a new multi-part model of the eye that includes a morphable model of the facial eye region, as well as an anatomy-based eyeball model. It is the first morphable model that accurately captures eye region shape, since it was built from high-quality head scans. It is also the first to allow independent eyeball movement, since we treat it as a separate part. To showcase our model we present a new method for illumination- and head-pose–invariant gaze estimation from a single RGB image. We fit our model to an image through analysis-by-synthesis, solving for eye region shape, texture, eyeball pose, and illumination simultaneously. The fitted eyeball pose parameters are then used to estimate gaze direction. Through evaluation on two standard datasets we show that our method generalizes to both webcam and high-quality camera images, and outperforms a state-of-the-art CNN method achieving a gaze estimation accuracy of $$9.44^\circ $$9.44∘ in a challenging user-independent scenario.

Erroll Wood, Tadas Baltrušaitis, Louis-Philippe Morency, Peter Robinson, Andreas Bulling

Foreground Segmentation via Dynamic Tree-Structured Sparse RPCA

Video analysis often begins with background subtraction which consists of creation of a background model, followed by a regularization scheme. Recent evaluation of representative background subtraction techniques demonstrated that there are still considerable challenges facing these methods. We present a new method in which we regard the image sequence as being made up of the sum of a low-rank background matrix and a dynamic tree-structured sparse outlier matrix and solve the decomposition using our approximated Robust Principal Component Analysis method extended to handle camera motion. Our contribution lies in dynamically estimating the support of the foreground regions via a superpixel generation step, so as to impose spatial coherence on these regions, and to obtain crisp and meaningful foreground regions. These advantages enable our method to outperform state-of-the-art alternatives in three benchmark datasets.

Salehe Erfanian Ebadi, Ebroul Izquierdo

Contextual Priming and Feedback for Faster R-CNN

The field of object detection has seen dramatic performance improvements in the last few years. Most of these gains are attributed to bottom-up, feedforward ConvNet frameworks. However, in case of humans, top-down information, context and feedback play an important role in doing object detection. This paper investigates how we can incorporate top-down information and feedback in the state-of-the-art Faster R-CNN framework. Specifically, we propose to: (a) augment Faster R-CNN with a semantic segmentation network; (b) use segmentation for top-down contextual priming; (c) use segmentation to provide top-down iterative feedback using two stage training. Our results indicate that all three contributions improve the performance on object detection, semantic segmentation and region proposal generation.

Abhinav Shrivastava, Abhinav Gupta

Efficient Multi-view Surface Refinement with Adaptive Resolution Control

The existing stereo refinement methods optimize a surface representation using a multi-view photo-consistency functional. Such optimization is iterative and requires repeated computation of gradients over all surface regions, which is the bottleneck affecting adversely the computational efficiency of the refinement. In this paper, we present a flexible and efficient framework for mesh surface refinement in multi-view stereo. The newly proposed Adaptive Resolution Control (ARC) evaluates an optimal trade-off between the geometry accuracy and the performance via curve analysis. Then, it classifies the regions into the significant and insignificant ones using a graph-cut optimization. After that, each region is subdivided and simplified accordingly in the remaining refinement process, producing a triangular mesh in adaptive resolutions. Consequently, the ARC accelerates the stereo refinement by severalfold by culling out most insignificant regions, while still maintaining a similar level of geometry details that the state-of-the-art methods could achieve. We have implemented the ARC and demonstrated intensively on both public benchmarks and private datasets, which all confirm the effectiveness and the robustness of the ARC.

Shiwei Li, Sing Yu Siu, Tian Fang, Long Quan

Gaussian Process Density Counting from Weak Supervision

As a novel learning setup, we introduce learning to count objects within an image from only region-level count information. This level of supervision is weaker than earlier approaches that require segmenting, drawing bounding boxes, or putting dots on centroids of all objects within training images. We devise a weakly supervised kernel learner that achieves higher count accuracies than previous counting models. We achieve this by placing a Gaussian process prior on a latent function the square of which is the count density. We impose non-negativeness and smooth the GP response as an intermediary step in model inference. We illustrate the effectiveness of our model on two benchmark applications: (i) synthetic cell and (ii) pedestrian counting, and one novel application: (iii) erythrocyte counting on blood samples of malaria patients.

Matthias von Borstel, Melih Kandemir, Philip Schmidt, Madhavi K. Rao, Kumar Rajamani, Fred A. Hamprecht

Region-Based Semantic Segmentation with End-to-End Training

We propose a novel method for semantic segmentation, the task of labeling each pixel in an image with a semantic class. Our method combines the advantages of the two main competing paradigms. Methods based on region classification offer proper spatial support for appearance measurements, but typically operate in two separate stages, none of which targets pixel labeling performance at the end of the pipeline. More recent fully convolutional methods are capable of end-to-end training for the final pixel labeling, but resort to fixed patches as spatial support. We show how to modify modern region-based approaches to enable end-to-end training for semantic segmentation. This is achieved via a differentiable region-to-pixel layer and a differentiable free-form Region-of-Interest pooling layer. Our method improves the state-of-the-art in terms of class-average accuracy with $$64.0\,\%$$64.0% on SIFT Flow and $$49.9\,\%$$49.9% on PASCAL Context, and is particularly accurate at object boundaries.

Holger Caesar, Jasper Uijlings, Vittorio Ferrari

Fast 6D Pose Estimation from a Monocular Image Using Hierarchical Pose Trees

It has been shown that the template based approaches could quickly estimate 6D pose of texture-less objects from a monocular image. However, they tend to be slow when the number of templates amounts to tens of thousands for handling a wider range of 3D object pose. To alleviate this problem, we propose a novel image feature and a tree-structured model. Our proposed perspectively cumulated orientation feature (PCOF) is based on the orientation histograms extracted from randomly generated 2D projection images using 3D CAD data, and the template using PCOF explicitly handle a certain range of 3D object pose. The hierarchical pose trees (HPT) is built by clustering 3D object pose and reducing the resolutions of templates, and HPT accelerates 6D pose estimation based on a coarse-to-fine strategy with an image pyramid. In the experimental evaluation on our texture-less object dataset, the combination of PCOF and HPT showed higher accuracy and faster speed in comparison with state-of-the-art techniques.

Yoshinori Konishi, Yuki Hanzawa, Masato Kawade, Manabu Hashimoto

Learning Models for Actions and Person-Object Interactions with Transfer to Question Answering

This paper proposes deep convolutional network models that utilize local and global context to make human activity label predictions in still images, achieving state-of-the-art performance on two recent datasets with hundreds of labels each. We use multiple instance learning to handle the lack of supervision on the level of individual person instances, and weighted loss to handle unbalanced training data. Further, we show how specialized features trained on these datasets can be used to improve accuracy on the Visual Question Answering (VQA) task, in the form of multiple choice fill-in-the-blank questions (Visual Madlibs). Specifically, we tackle two types of questions on person activity and person-object relationship and show improvements over generic features trained on the ImageNet classification task

Arun Mallya, Svetlana Lazebnik

A Software Platform for Manipulating the Camera Imaging Pipeline

There are a number of processing steps applied onboard a digital camera that collectively make up the camera imaging pipeline. Unfortunately, the imaging pipeline is typically embedded in a camera’s hardware making it difficult for researchers working on individual components to do so within the proper context of the full pipeline. This not only hinders research, it makes evaluating the effects from modifying an individual pipeline component on the final camera output challenging, if not impossible. This paper presents a new software platform that allows easy access to each stage of the camera imaging pipeline. The platform allows modification of the parameters for individual components as well as the ability to access and manipulate the intermediate images as they pass through different stages. We detail our platform design and demonstrate its usefulness on a number of examples.

Hakki Can Karaimer, Michael S. Brown

A Benchmark and Simulator for UAV Tracking

In this paper, we propose a new aerial video dataset and benchmark for low altitude UAV target tracking, as well as, a photo-realistic UAV simulator that can be coupled with tracking methods. Our benchmark provides the first evaluation of many state-of-the-art and popular trackers on 123 new and fully annotated HD video sequences captured from a low-altitude aerial perspective. Among the compared trackers, we determine which ones are the most suitable for UAV tracking both in terms of tracking accuracy and run-time. The simulator can be used to evaluate tracking algorithms in real-time scenarios before they are deployed on a UAV “in the field”, as well as, generate synthetic but photo-realistic tracking datasets with automatic ground truth annotations to easily extend existing real-world datasets. Both the benchmark and simulator are made publicly available to the vision community on our website to further research in the area of object tracking from UAVs. (

Matthias Mueller, Neil Smith, Bernard Ghanem

Scene Depth Profiling Using Helmholtz Stereopsis

Helmholtz stereopsis is a 3D reconstruction technique, capturing surface depth independent of the reflection properties of the material by using Helmholtz reciprocity. In this paper we are interested in studying the applicability of Helmholtz stereopsis for surface and depth profiling of objects and general scenes in the context of perspective stereo imaging. Helmholtz stereopsis captures a pair of reciprocal images by exchanging the position of light source and camera. The resulting image pair relates the image intensities and scene depth profile by a partial differential equation. The solution of this differential equation depends on the boundary conditions provided by the scene. We propose to limit the illumination angle of the light source, such that only mutually visible parts are imaged, resulting in stable boundary conditions. By simulation and experiment we show that a unique depth profile can be recovered for a large class of scenes including multiple occluding objects.

Hironori Mori, Roderick Köhle, Markus Kamm

Projective Bundle Adjustment from Arbitrary Initialization Using the Variable Projection Method

Bundle adjustment is used in structure-from-motion pipelines as final refinement stage requiring a sufficiently good initialization to reach a useful local mininum. Starting from an arbitrary initialization almost always gets trapped in a poor minimum. In this work we aim to obtain an initialization-free approach which returns global minima from a large proportion of purely random starting points. Our key inspiration lies in the success of the Variable Projection (VarPro) method for affine factorization problems, which have close to 100 % chance of reaching a global minimum from random initialization. We find empirically that this desirable behaviour does not directly carry over to the projective case, and we consequently design and evaluate strategies to overcome this limitation. Also, by unifying the affine and the projective camera settings, we obtain numerically better conditioned reformulations of original bundle adjustment algorithms.

Je Hyeong Hong, Christopher Zach, Andrew Fitzgibbon, Roberto Cipolla

Localizing and Orienting Street Views Using Overhead Imagery

In this paper we aim to determine the location and orientation of a ground-level query image by matching to a reference database of overhead (e.g. satellite) images. For this task we collect a new dataset with one million pairs of street view and overhead images sampled from eleven U.S. cities. We explore several deep CNN architectures for cross-domain matching – Classification, Hybrid, Siamese, and Triplet networks. Classification and Hybrid architectures are accurate but slow since they allow only partial feature precomputation. We propose a new loss function which significantly improves the accuracy of Siamese and Triplet embedding networks while maintaining their applicability to large-scale retrieval tasks like image geolocalization. This image matching task is challenging not just because of the dramatic viewpoint difference between ground-level and overhead imagery but because the orientation (i.e. azimuth) of the street views is unknown making correspondence even more difficult. We examine several mechanisms to match in spite of this – training for rotation invariance, sampling possible rotations at query time, and explicitly predicting relative rotation of ground and overhead images with our deep networks. It turns out that explicit orientation supervision also improves location prediction accuracy. Our best performing architectures are roughly 2.5 times as accurate as the commonly used Siamese network baseline.

Nam N. Vo, James Hays

Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding

Computer vision has a great potential to help our daily lives by searching for lost keys, watering flowers or reminding us to take a pill. To succeed with such tasks, computer vision methods need to be trained from real and diverse examples of our daily dynamic scenes. While most of such scenes are not particularly exciting, they typically do not appear on YouTube, in movies or TV broadcasts. So how do we collect sufficiently many diverse but boring samples representing our lives? We propose a novel Hollywood in Homes approach to collect such data. Instead of shooting videos in the lab, we ensure diversity by distributing and crowdsourcing the whole process of video creation from script writing to video recording and annotation. Following this procedure we collect a new dataset, Charades, with hundreds of people recording videos in their own homes, acting out casual everyday activities. The dataset is composed of 9,848 annotated videos with an average length of 30 s, showing activities of 267 people from three continents. Each video is annotated by multiple free-text descriptions, action labels, action intervals and classes of interacted objects. In total, Charades provides 27,847 video descriptions, 66,500 temporally localized intervals for 157 action classes and 41,104 labels for 46 object classes. Using this rich data, we evaluate and provide baseline results for several tasks including action recognition and automatic description generation. We believe that the realism, diversity, and casual nature of this dataset will present unique challenges and new opportunities for computer vision community.

Gunnar A. Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, Abhinav Gupta

Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification

In this paper, we present an approach for learning a visual representation from the raw spatiotemporal signals in videos. Our representation is learned without supervision from semantic labels. We formulate our method as an unsupervised sequential verification task, i.e., we determine whether a sequence of frames from a video is in the correct temporal order. With this simple task and no semantic labels, we learn a powerful visual representation using a Convolutional Neural Network (CNN). The representation contains complementary information to that learned from supervised image datasets like ImageNet. Qualitative results show that our method captures information that is temporally varying, such as human pose. When used as pre-training for action recognition, our method gives significant gains over learning without external data on benchmark datasets like UCF101 and HMDB51. To demonstrate its sensitivity to human pose, we show results for pose estimation on the FLIC and MPII datasets that are competitive, or better than approaches using significantly more supervision. Our method can be combined with supervised representations to provide an additional boost in accuracy.

Ishan Misra, C. Lawrence Zitnick, Martial Hebert

DOC: Deep OCclusion Estimation from a Single Image

In this paper, we propose a deep convolutional network architecture, called DOC, which detects object boundaries and estimates the occlusion relationships (i.e. which side of the boundary is foreground and which is background). Specifically, we first represent occlusion relations by a binary edge indicator, to indicate the object boundary, and an occlusion orientation variable whose direction specifies the occlusion relationships by a left-hand rule, see Fig. 1. Then, our DOC networks exploit local and non-local image cues to learn and estimate this representation and hence recover occlusion relations. To train and test DOC, we construct a large-scale instance occlusion boundary dataset using PASCAL VOC images, which we call the PASCAL instance occlusion dataset (PIOD). It contains 10,000 images and hence is two orders of magnitude larger than existing occlusion datasets for outdoor images. We test two variants of DOC on PIOD and on the BSDS ownership dataset and show they outperform state-of-the-art methods typically by more than 5AP. Finally, we perform numerous experiments investigating multiple settings of DOC and transfer between BSDS and PIOD, which provides more insights for further study of occlusion estimation.

Peng Wang, Alan Yuille

RepMatch: Robust Feature Matching and Pose for Reconstructing Modern Cities

A perennial problem in recovering 3-D models from images is repeated structures common in modern cities. The problem can be traced to the feature matcher which needs to match less distinctive features (permitting wide-baselines and avoiding broken sequences), while simultaneously avoiding incorrect matching of ambiguous repeated features. To meet this need, we develop RepMatch, an epipolar guided (assumes predominately camera motion) feature matcher that accommodates both wide-baselines and repeated structures. RepMatch is based on using RANSAC to guide the training of match consistency curves for differentiating true and false matches. By considering the set of all nearest-neighbor matches, RepMatch can procure very large numbers of matches over wide baselines. This in turn lends stability to pose estimation. RepMatch’s performance compares favorably on standard datasets and enables more complete reconstructions of modern architectures.

Wen-Yan Lin, Siying Liu, Nianjuan Jiang, Minh. N. Do, Ping Tan, Jiangbo Lu

Convolutional Oriented Boundaries

We present Convolutional Oriented Boundaries (COB), which produces multiscale oriented contours and region hierarchies starting from generic image classification Convolutional Neural Networks (CNNs). COB is computationally efficient, because it requires a single CNN forward pass for contour detection and it uses a novel sparse boundary representation for hierarchical segmentation; it gives a significant leap in performance over the state-of-the-art, and it generalizes very well to unseen categories and datasets. Particularly, we show that learning to estimate not only contour strength but also orientation provides more accurate results. We perform extensive experiments on BSDS, PASCAL Context, PASCAL Segmentation, and MS-COCO, showing that COB provides state-of-the-art contours, region hierarchies, and object proposals in all datasets.

Kevis-Kokitsi Maninis, Jordi Pont-Tuset, Pablo Arbeláez, Luc Van Gool

Superpixel Convolutional Networks Using Bilateral Inceptions

In this paper we propose a CNN architecture for semantic image segmentation. We introduce a new “bilateral inception” module that can be inserted in existing CNN architectures and performs bilateral filtering, at multiple feature-scales, between superpixels in an image. The feature spaces for bilateral filtering and other parameters of the module are learned end-to-end using standard backpropagation techniques. The bilateral inception module addresses two issues that arise with general CNN segmentation architectures. First, this module propagates information between (super) pixels while respecting image edges, thus using the structured information of the problem for improved results. Second, the layer recovers a full resolution segmentation result from the lower resolution solution of a CNN. In the experiments, we modify several existing CNN architectures by inserting our inception module between the last CNN ($$1\times 1$$1×1 convolution) layers. Empirical results on three different datasets show reliable improvements not only in comparison to the baseline networks, but also in comparison to several dense-pixel prediction techniques such as CRFs, while being competitive in time.

Raghudeep Gadde, Varun Jampani, Martin Kiefel, Daniel Kappler, Peter V. Gehler

Sublabel-Accurate Convex Relaxation of Vectorial Multilabel Energies

Convex relaxations of multilabel problems have been demonstrated to produce provably optimal or near-optimal solutions to a variety of computer vision problems. Yet, they are of limited practical use as they require a fine discretization of the label space, entailing a huge demand in memory and runtime. In this work, we propose the first sublabel accurate convex relaxation for vectorial multilabel problems. Our key idea is to approximate the dataterm in a piecewise convex (rather than piecewise linear) manner. As a result we have a more faithful approximation of the original cost function that provides a meaningful interpretation for fractional solutions of the relaxed convex problem.

Emanuel Laude, Thomas Möllenhoff, Michael Moeller, Jan Lellmann, Daniel Cremers

Building Dual-Domain Representations for Compression Artifacts Reduction

We propose a highly accurate approach to remove artifacts of JPEG-compressed images. Our approach jointly learns a very deep convolutional network in both DCT and pixel domains. The dual-domain representation can make full use of DCT-domain prior knowledge of JPEG compression, which is usually lacking in traditional network-based approaches. At the same time, it can also benefit from the prowess and the efficiency of the deep feed-forward architecture, in comparison to capacity-limited sparse-coding-based approaches. Two simple strategies, i.e., Adam and residual learning, are adopted to train the very deep network and later proved to be a success. Extensive experiments demonstrate the large improvements of our approach over the state of the arts.

Jun Guo, Hongyang Chao

Geometric Neural Phrase Pooling: Modeling the Spatial Co-occurrence of Neurons

Deep Convolutional Neural Networks (CNNs) are playing important roles in state-of-the-art visual recognition. This paper focuses on modeling the spatial co-occurrence of neuron responses, which is less studied in the previous work. For this, we consider the neurons in the hidden layer as neural words, and construct a set of geometric neural phrases on top of them. The idea that grouping neural words into neural phrases is borrowed from the Bag-of-Visual-Words (BoVW) model. Next, the Geometric Neural Phrase Pooling (GNPP) algorithm is proposed to efficiently encode these neural phrases. GNPP acts as a new type of hidden layer, which punishes the isolated neuron responses after convolution, and can be inserted into a CNN model with little extra computational overhead. Experimental results show that GNPP produces significant and consistent accuracy gain in image classification.

Lingxi Xie, Qi Tian, John Flynn, Jingdong Wang, Alan Yuille

Photo Aesthetics Ranking Network with Attributes and Content Adaptation

Real-world applications could benefit from the ability to automatically generate a fine-grained ranking of photo aesthetics. However, previous methods for image aesthetics analysis have primarily focused on the coarse, binary categorization of images into high- or low-aesthetic categories. In this work, we propose to learn a deep convolutional neural network to rank photo aesthetics in which the relative ranking of photo aesthetics are directly modeled in the loss function. Our model incorporates joint learning of meaningful photographic attributes and image content information which can help regularize the complicated photo aesthetics rating problem.To train and analyze this model, we have assembled a new aesthetics and attributes database (AADB) which contains aesthetic scores and meaningful attributes assigned to each image by multiple human raters. Anonymized rater identities are recorded across images allowing us to exploit intra-rater consistency using a novel sampling strategy when computing the ranking loss of training image pairs. We show the proposed sampling strategy is very effective and robust in face of subjective judgement of image aesthetics by individuals with different aesthetic tastes. Experiments demonstrate that our unified model can generate aesthetic rankings that are more consistent with human ratings. To further validate our model, we show that by simply thresholding the estimated aesthetic scores, we are able to achieve state-or-the-art classification performance on the existing AVA dataset benchmark.

Shu Kong, Xiaohui Shen, Zhe Lin, Radomir Mech, Charless Fowlkes

SDF-2-SDF: Highly Accurate 3D Object Reconstruction

This paper addresses the problem of 3D object reconstruction using RGB-D sensors. Our main contribution is a novel implicit-to-implicit surface registration scheme between signed distance fields (SDFs), utilized both for the real-time frame-to-frame camera tracking and for the subsequent global optimization. SDF-2-SDF registration circumvents expensive correspondence search and allows for incorporation of multiple geometric constraints without any dependence on texture, yielding highly accurate 3D models. An extensive quantitative evaluation on real and synthetic data demonstrates improved tracking and higher fidelity reconstructions than a variety of state-of-the-art methods. We make our data publicly available, creating the first object reconstruction dataset to include ground-truth CAD models and RGB-D sequences from sensors of various quality.

Miroslava Slavcheva, Wadim Kehl, Nassir Navab, Slobodan Ilic

Knowledge Transfer for Scene-Specific Motion Prediction

When given a single frame of the video, humans can not only interpret the content of the scene, but also they are able to forecast the near future. This ability is mostly driven by their rich prior knowledge about the visual world, both in terms of (i) the dynamics of moving agents, as well as (ii) the semantic of the scene. In this work we exploit the interplay between these two key elements to predict scene-specific motion patterns. First, we extract patch descriptors encoding the probability of moving to the adjacent patches, and the probability of being in that particular patch or changing behavior. Then, we introduce a Dynamic Bayesian Network which exploits this scene specific knowledge for trajectory prediction. Experimental results demonstrate that our method is able to accurately predict trajectories and transfer predictions to a novel scene characterized by similar elements.

Lamberto Ballan, Francesco Castaldo, Alexandre Alahi, Francesco Palmieri, Silvio Savarese

Weakly Supervised Localization Using Deep Feature Maps

Object localization is an important computer vision problem with a variety of applications. The lack of large scale object-level annotations and the relative abundance of image-level labels makes a compelling case for weak supervision in the object localization task. Deep Convolutional Neural Networks are a class of state-of-the-art methods for the related problem of object recognition. In this paper, we describe a novel object localization algorithm which uses classification networks trained on only image labels. This weakly supervised method leverages local spatial and semantic patterns captured in the convolutional layers of classification networks. We propose an efficient beam search based approach to detect and localize multiple objects in images. The proposed method significantly outperforms the state-of-the-art in standard object localization data-sets.

Archith John Bency, Heesung Kwon, Hyungtae Lee, S. Karthikeyan, B. S. Manjunath

Embedding Deep Metric for Person Re-identification: A Study Against Large Variations

Person re-identification is challenging due to the large variations of pose, illumination, occlusion and camera view. Owing to these variations, the pedestrian data is distributed as highly-curved manifolds in the feature space, despite the current convolutional neural networks (CNN)’s capability of feature extraction. However, the distribution is unknown, so it is difficult to use the geodesic distance when comparing two samples. In practice, the current deep embedding methods use the Euclidean distance for the training and test. On the other hand, the manifold learning methods suggest to use the Euclidean distance in the local range, combining with the graphical relationship between samples, for approximating the geodesic distance. From this point of view, selecting suitable positive (i.e. intra-class) training samples within a local range is critical for training the CNN embedding, especially when the data has large intra-class variations. In this paper, we propose a novel moderate positive sample mining method to train robust CNN for person re-identification, dealing with the problem of large variation. In addition, we improve the learning by a metric weight constraint, so that the learned metric has a better generalization ability. Experiments show that these two strategies are effective in learning robust deep metrics for person re-identification, and accordingly our deep model significantly outperforms the state-of-the-art methods on several benchmarks of person re-identification. Therefore, the study presented in this paper may be useful in inspiring new designs of deep models for person re-identification.

Hailin Shi, Yang Yang, Xiangyu Zhu, Shengcai Liao, Zhen Lei, Weishi Zheng, Stan Z. Li

Learning to Track at 100 FPS with Deep Regression Networks

Machine learning techniques are often used in computer vision due to their ability to leverage large amounts of training data to improve performance. Unfortunately, most generic object trackers are still trained from scratch online and do not benefit from the large number of videos that are readily available for offline training. We propose a method for offline training of neural networks that can track novel objects at test-time at 100 fps. Our tracker is significantly faster than previous methods that use neural networks for tracking, which are typically very slow to run and not practical for real-time applications. Our tracker uses a simple feed-forward network with no online training required. The tracker learns a generic relationship between object motion and appearance and can be used to track novel objects that do not appear in the training set. We test our network on a standard tracking benchmark to demonstrate our tracker’s state-of-the-art performance. Further, our performance improves as we add more videos to our offline training set. To the best of our knowledge, our tracker (Our tracker is available at is the first neural-network tracker that learns to track generic objects at 100 fps.

David Held, Sebastian Thrun, Silvio Savarese

Matching Handwritten Document Images

We address the problem of predicting similarity between a pair of handwritten document images written by potentially different individuals. This has applications related to matching and mining in image collections containing handwritten content. A similarity score is computed by detecting patterns of text re-usages between document images irrespective of the minor variations in word morphology, word ordering, layout and paraphrasing of the content. Our method does not depend on an accurate segmentation of words and lines. We formulate the document matching problem as a structured comparison of the word distributions across two document images. To match two word images, we propose a convolutional neural network (cnn) based feature descriptor. Performance of this representation surpasses the state-of-the-art on handwritten word spotting. Finally, we demonstrate the applicability of our method on a practical problem of matching handwritten assignments.

Praveen Krishnan, C. V. Jawahar

Semantic Clustering for Robust Fine-Grained Scene Recognition

In domain generalization, the knowledge learnt from one or multiple source domains is transferred to an unseen target domain. In this work, we propose a novel domain generalization approach for fine-grained scene recognition. We first propose a semantic scene descriptor that jointly captures the subtle differences between fine-grained scenes, while being robust to varying object configurations across domains. We model the occurrence patterns of objects in scenes, capturing the informativeness and discriminability of each object for each scene. We then transform such occurrences into scene probabilities for each scene image. Second, we argue that scene images belong to hidden semantic topics that can be discovered by clustering our semantic descriptors. To evaluate the proposed method, we propose a new fine-grained scene dataset in cross-domain settings. Extensive experiments on the proposed dataset and three benchmark scene datasets show the effectiveness of the proposed approach for fine-grained scene transfer, where we outperform state-of-the-art scene recognition and domain generalization methods.

Marian George, Mandar Dixit, Gábor Zogg, Nuno Vasconcelos

Scene Understanding


Ambient Sound Provides Supervision for Visual Learning

The sound of crashing waves, the roar of fast-moving cars – sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural network to predict a statistical summary of the sound associated with a video frame. We show that, through this process, the network learns a representation that conveys information about objects and scenes. We evaluate this representation on several recognition tasks, finding that its performance is comparable to that of other state-of-the-art unsupervised learning methods. Finally, we show through visualizations that the network learns units that are selective to objects that are often associated with characteristic sounds.

Andrew Owens, Jiajun Wu, Josh H. McDermott, William T. Freeman, Antonio Torralba

Grounding of Textual Phrases in Images by Reconstruction

Grounding (i.e. localizing) arbitrary, free-form textual phrases in visual content is a challenging problem with many applications for human-computer interaction and image-text reference resolution. Few datasets provide the ground truth spatial localization of phrases, thus it is desirable to learn from data with no or little grounding supervision. We propose a novel approach which learns grounding by reconstructing a given phrase using an attention mechanism, which can be either latent or optimized directly. During training our approach encodes the phrase using a recurrent network language model and then learns to attend to the relevant image region in order to reconstruct the input phrase. At test time, the correct attention, i.e., the grounding, is evaluated. If grounding supervision is available it can be directly applied via a loss over the attention mechanism. We demonstrate the effectiveness of our approach on the Flickr30k Entities and ReferItGame datasets with different levels of supervision, ranging from no supervision over partial supervision to full supervision. Our supervised variant improves by a large margin over the state-of-the-art on both datasets.

Anna Rohrbach, Marcus Rohrbach, Ronghang Hu, Trevor Darrell, Bernt Schiele

Improving Multi-label Learning with Missing Labels by Structured Semantic Correlations

Multi-label learning has attracted significant interests in computer vision recently, finding applications in many vision tasks such as multiple object recognition and automatic image annotation. Associating multiple labels to a complex image is very difficult, not only due to the intricacy of describing the image, but also because of the incompleteness nature of the observed labels. Existing works on the problem either ignore the label-label and instance-instance correlations or just assume these correlations are linear and unstructured. Considering that semantic correlations between images are actually structured, in this paper we propose to incorporate structured semantic correlations to solve the missing label problem of multi-label learning. Specifically, we project images to the semantic space with an effective semantic descriptor. A semantic graph is then constructed on these images to capture the structured correlations between them. We utilize the semantic graph Laplacian as a smooth term in the multi-label learning formulation to incorporate the structured semantic correlations. Experimental results demonstrate the effectiveness of the proposed semantic descriptor and the usefulness of incorporating the structured semantic correlations. We achieve better results than state-of-the-art multi-label learning methods on four benchmark datasets.

Hao Yang, Joey Tianyi Zhou, Jianfei Cai

Visual Relationship Detection with Language Priors

Visual relationships capture a wide variety of interactions between pairs of objects in images (e.g. “man riding bicycle” and “man pushing bicycle”). Consequently, the set of possible relationships is extremely large and it is difficult to obtain sufficient training examples for all possible relationships. Because of this limitation, previous work on visual relationship detection has concentrated on predicting only a handful of relationships. Though most relationships are infrequent, their objects (e.g. “man” and “bicycle”) and predicates (e.g. “riding” and “pushing”) independently occur more frequently. We propose a model that uses this insight to train visual models for objects and predicates individually and later combines them together to predict multiple relationships per image. We improve on prior work by leveraging language priors from semantic word embeddings to finetune the likelihood of a predicted relationship. Our model can scale to predict thousands of types of relationships from a few examples. Additionally, we localize the objects in the predicted relationships as bounding boxes in the image. We further demonstrate that understanding relationships can improve content based image retrieval.

Cewu Lu, Ranjay Krishna, Michael Bernstein, Li Fei-Fei


Weitere Informationen

Premium Partner