main-content

Über dieses Buch

The eight-volume set comprising LNCS volumes 9905-9912 constitutes the refereed proceedings of the 14th European Conference on Computer Vision, ECCV 2016, held in Amsterdam, The Netherlands, in October 2016.
The 415 revised papers presented were carefully reviewed and selected from 1480 submissions. The papers cover all aspects of computer vision and pattern recognition such as 3D computer vision; computational photography, sensing and display; face and gesture; low-level vision and image processing; motion and tracking; optimization methods; physics-based vision, photometry and shape-from-X; recognition: detection, categorization, indexing, matching; segmentation, grouping and shape representation; statistical methods and learning; video: events, activities and surveillance; applications. They are organized in topical sections on detection, recognition and retrieval; scene understanding; optimization; image and video processing; learning; action activity and tracking; 3D; and 9 poster sessions.

Inhaltsverzeichnis

Generating Visual Explanations

Clearly explaining a rationale for a classification decision to an end user can be as important as the decision itself. Existing approaches for deep visual recognition are generally opaque and do not output any justification text; contemporary vision-language models can describe image content but fail to take into account class-discriminative image aspects which justify visual predictions. We propose a new model that focuses on the discriminating properties of the visible object, jointly predicts a class label, and explains why the predicted label is appropriate for the image. Through a novel loss function based on sampling and reinforcement learning, our model learns to generate sentences that realize a global sentence property, such as class specificity. Our results on the CUB dataset show that our model is able to generate explanations which are not only consistent with an image but also more discriminative than descriptions produced by existing captioning methods.

Lisa Anne Hendricks, Zeynep Akata, Marcus Rohrbach, Jeff Donahue, Bernt Schiele, Trevor Darrell

Marker-Less 3D Human Motion Capture with Monocular Image Sequence and Height-Maps

The recovery of 3D human pose with monocular camera is an inherently ill-posed problem due to the large number of possible projections from the same 2D image to 3D space. Aimed at improving the accuracy of 3D motion reconstruction, we introduce the additional built-in knowledge, namely height-map, into the algorithmic scheme of reconstructing the 3D pose/motion under a single-view calibrated camera. Our novel proposed framework consists of two major contributions. Firstly, the RGB image and its calculated height-map are combined to detect the landmarks of 2D joints with a dual-stream deep convolution network. Secondly, we formulate a new objective function to estimate 3D motion from the detected 2D joints in the monocular image sequence, which reinforces the temporal coherence constraints on both the camera and 3D poses. Experiments with HumanEva, Human3.6M, and MCAD dataset validate that our method outperforms the state-of-the-art algorithms on both 2D joints localization and 3D motion recovery. Moreover, the evaluation results on HumanEva indicates that the performance of our proposed single-view approach is comparable to that of the multi-view deep learning counterpart.

Yu Du, Yongkang Wong, Yonghao Liu, Feilin Han, Yilin Gui, Zhen Wang, Mohan Kankanhalli, Weidong Geng

Tensor Representations via Kernel Linearization for Action Recognition from 3D Skeletons

In this paper, we explore tensor representations that can compactly capture higher-order relationships between skeleton joints for 3D action recognition. We first define RBF kernels on 3D joint sequences, which are then linearized to form kernel descriptors. The higher-order outer-products of these kernel descriptors form our tensor representations. We present two different kernels for action recognition, namely (i) a sequence compatibility kernel that captures the spatio-temporal compatibility of joints in one sequence against those in the other, and (ii) a dynamics compatibility kernel that explicitly models the action dynamics of a sequence. Tensors formed from these kernels are then used to train an SVM. We present experiments on several benchmark datasets and demonstrate state of the art results, substantiating the effectiveness of our representations.

Piotr Koniusz, Anoop Cherian, Fatih Porikli

Manhattan-World Urban Reconstruction from Point Clouds

Manhattan-world urban scenes are common in the real world. We propose a fully automatic approach for reconstructing such scenes from 3D point samples. Our key idea is to represent the geometry of the buildings in the scene using a set of well-aligned boxes. We first extract plane hypothesis from the points followed by an iterative refinement step. Then, candidate boxes are obtained by partitioning the space of the point cloud into a non-uniform grid. After that, we choose an optimal subset of the candidate boxes to approximate the geometry of the buildings. The contribution of our work is that we transform scene reconstruction into a labeling problem that is solved based on a novel Markov Random Field formulation. Unlike previous methods designed for particular types of input point clouds, our method can obtain faithful reconstructions from a variety of data sources. Experiments demonstrate that our method is superior to state-of-the-art methods.

Minglei Li, Peter Wonka, Liangliang Nan

From Multiview Image Curves to 3D Drawings

Reconstructing 3D scenes from multiple views has made impressive strides in recent years, chiefly by correlating isolated feature points, intensity patterns, or curvilinear structures. In the general setting – without controlled acquisition, abundant texture, curves and surfaces following specific models or limiting scene complexity – most methods produce unorganized point clouds, meshes, or voxel representations, with some exceptions producing unorganized clouds of 3D curve fragments. Ideally, many applications require structured representations of curves, surfaces and their spatial relationships. This paper presents a step in this direction by formulating an approach that combines 2D image curves into a collection of 3D curves, with topological connectivity between them represented as a 3D graph. This results in a 3D drawing, which is complementary to surface representations in the same sense as a 3D scaffold complements a tent taut over it. We evaluate our results against truth on synthetic and real datasets.

Anil Usumezbas, Ricardo Fabbri, Benjamin B. Kimia

Shape from Selfies: Human Body Shape Estimation Using CCA Regression Forests

In this work, we revise the problem of human body shape estimation from monocular imagery. Starting from a statistical human shape model that describes a body shape with shape parameters, we describe a novel approach to automatically estimate these parameters from a single input shape silhouette using semi-supervised learning. By utilizing silhouette features that encode local and global properties robust to noise, pose and view changes, and projecting them to lower dimensional spaces obtained through multi-view learning with canonical correlation analysis, we show how regression forests can be used to compute an accurate mapping from the silhouette to the shape parameter space. This results in a very fast, robust and automatic system under mild self-occlusion assumptions. We extensively evaluate our method on thousands of synthetic and real data and compare it to the state-of-art approaches that operate under more restrictive assumptions.

Endri Dibra, Cengiz Öztireli, Remo Ziegler, Markus Gross

Can We Jointly Register and Reconstruct Creased Surfaces by Shape-from-Template Accurately?

Shape-from-Template (SfT) aims to reconstruct a deformable object from a single image using a texture-mapped 3D model of the object in a reference position. Most existing SfT methods require well-textured surfaces that deform smoothly, which is a significant limitation. Due to the sparsity of correspondence constraint and strong regularizations, they usually fail to reconstruct strong changes of surface curvature such as surface creases. We investigate new ways to solve SfT for creased surfaces. Our main idea is to implicitly model creases with a dense mesh-based surface representation with an associated robust bending energy term, which deactivates curvature smoothing automatically where needed. Crucially, the crease locations are not required a priori since they emerge as the lowest-energy state during optimization. We show with real data that by combining this model with correspondence and surface boundary constraints we can successfully reconstruct creases while also preserving smooth regions.

Mathias Gallardo, Toby Collins, Adrien Bartoli

Distractor-Supported Single Target Tracking in Extremely Cluttered Scenes

This paper presents a novel method for single target tracking in RGB images under conditions of extreme clutter and camouflage, including frequent occlusions by objects with similar appearance as the target. In contrast to conventional single target trackers, which only maintain the estimated target status, we propose a multi-level clustering-based robust estimation for online detection and learning of multiple target-like regions, called distractors, when they appear near to the true target. To distinguish the target from these distractors, we exploit a global dynamic constraint (derived from the target and the distractors) in a feedback loop to improve single target tracking performance in situations where the target is camouflaged in highly cluttered scenes. Our proposed method successfully prevents the estimated target location from erroneously jumping to a distractor during occlusion or extreme camouflage interactions. To gain an insightful understanding of the evaluated trackers, we have augmented publicly available benchmark videos, by proposing a new set of clutter and camouflage sub-attributes, and annotating these sub-attributes for all frames in all sequences. Using this dataset, we first evaluate the effect of each key component of the tracker on the overall performance. Then, the proposed tracker is compared to other highly ranked single target tracking algorithms in the literature. The experimental results show that applying the proposed global dynamic constraint in a feedback loop can improve single target tracker performance, and demonstrate that the overall algorithm significantly outperforms other state-of-the-art single target trackers in highly cluttered scenes.

Jingjing Xiao, Linbo Qiao, Rustam Stolkin, Aleš Leonardis

Connectionist Temporal Modeling for Weakly Supervised Action Labeling

We propose a weakly-supervised framework for action labeling in video, where only the order of occurring actions is required during training time. The key challenge is that the per-frame alignments between the input (video) and label (action) sequences are unknown during training. We address this by introducing the Extended Connectionist Temporal Classification (ECTC) framework to efficiently evaluate all possible alignments via dynamic programming and explicitly enforce their consistency with frame-to-frame visual similarities. This protects the model from distractions of visually inconsistent or degenerated alignments without the need of temporal supervision. We further extend our framework to the semi-supervised case when a few frames are sparsely annotated in a video. With less than 1 % of labeled frames per video, our method is able to outperform existing semi-supervised approaches and achieve comparable performance to that of fully supervised approaches.

De-An Huang, Li Fei-Fei, Juan Carlos Niebles

Deep Joint Image Filtering

Joint image filters can leverage the guidance image as a prior and transfer the structural details from the guidance image to the target image for suppressing noise or enhancing spatial resolution. Existing methods rely on various kinds of explicit filter construction or hand-designed objective functions. It is thus difficult to understand, improve, and accelerate them in a coherent framework. In this paper, we propose a learning-based approach to construct a joint filter based on Convolutional Neural Networks. In contrast to existing methods that consider only the guidance image, our method can selectively transfer salient structures that are consistent in both guidance and target images. We show that the model trained on a certain type of data, e.g., RGB and depth images, generalizes well for other modalities, e.g., Flash/Non-Flash and RGB/NIR images. We validate the effectiveness of the proposed joint filter through extensive comparisons with state-of-the-art methods.

Yijun Li, Jia-Bin Huang, Narendra Ahuja, Ming-Hsuan Yang

Efficient Multi-frequency Phase Unwrapping Using Kernel Density Estimation

In this paper we introduce an efficient method to unwrap multi-frequency phase estimates for time-of-flight ranging. The algorithm generates multiple depth hypotheses and uses a spatial kernel density estimate (KDE) to rank them. The confidence produced by the KDE is also an effective means to detect outliers. We also introduce a new closed-form expression for phase noise prediction, that better fits real data. The method is applied to depth decoding for the Kinect v2 sensor, and compared to the Microsoft Kinect SDK and to the open source driver libfreenect2. The intended Kinect v2 use case is scenes with less than 8 m range, and for such cases we observe consistent improvements, while maintaining real-time performance. When extending the depth range to the maximal value of 18.75 m, we get about $$52\,\%$$52% more valid measurements than libfreenect2. The effect is that the sensor can now be used in large depth scenes, where it was previously not a good choice.

Felix Järemo Lawin, Per-Erik Forssén, Hannes Ovrén

A Multi-scale CNN for Affordance Segmentation in RGB Images

Given a single RGB image our goal is to label every pixel with an affordance type. By affordance, we mean an object’s capability to readily support a certain human action, without requiring precursor actions. We focus on segmenting the following five affordance types in indoor scenes: ‘walkable’, ‘sittable’, ‘lyable’, ‘reachable’, and ‘movable’. Our approach uses a deep architecture, consisting of a number of multi-scale convolutional neural networks, for extracting mid-level visual cues and combining them toward affordance segmentation. The mid-level cues include depth map, surface normals, and segmentation of four types of surfaces – namely, floor, structure, furniture and props. For evaluation, we augmented the NYUv2 dataset with new ground-truth annotations of the five affordance types. We are not aware of prior work which starts from pixels, infers mid-level cues, and combines them in a feed-forward fashion for predicting dense affordance maps of a single RGB image.

Anirban Roy, Sinisa Todorovic

Hierarchical Dynamic Parsing and Encoding for Action Recognition

A video action generally exhibits quite complex rhythms and non-stationary dynamics. To model such non-uniform dynamics, this paper describes a novel hierarchical dynamic encoding method to capture both the locally smooth dynamics and globally drastic dynamic changes. It provides a multi-layer joint representation for temporal modeling for action recognition. At the first layer, the action sequence is parsed in an unsupervised manner into several smooth-changing stages corresponding to different key poses or temporal structures. The dynamics within each stage are encoded by mean-pooling or learning to rank based encoding. At the second layer, the temporal information of the ordered dynamics extracted from the previous layer is encoded again to form the overall representation. Extensive experiments on a gesture action dataset (Chalearn) and several generic action datasets (Olympic Sports and Hollywood2) have demonstrated the effectiveness of the proposed method.

Bing Su, Jiahuan Zhou, Xiaoqing Ding, Hao Wang, Ying Wu

Distinct Class-Specific Saliency Maps for Weakly Supervised Semantic Segmentation

In this paper, we deal with a weakly supervised semantic segmentation problem where only training images with image-level labels are available. We propose a weakly supervised semantic segmentation method which is based on CNN-based class-specific saliency maps and fully-connected CRF. To obtain distinct class-specific saliency maps which can be used as unary potentials of CRF, we propose a novel method to estimate class saliency maps which improves the method proposed by Simonyan et al. (2014) significantly by the following improvements: (1) using CNN derivatives with respect to feature maps of the intermediate convolutional layers with up-sampling instead of an input image; (2) subtracting the saliency maps of the other classes from the saliency maps of the target class to differentiate target objects from other objects; (3) aggregating multiple-scale class saliency maps to compensate lower resolution of the feature maps. After obtaining distinct class saliency maps, we apply fully-connected CRF by using the class maps as unary potentials. By the experiments, we show that the proposed method has outperformed state-of-the-art results with the PASCAL VOC 2012 dataset under the weakly-supervised setting.

Wataru Shimoda, Keiji Yanai

A Diagram is Worth a Dozen Images

Diagrams are common tools for representing complex concepts, relationships and events, often when it would be difficult to portray the same information with natural images. Understanding natural images has been extensively studied in computer vision, while diagram understanding has received little attention. In this paper, we study the problem of diagram interpretation, the challenging task of identifying the structure of a diagram and the semantics of its constituents and their relationships. We introduce Diagram Parse Graphs (DPG) as our representation to model the structure of diagrams. We define syntactic parsing of diagrams as learning to infer DPGs for diagrams and study semantic interpretation and reasoning of diagrams in the context of diagram question answering. We devise an LSTM-based method for syntactic parsing of diagrams and introduce a DPG-based attention model for diagram question answering. We compile a new dataset of diagrams with exhaustive annotations of constituents and relationships for about 5,000 diagrams and 15,000 questions and answers. Our results show the significance of our models for syntactic parsing and question answering in diagrams using DPGs.

Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi

Automatic Attribute Discovery with Neural Activations

How can a machine learn to recognize visual attributes emerging out of online community without a definitive supervised dataset? This paper proposes an automatic approach to discover and analyze visual attributes from a noisy collection of image-text data on the Web. Our approach is based on the relationship between attributes and neural activations in the deep network. We characterize the visual property of the attribute word as a divergence within weakly-annotated set of images. We show that the neural activations are useful for discovering and learning a classifier that well agrees with human perception from the noisy real-world Web data. The empirical study suggests the layered structure of the deep neural networks also gives us insights into the perceptual depth of the given word. Finally, we demonstrate that we can utilize highly-activating neurons for finding semantically relevant regions.

Sirion Vittayakorn, Takayuki Umeda, Kazuhiko Murasaki, Kyoko Sudo, Takayuki Okatani, Kota Yamaguchi

“What Happens If...” Learning to Predict the Effect of Forces in Images

What happens if one pushes a cup sitting on a table toward the edge of the table? How about pushing a desk against a wall? In this paper, we study the problem of understanding the movements of objects as a result of applying external forces to them. For a given force vector applied to a specific location in an image, our goal is to predict long-term sequential movements caused by that force. Doing so entails reasoning about scene geometry, objects, their attributes, and the physical rules that govern the movements of objects. We design a deep neural network model that learns long-term sequential dependencies of object movements while taking into account the geometry and appearance of the scene by combining Convolutional and Recurrent Neural Networks. Training our model requires a large-scale dataset of object movements caused by external forces. To build a dataset of forces in scenes, we reconstructed all images in SUN RGB-D dataset in a physics simulator to estimate the physical movements of objects caused by external forces applied to them. Our Forces in Scenes (ForScene) dataset contains 65,000 object movements in 3D which represent a variety of external forces applied to different types of objects. Our experimental evaluations show that the challenging task of predicting long-term movements of objects as their reaction to external forces is possible from a single image. The code and dataset are available at: http://allenai.org/plato/forces.

View Synthesis by Appearance Flow

We address the problem of novel view synthesis: given an input image, synthesizing new images of the same object or scene observed from arbitrary viewpoints. We approach this as a learning task but, critically, instead of learning to synthesize pixels from scratch, we learn to copy them from the input image. Our approach exploits the observation that the visual appearance of different views of the same instance is highly correlated, and such correlation could be explicitly learned by training a convolutional neural network (CNN) to predict appearance flows – 2-D coordinate vectors specifying which pixels in the input view could be used to reconstruct the target view. Furthermore, the proposed framework easily generalizes to multiple input views by learning how to optimally combine single-view predictions. We show that for both objects and scenes, our approach is able to synthesize novel views of higher perceptual quality than previous CNN-based techniques.

Tinghui Zhou, Shubham Tulsiani, Weilun Sun, Jitendra Malik, Alexei A. Efros

Top-Down Learning for Structured Labeling with Convolutional Pseudoprior

Current practice in convolutional neural networks (CNN) remains largely bottom-up and the role of top-down process in CNN for pattern analysis and visual inference is not very clear. In this paper, we propose a new method for structured labeling by developing convolutional pseudoprior (ConvPP) on the ground-truth labels. Our method has several interesting properties: (1) compared with classic machine learning algorithms like CRFs and Structural SVM, ConvPP automatically learns rich convolutional kernels to capture both short- and long- range contexts; (2) compared with cascade classifiers like Auto-Context, ConvPP avoids the iterative steps of learning a series of discriminative classifiers and automatically learns contextual configurations; (3) compared with recent efforts combining CNN models with CRFs and RNNs, ConvPP learns convolution in the labeling space with improved modeling capability and less manual specification; (4) compared with Bayesian models like MRFs, ConvPP capitalizes on the rich representation power of convolution by automatically learning priors built on convolutional filters. We accomplish our task using pseudo-likelihood approximation to the prior under a novel fixed-point network structure that facilitates an end-to-end learning process. We show state-of-the-art results on sequential labeling and image labeling benchmarks.

Saining Xie, Xun Huang, Zhuowen Tu

Generative Image Modeling Using Style and Structure Adversarial Networks

Current generative frameworks use end-to-end learning and generate images by sampling from uniform noise distribution. However, these approaches ignore the most basic principle of image formation: images are product of: (a) Structure: the underlying 3D model; (b) Style: the texture mapped onto structure. In this paper, we factorize the image generation process and propose Style and Structure Generative Adversarial Network ($${\text {S}^2}$$S2-GAN). Our $${\text {S}^2}$$S2-GAN has two components: the Structure-GAN generates a surface normal map; the Style-GAN takes the surface normal map as input and generates the 2D image. Apart from a real vs. generated loss function, we use an additional loss with computed surface normals from generated images. The two GANs are first trained independently, and then merged together via joint learning. We show our $${\text {S}^2}$$S2-GAN model is interpretable, generates more realistic images and can be used to learn unsupervised RGBD representations.

Xiaolong Wang, Abhinav Gupta

Joint Learning of Semantic and Latent Attributes

As mid-level semantic properties shared across object categories, attributes have been studied extensively. Recent approaches have attempted joint modelling of multiple attributes together with class labels so as to exploit their correlations for better attribute prediction and object recognition. However, they often ignore the fact that there exist some shared properties other than nameable/semantic attributes, which we call latent attributes. Basically, they can be further divided into discriminative and non-discriminative parts depending on whether they can contribute to an object recognition task. We argue that learning the latent attributes jointly with user-defined semantic attributes not only leads to better representation for object recognition but also helps with semantic attribute prediction. A novel dictionary learning model is proposed which decomposes the dictionary space into three parts corresponding to semantic, latent discriminative and latent background attributes respectively. An efficient algorithm is then formulated to solve the resultant optimization problem. Extensive experiments show that the proposed attribute learning method produces state-of-the-art results on both attribute prediction and attribute-based person re-identification.

Peixi Peng, Yonghong Tian, Tao Xiang, Yaowei Wang, Tiejun Huang

A Unified Multi-scale Deep Convolutional Neural Network for Fast Object Detection

A unified deep neural network, denoted the multi-scale CNN (MS-CNN), is proposed for fast multi-scale object detection. The MS-CNN consists of a proposal sub-network and a detection sub-network. In the proposal sub-network, detection is performed at multiple output layers, so that receptive fields match objects of different scales. These complementary scale-specific detectors are combined to produce a strong multi-scale object detector. The unified network is learned end-to-end, by optimizing a multi-task loss. Feature upsampling by deconvolution is also explored, as an alternative to input upsampling, to reduce the memory and computation costs. State-of-the-art object detection performance, at up to 15 fps, is reported on datasets, such as KITTI and Caltech, containing a substantial number of small objects.

Zhaowei Cai, Quanfu Fan, Rogerio S. Feris, Nuno Vasconcelos

Deep Specialized Network for Illuminant Estimation

Illuminant estimation to achieve color constancy is an ill-posed problem. Searching the large hypothesis space for an accurate illuminant estimation is hard due to the ambiguities of unknown reflections and local patch appearances. In this work, we propose a novel Deep Specialized Network (DS-Net) that is adaptive to diverse local regions for estimating robust local illuminants. This is achieved through a new convolutional network architecture with two interacting sub-networks, i.e. an hypotheses network (HypNet) and a selection network (SelNet). In particular, HypNet generates multiple illuminant hypotheses that inherently capture different modes of illuminants with its unique two-branch structure. SelNet then adaptively picks for confident estimations from these plausible hypotheses. Extensive experiments on the two largest color constancy benchmark datasets show that the proposed ‘hypothesis selection’ approach is effective to overcome erroneous estimation. Through the synergy of HypNet and SelNet, our approach outperforms state-of-the-art methods such as [1–3].

Wu Shi, Chen Change Loy, Xiaoou Tang

Weakly-Supervised Semantic Segmentation Using Motion Cues

Fully convolutional neural networks (FCNNs) trained on a large number of images with strong pixel-level annotations have become the new state of the art for the semantic segmentation task. While there have been recent attempts to learn FCNNs from image-level weak annotations, they need additional constraints, such as the size of an object, to obtain reasonable performance. To address this issue, we present motion-CNN (M-CNN), a novel FCNN framework which incorporates motion cues and is learned from video-level weak annotations. Our learning scheme to train the network uses motion segments as soft constraints, thereby handling noisy motion information. When trained on weakly-annotated videos, our method outperforms the state-of-the-art approach [1] on the PASCAL VOC 2012 image segmentation benchmark. We also demonstrate that the performance of M-CNN learned with 150 weak video annotations is on par with state-of-the-art weakly-supervised methods trained with thousands of images. Finally, M-CNN substantially outperforms recent approaches in a related task of video co-localization on the YouTube-Objects dataset.

Pavel Tokmakov, Karteek Alahari, Cordelia Schmid

Human-in-the-Loop Person Re-identification

Current person re-identification (re-id) methods assume that (1) pre-labelled training data are available for every camera pair, (2) the gallery size for re-identification is moderate. Both assumptions scale poorly to real-world applications when camera network size increases and gallery size becomes large. Human verification of automatic model ranked re-id results becomes inevitable. In this work, a novel human-in-the-loop re-id model based on Human Verification Incremental Learning (HVIL) is formulated which does not require any pre-labelled training data to learn a model, therefore readily scalable to new camera pairs. This HVIL model learns cumulatively from human feedback to provide instant improvement to re-id ranking of each probe on-the-fly enabling the model scalable to large gallery sizes. We further formulate a Regularised Metric Ensemble Learning (RMEL) model to combine a series of incrementally learned HVIL models into a single ensemble model to be used when human feedback becomes unavailable.

Hanxiao Wang, Shaogang Gong, Xiatian Zhu, Tao Xiang

Real-Time Monocular Segmentation and Pose Tracking of Multiple Objects

We present a real-time system capable of segmenting multiple 3D objects and tracking their pose using a single RGB camera, based on prior shape knowledge. The proposed method uses twist-coordinates for pose parametrization and a pixel-wise second-order optimization approach which lead to major improvements in terms of tracking robustness, especially in cases of fast motion and scale changes, compared to previous region-based approaches. Our implementation runs at about 50–100 Hz on a commodity laptop when tracking a single object without relying on GPGPU computations. We compare our method to the current state of the art in various experiments involving challenging motion sequences and different complex objects.

Henning Tjaden, Ulrich Schwanecke, Elmar Schömer

Estimation of Human Body Shape in Motion with Wide Clothing

Estimating 3D human body shape in motion from a sequence of unstructured oriented 3D point clouds is important for many applications. We propose the first automatic method to solve this problem that works in the presence of loose clothing. The problem is formulated as an optimization problem that solves for identity and posture parameters in a shape space capturing likely body shape variations. The automation is achieved by leveraging a recent robust pose detection method [1]. To account for clothing, we take advantage of motion cues by encouraging the estimated body shape to be inside the observations. The method is evaluated on a new benchmark containing different subjects, motions, and clothing styles that allows to quantitatively measure the accuracy of body shape estimates. Furthermore, we compare our results to existing methods that require manual input and demonstrate that results of similar visual quality can be obtained.

Jinlong Yang, Jean-Sébastien Franco, Franck Hétroy-Wheeler, Stefanie Wuhrer

A Shape-Based Approach for Salient Object Detection Using Deep Learning

Salient object detection is a key step in many image analysis tasks as it not only identifies relevant parts of a visual scene but may also reduce computational complexity by filtering out irrelevant segments of the scene. In this paper, we propose a novel salient object detection method that combines a shape prediction driven by a convolutional neural network with the mid and low-region preserving image information. Our model learns a shape of a salient object using a CNN model for a target region and estimates the full but coarse saliency map of the target image. The map is then refined using image specific low-to-mid level information. Experimental results show that the proposed method outperforms previous state-of-the-arts methods in salient object detection.

Fast Optical Flow Using Dense Inverse Search

Most recent works in optical flow extraction focus on the accuracy and neglect the time complexity. However, in real-life visual applications, such as tracking, activity detection and recognition, the time complexity is critical. We propose a solution with very low time complexity and competitive accuracy for the computation of dense optical flow. It consists of three parts: (1) inverse search for patch correspondences; (2) dense displacement field creation through patch aggregation along multiple scales; (3) variational refinement. At the core of our Dense Inverse Search-based method (DIS) is the efficient search of correspondences inspired by the inverse compositional image alignment proposed by Baker and Matthews (2001, 2004). DIS is competitive on standard optical flow benchmarks. DIS runs at 300 Hz up to 600 Hz on a single CPU core (1024 $$\times$$× 436 resolution. 42 Hz/46 Hz when including preprocessing: disk access, image re-scaling, gradient computation. More details in Sect. 3.1.), reaching the temporal resolution of human’s biological vision system. It is order(s) of magnitude faster than state-of-the-art methods in the same range of accuracy, making DIS ideal for real-time applications.

Till Kroeger, Radu Timofte, Dengxin Dai, Luc Van Gool

Global Registration of 3D Point Sets via LRS Decomposition

This paper casts the global registration of multiple 3D point-sets into a low-rank and sparse decomposition problem. This neat mathematical formulation caters for missing data, outliers and noise, and it benefits from a wealth of available decomposition algorithms that can be plugged-in. Experimental results show that this approach compares favourably to the state of the art in terms of precision and speed, and it outperforms all the analysed techniques as for robustness to outliers.

Federica Arrigoni, Beatrice Rossi, Andrea Fusiello

Recognition from Hand Cameras: A Revisit with Deep Learning

We revisit the study of a wrist-mounted camera system (referred to as HandCam) for recognizing activities of hands. HandCam has two unique properties as compared to egocentric systems (referred to as HeadCam): (1) it avoids the need to detect hands; (2) it more consistently observes the activities of hands. By taking advantage of these properties, we propose a deep-learning-based method to recognize hand states (free vs. active hands, hand gestures, object categories), and discover object categories. Moreover, we propose a novel two-streams deep network to further take advantage of both HandCam and HeadCam. We have collected a new synchronized HandCam and HeadCam dataset with 20 videos captured in three scenes for hand states recognition. Experiments show that our HandCam system consistently outperforms a deep-learning-based HeadCam method (with estimated manipulation regions) and a dense-trajectory-based HeadCam method in all tasks. We also show that HandCam videos captured by different users can be easily aligned to improve free vs. active recognition accuracy ($$3.3\,\%$$3.3% improvement) in across-scenes use case. Moreover, we observe that finetuning Convolutional Neural Network consistently improves accuracy. Finally, our novel two-streams deep network combining HandCam and HeadCam achieves the best performance in four out of five tasks. With more data, we believe a joint HandCam and HeadCam system can robustly log hand states in daily life.

Cheng-Sheng Chan, Shou-Zhong Chen, Pei-Xuan Xie, Chiung-Chih Chang, Min Sun

XNOR-Net: ImageNet Classification Using Binary Convolutional Neural Networks

We propose two efficient approximations to standard convolutional neural networks: Binary-Weight-Networks and XNOR-Networks. In Binary-Weight-Networks, the filters are approximated with binary values resulting in 32$$\times$$× memory saving. In XNOR-Networks, both the filters and the input to convolutional layers are binary. XNOR-Networks approximate convolutions using primarily binary operations. This results in 58$$\times$$× faster convolutional operations (in terms of number of the high precision operations) and 32$$\times$$× memory savings. XNOR-Nets offer the possibility of running state-of-the-art networks on CPUs (rather than GPUs) in real-time. Our binary networks are simple, accurate, efficient, and work on challenging visual tasks. We evaluate our approach on the ImageNet classification task. The classification accuracy with a Binary-Weight-Network version of AlexNet is the same as the full-precision AlexNet. We compare our method with recent network binarization methods, BinaryConnect and BinaryNets, and outperform these methods by large margins on ImageNet, more than $$16\,\%$$16% in top-1 accuracy. Our code is available at: http://allenai.org/plato/xnornet.

Top-Down Neural Attention by Excitation Backprop

We aim to model the top-down attention of a Convolutional Neural Network (CNN) classifier for generating task-specific attention maps. Inspired by a top-down human visual attention model, we propose a new backpropagation scheme, called Excitation Backprop, to pass along top-down signals downwards in the network hierarchy via a probabilistic Winner-Take-All process. Furthermore, we introduce the concept of contrastive attention to make the top-down attention maps more discriminative. In experiments, we demonstrate the accuracy and generalizability of our method in weakly supervised localization tasks on the MS COCO, PASCAL VOC07 and ImageNet datasets. The usefulness of our method is further validated in the text-to-region association task. On the Flickr30k Entities dataset, we achieve promising performance in phrase localization by leveraging the top-down attention of a CNN model that has been trained on weakly labeled web images.

Jianming Zhang, Zhe Lin, Jonathan Brandt, Xiaohui Shen, Stan Sclaroff

Learning Recursive Filters for Low-Level Vision via a Hybrid Neural Network

In this paper, we consider numerous low-level vision problems (e.g., edge-preserving filtering and denoising) as recursive image filtering via a hybrid neural network. The network contains several spatially variant recurrent neural networks (RNN) as equivalents of a group of distinct recursive filters for each pixel, and a deep convolutional neural network (CNN) that learns the weights of RNNs. The deep CNN can learn regulations of recurrent propagation for various tasks and effectively guides recurrent propagation over an entire image. The proposed model does not need a large number of convolutional channels nor big kernels to learn features for low-level vision filters. It is significantly smaller and faster in comparison with a deep CNN based image filter. Experimental results show that many low-level vision tasks can be effectively learned and carried out in real-time by the proposed algorithm.

Sifei Liu, Jinshan Pan, Ming-Hsuan Yang

Learning Representations for Automatic Colorization

We develop a fully automatic image colorization system. Our approach leverages recent advances in deep networks, exploiting both low-level and semantic representations. As many scene elements naturally appear according to multimodal color distributions, we train our model to predict per-pixel color histograms. This intermediate output can be used to automatically generate a color image, or further manipulated prior to image formation. On both fully and partially automatic colorization tasks, we outperform existing methods. We also explore colorization as a vehicle for self-supervised visual representation learning.

Gustav Larsson, Michael Maire, Gregory Shakhnarovich

Deep Reconstruction-Classification Networks for Unsupervised Domain Adaptation

In this paper, we propose a novel unsupervised domain adaptation algorithm based on deep learning for visual object recognition. Specifically, we design a new model called Deep Reconstruction-Classification Network (DRCN), which jointly learns a shared encoding representation for two tasks: (i) supervised classification of labeled source data, and (ii) unsupervised reconstruction of unlabeled target data. In this way, the learnt representation not only preserves discriminability, but also encodes useful information from the target domain. Our new DRCN model can be optimized by using backpropagation similarly as the standard neural networks.We evaluate the performance of $$DRCN$$DRCN on a series of cross-domain object recognition tasks, where $$DRCN$$DRCN provides a considerable improvement (up to $$\sim$$∼8$$\%$$% in accuracy) over the prior state-of-the-art algorithms. Interestingly, we also observe that the reconstruction pipeline of $$DRCN$$DRCN transforms images from the source domain into images whose appearance resembles the target dataset. This suggests that $$DRCN$$DRCN’s performance is due to constructing a single composite representation that encodes information about both the structure of target images and the classification of source images. Finally, we provide a formal analysis to justify the algorithm’s objective in domain adaptation context.

Muhammad Ghifary, W. Bastiaan Kleijn, Mengjie Zhang, David Balduzzi, Wen Li

Learning Without Forgetting

When building a unified vision system or gradually adding new capabilities to a system, the usual assumption is that training data for all tasks is always available. However, as the number of tasks grows, storing and retraining on such data becomes infeasible. A new problem arises where we add new capabilities to a Convolutional Neural Network (CNN), but the training data for its existing capabilities are unavailable. We propose our Learning without Forgetting method, which uses only new task data to train the network while preserving the original capabilities. Our method performs favorably compared to commonly used feature extraction and fine-tuning adaption techniques and performs similarly to multitask learning that uses original task data we assume unavailable. A more surprising observation is that Learning without Forgetting may be able to replace fine-tuning as standard practice for improved new task performance.

Zhizhong Li, Derek Hoiem

Identity Mappings in Deep Residual Networks

Deep residual networks have emerged as a family of extremely deep architectures showing compelling accuracy and nice convergence behaviors. In this paper, we analyze the propagation formulations behind the residual building blocks, which suggest that the forward and backward signals can be directly propagated from one block to any other block, when using identity mappings as the skip connections and after-addition activation. A series of ablation experiments support the importance of these identity mappings. This motivates us to propose a new residual unit, which makes training easier and improves generalization. We report improved results using a 1001-layer ResNet on CIFAR-10 (4.62 % error) and CIFAR-100, and a 200-layer ResNet on ImageNet. Code is available at: https://github.com/KaimingHe/resnet-1k-layers.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun

Deep Networks with Stochastic Depth

Very deep convolutional networks with hundreds of layers have led to significant reductions in error on competitive benchmarks. Although the unmatched expressiveness of the many layers can be highly desirable at test time, training very deep networks comes with its own set of challenges. The gradients can vanish, the forward flow often diminishes, and the training time can be painfully slow. To address these problems, we propose stochastic depth, a training procedure that enables the seemingly contradictory setup to train short networks and use deep networks at test time. We start with very deep networks but during training, for each mini-batch, randomly drop a subset of layers and bypass them with the identity function. This simple approach complements the recent success of residual networks. It reduces training time substantially and improves the test error significantly on almost all data sets that we used for evaluation. With stochastic depth we can increase the depth of residual networks even beyond 1200 layers and still yield meaningful improvements in test error (4.91 % on CIFAR-10).

Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, Kilian Q. Weinberger

Less Is More: Towards Compact CNNs

To attain a favorable performance on large-scale datasets, convolutional neural networks (CNNs) are usually designed to have very high capacity involving millions of parameters. In this work, we aim at optimizing the number of neurons in a network, thus the number of parameters. We show that, by incorporating sparse constraints into the objective function, it is possible to decimate the number of neurons during the training stage. As a result, the number of parameters and the memory footprint of the neural network are also reduced, which is also desirable at the test time. We evaluated our method on several well-known CNN structures including AlexNet, and VGG over different datasets including ImageNet. Extensive experimental results demonstrate that our method leads to compact networks. Taking first fully connected layer as an example, our compact CNN contains only $$30\,\%$$30% of the original neurons without any degradation of the top-1 classification accuracy.

Hao Zhou, Jose M. Alvarez, Fatih Porikli

Unsupervised Visual Representation Learning by Graph-Based Consistent Constraints

Learning rich visual representations often require training on datasets of millions of manually annotated examples. This substantially limits the scalability of learning effective representations as labeled data is expensive or scarce. In this paper, we address the problem of unsupervised visual representation learning from a large, unlabeled collection of images. By representing each image as a node and each nearest-neighbor matching pair as an edge, our key idea is to leverage graph-based analysis to discover positive and negative image pairs (i.e., pairs belonging to the same and different visual categories). Specifically, we propose to use a cycle consistency criterion for mining positive pairs and geodesic distance in the graph for hard negative mining. We show that the mined positive and negative image pairs can provide accurate supervisory signals for learning effective representations using Convolutional Neural Networks (CNNs). We demonstrate the effectiveness of the proposed unsupervised constraint mining method in two settings: (1) unsupervised feature learning and (2) semi-supervised learning. For unsupervised feature learning, we obtain competitive performance with several state-of-the-art approaches on the PASCAL VOC 2007 dataset. For semi-supervised learning, we show boosted performance by incorporating the mined constraints on three image classification datasets.

Dong Li, Wei-Chih Hung, Jia-Bin Huang, Shengjin Wang, Narendra Ahuja, Ming-Hsuan Yang

Seed, Expand and Constrain: Three Principles for Weakly-Supervised Image Segmentation

We introduce a new loss function for the weakly-supervised training of semantic image segmentation models based on three guiding principles: to seed with weak localization cues, to expand objects based on the information about which classes can occur in an image, and to constrain the segmentations to coincide with object boundaries. We show experimentally that training a deep convolutional neural network using the proposed loss function leads to substantially better segmentations than previous state-of-the-art methods on the challenging PASCAL VOC 2012 dataset. We furthermore give insight into the working mechanism of our method by a detailed experimental study that illustrates how the segmentation quality is affected by each term of the proposed loss function as well as their combinations.

Alexander Kolesnikov, Christoph H. Lampert

Patch-Based Low-Rank Matrix Completion for Learning of Shape and Motion Models from Few Training Samples

Statistical models have opened up new possibilities for the automated analysis of images. However, the limited availability of representative training data, e.g. segmented images, leads to a bottleneck for the application of statistical models in practice. In this paper, we propose a novel patch-based technique that enables to learn representative statistical models of shape, appearance, or motion with a high grade of detail from a small number of observed training samples using low-rank matrix completion methods. Our method relies on the assumption that local variations have limited effects in distant areas. We evaluate our approach on three exemplary applications: (1) 2D shape modeling of faces, (2) 3D modeling of human lung shapes, and (3) population-based modeling of respiratory organ deformation. A comparison with the classical PCA-based modeling approach and FEM-PCA shows an improved generalization ability for small training sets indicating the improved flexibility of the model.

Jan Ehrhardt, Matthias Wilms, Heinz Handels

Chained Predictions Using Convolutional Neural Networks

In this work, we present an adaptation of the sequence-to-sequence model for structured vision tasks. In this model, the output variables for a given input are predicted sequentially using neural networks. The prediction for each output variable depends not only on the input but also on the previously predicted output variables. The model is applied to spatial localization tasks and uses convolutional neural networks (CNNs) for processing input images and a multi-scale deconvolutional architecture for making spatial predictions at each step. We explore the impact of weight sharing with a recurrent connection matrix between consecutive predictions, and compare it to a formulation where these weights are not tied. Untied weights are particularly suited for problems with a fixed sized structure, where different classes of output are predicted at different steps. We show that chain models achieve top performing results on human pose estimation from images and videos.

Georgia Gkioxari, Alexander Toshev, Navdeep Jaitly

Multi-region Two-Stream R-CNN for Action Detection

We propose a multi-region two-stream R-CNN model for action detection in realistic videos. We start from frame-level action detection based on faster R-CNN, and make three contributions: (1) we show that a motion region proposal network generates high-quality proposals, which are complementary to those of an appearance region proposal network; (2) we show that stacking optical flow over several frames significantly improves frame-level action detection; and (3) we embed a multi-region scheme in the faster R-CNN model, which adds complementary information on body parts. We then link frame-level detections with the Viterbi algorithm, and temporally localize an action with the maximum subarray method. Experimental results on the UCF-Sports, J-HMDB and UCF101 action detection datasets show that our approach outperforms the state of the art with a significant margin in both frame-mAP and video-mAP.

Xiaojiang Peng, Cordelia Schmid

Semantic Co-segmentation in Videos

Discovering and segmenting objects in videos is a challenging task due to large variations of objects in appearances, deformed shapes and cluttered backgrounds. In this paper, we propose to segment objects and understand their visual semantics from a collection of videos that link to each other, which we refer to as semantic co-segmentation. Without any prior knowledge on videos, we first extract semantic objects and utilize a tracking-based approach to generate multiple object-like tracklets across the video. Each tracklet maintains temporally connected segments and is associated with a predicted category. To exploit rich information from other videos, we collect tracklets that are assigned to the same category from all videos, and co-select tracklets that belong to true objects by solving a submodular function. This function accounts for object properties such as appearances, shapes and motions, and hence facilitates the co-segmentation process. Experiments on three video object segmentation datasets show that the proposed algorithm performs favorably against the other state-of-the-art methods.

Yi-Hsuan Tsai, Guangyu Zhong, Ming-Hsuan Yang

Attribute2Image: Conditional Image Generation from Visual Attributes

This paper investigates a novel problem of generating images from visual attributes. We model the image as a composite of foreground and background and develop a layered generative model with disentangled latent variables that can be learned end-to-end using a variational auto-encoder. We experiment with natural images of faces and birds and demonstrate that the proposed models are capable of generating realistic and diverse samples with disentangled latent representations. We use a general energy minimization algorithm for posterior inference of latent variables given novel images. Therefore, the learned generative models show excellent quantitative and visual results in the tasks of attribute-conditioned image reconstruction and completion.

Xinchen Yan, Jimei Yang, Kihyuk Sohn, Honglak Lee

Modeling Context Between Objects for Referring Expression Understanding

Referring expressions usually describe an object using properties of the object and relationships of the object with other objects. We propose a technique that integrates context between objects to understand referring expressions. Our approach uses an LSTM to learn the probability of a referring expression, with input features from a region and a context region. The context regions are discovered using multiple-instance learning (MIL) since annotations for context objects are generally not available for training. We utilize max-margin based MIL objective functions for training the LSTM. Experiments on the Google RefExp and UNC RefExp datasets show that modeling context between objects provides better performance than modeling only object properties. We also qualitatively show that our technique can ground a referring expression to its referred region along with the supporting context region.

Varun K. Nagaraja, Vlad I. Morariu, Larry S. Davis

Friction from Reflectance: Deep Reflectance Codes for Predicting Physical Surface Properties from One-Shot In-Field Reflectance

Images are the standard input for vision algorithms, but one-shot in-field reflectance measurements are creating new opportunities for recognition and scene understanding. In this work, we address the question of what reflectance can reveal about materials in an efficient manner. We go beyond the question of recognition and labeling and ask the question: What intrinsic physical properties of the surface can be estimated using reflectance? We introduce a framework that enables prediction of actual friction values for surfaces using one-shot reflectance measurements. This work is a first of its kind vision-based friction estimation. We develop a novel representation for reflectance disks that capture partial BRDF measurements instantaneously. Our method of deep reflectance codes combines CNN features and fisher vector pooling with optimal binary embedding to create codes that have sufficient discriminatory power and have important properties of illumination and spatial invariance. The experimental results demonstrate that reflectance can play a new role in deciphering the underlying physical properties of real-world scenes.

Hang Zhang, Kristin Dana, Ko Nishino

Saliency Detection with Recurrent Fully Convolutional Networks

Deep networks have been proved to encode high level semantic features and delivered superior performance in saliency detection. In this paper, we go one step further by developing a new saliency model using recurrent fully convolutional networks (RFCNs). Compared with existing deep network based methods, the proposed network is able to incorporate saliency prior knowledge for more accurate inference. In addition, the recurrent architecture enables our method to automatically learn to refine the saliency map by correcting its previous errors. To train such a network with numerous parameters, we propose a pre-training strategy using semantic segmentation data, which simultaneously leverages the strong supervision of segmentation tasks for better training and enables the network to capture generic representations of objects for saliency detection. Through extensive experimental evaluations, we demonstrate that the proposed method compares favorably against state-of-the-art approaches, and that the proposed recurrent deep model as well as the pre-training method can significantly improve performance.

Linzhao Wang, Lijun Wang, Huchuan Lu, Pingping Zhang, Xiang Ruan

Deep3D: Fully Automatic 2D-to-3D Video Conversion with Deep Convolutional Neural Networks

As 3D movie viewing becomes mainstream and the Virtual Reality (VR) market emerges, the demand for 3D contents is growing rapidly. Producing 3D videos, however, remains challenging. In this paper we propose to use deep neural networks to automatically convert 2D videos and images to a stereoscopic 3D format. In contrast to previous automatic 2D-to-3D conversion algorithms, which have separate stages and need ground truth depth map as supervision, our approach is trained end-to-end directly on stereo pairs extracted from existing 3D movies. This novel training scheme makes it possible to exploit orders of magnitude more data and significantly increases performance. Indeed, Deep3D outperforms baselines in both quantitative and human subject evaluations.

Junyuan Xie, Ross Girshick, Ali Farhadi

Temporal Model Adaptation for Person Re-identification

Person re-identification is an open and challenging problem in computer vision. Majority of the efforts have been spent either to design the best feature representation or to learn the optimal matching metric. Most approaches have neglected the problem of adapting the selected features or the learned model over time. To address such a problem, we propose a temporal model adaptation scheme with human in the loop. We first introduce a similarity-dissimilarity learning method which can be trained in an incremental fashion by means of a stochastic alternating directions methods of multipliers optimization procedure. Then, to achieve temporal adaptation with limited human effort, we exploit a graph-based approach to present the user only the most informative probe-gallery matches that should be used to update the model. Results on three datasets have shown that our approach performs on par or even better than state-of-the-art approaches while reducing the manual pairwise labeling effort by about $$80\,\%$$80%.

Niki Martinel, Abir Das, Christian Micheloni, Amit K. Roy-Chowdhury

Backmatter

Weitere Informationen