main-content

## Über dieses Buch

The eight-volume set comprising LNCS volumes 9905-9912 constitutes the refereed proceedings of the 14th European Conference on Computer Vision, ECCV 2016, held in Amsterdam, The Netherlands, in October 2016.
The 415 revised papers presented were carefully reviewed and selected from 1480 submissions. The papers cover all aspects of computer vision and pattern recognition such as 3D computer vision; computational photography, sensing and display; face and gesture; low-level vision and image processing; motion and tracking; optimization methods; physics-based vision, photometry and shape-from-X; recognition: detection, categorization, indexing, matching; segmentation, grouping and shape representation; statistical methods and learning; video: events, activities and surveillance; applications. They are organized in topical sections on detection, recognition and retrieval; scene understanding; optimization; image and video processing; learning; action, activity and tracking; 3D; and 9 poster sessions.

## Inhaltsverzeichnis

### Query-Focused Extractive Video Summarization

Video data is explosively growing. As a result of the “big video data”, intelligent algorithms for automatic video summarization have (re-)emerged as a pressing need. We develop a probabilistic model, Sequential and Hierarchical Determinantal Point Process (SH-DPP), for query-focused extractive video summarization. Given a user query and a long video sequence, our algorithm returns a summary by selecting key shots from the video. The decision to include a shot in the summary depends on the shot’s relevance to the user query and importance in the context of the video, jointly. We verify our approach on two densely annotated video datasets. The query-focused video summarization is particularly useful for search engines, e.g., to display snippets of videos.

Aidean Sharghi, Boqing Gong, Mubarak Shah

### Temporal Segment Networks: Towards Good Practices for Deep Action Recognition

Deep convolutional networks have achieved great success for visual recognition in still images. However, for action recognition in videos, the advantage over traditional methods is not so evident. This paper aims to discover the principles to design effective ConvNet architectures for action recognition in videos and learn these models given limited training samples. Our first contribution is temporal segment network (TSN), a novel framework for video-based action recognition. which is based on the idea of long-range temporal structure modeling. It combines a sparse temporal sampling strategy and video-level supervision to enable efficient and effective learning using the whole action video. The other contribution is our study on a series of good practices in learning ConvNets on video data with the help of temporal segment network. Our approach obtains the state-the-of-art performance on the datasets of HMDB51 ($$69.4\,\%$$) and UCF101 ($$94.2\,\%$$). We also visualize the learned ConvNet models, which qualitatively demonstrates the effectiveness of temporal segment network and the proposed good practices (Models and code at https://github.com/yjxiong/temporal-segment-networks).

Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao, Dahua Lin, Xiaoou Tang, Luc Van Gool

### PlaNet - Photo Geolocation with Convolutional Neural Networks

Is it possible to determine the location of a photo from just its pixels? While the general problem seems exceptionally difficult, photos often contain cues such as landmarks, weather patterns, vegetation, road markings, or architectural details, which in combination allow to infer where the photo was taken. Previously, this problem has been approached using image retrieval methods. In contrast, we pose the problem as one of classification by subdividing the surface of the earth into thousands of multi-scale geographic cells, and train a deep network using millions of geotagged images. We show that the resulting model, called PlaNet, outperforms previous approaches and even attains superhuman accuracy in some cases. Moreover, we extend our model to photo albums by combining it with a long short-term memory (LSTM) architecture. By learning to exploit temporal coherence to geolocate uncertain photos, this model achieves a 50 % performance improvement over the single-image model.

Tobias Weyand, Ilya Kostrikov, James Philbin

### Detecting Text in Natural Image with Connectionist Text Proposal Network

We propose a novel Connectionist Text Proposal Network (CTPN) that accurately localizes text lines in natural image. The CTPN detects a text line in a sequence of fine-scale text proposals directly in convolutional feature maps. We develop a vertical anchor mechanism that jointly predicts location and text/non-text score of each fixed-width proposal, considerably improving localization accuracy. The sequential proposals are naturally connected by a recurrent neural network, which is seamlessly incorporated into the convolutional network, resulting in an end-to-end trainable model. This allows the CTPN to explore rich context information of image, making it powerful to detect extremely ambiguous text. The CTPN works reliably on multi-scale and multi-language text without further post-processing, departing from previous bottom-up methods requiring multi-step post filtering. It achieves 0.88 and 0.61 F-measure on the ICDAR 2013 and 2015 benchmarks, surpassing recent results [8, 35] by a large margin. The CTPN is computationally efficient with 0.14 s/image, by using the very deep VGG16 model [27]. Online demo is available: http://textdet.com/.

Zhi Tian, Weilin Huang, Tong He, Pan He, Yu Qiao

### Face Recognition Using a Unified 3D Morphable Model

We address the problem of 3D-assisted 2D face recognition in scenarios when the input image is subject to degradations or exhibits intra-personal variations not captured by the 3D model. The proposed solution involves a novel approach to learn a subspace spanned by perturbations caused by the missing modes of variation and image degradations, using 3D face data reconstructed from 2D images rather than 3D capture. This is accomplished by modelling the difference in the texture map of the 3D aligned input and reference images. A training set of these texture maps then defines a perturbation space which can be represented using PCA bases. Assuming that the image perturbation subspace is orthogonal to the 3D face model space, then these additive components can be recovered from an unseen input image, resulting in an improved fit of the 3D face model. The linearity of the model leads to efficient fitting. Experiments show that our method achieves very competitive face recognition performance on Multi-PIE and AR databases. We also present baseline face recognition results on a new data set exhibiting combined pose and illumination variations as well as occlusion.

Guosheng Hu, Fei Yan, Chi-Ho Chan, Weihong Deng, William Christmas, Josef Kittler, Neil M. Robertson

### Augmented Feedback in Semantic Segmentation Under Image Level Supervision

Training neural networks for semantic segmentation is data hungry. Meanwhile annotating a large number of pixel-level segmentation masks needs enormous human effort. In this paper, we propose a framework with only image-level supervision. It unifies semantic segmentation and object localization with important proposal aggregation and selection modules. They greatly reduce the notorious error accumulation problem that commonly arises in weakly supervised learning. Our proposed training algorithm progressively improves segmentation performance with augmented feedback in iterations. Our method achieves decent results on the PASCAL VOC 2012 segmentation data, outperforming previous image-level supervised methods by a large margin.

Xiaojuan Qi, Zhengzhe Liu, Jianping Shi, Hengshuang Zhao, Jiaya Jia

### Linear Depth Estimation from an Uncalibrated, Monocular Polarisation Image

We present a method for estimating surface height directly from a single polarisation image simply by solving a large, sparse system of linear equations. To do so, we show how to express polarisation constraints as equations that are linear in the unknown depth. The ambiguity in the surface normal azimuth angle is resolved globally when the optimal surface height is reconstructed. Our method is applicable to objects with uniform albedo exhibiting diffuse and specular reflectance. We extend it to an uncalibrated scenario by demonstrating that the illumination (point source or first/second order spherical harmonics) can be estimated from the polarisation image, up to a binary convex/concave ambiguity. We believe that our method is the first monocular, passive shape-from-x technique that enables well-posed depth estimation with only a single, uncalibrated illumination condition. We present results on glossy objects, including in uncontrolled, outdoor illumination.

William A. P. Smith, Ravi Ramamoorthi, Silvia Tozza

### Online Variational Bayesian Motion Averaging

In this paper, we propose a novel algorithm dedicated to online motion averaging for large scale problems. To this end, we design a filter that continuously approximates the posterior distribution of the estimated transformations. In order to deal with large scale problems, we associate a variational Bayesian approachwith a relative parametrization of the absolute transformations. Such an association allows our algorithm to simultaneously possess two features that are essential for an algorithm dedicated to large scale online motion averaging: (1) a low computational time, (2) the ability to detect wrong loop closure measurements. We extensively demonstrate on several applications (binocular SLAM, monocular SLAM and video mosaicking) that our approach not only exhibits a low computational time and detects wrong loop closures but also significantly outperforms the state of the art algorithm in terms of RMSE.

Guillaume Bourmaud

### Unified Depth Prediction and Intrinsic Image Decomposition from a Single Image via Joint Convolutional Neural Fields

We present a method for jointly predicting a depth map and intrinsic images from single-image input. The two tasks are formulated in a synergistic manner through a joint conditional random field (CRF) that is solved using a novel convolutional neural network (CNN) architecture, called the joint convolutional neural field (JCNF) model. Tailored to our joint estimation problem, JCNF differs from previous CNNs in its sharing of convolutional activations and layers between networks for each task, its inference in the gradient domain where there exists greater correlation between depth and intrinsic images, and the incorporation of a gradient scale network that learns the confidence of estimated gradients in order to effectively balance them in the solution. This approach is shown to surpass state-of-the-art methods both on single-image depth estimation and on intrinsic image decomposition.

Seungryong Kim, Kihong Park, Kwanghoon Sohn, Stephen Lin

### ObjectNet3D: A Large Scale Database for 3D Object Recognition

We contribute a large scale database for 3D object recognition, named ObjectNet3D, that consists of 100 categories, 90,127 images, 201,888 objects in these images and 44,147 3D shapes. Objects in the 2D images in our database are aligned with the 3D shapes, and the alignment provides both accurate 3D pose annotation and the closest 3D shape annotation for each 2D object. Consequently, our database is useful for recognizing the 3D pose and 3D shape of objects from 2D images. We also provide baseline experiments on four tasks: region proposal generation, 2D object detection, joint 2D detection and 3D object pose estimation, and image-based 3D shape retrieval, which can serve as baselines for future research using our database. Our database is available online at http://cvgl.stanford.edu/projects/objectnet3d.

Yu Xiang, Wonhui Kim, Wei Chen, Jingwei Ji, Christopher Choy, Hao Su, Roozbeh Mottaghi, Leonidas Guibas, Silvio Savarese

### Branching Gaussian Processes with Applications to Spatiotemporal Reconstruction of 3D Trees

We propose a robust method for estimating dynamic 3D curvilinear branching structure from monocular images. While 3D reconstruction from images has been widely studied, estimating thin structure has received less attention. This problem becomes more challenging in the presence of camera error, scene motion, and a constraint that curves are attached in a branching structure. We propose a new general-purpose prior, a branching Gaussian processes (BGP), that models spatial smoothness and temporal dynamics of curves while enforcing attachment between them. We apply this prior to fit 3D trees directly to image data, using an efficient scheme for approximate inference based on expectation propagation. The BGP prior’s Gaussian form allows us to approximately marginalize over 3D trees with a given model structure, enabling principled comparison between tree models with varying complexity. We test our approach on a novel multi-view dataset depicting plants with known 3D structures and topologies undergoing small nonrigid motion. Our method outperforms a state-of-the-art 3D reconstruction method designed for non-moving thin structure. We evaluate under several common measures, and we propose a new measure for reconstructions of branching multi-part 3D scenes under motion.

Kyle Simek, Ravishankar Palanivelu, Kobus Barnard

### Tracking Completion

A fundamental component of modern trackers is an online learned tracking model, which is typically modeled either globally or locally. The two kinds of models perform differently in terms of effectiveness and robustness under different challenging situations. This work exploits the advantages of both models. A subspace model, from a global perspective, is learned from previously obtained targets via rank-minimization to address the tracking, and a pixel-level local observation is leveraged simultaneously, from a local point of view, to augment the subspace model. A matrix completion method is employed to integrate the two models. Unlike previous tracking methods, which locate the target among all fully observed target candidates, the proposed approach first estimates an expected target via the matrix completion through partially observed target candidates, and then, identifies the target according to the estimation accuracy with respect to the target candidates. Specifically, the tracking is formulated as a problem of target appearance estimation. Extensive experiments on various challenging video sequences verify the effectiveness of the proposed approach and demonstrate that the proposed tracker outperforms other popular state-of-the-art trackers.

Yao Sui, Guanghui Wang, Yafei Tang, Li Zhang

### Inter-battery Topic Representation Learning

In this paper, we present the Inter-Battery Topic Model (IBTM). Our approach extends traditional topic models by learning a factorized latent variable representation. The structured representation leads to a model that marries benefits traditionally associated with a discriminative approach, such as feature selection, with those of a generative model, such as principled regularization and ability to handle missing data. The factorization is provided by representing data in terms of aligned pairs of observations as different views. This provides means for selecting a representation that separately models topics that exist in both views from the topics that are unique to a single view. This structured consolidation allows for efficient and robust inference and provides a compact and efficient representation. Learning is performed in a Bayesian fashion by maximizing a rigorous bound on the log-likelihood. Firstly, we illustrate the benefits of the model on a synthetic dataset. The model is then evaluated in both uni- and multi-modality settings on two different classification tasks with off-the-shelf convolutional neural network (CNN) features which generate state-of-the-art results with extremely compact representations.

Cheng Zhang, Hedvig Kjellström, Carl Henrik Ek

### Online Adaptation for Joint Scene and Object Classification

Recent efforts in computer vision consider joint scene and object classification by exploiting mutual relationships (often termed as context) between them to achieve higher accuracy. On the other hand, there is also a lot of interest in online adaptation of recognition models as new data becomes available. In this paper, we address the problem of how models for joint scene and object classification can be learned online. A major motivation for this approach is to exploit the hierarchical relationships between scenes and objects, represented as a graphical model, in an active learning framework. To select the samples on the graph, which need to be labeled by a human, we use an information theoretic approach that reduces the joint entropy of scene and object variables. This leads to a significant reduction in the amount of manual labeling effort for similar or better performance when compared with a model trained with the full dataset. This is demonstrated through rigorous experimentation on three datasets.

Jawadul H. Bappy, Sujoy Paul, Amit K. Roy-Chowdhury

### Real-Time Facial Segmentation and Performance Capture from RGB Input

We introduce the concept of unconstrained real-time 3D facial performance capture through explicit semantic segmentation in the RGB input. To ensure robustness, cutting edge supervised learning approaches rely on large training datasets of face images captured in the wild. While impressive tracking quality has been demonstrated for faces that are largely visible, any occlusion due to hair, accessories, or hand-to-face gestures would result in significant visual artifacts and loss of tracking accuracy. The modeling of occlusions has been mostly avoided due to its immense space of appearance variability. To address this curse of high dimensionality, we perform tracking in unconstrained images assuming non-face regions can be fully masked out. Along with recent breakthroughs in deep learning, we demonstrate that pixel-level facial segmentation is possible in real-time by repurposing convolutional neural networks designed originally for general semantic segmentation. We develop an efficient architecture based on a two-stream deconvolution network with complementary characteristics, and introduce carefully designed training samples and data augmentation strategies for improved segmentation accuracy and robustness. We adopt a state-of-the-art regression-based facial tracking framework with segmented face images as training, and demonstrate accurate and uninterrupted facial performance capture in the presence of extreme occlusion and even side views. Furthermore, the resulting segmentation can be directly used to composite partial 3D face models on the input images and enable seamless facial manipulation tasks, such as virtual make-up or face replacement.

Shunsuke Saito, Tianye Li, Hao Li

### Learning Temporal Transformations from Time-Lapse Videos

Based on life-long observations of physical, chemical, and biologic phenomena in the natural world, humans can often easily picture in their minds what an object will look like in the future. But, what about computers? In this paper, we learn computational models of object transformations from time-lapse videos. In particular, we explore the use of generative models to create depictions of objects at future times. These models explore several different prediction tasks: generating a future state given a single depiction of an object, generating a future state given two depictions of an object at different times, and generating future states recursively in a recurrent framework. We provide both qualitative and quantitative evaluations of the generated results, and also conduct a human evaluation to compare variations of our models.

Yipin Zhou, Tamara L. Berg

### Interactive Image Segmentation Using Constrained Dominant Sets

We propose a new approach to interactive image segmentation based on some properties of a family of quadratic optimization problems related to dominant sets, a well-known graph-theoretic notion of a cluster which generalizes the concept of a maximal clique to edge-weighted graphs. In particular, we show that by properly controlling a regularization parameter which determines the structure and the scale of the underlying problem, we are in a position to extract groups of dominant-set clusters which are constrained to contain user-selected elements. The resulting algorithm can deal naturally with any type of input modality, including scribbles, sloppy contours, and bounding boxes, and is able to robustly handle noisy annotations on the part of the user. Experiments on standard benchmark datasets show the effectiveness of our approach as compared to state-of-the-art algorithms on a variety of natural images under several input conditions.

Eyasu Zemene, Marcello Pelillo

### Deep Markov Random Field for Image Modeling

Markov Random Fields (MRFs), a formulation widely used in generative image modeling, have long been plagued by the lack of expressive power. This issue is primarily due to the fact that conventional MRFs formulations tend to use simplistic factors to capture local patterns. In this paper, we move beyond such limitations, and propose a novel MRF model that uses fully-connected neurons to express the complex interactions among pixels. Through theoretical analysis, we reveal an inherent connection between this model and recurrent neural networks, and thereon derive an approximated feed-forward network that couples multiple RNNs along opposite directions. This formulation combines the expressive power of deep neural networks and the cyclic dependency structure of MRF in a unified model, bringing the modeling capability to a new level. The feed-forward approximation also allows it to be efficiently learned from data. Experimental results on a variety of low-level vision tasks show notable improvement over state-of-the-arts.

Zhirong Wu, Dahua Lin, Xiaoou Tang

### A Symmetry Prior for Convex Variational 3D Reconstruction

We propose a novel prior for variational 3D reconstruction that favors symmetric solutions when dealing with noisy or incomplete data. We detect symmetries from incomplete data while explicitly handling unexplored areas to allow for plausible scene completions. The set of detected symmetries is then enforced on their respective support domain within a variational reconstruction framework. This formulation also handles multiple symmetries sharing the same support. The proposed approach is able to denoise and complete surface geometry and even hallucinate large scene parts. We demonstrate in several experiments the benefit of harnessing symmetries when regularizing a surface.

Pablo Speciale, Martin R. Oswald, Andrea Cohen, Marc Pollefeys

### SPLeaP: Soft Pooling of Learned Parts for Image Classification

The aggregation of image statistics – the so-called pooling step of image classification algorithms – as well as the construction of part-based models, are two distinct and well-studied topics in the literature. The former aims at leveraging a whole set of local descriptors that an image can contain (through spatial pyramids or Fisher vectors for instance) while the latter argues that only a few of the regions an image contains are actually useful for its classification. This paper bridges the two worlds by proposing a new pooling framework based on the discovery of useful parts involved in the pooling of local representations. The key contribution lies in a model integrating a boosted non-linear part classifier as well as a parametric soft-max pooling component, both trained jointly with the image classifier. The experimental validation shows that the proposed model not only consistently surpasses standard pooling approaches but also improves over state-of-the-art part-based models, on several different and challenging classification tasks.

Praveen Kulkarni, Frédéric Jurie, Joaquin Zepeda, Patrick Pérez, Louis Chevallier

### Spatial Attention Deep Net with Partial PSO for Hierarchical Hybrid Hand Pose Estimation

Discriminative methods often generate hand poses kinematically implausible, then generative methods are used to correct (or verify) these results in a hybrid method. Estimating 3D hand pose in a hierarchy, where the high-dimensional output space is decomposed into smaller ones, has been shown effective. Existing hierarchical methods mainly focus on the decomposition of the output space while the input space remains almost the same along the hierarchy. In this paper, a hybrid hand pose estimation method is proposed by applying the kinematic hierarchy strategy to the input space (as well as the output space) of the discriminative method by a spatial attention mechanism and to the optimization of the generative method by hierarchical Particle Swarm Optimization (PSO). The spatial attention mechanism integrates cascaded and hierarchical regression into a CNN framework by transforming both the input (and feature space) and the output space, which greatly reduces the viewpoint and articulation variations. Between the levels in the hierarchy, the hierarchical PSO forces the kinematic constraints to the results of the CNNs. The experimental results show that our method significantly outperforms four state-of-the-art methods and three baselines on three public benchmarks.

Qi Ye, Shanxin Yuan, Tae-Kyun Kim

### VolumeDeform: Real-Time Volumetric Non-rigid Reconstruction

We present a novel approach for the reconstruction of dynamic geometric shapes using a single hand-held consumer-grade RGB-D sensor at real-time rates. Our method builds up the scene model from scratch during the scanning process, thus it does not require a pre-defined shape template to start with. Geometry and motion are parameterized in a unified manner by a volumetric representation that encodes a distance field of the surface geometry as well as the non-rigid space deformation. Motion tracking is based on a set of extracted sparse color features in combination with a dense depth constraint. This enables accurate tracking and drastically reduces drift inherent to standard model-to-depth alignment. We cast finding the optimal deformation of space as a non-linear regularized variational optimization problem by enforcing local smoothness and proximity to the input constraints. The problem is tackled in real-time at the camera’s capture rate using a data-parallel flip-flop optimization strategy. Our results demonstrate robust tracking even for fast motion and scenes that lack geometric features.

Matthias Innmann, Michael Zollhöfer, Matthias Nießner, Christian Theobalt, Marc Stamminger

### Match: Monocular vSLAM and Piecewise Planar Reconstruction Using Fast Plane Correspondences

This paper proposes $$\pi$$Match, a monocular SLAM pipeline that, in contrast to current state-of-the-art feature-based methods, provides a dense Piecewise Planar Reconstruction (PPR) of the scene. It builds on recent advances in planar segmentation from affine correspondences (ACs) for generating motion hypotheses that are fed to a PEaRL framework which merges close motions and decides about multiple motion situations. Among the selected motions, the camera motion is identified and refined, allowing the subsequent refinement of the initial plane estimates. The high accuracy of this two-view approach allows a good scale estimation and a small drift in scale is observed, when compared to prior monocular methods. The final discrete optimization step provides an improved PPR of the scene. Experiments on the KITTI dataset show the accuracy of $$\pi$$Match and that it robustly handles situations of multiple motions and pure rotation of the camera. A Matlab implementation of the pipeline runs in about 0.7 s per frame.

Carolina Raposo, João P. Barreto

### Peripheral Expansion of Depth Information via Layout Estimation with Fisheye Camera

Consumer RGB-D cameras have become very useful in the last years, but their field of view is too narrow for certain applications. We propose a new hybrid camera system composed by a conventional RGB-D and a fisheye camera to extend the field of view over 180$$^{\circ }$$. With this system we have a region of the hemispherical image with depth certainty, and color data in the periphery that is used to extend the structural information of the scene. We have developed a new method to generate scaled layout hypotheses from relevant corners, combining the extraction of lines in the fisheye image and the depth information. Experiments with real images from different scenarios validate our layout recovery method and the advantages of this camera system, which is also able to overcome severe occlusions. As a result, we obtain a scaled 3D model expanding the original depth information with the wide scene reconstruction. Our proposal expands successfully the depth map more than eleven times in a single shot.

Alejandro Perez-Yus, Gonzalo Lopez-Nicolas, Jose J. Guerrero

### Built-in Foreground/Background Prior for Weakly-Supervised Semantic Segmentation

Pixel-level annotations are expensive and time consuming to obtain. Hence, weak supervision using only image tags could have a significant impact in semantic segmentation. Recently, CNN-based methods have proposed to fine-tune pre-trained networks using image tags. Without additional information, this leads to poor localization accuracy. This problem, however, was alleviated by making use of objectness priors to generate foreground/background masks. Unfortunately these priors either require training pixel-level annotations/bounding boxes, or still yield inaccurate object boundaries. Here, we propose a novel method to extract markedly more accurate masks from the pre-trained network itself, forgoing external objectness modules. This is accomplished using the activations of the higher-level convolutional layers, smoothed by a dense CRF. We demonstrate that our method, based on these masks and a weakly-supervised loss, outperforms the state-of-the-art tag-based weakly-supervised semantic segmentation techniques. Furthermore, we introduce a new form of inexpensive weak supervision yielding an additional accuracy boost.

### It’s Moving! A Probabilistic Model for Causal Motion Segmentation in Moving Camera Videos

The human ability to detect and segment moving objects works in the presence of multiple objects, complex background geometry, motion of the observer, and even camouflage. In addition to all of this, the ability to detect motion is nearly instantaneous. While there has been much recent progress in motion segmentation, it still appears we are far from human capabilities. In this work, we derive from first principles a likelihood function for assessing the probability of an optical flow vector given the 2D motion direction of an object. This likelihood uses a novel combination of the angle and magnitude of the optical flow to maximize the information about how objects are moving differently. Using this new likelihood and several innovations in initialization, we develop a motion segmentation algorithm that beats current state-of-the-art methods by a large margin. We compare to five state-of-the-art methods on two established benchmarks, and a third new data set of camouflaged animals, which we introduce to push motion segmentation to the next level.

Pia Bideau, Erik Learned-Miller

### Kernelized Subspace Ranking for Saliency Detection

In this paper, we propose a novel saliency method that takes advantage of object-level proposals and region-based convolutional neural network (R-CNN) features. We follow the learning-to-rank methodology, and solve a ranking problem satisfying the constraint that positive samples have higher scores than negative ones. As the dimensionality of the deep features is high and the amount of training data is low, ranking in the primal space is suboptimal. A new kernelized subspace ranking model is proposed by jointly learning a Rank-SVM classifier and a subspace projection. The projection aims to measure the pairwise distances in a low-dimensional space. For an image, the ranking score of each proposal is assigned by the learnt ranker. The final saliency map is generated by a weighted fusion of the top-ranked candidates. Experimental results show that the proposed algorithm performs favorably against the state-of-the-art methods on four benchmark datasets.

Tiantian Wang, Lihe Zhang, Huchuan Lu, Chong Sun, Jinqing Qi

### Depth-Aware Motion Magnification

This paper adds depth to motion magnification. With the rise of cheap RGB+D cameras depth information is readily available. We make use of depth to make motion magnification robust to occlusion and large motions. Current approaches require a manual drawn pixel mask over all frames in the area of interest which is cumbersome and error-prone. By including depth, we avoid manual annotation and magnify motions at similar depth levels while ignoring occlusions at distant depth pixels. To achieve this, we propose an extension to the bilateral filter for non-Gaussian filters which allows us to treat pixels at very different depth layers as missing values. As our experiments will show, these missing values should be ignored, and not inferred with inpainting. We show results for a medical application (tremors) where we improve current baselines for motion magnification and motion measurements.

Julian F. P. Kooij, Jan C. van Gemert

### Stacked Hourglass Networks for Human Pose Estimation

This work introduces a novel convolutional network architecture for the task of human pose estimation. Features are processed across all scales and consolidated to best capture the various spatial relationships associated with the body. We show how repeated bottom-up, top-down processing used in conjunction with intermediate supervision is critical to improving the performance of the network. We refer to the architecture as a “stacked hourglass” network based on the successive steps of pooling and upsampling that are done to produce a final set of predictions. State-of-the-art results are achieved on the FLIC and MPII benchmarks outcompeting all recent methods.

Alejandro Newell, Kaiyu Yang, Jia Deng

### Real-Time Large-Scale Dense 3D Reconstruction with Loop Closure

In the highly active research field of dense 3D reconstruction and modelling, loop closure is still a largely unsolved problem. While a number of previous works show how to accumulate keyframes, globally optimize their pose on closure, and compute a dense 3D model as a post-processing step, in this paper we propose an online framework which delivers a consistent 3D model to the user in real time. This is achieved by splitting the scene into submaps, and adjusting the poses of the submaps as and when required. We present a novel technique for accumulating relative pose constraints between the submaps at very little computational cost, and demonstrate how to maintain a lightweight, scalable global optimization of submap poses. In contrast to previous works, the number of submaps grows with the observed 3D scene surface, rather than with time. In addition to loop closure, the paper incorporates relocalization and provides a novel way of assessing tracking quality.

Olaf Kähler, Victor A. Prisacariu, David W. Murray

### Pixel-Level Domain Transfer

We present an image-conditional image generation model. The model transfers an input domain to a target domain in semantic level, and generates the target image in pixel level. To generate realistic target images, we employ the real/fake-discriminator as in Generative Adversarial Nets [6], but also introduce a novel domain-discriminator to make the generated image relevant to the input image. We verify our model through a challenging task of generating a piece of clothing from an input image of a dressed person. We present a high quality clothing dataset containing the two domains, and succeed in demonstrating decent results.

Donggeun Yoo, Namil Kim, Sunggyun Park, Anthony S. Paek, In So Kweon

### Accelerating Convolutional Neural Networks with Dominant Convolutional Kernel and Knowledge Pre-regression

Aiming at accelerating the test time of deep convolutional neural networks (CNNs), we propose a model compression method that contains a novel dominant kernel (DK) and a new training method called knowledge pre-regression (KP). In the combined model DK$$^2$$PNet, DK is presented to significantly accomplish a low-rank decomposition of convolutional kernels, while KP is employed to transfer knowledge of intermediate hidden layers from a larger teacher network to its compressed student network on the basis of a cross entropy loss function instead of previous Euclidean distance. Compared to the latest results, the experimental results achieved on CIFAR-10, CIFAR-100, MNIST, and SVHN benchmarks show that our DK$$^2$$PNet method has the best performance in the light of being close to the state of the art accuracy and requiring dramatically fewer number of model parameters.

Zhenyang Wang, Zhidong Deng, Shiyao Wang

### Learning Social Etiquette: Human Trajectory Understanding In Crowded Scenes

Humans navigate crowded spaces such as a university campus by following common sense rules based on social etiquette. In this paper, we argue that in order to enable the design of new target tracking or trajectory forecasting methods that can take full advantage of these rules, we need to have access to better data in the first place. To that end, we contribute a new large-scale dataset that collects videos of various types of targets (not just pedestrians, but also bikers, skateboarders, cars, buses, golf carts) that navigate in a real world outdoor environment such as a university campus. Moreover, we introduce a new characterization that describes the “social sensitivity” at which two targets interact. We use this characterization to define “navigation styles” and improve both forecasting models and state-of-the-art multi-target tracking–whereby the learnt forecasting models help the data association step.

Alexandre Robicquet, Amir Sadeghian, Alexandre Alahi, Silvio Savarese

### Bayesian Image Based 3D Pose Estimation

We introduce a 3D human pose estimation method from single image, based on a hierarchical Bayesian non-parametric model. The proposed model relies on a representation of the idiosyncratic motion of human body parts, which is captured by a subdivision of the human skeleton joints into groups. A dictionary of motion snapshots for each group is generated. The hierarchy ensures to integrate the visual features within the pose dictionary. Given a query image, the learned dictionary is used to estimate the likelihood of the group pose based on its visual features. The full-body pose is reconstructed taking into account the consistency of the connected group poses. The results show that the proposed approach is able to accurately reconstruct the 3D pose of previously unseen subjects.

Marta Sanzari, Valsamis Ntouskos, Fiora Pirri

### Efficient and Robust Semi-supervised Learning Over a Sparse-Regularized Graph

Graph-based Semi-Supervised Learning (GSSL) has limitations in widespread applicability due to its computationally prohibitive large-scale inference, sensitivity to data incompleteness, and incapability on handling time-evolving characteristics in an open set. To address these issues, we propose a novel GSSL based on a batch of informative beacons with sparsity appropriately harnessed, rather than constructing the pairwise affinity graph between the entire original samples. Specifically, (1) beacons are placed automatically by unifying the consistence of both data features and labels, which subsequentially act as indicators during the inference; (2) leveraging the information carried by beacons, the sample labels are interpreted as the weighted combination of a subset of characteristics-specified beacons; (3) if unfamiliar samples are encountered in an open set, we seek to expand the beacon set incrementally and update their parameters by incorporating additional human interventions if necessary. Experimental results on real datasets validate that our algorithm is effective and efficient to implement scalable inference, robust to sample corruptions, and capable to boost the performance incrementally in an open set by updating the beacon-related parameters.

Hang Su, Jun Zhu, Zhaozheng Yin, Yinpeng Dong, Bo Zhang

### Novel Coplanar Line-Points Invariants for Robust Line Matching Across Views

Robust line matching across wide-baseline views is a challenging task in computer vision. Most of the existing methods highly depend on the positional relationships between lines and the associated textures. These cues are sensitive to various image transformations especially perspective deformations, and likely to fail in the scenarios where few texture present. In this paper, we construct a new coplanar line-points invariant upon a newly developed projective invariant, named characteristic number, and propose a line matching algorithm using the invariant. The construction of this invariant uses intersections of coplanar lines instead of endpoints, rendering more robust matching across views. Additionally, a series of line-points invariant values generate the similarity metric for matching that is less affected by mismatched interest points than traditional approaches. Accurate homography recovered from the invariant allows all lines, even those without interest points around them, a chance to be matched. Extensive comparisons with the state-of-the-art validate the matching accuracy and robustness of the proposed method to projective transformations. The method also performs well for image pairs with few textures and similar textures.

Qi Jia, Xinkai Gao, Xin Fan, Zhongxuan Luo, Haojie Li, Ziyao Chen

### Sparse Representation Based Complete Kernel Marginal Fisher Analysis Framework for Computational Art Painting Categorization

This paper presents a sparse representation based complete kernel marginal Fisher analysis (SCMFA) framework for categorizing fine art images. First, we introduce several Fisher vector based features for feature extraction so as to extract and encode important discriminatory information of the painting image. Second, we propose a complete marginal Fisher analysis method so as to extract two kinds of discriminant information, regular and irregular. In particular, the regular discriminant features are extracted from the range space of the intraclass compactness using the marginal Fisher discriminant criterion whereas the irregular discriminant features are extracted from the null space of the intraclass compactness using the marginal interclass separability criterion. The motivation for extracting two kinds of discriminant information is that the traditional MFA method uses a PCA projection in the initial step that may discard the null space of the intraclass compactness which may contain useful discriminatory information. Finally, we learn a discriminative sparse representation model with the objective to integrate the representation criterion with the discriminant criterion in order to enhance the discriminative ability of the proposed method. The effectiveness of the proposed SCMFA method is assessed on the challenging Painting-91 dataset. Experimental results show that our proposed method is able to (i) achieve the state-of-the-art performance for painting artist and style classification, (ii) outperform other popular image descriptors and deep learning methods, (iii) improve upon the traditional MFA method as well as (iv) discover the artist and style influence to understand their connections in different art movement periods.

Ajit Puthenputhussery, Qingfeng Liu, Chengjun Liu

### 3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction

Inspired by the recent success of methods that employ shape priors to achieve robust 3D reconstructions, we propose a novel recurrent neural network architecture that we call the 3D Recurrent Reconstruction Neural Network (3D-R2N2). The network learns a mapping from images of objects to their underlying 3D shapes from a large collection of synthetic data [13]. Our network takes in one or more images of an object instance from arbitrary viewpoints and outputs a reconstruction of the object in the form of a 3D occupancy grid. Unlike most of the previous works, our network does not require any image annotations or object class labels for training or testing. Our extensive experimental analysis shows that our reconstruction framework (i) outperforms the state-of-the-art methods for single view reconstruction, and (ii) enables the 3D reconstruction of objects in situations when traditional SFM/SLAM methods fail (because of lack of texture and/or wide baseline).

Christopher B. Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, Silvio Savarese

### Cascaded Continuous Regression for Real-Time Incremental Face Tracking

This paper introduces a novel real-time algorithm for facial landmark tracking. Compared to detection, tracking has both additional challenges and opportunities. Arguably the most important aspect in this domain is updating a tracker’s models as tracking progresses, also known as incremental (face) tracking. While this should result in more accurate localisation, how to do this online and in real time without causing a tracker to drift is still an important open research question. We address this question in the cascaded regression framework, the state-of-the-art approach for facial landmark localisation. Because incremental learning for cascaded regression is costly, we propose a much more efficient yet equally accurate alternative using continuous regression. More specifically, we first propose cascaded continuous regression (CCR) and show its accuracy is equivalent to the Supervised Descent Method. We then derive the incremental learning updates for CCR (iCCR) and show that it is an order of magnitude faster than standard incremental learning for cascaded regression, bringing the time required for the update from seconds down to a fraction of a second, thus enabling real-time tracking. Finally, we evaluate iCCR and show the importance of incremental learning in achieving state-of-the-art performance. Code for our iCCR is available from http://www.cs.nott.ac.uk/~psxes1.

Enrique Sánchez-Lozano, Brais Martinez, Georgios Tzimiropoulos, Michel Valstar

### Real-Time Visual Tracking: Promoting the Robustness of Correlation Filter Learning

Correlation filtering based tracking model has received lots of attention and achieved great success in real-time tracking, however, the lost function in current correlation filtering paradigm could not reliably response to the appearance changes caused by occlusion and illumination variations. This study intends to promote the robustness of the correlation filter learning. By exploiting the anisotropy of the filter response, three sparsity related loss functions are proposed to alleviate the overfitting issue of previous methods and improve the overall tracking performance. As a result, three real-time trackers are implemented. Extensive experiments in various challenging situations demonstrate that the robustness of the learned correlation filter has been greatly improved via the designed loss functions. In addition, the study reveals, from an experimental perspective, how different loss functions essentially influence the tracking performance. An important conclusion is that the sensitivity of the peak values of the filter in successive frames is consistent with the tracking performance. This is a useful reference criterion in designing a robust correlation filter for visual tracking.

Yao Sui, Ziming Zhang, Guanghui Wang, Yafei Tang, Li Zhang

### Deep Self-correlation Descriptor for Dense Cross-Modal Correspondence

We present a novel descriptor, called deep self-correlation (DSC), designed for establishing dense correspondences between images taken under different imaging modalities, such as different spectral ranges or lighting conditions. Motivated by local self-similarity (LSS), we formulate a novel descriptor by leveraging LSS in a deep architecture, leading to better discriminative power and greater robustness to non-rigid image deformations than state-of-the-art descriptors. The DSC first computes self-correlation surfaces over a local support window for randomly sampled patches, and then builds hierarchical self-correlation surfaces by performing an average pooling within a deep architecture. Finally, the feature responses on the self-correlation surfaces are encoded through a spatial pyramid pooling in a circular configuration. In contrast to convolutional neural networks (CNNs) based descriptors, the DSC is training-free, is robust to cross-modal imaging, and can be densely computed in an efficient manner that significantly reduces computational redundancy. The state-of-the-art performance of DSC on challenging cases of cross-modal image pairs is demonstrated through extensive experiments.

Seungryong Kim, Dongbo Min, Stephen Lin, Kwanghoon Sohn

### Structured Matching for Phrase Localization

In this paper we introduce a new approach to phrase localization: grounding phrases in sentences to image regions. We propose a structured matching of phrases and regions that encourages the semantic relations between phrases to agree with the visual relations between regions. We formulate structured matching as a discrete optimization problem and relax it to a linear program. We use neural networks to embed regions and phrases into vectors, which then define the similarities (matching weights) between regions and phrases. We integrate structured matching with neural networks to enable end-to-end training. Experiments on Flickr30K Entities demonstrate the empirical effectiveness of our approach.

Mingzhe Wang, Mahmoud Azab, Noriyuki Kojima, Rada Mihalcea, Jia Deng

### Crossing-Line Crowd Counting with Two-Phase Deep Neural Networks

In this paper, we propose a deep Convolutional Neural Network (CNN) for counting the number of people across a line-of-interest (LOI) in surveillance videos. It is a challenging problem and has many potential applications. Observing the limitations of temporal slices used by state-of-the-art LOI crowd counting methods, our proposed CNN directly estimates the crowd counts with pairs of video frames as inputs and is trained with pixel-level supervision maps. Such rich supervision information helps our CNN learn more discriminative feature representations. A two-phase training scheme is adopted, which decomposes the original counting problem into two easier sub-problems, estimating crowd density map and estimating crowd velocity map. Learning to solve the sub-problems provides a good initial point for our CNN model, which is then fine-tuned to solve the original counting problem. A new dataset with pedestrian trajectory annotations is introduced for evaluating LOI crowd counting methods and has more annotations than any existing one. Our extensive experiments show that our proposed method is robust to variations of crowd density, crowd velocity, and directions of the LOI, and outperforms state-of-the-art LOI counting methods.

Zhuoyi Zhao, Hongsheng Li, Rui Zhao, Xiaogang Wang

### Revisiting Visual Question Answering Baselines

Visual question answering (VQA) is an interesting learning setting for evaluating the abilities and shortcomings of current systems for image understanding. Many of the recently proposed VQA systems include attention or memory mechanisms designed to perform “reasoning”. Furthermore, for the task of multiple-choice VQA, nearly all of these systems train a multi-class classifier on image and question features to predict an answer. This paper questions the value of these common practices and develops a simple alternative model based on binary classification. Instead of treating answers as competing choices, our model receives the answer as input and predicts whether or not an image-question-answer triplet is correct. We evaluate our model on the Visual7W Telling and the VQA Real Multiple Choice tasks, and find that even simple versions of our model perform competitively. Our best model achieves state-of-the-art performance of $$65.8\,\%$$ accuracy on the Visual7W Telling task and compares surprisingly well with the most complex systems proposed for the VQA Real Multiple Choice task. Additionally, we explore variants of the model and study the transferability of the model between both datasets. We also present an error analysis of our best model, the results of which suggest that a key problem of current VQA systems lies in the lack of visual grounding and localization of concepts that occur in the questions and answers.

Allan Jabri, Armand Joulin, Laurens van der Maaten

### Unsupervised CNN for Single View Depth Estimation: Geometry to the Rescue

A significant weakness of most current deep Convolutional Neural Networks is the need to train them using vast amounts of manually labelled data. In this work we propose a unsupervised framework to learn a deep convolutional neural network for single view depth prediction, without requiring a pre-training stage or annotated ground-truth depths. We achieve this by training the network in a manner analogous to an autoencoder. At training time we consider a pair of images, source and target, with small, known camera motion between the two such as a stereo pair. We train the convolutional encoder for the task of predicting the depth map for the source image. To do so, we explicitly generate an inverse warp of the target image using the predicted depth and known inter-view displacement, to reconstruct the source image; the photometric error in the reconstruction is the reconstruction loss for the encoder. The acquisition of this training data is considerably simpler than for equivalent systems, requiring no manual annotation, nor calibration of depth sensor to camera. We show that our network trained on less than half of the KITTI dataset gives comparable performance to that of the state-of-the-art supervised methods for single view depth estimation.

Ravi Garg, Vijay Kumar B.G., Gustavo Carneiro, Ian Reid

### A Continuous Optimization Approach for Efficient and Accurate Scene Flow

We propose a continuous optimization method for solving dense 3D scene flow problems from stereo imagery. As in recent work, we represent the dynamic 3D scene as a collection of rigidly moving planar segments. The scene flow problem then becomes the joint estimation of pixel-to-segment assignment, 3D position, normal vector and rigid motion parameters for each segment, leading to a complex and expensive discrete-continuous optimization problem. In contrast, we propose a purely continuous formulation which can be solved more efficiently. Using a fine superpixel segmentation that is fixed a-priori, we propose a factor graph formulation that decomposes the problem into photometric, geometric, and smoothing constraints. We initialize the solution with a novel, high-quality initialization method, then independently refine the geometry and motion of the scene, and finally perform a global non-linear refinement using Levenberg-Marquardt. We evaluate our method in the challenging KITTI Scene Flow benchmark, ranking in third position, while being 3 to 30 times faster than the top competitors (x37 [10] and x3.75 [24]).

Zhaoyang Lv, Chris Beall, Pablo F. Alcantarilla, Fuxin Li, Zsolt Kira, Frank Dellaert

### Improving Multi-frame Data Association with Sparse Representations for Robust Near-online Multi-object Tracking

Multiple Object Tracking still remains a difficult problem due to appearance variations and occlusions of the targets or detection failures. Using sophisticated appearance models or performing data association over multiple frames are two common approaches that lead to gain in performances. Inspired by the success of sparse representations in Single Object Tracking, we propose to formulate the multi-frame data association step as an energy minimization problem, designing an energy that efficiently exploits sparse representations of all detections. Furthermore, we propose to use a structured sparsity-inducing norm to compute representations more suited to the tracking context. We perform extensive experiments to demonstrate the effectiveness of the proposed formulation, and evaluate our approach on two public authoritative benchmarks in order to compare it with several state-of-the-art methods.

Loïc Fagot-Bouquet, Romaric Audigier, Yoann Dhome, Frédéric Lerasle

### Gated Siamese Convolutional Neural Network Architecture for Human Re-identification

Matching pedestrians across multiple camera views, known as human re-identification, is a challenging research problem that has numerous applications in visual surveillance. With the resurgence of Convolutional Neural Networks (CNNs), several end-to-end deep Siamese CNN architectures have been proposed for human re-identification with the objective of projecting the images of similar pairs (i.e. same identity) to be closer to each other and those of dissimilar pairs to be distant from each other. However, current networks extract fixed representations for each image regardless of other images which are paired with it and the comparison with other images is done only at the final level. In this setting, the network is at risk of failing to extract finer local patterns that may be essential to distinguish positive pairs from hard negative pairs. In this paper, we propose a gating function to selectively emphasize such fine common local patterns by comparing the mid-level features across pairs of images. This produces flexible representations for the same image according to the images they are paired with. We conduct experiments on the CUHK03, Market-1501 and VIPeR datasets and demonstrate improved performance compared to a baseline Siamese CNN architecture.

Rahul Rama Varior, Mrinal Haloi, Gang Wang

### Saliency Detection via Combining Region-Level and Pixel-Level Predictions with CNNs

This paper proposes a novel saliency detection method by combining region-level saliency estimation and pixel-level saliency prediction with CNNs (denoted as CRPSD). For pixel-level saliency prediction, a fully convolutional neural network (called pixel-level CNN) is constructed by modifying the VGGNet architecture to perform multi-scale feature learning, based on which an image-to-image prediction is conducted to accomplish the pixel-level saliency detection. For region-level saliency estimation, an adaptive superpixel based region generation technique is first designed to partition an image into regions, based on which the region-level saliency is estimated by using a CNN model (called region-level CNN). The pixel-level and region-level saliencies are fused to form the final salient map by using another CNN (called fusion CNN). And the pixel-level CNN and fusion CNN are jointly learned. Extensive quantitative and qualitative experiments on four public benchmark datasets demonstrate that the proposed method greatly outperforms the state-of-the-art saliency detection approaches.

Youbao Tang, Xiangqian Wu

### Backmatter

Weitere Informationen