Skip to main content

About this book

The eight-volume set comprising LNCS volumes 9905-9912 constitutes the refereed proceedings of the 14th European Conference on Computer Vision, ECCV 2016, held in Amsterdam, The Netherlands, in October 2016.
The 415 revised papers presented were carefully reviewed and selected from 1480 submissions. The papers cover all aspects of computer vision and pattern recognition such as 3D computer vision; computational photography, sensing and display; face and gesture; low-level vision and image processing; motion and tracking; optimization methods; physics-based vision, photometry and shape-from-X; recognition: detection, categorization, indexing, matching; segmentation, grouping and shape representation; statistical methods and learning; video: events, activities and surveillance; applications. They are organized in topical sections on detection, recognition and retrieval; scene understanding; optimization; image and video processing; learning; action, activity and tracking; 3D; and 9 poster sessions.

Table of Contents


Poster Session 5 (Continued)


Image Quality Assessment Using Similar Scene as Reference

Most of Image Quality Assessment (IQA) methods require the reference image to be pixel-wise aligned with the distorted image, and thus limiting the application of reference image based IQA methods. In this paper, we show that non-aligned image with similar scene could be well used for reference, using a proposed Dual-path deep Convolutional Neural Network (DCNN). Analysis indicates that the model captures the scene structural information and non-structural information “naturalness” between the pair for quality assessment. As shown in the experiments, our proposed DCNN model handles the IQA problem well. With an aligned reference image, our predictions outperform many state-of-the-art methods. And in more general case where the reference image contains the similar scene but is not aligned with the distorted one, DCNN could still achieve superior consistency with subjective evaluation than many existing methods that even use aligned reference images.

Yudong Liang, Jinjun Wang, Xingyu Wan, Yihong Gong, Nanning Zheng

MOON: A Mixed Objective Optimization Network for the Recognition of Facial Attributes

Attribute recognition, particularly facial, extracts many labels for each image. While some multi-task vision problems can be decomposed into separate tasks and stages, e.g., training independent models for each task, for a growing set of problems joint optimization across all tasks has been shown to improve performance. We show that for deep convolutional neural network (DCNN) facial attribute extraction, multi-task optimization is better. Unfortunately, it can be difficult to apply joint optimization to DCNNs when training data is imbalanced, and re-balancing multi-label data directly is structurally infeasible, since adding/removing data to balance one label will change the sampling of the other labels. This paper addresses the multi-label imbalance problem by introducing a novel mixed objective optimization network (MOON) with a loss function that mixes multiple task objectives with domain adaptive re-weighting of propagated loss. Experiments demonstrate that not only does MOON advance the state of the art in facial attribute recognition, but it also outperforms independently trained DCNNs using the same data. When using facial attributes for the LFW face recognition task, we show that our balanced (domain adapted) network outperforms the unbalanced trained network.

Ethan M. Rudd, Manuel Günther, Terrance E. Boult

Degeneracies in Rolling Shutter SfM

We address the problem of Structure from Motion (SfM) with rolling shutter cameras. We first show that many common camera configurations, e.g. cameras with parallel readout directions, become critical and allow for a large class of ambiguities in multi-view reconstruction. We provide mathematical analysis for one, two and some multi-view cases and verify it by synthetic experiments. Next, we demonstrate that bundle adjustment with rolling shutter cameras, which are close to critical configurations, may still produce drastically deformed reconstructions. Finally, we provide practical recipes how to photograph with rolling shutter cameras to avoid scene deformations in SfM. We evaluate the recipes and provide a quantitative analysis of their performance in real experiments. Our results show how to reconstruct correct 3D models with rolling shutter cameras.

Cenek Albl, Akihiro Sugimoto, Tomas Pajdla

Deep Deformation Network for Object Landmark Localization

We propose a novel cascaded framework, namely deep deformation network (DDN), for localizing landmarks in non-rigid objects. The hallmarks of DDN are its incorporation of geometric constraints within a convolutional neural network (CNN) framework, ease and efficiency of training, as well as generality of application. A novel shape basis network (SBN) forms the first stage of the cascade, whereby landmarks are initialized by combining the benefits of CNN features and a learned shape basis to reduce the complexity of the highly nonlinear pose manifold. In the second stage, a point transformer network (PTN) estimates local deformation parameterized as thin-plate spline transformation for a finer refinement. Our framework does not incorporate either handcrafted features or part connectivity, which enables an end-to-end shape prediction pipeline during both training and testing. In contrast to prior cascaded networks for landmark localization that learn a mapping from feature space to landmark locations, we demonstrate that the regularization induced through geometric priors in the DDN makes it easier to train, yet produces superior results. The efficacy and generality of the architecture is demonstrated through state-of-the-art performances on several benchmarks for multiple tasks such as facial landmark localization, human body pose estimation and bird part localization.

Xiang Yu, Feng Zhou, Manmohan Chandraker

Learning Visual Storylines with Skipping Recurrent Neural Networks

What does a typical visit to Paris look like? Do people first take photos of the Louvre and then the Eiffel Tower? Can we visually model a temporal event like “Paris Vacation” using current frameworks? In this paper, we explore how we can automatically learn the temporal aspects, or storylines of visual concepts from web data. Previous attempts focus on consecutive image-to-image transitions and are unsuccessful at recovering the long-term underlying story. Our novel Skipping Recurrent Neural Network (S-RNN) model does not attempt to predict each and every data point in the sequence, like classic RNNs. Rather, S-RNN uses a framework that skips through the images in the photo stream to explore the space of all ordered subsets of the albums via an efficient sampling procedure. This approach reduces the negative impact of strong short-term correlations, and recovers the latent story more accurately. We show how our learned storylines can be used to analyze, predict, and summarize photo albums from Flickr. Our experimental results provide strong qualitative and quantitative evidence that S-RNN is significantly better than other candidate methods such as LSTMs on learning long-term correlations and recovering latent storylines. Moreover, we show how storylines can help machines better understand and summarize photo streams by inferring a brief personalized story of each individual album.

Gunnar A. Sigurdsson, Xinlei Chen, Abhinav Gupta

Towards Large-Scale City Reconstruction from Satellites

Automatic city modeling from satellite imagery is one of the biggest challenges in urban reconstruction. Existing methods produce at best rough and dense Digital Surface Models. Inspired by recent works on semantic 3D reconstruction and region-based stereovision, we propose a method for producing compact, semantic-aware and geometrically accurate 3D city models from stereo pair of satellite images. Our approach relies on two key ingredients. First, geometry and semantics are retrieved simultaneously bringing robustness to occlusions and to low image quality. Second, we operate at the scale of geometric atomic region which allows the shape of urban objects to be well preserved, and a gain in scalability and efficiency. We demonstrate the potential of our algorithm by reconstructing different cities around the world in a few minutes.

Liuyun Duan, Florent Lafarge

Weakly Supervised Object Localization Using Size Estimates

We present a technique for weakly supervised object localization (WSOL), building on the observation that WSOL algorithms usually work better on images with bigger objects. Instead of training the object detector on the entire training set at the same time, we propose a curriculum learning strategy to feed training images into the WSOL learning loop in an order from images containing bigger objects down to smaller ones. To automatically determine the order, we train a regressor to estimate the size of the object given the whole image as input. Furthermore, we use these size estimates to further improve the re-localization step of WSOL by assigning weights to object proposals according to how close their size matches the estimated object size. We demonstrate the effectiveness of using size order and size weighting on the challenging PASCAL VOC 2007 dataset, where we achieve a significant improvement over existing state-of-the-art WSOL techniques.

Miaojing Shi, Vittorio Ferrari

Supervised Transformer Network for Efficient Face Detection

Large pose variations remain to be a challenge that confronts real-word face detection. We propose a new cascaded Convolutional Neural Network, dubbed the name Supervised Transformer Network, to address this challenge. The first stage is a multi-task Region Proposal Network (RPN), which simultaneously predicts candidate face regions along with associated facial landmarks. The candidate regions are then warped by mapping the detected facial landmarks to their canonical positions to better normalize the face patterns. The second stage, which is a RCNN, then verifies if the warped candidate regions are valid faces or not. We conduct end-to-end learning of the cascaded network, including optimizing the canonical positions of the facial landmarks. This supervised learning of the transformations automatically selects the best scale to differentiate face/non-face patterns. By combining feature maps from both stages of the network, we achieve state-of-the-art detection accuracies on several public benchmarks. For real-time performance, we run the cascaded network only on regions of interests produced from a boosting cascade face detector. Our detector runs at 30 FPS on a single CPU core for a VGA-resolution image.

Dong Chen, Gang Hua, Fang Wen, Jian Sun

A Geometric Approach to Image Labeling

We introduce a smooth non-convex approach in a novel geometric framework which complements established convex and non-convex approaches to image labeling. The major underlying concept is a smooth manifold of probabilistic assignments of a prespecified set of prior data (the “labels”) to given image data. The Riemannian gradient flow with respect to a corresponding objective function evolves on the manifold and terminates, for any $$\delta > 0$$δ>0, within a $$\delta $$δ-neighborhood of an unique assignment (labeling). As a consequence, unlike with convex outer relaxation approaches to (non-submodular) image labeling problems, no post-processing step is needed for the rounding of fractional solutions. Our approach is numerically implemented with sparse, highly-parallel interior-point updates that efficiently converge, largely independent from the number of labels. Experiments with noisy labeling and inpainting problems demonstrate competitive performance.

Freddie Åström, Stefania Petra, Bernhard Schmitzer, Christoph Schnörr

ActionSnapping: Motion-Based Video Synchronization

Video synchronization is a fundamental step for many applications in computer vision, ranging from video morphing to motion analysis. We present a novel method for synchronizing action videos where a similar action is performed by different people at different times and different locations with different local speed changes, e.g., as in sports like weightlifting, baseball pitch, or dance. Our approach extends the popular “snapping” tool of video editing software and allows users to automatically snap action videos together in a timeline based on their content. Since the action can take place at different locations, existing appearance-based methods are not appropriate. Our approach leverages motion information, and computes a nonlinear synchronization of the input videos to establish frame-to-frame temporal correspondences. We demonstrate our approach can be applied for video synchronization, video annotation, and action snapshots. Our approach has been successfully evaluated with ground truth data and a user study.

Jean-Charles Bazin, Alexander Sorkine-Hornung

A Minimal Solution for Non-perspective Pose Estimation from Line Correspondences

In this paper, we study and propose solutions to the relatively un-investigated non-perspective pose estimation problem from line correspondences. Specifically, we represent the 2D and 3D line correspondences as Plücker lines and derive the minimal solution for the minimal problem of three line correspondences with Gröbner basis. Our minimal 3-Line algorithm that gives up to eight solutions is well-suited for robust estimation with RANSAC. We show that our algorithm works as a least-squares that takes in more than three line correspondences without any reformulation. In addition, our algorithm does not require initialization in both the minimal 3-Line and least-squares n-Line cases. Furthermore, our algorithm works without a need for reformulation under the special case of perspective pose estimation when all line correspondences are observed from one single camera. We verify our algorithms with both simulated and real-world data.

Gim Hee Lee

Natural Image Stitching with the Global Similarity Prior

This paper proposes a method for stitching multiple images together so that the stitched image looks as natural as possible. Our method adopts the local warp model and guides the warping of each image with a grid mesh. An objective function is designed for specifying the desired characteristics of the warps. In addition to good alignment and minimal local distortion, we add a global similarity prior in the objective function. This prior constrains the warp of each image so that it resembles a similarity transformation as a whole. The selection of the similarity transformation is crucial to the naturalness of the results. We propose methods for selecting the proper scale and rotation for each image. The warps of all images are solved together for minimizing the distortion globally. A comprehensive evaluation shows that the proposed method consistently outperforms several state-of-the-art methods, including AutoStitch, APAP, SPHP and ANNAP.

Yu-Sheng Chen, Yung-Yu Chuang

Minimal Solvers for Generalized Pose and Scale Estimation from Two Rays and One Point

Estimating the poses of a moving camera with respect to a known 3D map is a key problem in robotics and Augmented Reality applications. Instead of solving for each pose individually, the trajectory can be considered as a generalized camera. Thus, all poses can be jointly estimated by solving a generalized PnP (gPnP) problem. In this paper, we show that the gPnP problem for camera trajectories permits an extremely efficient minimal solution when exploiting the fact that pose tracking allows us to locally triangulate 3D points. We present a problem formulation based on one point-point and two point-ray correspondences that encompasses both the case where the scale of the trajectory is known and where it is unknown. Our formulation leads to closed-form solutions that are orders of magnitude faster to compute than the current state-of-the-art, while resulting in a similar or better pose accuracy.

Federico Camposeco, Torsten Sattler, Marc Pollefeys

Learning to Hash with Binary Deep Neural Network

This work proposes deep network models and learning algorithms for unsupervised and supervised binary hashing. Our novel network design constrains one hidden layer to directly output the binary codes. This addresses a challenging issue in some previous works: optimizing non-smooth objective functions due to binarization. Moreover, we incorporate independence and balance properties in the direct and strict forms in the learning. Furthermore, we include similarity preserving property in our objective function. Our resulting optimization with these binary, independence, and balance constraints is difficult to solve. We propose to attack it with alternating optimization and careful relaxation. Experimental results on three benchmark datasets show that our proposed methods compare favorably with the state of the art.

Thanh-Toan Do, Anh-Dzung Doan, Ngai-Man Cheung

Automatically Selecting Inference Algorithms for Discrete Energy Minimisation

Minimisation of discrete energies defined over factors is an important problem in computer vision, and a vast number of MAP inference algorithms have been proposed. Different inference algorithms perform better on factor graph models (GMs) from different underlying problem classes, and in general it is difficult to know which algorithm will yield the lowest energy for a given GM. To mitigate this difficulty, survey papers [1–3] advise the practitioner on what algorithms perform well on what classes of models. We take the next step forward, and present a technique to automatically select the best inference algorithm for an input GM. We validate our method experimentally on an extended version of the OpenGM2 benchmark [3], containing a diverse set of vision problems. On average, our method selects an inference algorithm yielding labellings with 96 % of variables the same as the best available algorithm.

Paul Henderson, Vittorio Ferrari

Ego2Top: Matching Viewers in Egocentric and Top-View Videos

Egocentric cameras are becoming increasingly popular and provide us with large amounts of videos, captured from the first person perspective. At the same time, surveillance cameras and drones offer an abundance of visual information, often captured from top-view. Although these two sources of information have been separately studied in the past, they have not been collectively studied and related. Having a set of egocentric cameras and a top-view camera capturing the same area, we propose a framework to identify the egocentric viewers in the top-view video. We utilize two types of features for our assignment procedure. Unary features encode what a viewer (seen from top-view or recording an egocentric video) visually experiences over time. Pairwise features encode the relationship between the visual content of a pair of viewers. Modeling each view (egocentric or top) by a graph, the assignment process is formulated as spectral graph matching. Evaluating our method over a dataset of 50 top-view and 188 egocentric videos taken in different scenarios demonstrates the efficiency of the proposed approach in assigning egocentric viewers to identities present in top-view camera. We also study the effect of different parameters such as the number of egocentric viewers and visual features.

Shervin Ardeshir, Ali Borji

Online Action Detection

In online action detection, the goal is to detect the start of an action in a video stream as soon as it happens. For instance, if a child is chasing a ball, an autonomous car should recognize what is going on and respond immediately. This is a very challenging problem for four reasons. First, only partial actions are observed. Second, there is a large variability in negative data. Third, the start of the action is unknown, so it is unclear over what time window the information should be integrated. Finally, in real world data, large within-class variability exists. This problem has been addressed before, but only to some extent. Our contributions to online action detection are threefold. First, we introduce a realistic dataset composed of 27 episodes from 6 popular TV series. The dataset spans over 16 h of footage annotated with 30 action classes, totaling 6,231 action instances. Second, we analyze and compare various baseline methods, showing this is a challenging problem for which none of the methods provides a good solution. Third, we analyze the change in performance when there is a variation in viewpoint, occlusion, truncation, etc. We introduce an evaluation protocol for fair comparison. The dataset, the baselines and the models will all be made publicly available to encourage (much needed) further research on online action detection on realistic data.

Roeland De Geest, Efstratios Gavves, Amir Ghodrati, Zhenyang Li, Cees Snoek, Tinne Tuytelaars

Cross-Modal Supervision for Learning Active Speaker Detection in Video

In this paper, we show how to use audio to supervise the learning of active speaker detection in video. Voice Activity Detection (VAD) guides the learning of the vision-based classifier in a weakly supervised manner. The classifier uses spatio-temporal features to encode upper body motion - facial expressions and gesticulations associated with speaking. We further improve a generic model for active speaker detection by learning person specific models. Finally, we demonstrate the online adaptation of generic models learnt on one dataset, to previously unseen people in a new dataset, again using audio (VAD) for weak supervision. The use of temporal continuity overcomes the lack of clean training data. We are the first to present an active speaker detection system that learns on one audio-visual dataset and automatically adapts to speakers in a new dataset. This work can be seen as an example of how the availability of multi-modal data allows us to learn a model without the need for supervision, by transferring knowledge from one modality to another.

Punarjay Chakravarty, Tinne Tuytelaars

Recurrent Temporal Deep Field for Semantic Video Labeling

This paper specifies a new deep architecture, called Recurrent Temporal Deep Field (RTDF), for semantic video labeling. RTDF is a conditional random field (CRF) that combines a deconvolution neural network (DeconvNet) and a recurrent temporal restricted Boltzmann machine (RTRBM). DeconvNet is grounded onto pixels of a new frame for estimating the unary potential of the CRF. RTRBM estimates a high-order potential of the CRF by capturing long-term spatiotemporal dependencies of pixel labels that RTDF has already predicted in previous frames. We derive a mean-field inference algorithm to jointly predict all latent variables in both RTRBM and CRF. We also conduct end-to-end joint training of all DeconvNet, RTRBM, and CRF parameters. The joint learning and inference integrate the three components into a unified deep model – RTDF. Our evaluation on the benchmark Youtube Face Database (YFDB) and Cambridge-driving Labeled Video Database (Camvid) demonstrates that RTDF outperforms the state of the art both qualitatively and quantitatively.

Peng Lei, Sinisa Todorovic

Ultra-Resolving Face Images by Discriminative Generative Networks

Conventional face super-resolution methods, also known as face hallucination, are limited up to $$2 \! \sim \! 4\times $$2∼4× scaling factors where $$4 \sim 16$$4∼16 additional pixels are estimated for each given pixel. Besides, they become very fragile when the input low-resolution image size is too small that only little information is available in the input image. To address these shortcomings, we present a discriminative generative network that can ultra-resolve a very low resolution face image of size $$16 \times 16$$16×16 pixels to its $$8\times $$8× larger version by reconstructing 64 pixels from a single pixel. We introduce a pixel-wise $$\ell _2$$ℓ2 regularization term to the generative model and exploit the feedback of the discriminative network to make the upsampled face images more similar to real ones. In our framework, the discriminative network learns the essential constituent parts of the faces and the generative network blends these parts in the most accurate fashion to the input image. Since only frontal and ordinary aligned images are used in training, our method can ultra-resolve a wide range of very low-resolution images directly regardless of pose and facial expression variations. Our extensive experimental evaluations demonstrate that the presented ultra-resolution by discriminative generative networks (UR-DGN) achieves more appealing results than the state-of-the-art.

Xin Yu, Fatih Porikli

A Discriminative Framework for Anomaly Detection in Large Videos

We address an anomaly detection setting in which training sequences are unavailable and anomalies are scored independently of temporal ordering. Current algorithms in anomaly detection are based on the classical density estimation approach of learning high-dimensional models and finding low-probability events. These algorithms are sensitive to the order in which anomalies appear and require either training data or early context assumptions that do not hold for longer, more complex videos. By defining anomalies as examples that can be distinguished from other examples in the same video, our definition inspires a shift in approaches from classical density estimation to simple discriminative learning. Our contributions include a novel framework for anomaly detection that is (1) independent of temporal ordering of anomalies, and (2) unsupervised, requiring no separate training sequences. We show that our algorithm can achieve state-of-the-art results even when we adjust the setting by removing training sequences from standard datasets.

Allison Del Giorno, J. Andrew Bagnell, Martial Hebert

ContextLocNet: Context-Aware Deep Network Models for Weakly Supervised Localization

We aim to localize objects in images using image-level supervision only. Previous approaches to this problem mainly focus on discriminative object regions and often fail to locate precise object boundaries. We address this problem by introducing two types of context-aware guidance models, additive and contrastive models, that leverage their surrounding context regions to improve localization. The additive model encourages the predicted object region to be supported by its surrounding context region. The contrastive model encourages the predicted object region to be outstanding from its surrounding context region. Our approach benefits from the recent success of convolutional neural networks for object recognition and extends Fast R-CNN to weakly supervised object localization. Extensive experimental evaluation on the PASCAL VOC 2007 and 2012 benchmarks shows that our context-aware approach significantly improves weakly supervised localization and detection.

Vadim Kantorov, Maxime Oquab, Minsu Cho, Ivan Laptev

Network Flow Formulations for Learning Binary Hashing

The problem of learning binary hashing seeks the identification of a binary mapping for a set of n examples such that the corresponding Hamming distances preserve high fidelity with a given$$n \times n$$n×n matrix of distances (or affinities). This formulation has numerous applications in efficient search and retrieval of images (and other high dimensional data) on devices with storage/processing constraints. As a result, the problem has received much attention recently in vision and machine learning and a number of interesting solutions have been proposed. A common feature of most existing solutions is that they adopt continuous iterative optimization schemes which is then followed by a post-hoc rounding process to recover a feasible discrete solution. In this paper, we present a fully combinatorial network-flow based formulation for a relaxed version of this problem. The main maximum flow/minimum cut modules which drive our algorithm can be solved efficiently and can directly learn the binary codes. Despite its simplicity, we show that on most widely used benchmarks, our proposal yields competitive performance relative to a suite of nine different state of the art algorithms.

Lopamudra Mukherjee, Jiming Peng, Trevor Sigmund, Vikas Singh

SPICE: Semantic Propositional Image Caption Evaluation

There is considerable interest in the task of automatically generating image captions. However, evaluation is challenging. Existing automatic evaluation metrics are primarily sensitive to n-gram overlap, which is neither necessary nor sufficient for the task of simulating human judgment. We hypothesize that semantic propositional content is an important component of human caption evaluation, and propose a new automated caption evaluation metric defined over scene graphs coined SPICE. Extensive evaluations across a range of models and datasets indicate that SPICE captures human judgments over model-generated captions better than other automatic metrics (e.g., system-level correlation of 0.88 with human judgments on the MS COCO dataset, versus 0.43 for CIDEr and 0.53 for METEOR). Furthermore, SPICE can answer questions such as which caption-generator best understands colors? and can caption-generators count?

Peter Anderson, Basura Fernando, Mark Johnson, Stephen Gould

Transfer Neural Trees for Heterogeneous Domain Adaptation

Heterogeneous domain adaptation (HDA) addresses the task of associating data not only across dissimilar domains but also described by different types of features. Inspired by the recent advances of neural networks and deep learning, we propose Transfer Neural Trees (TNT) which jointly solves cross-domain feature mapping, adaptation, and classification in a NN-based architecture. As the prediction layer in TNT, we further propose Transfer Neural Decision Forest (Transfer-NDF), which effectively adapts the neurons in TNT for adaptation by stochastic pruning. Moreover, to address semi-supervised HDA, a unique embedding loss term for preserving prediction and structural consistency between target-domain data is introduced into TNT. Experiments on classification tasks across features, datasets, and modalities successfully verify the effectiveness of our TNT.

Wei-Yu Chen, Tzu-Ming Harry Hsu, Yao-Hung Hubert Tsai, Yu-Chiang Frank Wang, Ming-Syan Chen

Tracking Persons-of-Interest via Adaptive Discriminative Features

Multi-face tracking in unconstrained videos is a challenging problem as faces of one person often appear drastically different in multiple shots due to significant variations in scale, pose, expression, illumination, and make-up. Low-level features used in existing multi-target tracking methods are not effective for identifying faces with such large appearance variations. In this paper, we tackle this problem by learning discriminative, video-specific face features using convolutional neural networks (CNNs). Unlike existing CNN-based approaches that are only trained on large-scale face image datasets offline, we further adapt the pre-trained face CNN to specific videos using automatically discovered training samples from tracklets. Our network directly optimizes the embedding space so that the Euclidean distances correspond to a measure of semantic face similarity. This is technically realized by minimizing an improved triplet loss function. With the learned discriminative features, we apply the Hungarian algorithm to link tracklets within each shot and the hierarchical clustering algorithm to link tracklets across multiple shots to form final trajectories. We extensively evaluate the proposed algorithm on a set of TV sitcoms and music videos and demonstrate significant performance improvement over existing techniques.

Shun Zhang, Yihong Gong, Jia-Bin Huang, Jongwoo Lim, Jinjun Wang, Narendra Ahuja, Ming-Hsuan Yang

Action, Activity and Tracking


Spot On: Action Localization from Pointly-Supervised Proposals

We strive for spatio-temporal localization of actions in videos. The state-of-the-art relies on action proposals at test time and selects the best one with a classifier trained on carefully annotated box annotations. Annotating action boxes in video is cumbersome, tedious, and error prone. Rather than annotating boxes, we propose to annotate actions in video with points on a sparse subset of frames only. We introduce an overlap measure between action proposals and points and incorporate them all into the objective of a non-convex Multiple Instance Learning optimization. Experimental evaluation on the UCF Sports and UCF 101 datasets shows that (i) spatio-temporal proposals can be used to train classifiers while retaining the localization performance, (ii) point annotations yield results comparable to box annotations while being significantly faster to annotate, (iii) with a minimum amount of supervision our approach is competitive to the state-of-the-art. Finally, we introduce spatio-temporal action annotations on the train and test videos of Hollywood2, resulting in Hollywood2Tubes, available at

Pascal Mettes, Jan C. van Gemert, Cees G. M. Snoek

Detecting Engagement in Egocentric Video

In a wearable camera video, we see what the camera wearer sees. While this makes it easy to know roughly , it does not immediately reveal . Specifically, at what moments did his focus linger, as he paused to gather more information about something he saw? Knowing this answer would benefit various applications in video summarization and augmented reality, yet prior work focuses solely on the “what” question (estimating saliency, gaze) without considering the “when” (engagement). We propose a learning-based approach that uses long-term egomotion cues to detect engagement, specifically in browsing scenarios where one frequently takes in new visual information (e.g., shopping, touring). We introduce a large, richly annotated dataset for ego-engagement that is the first of its kind. Our approach outperforms a wide array of existing methods. We show engagement can be detected well independent of both scene appearance and the camera wearer’s identity.

Yu-Chuan Su, Kristen Grauman

Beyond Correlation Filters: Learning Continuous Convolution Operators for Visual Tracking

Discriminative Correlation Filters (DCF) have demonstrated excellent performance for visual object tracking. The key to their success is the ability to efficiently exploit available negative data by including all shifted versions of a training sample. However, the underlying DCF formulation is restricted to single-resolution feature maps, significantly limiting its potential. In this paper, we go beyond the conventional DCF framework and introduce a novel formulation for training continuous convolution filters. We employ an implicit interpolation model to pose the learning problem in the continuous spatial domain. Our proposed formulation enables efficient integration of multi-resolution deep feature maps, leading to superior results on three object tracking benchmarks: OTB-2015 ($$+5.1\,\%$$+5.1% in mean OP), Temple-Color ($$+4.6\,\%$$+4.6% in mean OP), and VOT2015 ($$20\,\%$$20% relative reduction in failure rate). Additionally, our approach is capable of sub-pixel localization, crucial for the task of accurate feature point tracking. We also demonstrate the effectiveness of our learning formulation in extensive feature point tracking experiments.

Martin Danelljan, Andreas Robinson, Fahad Shahbaz Khan, Michael Felsberg

Look-Ahead Before You Leap: End-to-End Active Recognition by Forecasting the Effect of Motion

Visual recognition systems mounted on autonomous moving agents face the challenge of unconstrained data, but simultaneously have the opportunity to improve their performance by moving to acquire new views of test data. In this work, we first show how a recurrent neural network-based system may be trained to perform end-to-end learning of motion policies suited for this “active recognition” setting. Further, we hypothesize that active vision requires an agent to have the capacity to reason about the effects of its motions on its view of the world. To verify this hypothesis, we attempt to induce this capacity in our active recognition pipeline, by simultaneously learning to forecast the effects of the agent’s motions on its internal representation of the environment conditional on all past views. Results across two challenging datasets confirm both that our end-to-end system successfully learns meaningful policies for active category recognition, and that “learning to look ahead” further boosts recognition performance.

Dinesh Jayaraman, Kristen Grauman

Poster Session 6


General Automatic Human Shape and Motion Capture Using Volumetric Contour Cues

Markerless motion capture algorithms require a 3D body with properly personalized skeleton dimension and/or body shape and appearance to successfully track a person. Unfortunately, many tracking methods consider model personalization a different problem and use manual or semi-automatic model initialization, which greatly reduces applicability. In this paper, we propose a fully automatic algorithm that jointly creates a rigged actor model commonly used for animation – skeleton, volumetric shape, appearance, and optionally a body surface – and estimates the actor’s motion from multi-view video input only. The approach is rigorously designed to work on footage of general outdoor scenes recorded with very few cameras and without background subtraction. Our method uses a new image formation model with analytic visibility and analytically differentiable alignment energy. For reconstruction, 3D body shape is approximated as a Gaussian density field. For pose and shape estimation, we minimize a new edge-based alignment energy inspired by volume ray casting in an absorbing medium. We further propose a new statistical human body model that represents the body surface, volumetric Gaussian density, and variability in skeleton shape. Given any multi-view sequence, our method jointly optimizes the pose and shape parameters of this model fully automatically in a spatiotemporal way.

Helge Rhodin, Nadia Robertini, Dan Casas, Christian Richardt, Hans-Peter Seidel, Christian Theobalt

Globally Continuous and Non-Markovian Crowd Activity Analysis from Videos

Automatically recognizing activities in video is a classic problem in vision and helps to understand behaviors, describe scenes and detect anomalies. We propose an unsupervised method for such purposes. Given video data, we discover recurring activity patterns that appear, peak, wane and disappear over time. By using non-parametric Bayesian methods, we learn coupled spatial and temporal patterns with minimum prior knowledge. To model the temporal changes of patterns, previous works compute Markovian progressions or locally continuous motifs whereas we model time in a globally continuous and non-Markovian way. Visually, the patterns depict flows of major activities. Temporally, each pattern has its own unique appearance-disappearance cycles. To compute compact pattern representations, we also propose a hybrid sampling method. By combining these patterns with detailed environment information, we interpret the semantics of activities and report anomalies. Also, our method fits data better and detects anomalies that were difficult to detect previously.

He Wang, Carol O’Sullivan

Joint Face Alignment and 3D Face Reconstruction

We present an approach to simultaneously solve the two problems of face alignment and 3D face reconstruction from an input 2D face image of arbitrary poses and expressions. The proposed method iteratively and alternately applies two sets of cascaded regressors, one for updating 2D landmarks and the other for updating reconstructed pose-expression-normalized (PEN) 3D face shape. The 3D face shape and the landmarks are correlated via a 3D-to-2D mapping matrix. In each iteration, adjustment to the landmarks is firstly estimated via a landmark regressor, and this landmark adjustment is also used to estimate 3D face shape adjustment via a shape regressor. The 3D-to-2D mapping is then computed based on the adjusted 3D face shape and 2D landmarks, and it further refines the 2D landmarks. An effective algorithm is devised to learn these regressors based on a training dataset of pairing annotated 3D face shapes and 2D face images. Compared with existing methods, the proposed method can fully automatically generate PEN 3D face shapes in real time from a single 2D face image and locate both visible and invisible 2D landmarks. Extensive experiments show that the proposed method can achieve the state-of-the-art accuracy in both face alignment and 3D face reconstruction, and benefit face recognition owing to its reconstructed PEN 3D face shapes.

Feng Liu, Dan Zeng, Qijun Zhao, Xiaoming Liu

Keep It SMPL: Automatic Estimation of 3D Human Pose and Shape from a Single Image

We describe the first method to automatically estimate the 3D pose of the human body as well as its 3D shape from a single unconstrained image. We estimate a full 3D mesh and show that 2D joints alone carry a surprising amount of information about body shape. The problem is challenging because of the complexity of the human body, articulation, occlusion, clothing, lighting, and the inherent ambiguity in inferring 3D from 2D. To solve this, we first use a recently published CNN-based method, DeepCut, to predict (bottom-up) the 2D body joint locations. We then fit (top-down) a recently published statistical body shape model, called SMPL, to the 2D joints. We do so by minimizing an objective function that penalizes the error between the projected 3D model joints and detected 2D joints. Because SMPL captures correlations in human shape across the population, we are able to robustly fit it to very little data. We further leverage the 3D model to prevent solutions that cause interpenetration. We evaluate our method, SMPLify, on the Leeds Sports, HumanEva, and Human3.6M datasets, showing superior pose accuracy with respect to the state of the art.

Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, Michael J. Black

Do We Really Need to Collect Millions of Faces for Effective Face Recognition?

Face recognition capabilities have recently made extraordinary leaps. Though this progress is at least partially due to ballooning training set sizes – huge numbers of face images downloaded and labeled for identity – it is not clear if the formidable task of collecting so many images is truly necessary. We propose a far more accessible means of increasing training data sizes for face recognition systems: Domain specific data augmentation. We describe novel methods of enriching an existing dataset with important facial appearance variations by manipulating the faces it contains. This synthesis is also used when matching query images represented by standard convolutional neural networks. The effect of training and testing with synthesized images is tested on the LFW and IJB-A (verification and identification) benchmarks and Janus CS2. The performances obtained by our approach match state of the art results reported by systems trained on millions of downloaded images.

Iacopo Masi, Anh Tuấn Trần, Tal Hassner, Jatuporn Toy Leksut, Gérard Medioni

Generative Visual Manipulation on the Natural Image Manifold

Realistic image manipulation is challenging because it requires modifying the image appearance in a user-controlled way, while preserving the realism of the result. Unless the user has considerable artistic skill, it is easy to “fall off” the manifold of natural images while editing. In this paper, we propose to learn the natural image manifold directly from data using a generative adversarial neural network. We then define a class of image editing operations, and constrain their output to lie on that learned manifold at all times. The model automatically adjusts the output keeping all edits as realistic as possible. All our manipulations are expressed in terms of constrained optimization and are applied in near-real time. We evaluate our algorithm on the task of realistic photo manipulation of shape and color. The presented method can further be used for changing one image to look like the other, as well as generating novel imagery from scratch based on user’s scribbles.

Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, Alexei A. Efros

Deep Cascaded Bi-Network for Face Hallucination

We present a novel framework for hallucinating faces of unconstrained poses and with very low resolution (face size as small as 5pxIOD). In contrast to existing studies that mostly ignore or assume pre-aligned face spatial configuration (e.g. facial landmarks localization or dense correspondence field), we alternatingly optimize two complementary tasks, namely face hallucination and dense correspondence field estimation, in a unified framework. In addition, we propose a new gated deep bi-network that contains two functionality-specialized branches to recover different levels of texture details. Extensive experiments demonstrate that such formulation allows exceptional hallucination quality on in-the-wild low-res faces with significant pose and illumination variations.

Shizhan Zhu, Sifei Liu, Chen Change Loy, Xiaoou Tang

Cluster Sparsity Field for Hyperspectral Imagery Denoising

Hyperspectral images (HSIs) can facilitate extensive computer vision applications with the extra spectra information. However, HSIs often suffer from noise corruption during the practical imaging procedure. Though it has been testified that intrinsic correlation across spectrum and spatial similarity (i.e., local similarity in locally smooth areas and non-local similarity among recurrent patterns) in HSIs are useful for denoising, how to fully exploit them together to obtain a good denoising model is seldom studied. In this study, we present an effective cluster sparsity field based HSIs denoising (CSFHD) method by exploiting those two characteristics simultaneously. Firstly, a novel Markov random field prior, named cluster sparsity field (CSF), is proposed for the sparse representation of an HSI. By grouping pixels into several clusters with spectral similarity, the CSF prior defines both a structured sparsity potential and a graph structure potential on each cluster to model the correlation across spectrum and spatial similarity in the HSI, respectively. Then, the CSF prior learning and the image denoising are unified into a variational framework for optimization, where all unknown variables are learned directly from the noisy observation. This guarantees to learn a data-dependent image model, thus producing satisfying denoising results. Plenty experiments on denoising synthetic and real noisy HSIs validated that the proposed CSFHD outperforms several state-of-the-art methods.

Lei Zhang, Wei Wei, Yanning Zhang, Chunhua Shen, Anton van den Hengel, Qinfeng Shi

Zoom Better to See Clearer: Human and Object Parsing with Hierarchical Auto-Zoom Net

Parsing articulated objects, e.g. humans and animals, into semantic parts (e.g. head, body and arms, etc.) from natural images is a challenging and fundamental problem in computer vision. A big difficulty is the large variability of scale and location for objects and their corresponding parts. Even limited mistakes in estimating scale and location will degrade the parsing output and cause errors in boundary details. To tackle this difficulty, we propose a “Hierarchical Auto-Zoom Net” (HAZN) for object part parsing which adapts to the local scales of objects and parts. HAZN is a sequence of two “Auto-Zoom Nets” (AZNs), each employing fully convolutional networks for two tasks: (1) predict the locations and scales of object instances (the first AZN) or their parts (the second AZN); (2) estimate the part scores for predicted object instance or part regions. Our model can adaptively “zoom” (resize) predicted image regions into their proper scales to refine the parsing. We conduct extensive experiments over the PASCAL part datasets on humans, horses, and cows. In all the three categories, our approach significantly outperforms alternative state-of-the-arts by more than $$5\,\%$$5% mIOU and is especially better at segmenting small instances and small parts. In summary, our strategy of first zooming into objects and then zooming into parts is very effective. It also enables us to process different regions of the image at different scales adaptively so that we do not need to waste computational resources scaling the entire image.

Fangting Xia, Peng Wang, Liang-Chieh Chen, Alan L. Yuille

Learning Common and Specific Features for RGB-D Semantic Segmentation with Deconvolutional Networks

In this paper, we tackle the problem of RGB-D semantic segmentation of indoor images. We take advantage of deconvolutional networks which can predict pixel-wise class labels, and develop a new structure for deconvolution of multiple modalities. We propose a novel feature transformation network to bridge the convolutional networks and deconvolutional networks. In the feature transformation network, we correlate the two modalities by discovering common features between them, as well as characterize each modality by discovering modality specific features. With the common features, we not only closely correlate the two modalities, but also allow them to borrow features from each other to enhance the representation of shared information. With specific features, we capture the visual patterns that are only visible in one modality. The proposed network achieves competitive segmentation accuracy on NYU depth dataset V1 and V2.

Jinghua Wang, Zhenhua Wang, Dacheng Tao, Simon See, Gang Wang

MADMM: A Generic Algorithm for Non-smooth Optimization on Manifolds

Numerous problems in computer vision, pattern recognition, and machine learning are formulated as optimization with manifold constraints. In this paper, we propose the Manifold Alternating Directions Method of Multipliers (MADMM), an extension of the classical ADMM scheme for manifold-constrained non-smooth optimization problems. To our knowledge, MADMM is the first generic non-smooth manifold optimization method. We showcase our method on several challenging problems in dimensionality reduction, non-rigid correspondence, multi-modal clustering, and multidimensional scaling.

Artiom Kovnatsky, Klaus Glashoff, Michael M. Bronstein

Interpreting the Ratio Criterion for Matching SIFT Descriptors

Matching keypoints by minimizing the Euclidean distance between their SIFT descriptors is an effective and extremely popular technique. Using the ratio between distances, as suggested by Lowe, is even more effective and leads to excellent matching accuracy. Probabilistic approaches that model the distribution of the distances were found effective as well. This work focuses, for the first time, on analyzing Lowe’s ratio criterion using a probabilistic approach. We provide two alternative interpretations of this criterion, which show that it is not only an effective heuristic but can also be formally justified. The first interpretation shows that Lowe’s ratio corresponds to a conditional probability that the match is incorrect. The second shows that the ratio corresponds to the Markov bound on this probability. The interpretations make it possible to slightly increase the effectiveness of the ratio criterion, and to obtain matching performance that exceeds all previous (non-learning based) results.

Avi Kaplan, Tamar Avraham, Michael Lindenbaum

Semi-supervised Learning Based on Joint Diffusion of Graph Functions and Laplacians

We observe the distances between estimated function outputs on data points to create an anisotropic graph Laplacian which, through an iterative process, can itself be regularized. Our algorithm is instantiated as a discrete regularizer on a graph’s diffusivity operator. This idea is grounded in the theory that regularizing the diffusivity operator corresponds to regularizing the metric on Riemannian manifolds, which further corresponds to regularizing the anisotropic Laplace-Beltrami operator. We show that our discrete regularization framework is consistent in the sense that it converges to (continuous) regularization on underlying data generating manifolds. In semi-supervised learning experiments, across ten standard datasets, our diffusion of Laplacian approach has the lowest average error rate of eight different established and state-of-the-art approaches, which shows the promise of our approach.

Kwang In Kim

Improving Semantic Embedding Consistency by Metric Learning for Zero-Shot Classiffication

This paper addresses the task of zero-shot image classification. The key contribution of the proposed approach is to control the semantic embedding of images – one of the main ingredients of zero-shot learning – by formulating it as a metric learning problem. The optimized empirical criterion associates two types of sub-task constraints: metric discriminating capacity and accurate attribute prediction. This results in a novel expression of zero-shot learning not requiring the notion of class in the training phase: only pairs of image/attributes, augmented with a consistency indicator, are given as ground truth. At test time, the learned model can predict the consistency of a test image with a given set of attributes, allowing flexible ways to produce recognition inferences. Despite its simplicity, the proposed approach gives state-of-the-art results on four challenging datasets used for zero-shot recognition evaluation.

Maxime Bucher, Stéphane Herbin, Frédéric Jurie

A Sequential Approach to 3D Human Pose Estimation: Separation of Localization and Identification of Body Joints

In this paper, we propose a new approach to 3D human pose estimation from a single depth image. Conventionally, 3D human pose estimation is formulated as a detection problem of the desired list of body joints. Most of the previous methods attempted to simultaneously localize and identify body joints, with the expectation that the accomplishment of one task would facilitate the accomplishment of the other. However, we believe that identification hampers localization; therefore, the two tasks should be solved separately for enhanced pose estimation performance. We propose a two-stage framework that initially estimates all the locations of joints and subsequently identifies the estimated joints for a specific pose. The locations of joints are estimated by regressing K closest joints from every pixel with the use of a random tree. The identification of joints are realized by transferring labels from a retrieved nearest exemplar model. Once the 3D configuration of all the joints is derived, identification becomes much easier than when it is done simultaneously with localization, exploiting the reduced solution space. Our proposed method achieves significant performance gain on pose estimation accuracy, thereby improving both localization and identification. Experimental results show that the proposed method exhibits an accuracy significantly higher than those of previous approaches that simultaneously localize and identify the body parts.

Ho Yub Jung, Yumin Suh, Gyeongsik Moon, Kyoung Mu Lee

A Novel Tiny Object Recognition Algorithm Based on Unit Statistical Curvature Feature

To recognize tiny objects whose sizes are in the range of 15$$\times $$×15 to 40$$\times $$×40 pixels, a novel image feature descriptor, unit statistical curvature feature (USCF), is proposed based on the statistics of unit curvature distribution. USCF can represent the local general invariant features of the image texture. Due to the curvature features are independent of image sizes, USCF algorithm had high recognition rate for object images in any size including tiny object images. USCF is invariant to rotation and linear illumination variation, and is partially invariant to viewpoint variation. Experimental results showed that the recognition rate of USCF algorithm was the highest for tiny object recognition compared to other nine typical object recognition algorithms under complex test conditions with simultaneous rotation, illumination, viewpoint variation and background interference.

Yimei Kang, Xiang Li

Fine-Grained Material Classification Using Micro-geometry and Reflectance

In this paper we focus on an understudied computer vision problem, particularly how the micro-geometry and the reflectance of a surface can be used to infer its material. To this end, we introduce a new, publicly available database for fine-grained material classification, consisting of over 2000 surfaces of fabrics ( The database has been collected using a custom-made portable but cheap and easy to assemble photometric stereo sensor. We use the normal map and the albedo of each surface to recognize its material via the use of handcrafted and learned features and various feature encodings. We also perform garment classification using the same approach. We show that the fusion of normals and albedo information outperforms standard methods which rely only on the use of texture information. Our methodologies, both for data collection, as well as for material classification can be applied easily to many real-word scenarios including design of new robots able to sense materials and industrial inspection.

Christos Kampouris, Stefanos Zafeiriou, Abhijeet Ghosh, Sotiris Malassiotis

The Conditional Lucas & Kanade Algorithm

The Lucas & Kanade (LK) algorithm is the method of choice for efficient dense image and object alignment. The approach is efficient as it attempts to model the connection between appearance and geometric displacement through a linear relationship that assumes independence across pixel coordinates. A drawback of the approach, however, is its generative nature. Specifically, its performance is tightly coupled with how well the linear model can synthesize appearance from geometric displacement, even though the alignment task itself is associated with the inverse problem. In this paper, we present a new approach, referred to as the Conditional LK algorithm, which: (i) directly learns linear models that predict geometric displacement as a function of appearance, and (ii) employs a novel strategy for ensuring that the generative pixel independence assumption can still be taken advantage of. We demonstrate that our approach exhibits superior performance to classical generative forms of the LK algorithm. Furthermore, we demonstrate its comparable performance to state-of-the-art methods such as the Supervised Descent Method with substantially less training examples, as well as the unique ability to “swap” geometric warp functions without having to retrain from scratch. Finally, from a theoretical perspective, our approach hints at possible redundancies that exist in current state-of-the-art methods for alignment that could be leveraged in vision systems of the future.

Chen-Hsuan Lin, Rui Zhu, Simon Lucey

Where Should Saliency Models Look Next?

Recently, large breakthroughs have been observed in saliency modeling. The top scores on saliency benchmarks have become dominated by neural network models of saliency, and some evaluation scores have begun to saturate. Large jumps in performance relative to previous models can be found across datasets, image types, and evaluation metrics. Have saliency models begun to converge on human performance? In this paper, we re-examine the current state-of-the-art using a fine-grained analysis on image types, individual images, and image regions. Using experiments to gather annotations for high-density regions of human eye fixations on images in two established saliency datasets, MIT300 and CAT2000, we quantify up to 60% of the remaining errors of saliency models. We argue that to continue to approach human-level performance, saliency models will need to discover higher-level concepts in images: text, objects of gaze and action, locations of motion, and expected locations of people in images. Moreover, they will need to reason about the relative importance of image regions, such as focusing on the most important person in the room or the most informative sign on the road. More accurately tracking performance will require finer-grained evaluations and metrics. Pushing performance further will require higher-level image understanding.

Zoya Bylinskii, Adrià Recasens, Ali Borji, Aude Oliva, Antonio Torralba, Frédo Durand

Robust Face Alignment Using a Mixture of Invariant Experts

Face alignment, which is the task of finding the locations of a set of facial landmark points in an image of a face, is useful in widespread application areas. Face alignment is particularly challenging when there are large variations in pose (in-plane and out-of-plane rotations) and facial expression. To address this issue, we propose a cascade in which each stage consists of a mixture of regression experts. Each expert learns a customized regression model that is specialized to a different subset of the joint space of pose and expressions. The system is invariant to a predefined class of transformations (e.g., affine), because the input is transformed to match each expert’s prototype shape before the regression is applied. We also present a method to include deformation constraints within the discriminative alignment framework, which makes our algorithm more robust. Our algorithm significantly outperforms previous methods on publicly available face alignment datasets.

Oncel Tuzel, Tim K. Marks, Salil Tambe

Partial Linearization Based Optimization for Multi-class SVM

We propose a novel partial linearization based approach for optimizing the multi-class svm learning problem. Our method is an intuitive generalization of the Frank-Wolfe and the exponentiated gradient algorithms. In particular, it allows us to combine several of their desirable qualities into one approach: (i) the use of an expectation oracle (which provides the marginals over each output class) in order to estimate an informative descent direction, similar to exponentiated gradient; (ii) analytical computation of the optimal step-size in the descent direction that guarantees an increase in the dual objective, similar to Frank-Wolfe; and (iii) a block coordinate formulation similar to the one proposed for Frank-Wolfe, which allows us to solve large-scale problems. Using the challenging computer vision problems of action classification, object recognition and gesture recognition, we demonstrate the efficacy of our approach on training multi-class svms with standard, publicly available, machine learning datasets.

Pritish Mohapatra, Puneet Kumar Dokania, C. V. Jawahar, M. Pawan Kumar

Search-Based Depth Estimation via Coupled Dictionary Learning with Large-Margin Structure Inference

Depth estimation from a single image is an emerging topic in computer vision and beyond. To this end, the existing works typically train a depth regressor from visual appearance. However, the state-of-the-art performance of these schemes is still far from satisfactory, mainly because of the over-fitting and under-fitting problems in regressor training. In this paper, we offer a different data-driven paradigm of estimating depth from a single image, which formulates depth estimation from a search-based perspective. In particular, we handle the depth estimation of local patches via a novel cross-modality retrieval scheme, which searches for the 3D patches with similar structure/appearance to the 2D query from a dataset with 2D-3D mappings. To that effect, a coupled dictionary learning formulation is proposed to link the 2D query with the 3D patches, on the reconstruction coefficients to capture the cross-modality similarity, to obtain a rough depth estimation locally. In addition, consistency on spatial context is further introduced to refine the local depth estimation using a Conditional Random Field. We demonstrate the efficacy of the proposed method by comparing it with the state-of-the-art approaches on popular public datasets such as Make3D and NYUv2, upon which significant performance gains are reported.

Yan Zhang, Rongrong Ji, Xiaopeng Fan, Yan Wang, Feng Guo, Yue Gao, Debin Zhao

Scalable Metric Learning via Weighted Approximate Rank Component Analysis

We are interested in the large-scale learning of Mahalanobis distances, with a particular focus on person re-identification. We propose a metric learning formulation called Weighted Approximate Rank Component Analysis (WARCA). WARCA optimizes the precision at top ranks by combining the WARP loss with a regularizer that favors orthonormal linear mappings and avoids rank-deficient embeddings. Using this new regularizer allows us to adapt the large-scale WSABIE procedure and to leverage the Adam stochastic optimization algorithm, which results in an algorithm that scales gracefully to very large data-sets. Also, we derive a kernelized version which allows to take advantage of state-of-the-art features for re-identification when data-set size permits kernel computation. Benchmarks on recent and standard re-identification data-sets show that our method beats existing state-of-the-art techniques both in terms of accuracy and speed. We also provide experimental analysis to shade lights on the properties of the regularizer we use, and how it improves performance.

Cijo Jose, François Fleuret


Additional information

Premium Partner

    Image Credits