Skip to main content

2014 | Book

Computer Vision – ECCV 2014

13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V

Editors: David Fleet, Tomas Pajdla, Bernt Schiele, Tinne Tuytelaars

Publisher: Springer International Publishing

Book Series : Lecture Notes in Computer Science


About this book

The seven-volume set comprising LNCS volumes 8689-8695 constitutes the refereed proceedings of the 13th European Conference on Computer Vision, ECCV 2014, held in Zurich, Switzerland, in September 2014. The 363 revised papers presented were carefully reviewed and selected from 1444 submissions. The papers are organized in topical sections on tracking and activity recognition; recognition; learning and inference; structure from motion and feature matching; computational photography and low-level vision; vision; segmentation and saliency; context and 3D scenes; motion and 3D scene analysis; and poster sessions.

Table of Contents


Poster Session 5 (continued)

Video Registration to SfM Models

Registering image data to Structure from Motion (SfM) point clouds is widely used to find precise camera location and orientation with respect to a world model. In case of videos one constraint has previously been unexploited: temporal smoothness. Without temporal smoothness the magnitude of the pose error in each frame of a video will often dominate the magnitude of frame-to-frame pose change. This hinders application of methods requiring stable poses estimates (e.g. tracking, augmented reality). We incorporate temporal constraints into the image-based registration setting and solve the problem by pose regularization with model fitting and smoothing methods. This leads to accurate, gap-free and smooth poses for all frames. We evaluate different methods on challenging synthetic and real street-view SfM data for varying scenarios of motion speed, outlier contamination, pose estimation failures and 2D-3D correspondence noise. For all test cases a 2 to 60-fold reduction in root mean squared (RMS) positional error is observed, depending on pose estimation difficulty. For varying scenarios, different methods perform best. We give guidance which methods should be preferred depending on circumstances and requirements.

Till Kroeger, Luc Van Gool
Soft Cost Aggregation with Multi-resolution Fusion

This paper presents a simple and effective cost volume aggregation framework for addressing pixels labeling problem. Our idea is based on the observation that incorrect labelings are greatly reduced in cost volume aggregation results from low resolutions. However, image details may be lost in the low resolution results. To take advantage of the results from low resolution for reducing these incorrect labelings while preserving details, we propose a multi-resolution cost aggregation method (MultiAgg) by using a soft fusion scheme based on min-convolution. We implement our MultiAgg in applications on stereo matching and interactive image segmentation. Experimental results show that our method significantly outperforms conventional cost aggregation methods in labeling accuracy. Moreover, although MultiAgg is a simple and straight-forward method, it produces results which are close to or even better than those from iterative methods based on global optimization.

Xiao Tan, Changming Sun, Dadong Wang, Yi Guo, Tuan D. Pham
Inverse Kernels for Fast Spatial Deconvolution

Deconvolution is an indispensable tool in image processing and computer vision. It commonly employs fast Fourier transform (FFT) to simplify computation. This operator, however, needs to transform from and to the frequency domain and loses spatial information when processing irregular regions. We propose an efficient spatial deconvolution method that can incorporate sparse priors to suppress noise and visual artifacts. It is based on estimating inverse kernels that are decomposed into a series of 1D kernels. An augmented Lagrangian method is adopted, making inverse kernel be estimated only once for each optimization process. Our method is fully parallelizable and its speed is comparable to or even faster than other strategies employing FFTs.

Li Xu, Xin Tao, Jiaya Jia
Deep Network Cascade for Image Super-resolution

In this paper, we propose a new model called deep network cascade (DNC) to gradually upscale low-resolution images layer by layer, each layer with a small scale factor. DNC is a cascade of multiple stacked collaborative local auto-encoders. In each layer of the cascade, non-local self-similarity search is first performed to enhance high-frequency texture details of the partitioned patches in the input image. The enhanced image patches are then input into a collaborative local auto-encoder (CLA) to suppress the noises as well as collaborate the compatibility of the overlapping patches. By closing the loop on non-local self-similarity search and CLA in a cascade layer, we can refine the super-resolution result, which is further fed into next layer until the required image scale. Experiments on image super-resolution demonstrate that the proposed DNC can gradually upscale a low-resolution image with the increase of network layers and achieve more promising results in visual quality as well as quantitative performance.

Zhen Cui, Hong Chang, Shiguang Shan, Bineng Zhong, Xilin Chen
Spectral Edge Image Fusion: Theory and Applications

This paper describes a novel approach to the fusion of multidimensional images for colour displays. The goal of the method is to generate an output image whose gradient matches that of the input as closely as possible. It achieves this using a constrained contrast mapping paradigm in the gradient domain, where the structure tensor of a high-dimensional gradient representation is mapped


to that of a low-dimensional gradient field which is subsequently reintegrated to generate an output. Constraints on the output colours are provided by an initial RGB rendering to produce ‘naturalistic’ colours: we provide a theorem for projecting higher-D contrast onto the initial colour gradients such that they remain close to the original gradients whilst maintaining exact high-D contrast. The solution to this constrained optimisation is closed-form, allowing for a very simple and hence fast and efficient algorithm. Our approach is generic in that it can map any


-D image data to any


-D output, and can be used in a variety of applications using the same basic algorithm. In this paper we focus on the problem of mapping


-D inputs to 3-D colour outputs. We present results in three applications: hyperspectral remote sensing, fusion of colour and near-infrared images, and colour visualisation of MRI Diffusion-Tensor imaging.

David Connah, Mark Samuel Drew, Graham David Finlayson
Spatio-chromatic Opponent Features

This work proposes colour opponent features that are based on low-level models of mammalian colour visual processing. A key step is the construction of opponent spatio-chromatic feature maps by filtering colour planes with Gaussians of unequal spreads. Weighted combination of these planes yields a spatial center-surround effect across chromatic channels. The resulting feature spaces – substantially different to CIELAB and other colour-opponent spaces obtained by colour-plane differencing – are further processed to assign local spatial orientations. The nature of the initial spatio-chromatic processing requires a customised approach to generating gradient-like fields, which is also described. The resulting direction-encoding responses are then pooled to form compact descriptors. The individual performance of the new descriptors was found to be substantially higher than those arising from spatial processing of standard opponent colour spaces, and these are the first chromatic descriptors that appear to achieve such performance levels individually. For all stages, parametrisations are suggested that allow successful optimisation using categorization performance as an objective. Classification benchmarks on Pascal VOC 2007 and Bird-200-2011 are presented to show the merits of these new features.

Ioannis Alexiou, Anil A. Bharath
Modeling Perceptual Color Differences by Local Metric Learning

Having perceptual differences between scene colors is key in many computer vision applications such as image segmentation or visual salient region detection. Nevertheless, most of the times, we only have access to the rendered image colors, without any means to go back to the true scene colors. The main existing approaches propose either to compute a perceptual distance between the rendered image colors, or to estimate the scene colors from the rendered image colors and then to evaluate perceptual distances. However the first approach provides distances that can be far from the scene color differences while the second requires the knowledge of the acquisition conditions that are unavailable for most of the applications. In this paper, we design a new local Mahalanobis-like metric learning algorithm that aims at approximating a perceptual scene color difference that is invariant to the acquisition conditions and computed only from rendered image colors. Using the theoretical framework of uniform stability, we provide consistency guarantees on the learned model. Moreover, our experimental evaluation shows its great ability (i) to generalize to new colors and devices and (ii) to deal with segmentation tasks.

Michaël Perrot, Amaury Habrard, Damien Muselet, Marc Sebban
Online Graph-Based Tracking

Tracking by sequential Bayesian filtering relies on a graphical model with temporally ordered linear structure based on temporal smoothness assumption. This framework is convenient to propagate the posterior through the first-order Markov chain. However, density propagation from a single immediately preceding frame may be unreliable especially in challenging situations such as abrupt appearance changes, fast motion, occlusion, and so on. We propose a visual tracking algorithm based on more general graphical models, where multiple previous frames contribute to computing the posterior in the current frame and edges between frames are created upon inter-frame trackability. Such data-driven graphical model reflects sequence structures as well as target characteristics, and is more desirable to implement a robust tracking algorithm. The proposed tracking algorithm runs online and achieves outstanding performance with respect to the state-of-the-art trackers. We illustrate quantitative and qualitative performance of our algorithm in all the sequences in tracking benchmark and other challenging videos.

Hyeonseob Nam, Seunghoon Hong, Bohyung Han
Fast Visual Tracking via Dense Spatio-temporal Context Learning

In this paper, we present a simple yet fast and robust algorithm which exploits the dense spatio-temporal context for visual tracking. Our approach formulates the spatio-temporal relationships between the object of interest and its locally dense contexts in a Bayesian framework, which models the statistical correlation between the simple low-level features (i.e., image intensity and position) from the target and its surrounding regions. The tracking problem is then posed by computing a confidence map which takes into account the prior information of the target location and thereby alleviates target location ambiguity effectively. We further propose a novel explicit scale adaptation scheme, which is able to deal with target scale variations efficiently and effectively. The Fast Fourier Transform (FFT) is adopted for fast learning and detection in this work, which only needs 4 FFT operations. Implemented in MATLAB without code optimization, the proposed tracker runs at 350 frames per second on an i7 machine. Extensive experimental results show that the proposed algorithm performs favorably against state-of-the-art methods in terms of efficiency, accuracy and robustness.

Kaihua Zhang, Lei Zhang, Qingshan Liu, David Zhang, Ming-Hsuan Yang
Extended Lucas-Kanade Tracking

The Lucas-Kanade (LK) method is a classic tracking algorithm exploiting target structural constraints thorough template matching. Extended Lucas Kanade or ELK casts the original LK algorithm as a maximum likelihood optimization and then extends it by considering pixel object / background likelihoods in the optimization. Template matching and pixel-based object / background segregation are tied together by a unified Bayesian framework. In this framework two log-likelihood terms related to pixel object / background affiliation are introduced in addition to the standard LK template matching term. Tracking is performed using an EM algorithm, in which the E-step corresponds to pixel object/background inference, and the M-step to parameter optimization. The final algorithm, implemented using a classifier for object / background modeling and equipped with simple template update and occlusion handling logic, is evaluated on two challenging data-sets containing 50 sequences each. The first is a recently published benchmark where ELK ranks 3rd among 30 tracking methods evaluated. On the second data-set of vehicles undergoing severe view point changes ELK ranks in 1st place outperforming state-of-the-art methods.

Shaul Oron, Aharon Bar-Hille, Shai Avidan
Appearances Can Be Deceiving: Learning Visual Tracking from Few Trajectory Annotations

Visual tracking is the task of estimating the trajectory of an object in a video given its initial location. This is usually done by combining at each step an appearance and a motion model. In this work, we learn from a small set of training trajectory annotations how the objects in the scene typically move. We learn the relative weight between the appearance and the motion model. We call this weight:

visual deceptiveness

. At test time, we transfer the deceptiveness and the displacement from the closest trajectory annotation to infer the next location of the object. Further, we condition the transference on an event model. On a set of 161 manually annotated test trajectories, we show in our experiments that learning from just 10 trajectory annotations


the center location error and improves the success rate by about 10%.

Santiago Manen, Junseok Kwon, Matthieu Guillaumin, Luc Van Gool
Generalized Background Subtraction Using Superpixels with Label Integrated Motion Estimation

We propose an online background subtraction algorithm with superpixel-based density estimation for videos captured by moving camera. Our algorithm maintains appearance and motion models of foreground and background for each superpixel, computes foreground and background likelihoods for each pixel based on the models, and determines pixelwise labels using binary belief propagation. The estimated labels trigger the update of appearance and motion models, and the above steps are performed iteratively in each frame. After convergence, appearance models are propagated through a sequential Bayesian filtering, where predictions rely on motion fields of both labels whose computation exploits the segmentation mask. Superpixel-based modeling and label integrated motion estimation make propagated appearance models more accurate compared to existing methods since the models are constructed on visually coherent regions and the quality of estimated motion is improved by avoiding motion smoothing across regions with different labels. We evaluate our algorithm with challenging video sequences and present significant performance improvement over the state-of-the-art techniques quantitatively and qualitatively.

Jongwoo Lim, Bohyung Han
Spectra Estimation of Fluorescent and Reflective Scenes by Using Ordinary Illuminants

The spectrum behavior of a typical fluorescent object is regulated by its reflectance, absorption and emission spectra. It was shown that two high-frequency and complementary illuminations in the spectral domain can be used to simultaneously estimate reflectance and emission spectra. In spite of its accuracy, such specialized illuminations are not easily accessible. This motivates us to explore the feasibility of using ordinary illuminants to achieve this task with comparable accuracy. We show that three hyperspectral images under wideband and independent illuminants are both necessary and sufficient, and successfully develop a convex optimization method for solving. We also disclose the reason why using one or two images is inadequate, although embedding the linear low-dimensional models of reflectance and emission would lead to an apparently overconstrained equation system. In addition, we propose a novel four-parameter model to express absorption and emission spectra, which is more compact and discriminative than the linear model. Based on this model, we present an absorption spectra estimation method in the presence of three illuminations. The correctness and accuracy of our proposed model and methods have been verified.

Yinqiang Zheng, Imari Sato, Yoichi Sato
Interreflection Removal Using Fluorescence

Interreflections exhibit a number of challenges for existing shape-from-intensity methods that only assume a direct lighting model. Removing the interreflections from scene observations is of broad interest since it enhances the accuracy of those methods. In this paper, we propose a method for removing interreflections from a single image using fluorescence. From a bispectral observation of reflective and fluorescent components recorded in distinct color channels, our method separates direct lighting from interreflections. Experimental results demonstrate the effectiveness of the proposed method on complex and dynamic scenes. In addition, we show how our method improves an existing photometric stereo method in shape recovery.

Ying Fu, Antony Lam, Yasuyuki Matsushita, Imari Sato, Yoichi Sato
Intrinsic Face Image Decomposition with Human Face Priors

We present a method for decomposing a single face photograph into its intrinsic image components. Intrinsic image decomposition has commonly been used to facilitate image editing operations such as relighting and re-texturing. Although current single-image intrinsic image methods are able to obtain an approximate decomposition, image operations involving the human face require greater accuracy since slight errors can lead to visually disturbing results. To improve decomposition for faces, we propose to utilize human face priors as constraints for intrinsic image estimation. These priors include statistics on skin reflectance and facial geometry. We also make use of a physically-based model of skin translucency to heighten accuracy, as well as to further decompose the reflectance image into a diffuse and a specular component. With the use of priors and a skin reflectance model for human faces, our method is able to achieve appreciable improvements in intrinsic image decomposition over more generic techniques.

Chen Li, Kun Zhou, Stephen Lin
Recovering Scene Geometry under Wavy Fluid via Distortion and Defocus Analysis

In this paper, we consider scenes that are immersed in transparent refractive media with a dynamic surface. We take the first steps to reconstruct both the 3D fluid surface shape and the 3D structure of immersed scene simultaneously by utilizing distortion and defocus clues. We demonstrate that the images captured through a refractive dynamic fluid surface are the distorted and blurred versions of all-in-focused (AIF) images captured through a flat fluid surface. The amounts of distortion and refractive blur are formulated by the shape of fluid surface, scene depth and camera parameters, based on our refractive geometry model of a finite aperture imaging system. An iterative optimization algorithm is proposed to reconstruct the distortion and immersed scene depth, which are then used to infer the 3D fluid surface. We validate and demonstrate the effectiveness of our approach on a variety of synthetic and real scenes under different fluid surfaces.

Mingjie Zhang, Xing Lin, Mohit Gupta, Jinli Suo, Qionghai Dai
Human Detection Using Learned Part Alphabet and Pose Dictionary

As structured data, human body and text are similar in many aspects. In this paper, we make use of the analogy between human body and text to build a compositional model for human detection in natural scenes. Basic concepts and mature techniques in text recognition are introduced into this model. A discriminative alphabet, each grapheme of which is a mid-level element representing a body part, is automatically learned from bounding box labels. Based on this alphabet, the flexible structure of human body is expressed by means of symbolic sequences, which correspond to various human poses and allow for robust, efficient matching. A pose dictionary is constructed from training examples, which is used to verify hypotheses at runtime. Experiments on standard benchmarks demonstrate that the proposed algorithm achieves state-of-the-art or competitive performance.

Cong Yao, Xiang Bai, Wenyu Liu, Longin Jan Latecki
SPADE: Scalar Product Accelerator by Integer Decomposition for Object Detection

We propose a method for accelerating computation of an object detector based on a linear classifier when objects are expressed by binary feature vectors. Our key idea is to decompose a real-valued weight vector of the linear classifier into a weighted sum of a few ternary basis vectors so as to preserve the original classification scores. Our data-dependent decomposition algorithm can approximate the original classification scores by a small number of the ternary basis vectors with an allowable error. Instead of using the original real-valued weight vector, the approximated classification score can be obtained by evaluating the few inner products between the binary feature vector and the ternary basis vectors, which can be computed using extremely fast logical operations. We also show that each evaluation of the inner products can be cascaded for incorporating early termination. Our experiments revealed that the linear filtering used in a HOG-based object detector becomes 36.9× faster than the original implementation with 1.5% loss of accuracy for 0.1 false positives per image in pedestrian detection task.

Mitsuru Ambai, Ikuro Sato
Detecting Snap Points in Egocentric Video with a Web Photo Prior

Wearable cameras capture a first-person view of the world, and offer a hands-free way to record daily experiences or special events. Yet, not every frame is worthy of being captured and stored. We propose to automatically predict

“snap points”

in unedited egocentric video—that is, those frames that look like they could have been intentionally taken photos. We develop a generative model for snap points that relies on a Web photo prior together with domain-adapted features. Critically, our approach avoids strong assumptions about the particular


of snap points, focusing instead on their


. Using 17 hours of egocentric video from both human and mobile robot camera wearers, we show that the approach accurately isolates those frames that human judges would believe to be intentionally snapped photos. In addition, we demonstrate the utility of snap point detection for improving object detection and keyframe selection in egocentric video.

Bo Xiong, Kristen Grauman
Towards Unified Object Detection and Semantic Segmentation

Object detection and semantic segmentation are two strongly correlated tasks, yet typically solved separately or sequentially with substantially different techniques. Motivated by the complementary effect observed from the typical failure cases of the two tasks, we propose a unified framework for joint object detection and semantic segmentation. By enforcing the consistency between final detection and segmentation results, our unified framework can effectively leverage the advantages of leading techniques for these two tasks. Furthermore, both local and global context information are integrated into the framework to better distinguish the ambiguous samples. By jointly optimizing the model parameters for all the components, the relative importance of different component is automatically learned for each category to guarantee the overall performance. Extensive experiments on the PASCAL VOC 2010 and 2012 datasets demonstrate encouraging performance of the proposed unified framework for both object detection and semantic segmentation tasks.

Jian Dong, Qiang Chen, Shuicheng Yan, Alan Yuille
Foreground Consistent Human Pose Estimation Using Branch and Bound

We propose a method for human pose estimation which extends common unary and pairwise terms of graphical models with a global foreground term. Given knowledge of per pixel foreground, a pose should not only be plausible according to the graphical model but also explain the foreground well.

However, while inference on a standard tree-structured graphical model for pose estimation can be computed easily and very efficiently using dynamic programming, this no longer holds when the global foreground term is added to the problem.

We therefore propose a branch and bound based algorithm to retrieve the globally optimal solution to our pose estimation problem. To keep inference tractable and avoid the obvious combinatorial explosion, we propose upper bounds allowing for an intelligent exploration of the solution space.

We evaluated our method on several publicly available datasets, showing the benefits of our method.

Jens Puwein, Luca Ballan, Remo Ziegler, Marc Pollefeys
Human Pose Estimation with Fields of Parts

This paper proposes a new formulation of the human pose estimation problem. We present the

Fields of Parts

model, a binary Conditional Random Field model designed to detect human body parts of articulated people in single images.

The Fields of Parts model is inspired by the idea of Pictorial Structures, it models local appearance and joint spatial configuration of the human body. However the underlying graph structure is entirely different. The idea is simple: we model the presence and absence of a body part at every possible position, orientation, and scale in an image with a binary random variable. This results into a vast number of random variables, however, we show that approximate inference in this model is efficient. Moreover we can encode the very same appearance and spatial structure as in Pictorial Structures models.

This approach allows us to combine ideas from segmentation and pose estimation into a single model. The Fields of Parts model can use evidence from the background, include local color information, and it is connected more densely than a kinematic chain structure. On the challenging Leeds Sports Poses dataset we improve over the Pictorial Structures counterpart by 6.0% in terms of Average Precision of Keypoints.

Martin Kiefel, Peter Vincent Gehler
Unsupervised Video Adaptation for Parsing Human Motion

In this paper, we propose a method to parse human motion in unconstrained Internet videos without labeling any videos for training. We use the training samples from a public image pose dataset to avoid the tediousness of labeling video streams. There are two main problems confronted. First, the distribution of images and videos are different. Second, no temporal information is available in the training images. To smooth the inconsistency between the labeled images and unlabeled videos, our algorithm iteratively incorporates the pose knowledge harvested from the testing videos into the image pose detector via an adjust-and-refine method. During this process, continuity and tracking constraints are imposed to leverage the spatio-temporal information only available in videos. For our experiments, we have collected two datasets from YouTube and experiments show that our method achieves good performance for parsing human motions. Furthermore, we found that our method achieves better performance by using unlabeled video than adding more labeled pose images into the training set.

Haoquan Shen, Shoou-I Yu, Yi Yang, Deyu Meng, Alexander Hauptmann
Training Object Class Detectors from Eye Tracking Data

Training an object class detector typically requires a large set of images annotated with bounding-boxes, which is expensive and time consuming to create. We propose novel approach to annotate object locations which can substantially reduce annotation time. We first track the eye movements of annotators instructed to find the object and then propose a technique for deriving object bounding-boxes from these fixations. To validate our idea, we collected eye tracking data for the trainval part of 10 object classes of Pascal VOC 2012 (6,270 images, 5 observers). Our technique correctly produces bounding-boxes in 50%of the images, while reducing the total annotation time by factor 6.8× compared to drawing bounding-boxes. Any standard object class detector can be trained on the bounding-boxes predicted by our model. Our large scale eye tracking dataset is available at


Dim P. Papadopoulos, Alasdair D. F. Clarke, Frank Keller, Vittorio Ferrari
Depth Based Object Detection from Partial Pose Estimation of Symmetric Objects

Category-level object detection, the task of locating object instances of a given category in images, has been tackled with many algorithms employing standard color images. Less attention has been given to solving it using range and depth data, which has lately become readily available using laser and RGB-D cameras. Exploiting the different nature of the depth modality, we propose a novel shape-based object detector with partial pose estimation for axial or reflection symmetric objects. We estimate this partial pose by detecting target’s symmetry, which as a global mid-level feature provides us with a robust frame of reference with which shape features are represented for detection. Results are shown on a particularly challenging depth dataset and exhibit significant improvement compared to the prior art.

Ehud Barnea, Ohad Ben-Shahar
Edge Boxes: Locating Object Proposals from Edges

The use of object proposals is an effective recent approach for increasing the computational efficiency of object detection. We propose a novel method for generating object bounding box proposals using edges. Edges provide a sparse yet informative representation of an image. Our main observation is that the number of contours that are wholly contained in a bounding box is indicative of the likelihood of the box containing an object. We propose a simple box objectness score that measures the number of edges that exist in the box minus those that are members of contours that overlap the box’s boundary. Using efficient data structures, millions of candidate boxes can be evaluated in a fraction of a second, returning a ranked set of a few thousand top-scoring proposals. Using standard metrics, we show results that are significantly more accurate than the current state-of-the-art while being faster to compute. In particular, given just 1000 proposals we achieve over 96% object recall at overlap threshold of 0.5 and over 75% recall at the more challenging overlap of 0.7. Our approach runs in 0.25 seconds and we additionally demonstrate a near real-time variant with only minor loss in accuracy.

C. Lawrence Zitnick, Piotr Dollár
Training Deformable Object Models for Human Detection Based on Alignment and Clustering

We propose a clustering method that considers non-rigid alignment of samples. The motivation for such a clustering is training of object detectors that consist of multiple mixture components. In particular, we consider the deformable part model (DPM) of Felzenszwalb et al., where each mixture component includes a learned deformation model. We show that alignment based clustering distributes the data better to the mixture components of the DPM than previous methods. Moreover, the alignment helps the non-convex optimization of the DPM find a consistent placement of its parts and, thus, learn more accurate part filters.

Benjamin Drayer, Thomas Brox
Predicting Actions from Static Scenes

Human actions naturally co-occur with scenes. In this work we aim to discover action-scene correlation for a large number of scene categories and to use such correlation for action prediction. Towards this goal, we collect a new SUN Action dataset with manual annotations of typical human actions for 397 scenes. We next discover action-scene associations and demonstrate that scene categories can be well identified from their associated actions. Using discovered associations, we address a new task of predicting human actions for images of static scenes. We evaluate prediction of 23 and 38 action classes for images of indoor and outdoor scenes respectively and show promising results. We also propose a new application of geo-localized action prediction and demonstrate ability of our method to automatically answer queries such as “Where is a good place for a picnic?” or “Can I cycle along this path?”.

Tuan-Hung Vu, Catherine Olsson, Ivan Laptev, Aude Oliva, Josef Sivic
Exploiting Privileged Information from Web Data for Image Categorization

Relevant and irrelevant web images collected by tag-based image retrieval have been employed as loosely labeled training data for learning SVM classifiers for image categorization by only using the visual features. In this work, we propose a new image categorization method by incorporating the textual features extracted from the surrounding textual descriptions (tags, captions, categories, etc.) as privileged information and simultaneously coping with noise in the loose labels of training web images. When the training and test samples come from different datasets, our proposed method can be further extended to reduce the data distribution mismatch by adding a regularizer based on the Maximum Mean Discrepancy (MMD) criterion. Our comprehensive experiments on three benchmark datasets demonstrate the effectiveness of our proposed methods for image categorization and image retrieval by exploiting privileged information from web data.

Wen Li, Li Niu, Dong Xu
Multi-modal Unsupervised Feature Learning for RGB-D Scene Labeling

Most of the existing approaches for RGB-D indoor scene labeling employ hand-crafted features for each modality independently and combine them in a heuristic manner. There has been some attempt on directly learning features from raw RGB-D data, but the performance is not satisfactory. In this paper, we adapt the unsupervised feature learning technique for RGB-D labeling as a multi-modality learning problem. Our learning framework performs feature learning and feature encoding simultaneously which significantly boosts the performance. By stacking basic learning structure, higher-level features are derived and combined with lower-level features for better representing RGB-D data. Experimental results on the benchmark NYU depth dataset show that our method achieves competitive performance, compared with state-of-the-art.

Anran Wang, Jiwen Lu, Gang Wang, Jianfei Cai, Tat-Jen Cham
Discriminatively Trained Dense Surface Normal Estimation

In this work we propose the method for a rather unexplored problem of computer vision - discriminatively trained dense surface normal estimation from a single image. Our method combines contextual and segment-based cues and builds a regressor in a boosting framework by transforming the problem into the regression of coefficients of a local coding. We apply our method to two challenging data sets containing images of man-made environments, the indoor NYU2 data set and the outdoor KITTI data set. Our surface normal predictor achieves results better than initially expected, significantly outperforming state-of-the-art.

L’ubor Ladický, Bernhard Zeisl, Marc Pollefeys
Numerical Inversion of SRNFs for Efficient Elastic Shape Analysis of Star-Shaped Objects

The elastic shape analysis of surfaces has proven useful in several application areas, including medical image analysis, vision, and graphics.

This approach is based on defining new mathematical representations of parameterized surfaces, including the square root normal field (SRNF), and then using the


norm to compare their shapes. Past work is based on using the pullback of the


metric to the space of surfaces, performing statistical analysis under this induced Riemannian metric. However, if one can estimate the inverse of the SRNF mapping, even approximately, a very efficient framework results: the surfaces, represented by their SRNFs, can be efficiently analyzed using standard Euclidean tools, and only the final results need be mapped back to the surface space. Here we describe a procedure for inverting SRNF maps of star-shaped surfaces, a special case for which analytic results can be obtained. We test our method via the classification of 34 cases of ADHD (Attention Deficit Hyperactivity Disorder), plus controls, in the Detroit Fetal Alcohol and Drug Exposure Cohort study. We obtain state-of-the-art results.

Qian Xie, Ian Jermyn, Sebastian Kurtek, Anuj Srivastava
Non-associative Higher-Order Markov Networks for Point Cloud Classification

In this paper, we introduce a non-associative higher-order graphical model to tackle the problem of semantic labeling of 3D point clouds. For this task, existing higher-order models overlook the relationships between the different classes and simply encourage the nodes in the cliques to have consistent labelings. We address this issue by devising a set of non-associative context patterns that describe higher-order geometric relationships between different class labels within the cliques. To this end, we propose a method to extract informative cliques in 3D point clouds that provide more knowledge about the context of the scene. We evaluate our approach on three challenging outdoor point cloud datasets. Our experiments evidence the benefits of our non-associative higher-order Markov networks over state-of-the-art point cloud labeling techniques.

Mohammad Najafi, Sarah Taghavi Namin, Mathieu Salzmann, Lars Petersson
Learning Where to Classify in Multi-view Semantic Segmentation

There is an increasing interest in semantically annotated 3D models, e.g. of cities. The typical approaches start with the semantic labelling of all the images used for the 3D model. Such labelling tends to be very time consuming though. The inherent redundancy among the overlapping images calls for more efficient solutions. This paper proposes an alternative approach that exploits the geometry of a 3D mesh model obtained from multi-view reconstruction. Instead of clustering similar views, we predict the best view before the actual labelling. For this we find the single image part that bests supports the correct semantic labelling of each face of the underlying 3D mesh. Moreover, our single-image approach may surprise because it tends to increase the accuracy of the model labelling when compared to approaches that fuse the labels from multiple images. As a matter of fact, we even go a step further, and only explicitly label a subset of faces (e.g. 10%), to subsequently fill in the labels of the remaining faces. This leads to a further reduction of computation time, again combined with a gain in accuracy. Compared to a process that starts from the semantic labelling of the images, our method to semantically label 3D models yields accelerations of about 2 orders of magnitude. We tested our multi-view semantic labelling on a variety of street scenes.

Hayko Riemenschneider, András Bódis-Szomorú, Julien Weissenberg, Luc Van Gool
Stixmantics: A Medium-Level Model for Real-Time Semantic Scene Understanding

In this paper we present


, a novel medium-level scene representation for real-time visual semantic scene understanding. Relevant scene structure, motion and object class information is encoded using so-called


as primitive elements. Sparse feature-point trajectories are used to estimate the 3D motion field and to enforce temporal consistency of semantic labels. Spatial label coherency is obtained by using a CRF framework.

The proposed model abstracts and aggregates low-level pixel information to gain robustness and efficiency. Yet, enough flexibility is retained to adequately model complex scenes, such as urban traffic. Our experimental evaluation focuses on semantic scene segmentation using a recently introduced dataset for urban traffic scenes. In comparison to our best baseline approach, we demonstrate state-of-the-art performance but reduce inference time by a factor of more than 2,000, requiring only 50 ms per image.

Timo Scharwächter, Markus Enzweiler, Uwe Franke, Stefan Roth
Sparse Dictionaries for Semantic Segmentation

A popular trend in semantic segmentation is to use top-down object information to improve bottom-up segmentation. For instance, the classification scores of the Bag of Features (BoF) model for image classification have been used to build a top-down categorization cost in a Conditional Random Field (CRF) model for semantic segmentation. Recent work shows that discriminative sparse dictionary learning (DSDL) can improve upon the unsupervised


-means dictionary learning method used in the BoF model due to the ability of DSDL to capture discriminative features from different classes. However, to the best of our knowledge, DSDL has not been used for building a top-down categorization cost for semantic segmentation. In this paper, we propose a CRF model that incorporates a DSDL based top-down cost for semantic segmentation. We show that the new CRF energy can be minimized using existing efficient discrete optimization techniques. Moreover, we propose a new method for jointly learning the CRF parameters, object classifiers and the visual dictionary. Our experiments demonstrate that by jointly learning these parameters, the feature representation becomes more discriminative and the segmentation performance improves with respect to that of state-of-the-art methods that use unsupervised


-means dictionary learning.

Lingling Tao, Fatih Porikli, René Vidal
Video Action Detection with Relational Dynamic-Poselets

Action detection is of great importance in understanding human motion from video. Compared with action recognition, it not only recognizes action type, but also localizes its spatiotemporal extent. This paper presents a relational model for action detection, which first decomposes human action into temporal “key poses” and then further into spatial “action parts”. Specifically, we start by clustering cuboids around each human joint into dynamic-poselets using a new descriptor. The cuboids from the same cluster share consistent geometric and dynamic structure, and each cluster acts as a mixture of body parts. We then propose a sequential skeleton model to capture the relations among dynamic-poselets. This model unifies the tasks of learning the composites of mixture dynamic-poselets, the spatiotemporal structures of action parts, and the local model for each action part in a single framework. Our model not only allows to localize the action in a video stream, but also enables a detailed pose estimation of an actor. We formulate the model learning problem in a structured SVM framework and speed up model inference by dynamic programming. We conduct experiments on three challenging action detection datasets: the MSR-II dataset, the UCF Sports dataset, and the JHMDB dataset. The results show that our method achieves superior performance to the state-of-the-art methods on these datasets.

Limin Wang, Yu Qiao, Xiaoou Tang
Action Recognition with Stacked Fisher Vectors

Representation of video is a vital problem in action recognition. This paper proposes Stacked Fisher Vectors (SFV), a new representation with multi-layer nested Fisher vector encoding, for action recognition. In the first layer, we densely sample large subvolumes from input videos, extract local features, and encode them using Fisher vectors (FVs). The second layer compresses the FVs of subvolumes obtained in previous layer, and then encodes them again with Fisher vectors. Compared with standard FV, SFV allows refining the representation and abstracting semantic information in a hierarchical way. Compared with recent mid-level based action representations, SFV need not to mine discriminative action parts but can preserve mid-level information through Fisher vector encoding in higher layer. We evaluate the proposed methods on three challenging datasets, namely Youtube, J-HMDB, and HMDB51. Experimental results demonstrate the effectiveness of SFV, and the combination of the traditional FV and SFV outperforms state-of-the-art methods on these datasets with a large margin.

Xiaojiang Peng, Changqing Zou, Yu Qiao, Qiang Peng
A Discriminative Model with Multiple Temporal Scales for Action Prediction

The speed with which intelligent systems can react to an action depends on how soon it can be recognized. The ability to recognize ongoing actions is critical in many applications, for example, spotting criminal activity. It is challenging, since decisions have to be made based on partial videos of temporally incomplete action executions. In this paper, we propose a novel discriminative multi-scale model for predicting the action class from a partially observed video. The proposed model captures temporal dynamics of human actions by explicitly considering all the history of observed features as well as features in smaller temporal segments. We develop a new learning formulation, which elegantly captures the temporal evolution over time, and enforces the label consistency between segments and corresponding partial videos. Experimental results on two public datasets show that the proposed approach outperforms state-of-the-art action prediction methods.

Yu Kong, Dmitry Kit, Yun Fu
Seeing is Worse than Believing: Reading People’s Minds Better than Computer-Vision Methods Recognize Actions

We had human subjects perform a one-out-of-six class action recognition task from video stimuli while undergoing functional magnetic resonance imaging (fMRI). Support-vector machines (SVMs) were trained on the recovered brain scans to classify actions observed during imaging, yielding average classification accuracy of 69.73% when tested on scans from the same subject and of 34.80% when tested on scans from different subjects. An apples-to-apples comparison was performed with all publicly available software that implements state-of-the-art action recognition on the same video corpus with the same cross-validation regimen and same partitioning into training and test sets, yielding classification accuracies between 31.25% and 52.34%. This indicates that one can read people’s minds better than state-of-the-art computer-vision methods can perform action recognition.

Andrei Barbu, Daniel P. Barrett, Wei Chen, Narayanaswamy Siddharth, Caiming Xiong, Jason J. Corso, Christiane D. Fellbaum, Catherine Hanson, Stephen José Hanson, Sébastien Hélie, Evguenia Malaia, Barak A. Pearlmutter, Jeffrey Mark Siskind, Thomas Michael Talavage, Ronnie B. Wilbur
Weakly Supervised Action Labeling in Videos under Ordering Constraints

We are given a set of video clips, each one annotated with an


list of actions, such as “walk” then “sit” then “answer phone” extracted from, for example, the associated text script. We seek to temporally localize the individual actions in each clip as well as to learn a discriminative classifier for each action. We formulate the problem as a weakly supervised temporal assignment with ordering constraints. Each video clip is divided into small time intervals and each time interval of each video clip is assigned one action label, while respecting the order in which the action labels appear in the given annotations. We show that the action label assignment can be determined together with learning a classifier for each action in a discriminative manner. We evaluate the proposed model on a new and challenging dataset of 937 video clips with a total of 787720 frames containing sequences of 16 different actions from 69 Hollywood movies.

Piotr Bojanowski, Rémi Lajugie, Francis Bach, Ivan Laptev, Jean Ponce, Cordelia Schmid, Josef Sivic
Active Random Forests: An Application to Autonomous Unfolding of Clothes

We present

Active Random Forests

, a novel framework to address active vision problems. State of the art focuses on best viewing parameters selection based on single view classifiers. We propose a multi-view classifier where the decision mechanism of optimally changing viewing parameters is inherent to the classification process. This has many advantages: a) the classifier exploits the entire set of captured images and does not simply aggregate probabilistically per view hypotheses; b) actions are based on learnt disambiguating features from all views and are optimally selected using the powerful voting scheme of Random Forests and c) the classifier can take into account the costs of actions. The proposed framework is applied to the task of autonomously unfolding clothes by a robot, addressing the problem of best viewpoint selection in classification, grasp point and pose estimation of garments. We show great performance improvement compared to state of the art methods.

Andreas Doumanoglou, Tae-Kyun Kim, Xiaowei Zhao, Sotiris Malassiotis
Model-Free Segmentation and Grasp Selection of Unknown Stacked Objects

We present a novel grasping approach for unknown stacked objects using RGB-D images of highly complex real-world scenes. Specifically, we propose a novel 3D segmentation algorithm to generate an efficient representation of the scene into segmented surfaces (known as surfels) and objects. Based on this representation, we next propose a novel grasp selection algorithm which generates potential grasp hypotheses and automatically selects the most appropriate grasp without requiring any prior information of the objects or the scene. We tested our algorithms in real-world scenarios using live video streams from Kinect and publicly available RGB-D object datasets. Our experimental results show that both our proposed segmentation and grasp selection algorithms consistently perform superior compared to the state-of-the-art methods.

Umar Asif, Mohammed Bennamoun, Ferdous Sohel

Segmentation and Saliency

Convexity Shape Prior for Segmentation

Convexity is known as an important cue in human vision. We propose shape convexity as a new high-order regularization constraint for binary image segmentation. In the context of discrete optimization, object convexity is represented as a sum of 3-clique potentials penalizing any 1-0-1 configuration on all straight lines. We show that these non-submodular interactions can be efficiently optimized using a trust region approach. While the quadratic number of all 3-cliques is prohibitively high, we designed a dynamic programming technique for evaluating and approximating these cliques in linear time. Our experiments demonstrate general usefulness of the proposed convexity constraint on synthetic and real image segmentation examples. Unlike standard second-order length regularization, our convexity prior is scale invariant, does not have shrinking bias, and is virtually parameter-free.

Lena Gorelick, Olga Veksler, Yuri Boykov, Claudia Nieuwenhuis
Pseudo-bound Optimization for Binary Energies

High-order and non-submodular pairwise energies are important for image segmentation, surface matching, deconvolution, tracking and other computer vision problems. Minimization of such energies is generally


. One standard approximation approach is to optimize an

auxiliary function

- an upper bound of the original energy across the entire solution space. This bound must be amenable to fast global solvers. Ideally, it should also closely approximate the original functional, but it is very difficult to find such upper bounds in practice.

Our main idea is to relax the upper-bound condition for an auxiliary function and to replace it with a family of pseudo-bounds, which can better approximate the original energy. We use fast polynomial parametric maxflow approach to explore all global minima for our family of submodular pseudo-bounds. The best solution is guaranteed to decrease the original energy because the family includes at least one auxiliary function. Our Pseudo-Bound Cuts algorithm improves the state-of-the-art in many applications: appearance entropy minimization, target distribution matching, curvature regularization, image deconvolution and interactive segmentation.

Meng Tang, Ismail Ben Ayed, Yuri Boykov
A Closer Look at Context: From Coxels to the Contextual Emergence of Object Saliency

Visual context is used in different forms for saliency computation. While its use in saliency models for fixations prediction is often reasoned, this is less so the case for approaches that aim to compute saliency at the


level. We argue that the types of context employed by these methods lack clear justification and may in fact interfere with the purpose of capturing the saliency of whole visual objects. In this paper we discuss the constraints that different types of context impose and suggest a new interpretation of visual context that allows the emergence of saliency for more complex, abstract, or multiple visual objects. Despite shying away from an explicit attempt to capture “objectness” (e.g., via segmentation), our results are qualitatively superior and quantitatively better than the state-of-the-art.

Rotem Mairon, Ohad Ben-Shahar
Geodesic Object Proposals

We present an approach for identifying a set of candidate objects in a given image. This set of candidates can be used for object recognition, segmentation, and other object-based image parsing tasks. To generate the proposals, we identify critical level sets in geodesic distance transforms computed for seeds placed in the image. The seeds are placed by specially trained classifiers that are optimized to discover objects. Experiments demonstrate that the presented approach achieves significantly higher accuracy than alternative approaches, at a fraction of the computational cost.

Philipp Krähenbühl, Vladlen Koltun
Microsoft COCO: Common Objects in Context

We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, C. Lawrence Zitnick

Poster Session 6

Efficient Joint Segmentation, Occlusion Labeling, Stereo and Flow Estimation

In this paper we propose a slanted plane model for jointly recovering an image segmentation, a dense depth estimate as well as boundary labels (such as occlusion boundaries) from a static scene given two frames of a stereo pair captured from a moving vehicle. Towards this goal we propose a new optimization algorithm for our SLIC-like objective which preserves connecteness of image segments and exploits shape regularization in the form of boundary length. We demonstrate the performance of our approach in the challenging stereo and flow KITTI benchmarks and show superior results to the state-of-the-art. Importantly, these results can be achieved an order of magnitude faster than competing approaches.

Koichiro Yamaguchi, David McAllester, Raquel Urtasun
Robust Bundle Adjustment Revisited

In this work we address robust estimation in the bundle adjustment procedure. Typically, bundle adjustment is not solved via a generic optimization algorithm, but usually cast as a nonlinear least-squares problem instance. In order to handle gross outliers in bundle adjustment the least-squares formulation must be robustified. We investigate several approaches to make least-squares objectives robust while retaining the least-squares nature to use existing efficient solvers. In particular, we highlight a method based on


a robust cost function into a higher dimensional representation, and show how the lifted formulation is efficiently implemented in a Gauss-Newton framework. In our experiments the proposed lifting-based approach almost always yields the best (i.e. lowest) objectives.

Christopher Zach
Accurate Intrinsic Calibration of Depth Camera with Cuboids

Due to the low precision, the consumer-grade depth sensor is often calibrated jointly with a color camera, and the joint calibration sometimes presents undesired interactions. In this paper, we propose a novel method to carry out the high-accuracy intrinsic calibration of depth sensors merely by the depth camera, in which the traditional calibration rig, checker-board pattern, is replaced with a set of cuboids with known sizes, and the objective function for calibration is based on the length, width, and height of cuboids and its angle between the neighboring surfaces, which can be directly and robustly calculated from the depth-map. We experimentally evaluate the accuracy of the calibrated depth camera by measuring the angles and sizes of cubic object, and it is empirically shown that the resulting calibration accuracy is higher than that in the state-of-the-art calibration procedures, making the commodity depth sensors applicable to more interesting application scenarios such as 3D measurement and shape modeling etc.

Bingwen Jin, Hao Lei, Weidong Geng
Statistical Pose Averaging with Non-isotropic and Incomplete Relative Measurements

In the last few years there has been a growing interest in optimization methods for averaging pose measurements between a set of cameras or objects (obtained, for instance, using epipolar geometry or pose estimation). Alas, existing approaches do not take into consideration that measurements might have different uncertainties (i.e., the noise might not be isotropically distributed), or that they might be incomplete (e.g., they might be known only up to a rotation around a fixed axis). We propose a Riemannian optimization framework which addresses these cases by using covariance matrices, and test it on synthetic and real data.

Roberto Tron, Kostas Daniilidis
A Pot of Gold: Rainbows as a Calibration Cue

Rainbows are a natural cue for calibrating outdoor imagery. While ephemeral, they provide unique calibration cues because they are centered exactly opposite the sun and have an outer radius of 42 degrees. In this work, we define the geometry of a rainbow and describe minimal sets of constraints that are sufficient for estimating camera calibration. We present both semi-automatic and fully automatic methods to calibrate a camera using an image of a rainbow. To demonstrate our methods, we have collected a large database of rainbow images and use these to evaluate calibration accuracy and to create an empirical model of rainbow appearance. We show how this model can be used to edit rainbow appearance in natural images and how rainbow geometry, in conjunction with a horizon line and capture time, provides an estimate of camera location. While we focus on rainbows, many of the geometric properties and algorithms we present also apply to other solar-refractive phenomena, such as parhelion, often called sun dogs, and the 22 degree solar halo.

Scott Workman, Radu Paul Mihail, Nathan Jacobs
Let There Be Color! Large-Scale Texturing of 3D Reconstructions

3D reconstruction pipelines using structure-from-motion and multi-view stereo techniques are today able to reconstruct impressive, large-scale geometry models from images but do not yield textured results. Current texture creation methods are unable to handle the complexity and scale of these models. We therefore present the first comprehensive texturing framework for large-scale, real-world 3D reconstructions. Our method addresses most challenges occurring in such reconstructions: the large number of input images, their drastically varying properties such as image scale, (out-of-focus) blur, exposure variation, and occluders (e.g., moving plants or pedestrians). Using the proposed technique, we are able to texture datasets that are several orders of magnitude larger and far more challenging than shown in related work.

Michael Waechter, Nils Moehrle, Michael Goesele
Computer Vision – ECCV 2014
David Fleet
Tomas Pajdla
Bernt Schiele
Tinne Tuytelaars
Copyright Year
Springer International Publishing
Electronic ISBN
Print ISBN

Premium Partner