Top

2014 | Book

Read chapter Read first chapter

Computer Vision – ECCV 2014

13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part III

Editors: David Fleet, Tomas Pajdla, Bernt Schiele, Tinne Tuytelaars

Publisher: Springer International Publishing

Book Series : Lecture Notes in Computer Science

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

The seven-volume set comprising LNCS volumes 8689-8695 constitutes the refereed proceedings of the 13th European Conference on Computer Vision, ECCV 2014, held in Zurich, Switzerland, in September 2014.

The 363 revised papers presented were carefully reviewed and selected from 1444 submissions. The papers are organized in topical sections on tracking and activity recognition; recognition; learning and inference; structure from motion and feature matching; computational photography and low-level vision; vision; segmentation and saliency; context and 3D scenes; motion and 3D scene analysis; and poster sessions.

Frontmatter

Poster Session 3

The 3D Jigsaw Puzzle: Mapping Large Indoor Spaces

We introduce an approach for analyzing annotated maps of a site, together with Internet photos, to reconstruct large indoor spaces of famous tourist sites. While current 3D reconstruction algorithms often produce a set of disconnected components (3D pieces) for indoor scenes due to scene coverage or matching failures, we make use of a provided map to lay out the 3D pieces in a global coordinate system. Our approach leverages position, orientation, and shape cues extracted from the map and 3D pieces and optimizes a global objective to recover the global layout of the pieces. We introduce a novel crowd flow cue that measures how people move across the site to recover 3D geometry orientation. We show compelling results on major tourist sites.

Ricardo Martin-Brualla, Yanling He, Bryan C. Russell, Steven M. Seitz

Pipe-Run Extraction and Reconstruction from Point Clouds

This paper presents automatic methods to extract and reconstruct industrial site pipe-runs from large-scale point clouds. We observe three key characteristics in this modeling problem, namely, primitives, similarities, and joints. While

primitives

capture the dominant cylindric shapes,

similarities

reveal the inter-primitive relations intrinsic to industrial structures because of human design and construction. Statistical analysis over point normals discovers primitive similarities from raw data to guide primitive fitting, increasing robustness to data noise and incompleteness. Finally,

joints

are automatically detected to close gaps and propagate connectivity information. The resulting model is more than a collection of 3D triangles, as it contains semantic labels for pipes as well as their connectivity.

Rongqi Qiu, Qian-Yi Zhou, Ulrich Neumann

Image-Based 4-d Reconstruction Using 3-d Change Detection

This paper describes an approach to reconstruct the complete history of a 3-d scene over time from imagery. The proposed approach avoids rebuilding 3-d models of the scene at each time instant. Instead, the approach employs an initial 3-d model which is continuously updated with changes in the environment to form a full 4-d representation. This updating scheme is enabled by a novel algorithm that infers 3-d changes with respect to the model at one time step from images taken at a subsequent time step. This algorithm can effectively detect changes even when the illumination conditions between image collections are significantly different. The performance of the proposed framework is demonstrated on four challenging datasets in terms of 4-d modeling accuracy as well as quantitative evaluation of 3-d change detection.

Ali Osman Ulusoy, Joseph L. Mundy

VocMatch: Efficient Multiview Correspondence for Structure from Motion

Feature matching between pairs of images is a main bottleneck of structure-from-motion computation from large, unordered image sets. We propose an efficient way to establish point correspondences between

all

pairs of images in a dataset, without having to test each individual pair. The principal message of this paper is that, given a sufficiently large visual vocabulary,

feature matching can be cast as image indexing

, subject to the additional constraints that index words must be rare in the database and unique in each image. We demonstrate that the proposed matching method, in conjunction with a standard inverted file, is 2-3 orders of magnitude faster than conventional pairwise matching. The proposed vocabulary-based matching has been integrated into a standard SfM pipeline, and delivers results similar to those of the conventional method in much less time.

Michal Havlena, Konrad Schindler

Robust Global Translations with 1DSfM

We present a simple, effective method for solving structure from motion problems by averaging epipolar geometries. Based on recent successes in solving for global camera rotations using averaging schemes, we focus on the problem of solving for 3D camera translations given a network of noisy pairwise camera translation directions (or 3D point observations). To do this well, we have two main insights. First, we propose a method for removing outliers from problem instances by solving simpler low-dimensional subproblems, which we refer to as 1DSfM problems. Second, we present a simple, principled averaging scheme. We demonstrate this new method in the wild on Internet photo collections.

Kyle Wilson, Noah Snavely

Comparing Salient Object Detection Results without Ground Truth

A wide variety of methods have been developed to approach the problem of salient object detection. The performance of these methods is often image-dependent. This paper aims to develop a method that is able to select for an input image the best salient object detection result from many results produced by different methods. This is a challenging task as different salient object detection results need to be compared without any ground truth. This paper addresses this challenge by designing a range of features to measure the quality of salient object detection results. These features are then used in various machine learning algorithms to rank different salient object detection results. Our experiments show that our method is promising for ranking salient object detection results and our method is also able to pick the best salient object detection result such that the overall salient object detection performance is better than each individual method.

Long Mai, Feng Liu

RGBD Salient Object Detection: A Benchmark and Algorithms

Although depth information plays an important role in the human vision system, it is not yet well-explored in existing visual saliency computational models. In this work, we first introduce a large scale RGBD image dataset to address the problem of data deficiency in current research of RGBD salient object detection. To make sure that most existing RGB saliency models can still be adequate in RGBD scenarios, we continue to provide a simple fusion framework that combines existing RGB-produced saliency with new depth-induced saliency, the former one is estimated from existing RGB models while the latter one is based on the proposed multi-contextual contrast model. Moreover, a specialized multi-stage RGBD model is also proposed which takes account of both depth and appearance cues derived from low-level feature contrast, mid-level region grouping and high-level priors enhancement. Extensive experiments show the effectiveness and superiority of our model which can accurately locate the salient objects from RGBD images, and also assign consistent saliency values for the target objects.

Houwen Peng, Bing Li, Weihua Xiong, Weiming Hu, Rongrong Ji

Saliency Detection with Flash and No-flash Image Pairs

In this paper, we propose a new saliency detection method using a pair of flash and no-flash images. Our approach is inspired by two observations. First, only the foreground objects are significantly brightened by the flash as they are relatively nearer to the camera than the background. Second, the brightness variations introduced by the flash provide hints to surface orientation changes. Accordingly, the first observation is explored to form the background prior to eliminate background distraction. The second observation provides a new orientation cue to compute surface orientation contrast. These photometric cues from the two observations are independent of visual attributes like color, and they provide new and robust distinctiveness to support salient object detection. The second observation further leads to the introduction of new spatial priors to constrain the regions rendered salient to be compact both in the image plane and in 3D space. We have constructed a new flash/no-flash image dataset. Experiments on this dataset show that the proposed method successfully identifies salient objects from various challenging scenes that the state-of-the-art methods usually fail.

Shengfeng He, Rynson W. H. Lau

Alpha Matting of Motion-Blurred Objects in Bracket Sequence Images

We present a method that utilizes bracket sequence images to automatically extract the alpha matte of a motion-blurred object. This method makes use of a sharp, short-exposure snapshot in the sequence to help overcome major challenges in this task, including blurred object detection, spatially-variant object motion, and foreground/background color ambiguity. A key component of our matte estimation is the inference of approximate, spatially-varying motion of the blurred object with the help of the sharp snapshot, as this motion information provides important constraints on the aforementioned issues. In addition, we take advantage of other relationships that exist between a pair of consecutive short-exposure and long-exposure frames, such as common background areas and consistencies in foreground appearance. With this technique, we demonstrate successful alpha matting results on a variety of moving objects including non-rigid human motion.

Heesoo Myeong, Stephen Lin, Kyoung Mu Lee

An Active Patch Model for Real World Texture and Appearance Classification

This paper addresses the task of natural texture and appearance classification. Our goal is to develop a simple and intuitive method that performs at state of the art on datasets ranging from homogeneous texture (e.g., material texture), to less homogeneous texture (e.g., the fur of animals), and to inhomogeneous texture (the appearance patterns of vehicles). Our method uses a bag-of-words model where the features are based on a dictionary of active patches. Active patches are raw intensity patches which can undergo spatial transformations (e.g., rotation and scaling) and adjust themselves to best match the image regions. The dictionary of active patches is required to be compact and representative, in the sense that we can use it to approximately reconstruct the images that we want to classify. We propose a probabilistic model to quantify the quality of image reconstruction and design a greedy learning algorithm to obtain the dictionary. We classify images using the occurrence frequency of the active patches. Feature extraction is fast (about 100 ms per image) using the GPU. The experimental results show that our method improves the state of the art on a challenging material texture benchmark dataset (KTH-TIPS2). To test our method on less homogeneous or inhomogeneous images, we construct two new datasets consisting of appearance image patches of animals and vehicles cropped from the PASCAL VOC dataset. Our method outperforms competing methods on these datasets.

Junhua Mao, Jun Zhu, Alan L. Yuille

Material Classification Based on Training Data Synthesized Using a BTF Database

To cope with the richness in appearance variation found in real-world data under natural illumination, we propose to synthesize training data capturing these variations for material classification. Using synthetic training data created from separately acquired material and illumination characteristics allows to overcome the problems of existing material databases which only include a tiny fraction of the possible real-world conditions under controlled laboratory environments. However, it is essential to utilize a representation for material appearance which preserves fine details in the reflectance behavior of the digitized materials. As BRDFs are not sufficient for many materials due to the lack of modeling mesoscopic effects, we present a high-quality BTF database with 22,801 densely measured view-light configurations including surface geometry measurements for each of the 84 measured material samples. This representation is used to generate a database of synthesized images depicting the materials under different view-light conditions with their characteristic surface geometry using image-based lighting to simulate the complexity of real-world scenarios. We demonstrate that our synthesized data allows classifying materials under complex real-world scenarios.

Michael Weinmann, Juergen Gall, Reinhard Klein

Déjà Vu:

Motion Prediction in Static Images

This paper proposes motion prediction in single still images by learning it from a set of videos. The building assumption is that similar motion is characterized by similar appearance. The proposed method learns local motion patterns given a specific appearance and adds the predicted motion in a number of applications. This work (i) introduces a novel method to predict motion from appearance in a single static image, (ii) to that end, extends of the Structured Random Forest with regression derived from first principles, and (iii) shows the value of adding motion predictions in different tasks such as: weak frame-proposals containing unexpected events, action recognition, motion saliency. Illustrative results indicate that motion prediction is not only feasible, but also provides valuable information for a number of applications.

Silvia L. Pintea, Jan C. van Gemert, Arnold W. M. Smeulders

Transfer Learning Based Visual Tracking with Gaussian Processes Regression

Modeling the target appearance is critical in many modern visual tracking algorithms. Many tracking-by-detection algorithms formulate the probability of target appearance as exponentially related to the confidence of a classifier output. By contrast, in this paper we directly analyze this probability using Gaussian Processes Regression (GPR), and introduce a latent variable to assist the tracking decision. Our observation model for regression is learnt in a semi-supervised fashion by using both labeled samples from previous frames and the unlabeled samples that are tracking candidates extracted from the current frame. We further divide the labeled samples into two categories:

auxiliary samples

collected from the very early frames and

target samples

from most recent frames. The auxiliary samples are dynamically re-weighted by the regression, and the final tracking result is determined by fusing decisions from two individual trackers, one derived from the auxiliary samples and the other from the target samples. All these ingredients together enable our tracker, denoted as TGPR, to alleviate the drifting issue from various aspects. The effectiveness of TGPR is clearly demonstrated by its excellent performances on three recently proposed public benchmarks, involving 161 sequences in total, in comparison with state-of-the-arts.

Jin Gao, Haibin Ling, Weiming Hu, Junliang Xing

Separable Spatiotemporal Priors for Convex Reconstruction of Time-Varying 3D Point Clouds

Reconstructing 3D motion data is highly under-constrained due to several common sources of data loss during measurement, such as projection, occlusion, or miscorrespondence. We present a statistical model of 3D motion data, based on the Kronecker structure of the spatiotemporal covariance of natural motion, as a prior on 3D motion. This prior is expressed as a matrix normal distribution, composed of separable and compact row and column covariances. We relate the marginals of the distribution to the shape, trajectory, and shape-trajectory models of prior art. When the marginal shape distribution is not available from training data, we show how placing a hierarchical prior over shapes results in a convex MAP solution in terms of the trace-norm. The matrix normal distribution, fit to a single sequence, outperforms state-of-the-art methods at reconstructing 3D motion data in the presence of significant data loss, while providing covariance estimates of the imputed points.

Tomas Simon, Jack Valmadre, Iain Matthews, Yaser Sheikh

Highly Overparameterized Optical Flow Using PatchMatch Belief Propagation

Motion in the image plane is ultimately a function of 3D motion in space. We propose to compute optical flow using what is ostensibly an extreme overparameterization: depth, surface normal, and frame-to-frame 3D rigid body motion at every pixel, giving a total of 9 DoF. The advantages of such an overparameterization are twofold: first, geometrically meaningful reasoning can be called upon in the optimization, reflecting possible 3D motion in the underlying scene; second, the ‘fronto-parallel’ assumption implicit in the use of traditional matching pixel windows is ameliorated because the parameterization determines a plane-induced homography at every pixel. We show that optimization over this high-dimensional, continuous state space can be carried out using an adaptation of the recently introduced PatchMatch Belief Propagation (PMBP) energy minimization algorithm, and that the resulting flow fields compare favorably to the state of the art on a number of small- and large-displacement datasets.

Michael Hornáček, Frederic Besse, Jan Kautz, Andrew Fitzgibbon, Carsten Rother

Local Estimation of High Velocity Optical Flow with Correlation Image Sensor

In this article, the authors address a problem of the estimation of high velocity optical flow. When images are captured by conventional image sensors, the problem of the optical flow estimation is ill-posed if only the temporal constancy of the image brightness is the valid assumption. When given images are captured by the correlation image sensors, though, you can make the problem of the optical flow estimation well-posed under some condition and can locally estimate the unique optical flow at each pixel in each single frame. The condition though would not be satisfied when the flow velocity is high. In this article, we propose a method that can estimate the normal component of high velocity optical flow using only the local image values in each single frame. The equation used for estimating the normal velocity is theoretically derived and the condition the equation holds is also revealed.

Hidekata Hontani, Go Oishi, Tomohiro Kitagawa

Rank Minimization with Structured Data Patterns

The problem of finding a low rank approximation of a given measurement matrix is of key interest in computer vision. If all the elements of the measurement matrix are available, the problem can be solved using factorization. However, in the case of missing data no satisfactory solution exists. Recent approaches replace the rank term with the weaker (but convex) nuclear norm. In this paper we show that this heuristic works poorly on problems where the locations of the missing entries are highly correlated and structured which is a common situation in many applications.

Our main contribution is the derivation of a much stronger convex relaxation that takes into account not only the rank function but also the data. We propose an algorithm which uses this relaxation to solve the rank approximation problem on matrices where the given measurements can be organized into overlapping blocks without missing data. The algorithm is computationally efficient and we have applied it to several classical problems including structure from motion and linear shape basis estimation. We demonstrate on both real and synthetic data that it outperforms state-of-the-art alternatives.

Viktor Larsson, Carl Olsson, Erik Bylow, Fredrik Kahl

Duality and the Continuous Graphical Model

Inspired by the Linear Programming based algorithms for discrete MRFs, we show how a corresponding infinite-dimensional dual for continuous-state MRFs can be approximated by a hierarchy of tractable relaxations. This hierarchy of dual programs includes as a special case the methods of Peng et al. [17] and Zach & Kohli [33]. We give approximation bounds for the tightness of our construction, study their relationship to discrete MRFs and give a generic optimization algorithm based on Nesterov’s dual-smoothing method [16].

Alexander Fix, Sameer Agarwal

Spectral Clustering with a Convex Regularizer on Millions of Images

This paper focuses on efficient algorithms for single and multi-view spectral clustering with a convex regularization term for very large scale image datasets. In computer vision applications, multiple views denote distinct image-derived feature representations that inform the clustering. Separately, the regularization encodes high level advice such as tags or user interaction in identifying similar objects across examples. Depending on the specific task, schemes to exploit such information may lead to a smooth or non-smooth regularization function. We present stochastic gradient descent methods for optimizing spectral clustering objectives with such convex regularizers for datasets with up to a hundred million examples. We prove that under mild conditions the local convergence rate is

$O(1/\sqrt{T})$

where

is the number of iterations; further, our analysis shows that the convergence improves linearly by increasing the number of threads. We give extensive experimental results on a range of vision datasets demonstrating the algorithm’s empirical behavior.

Maxwell D. Collins, Ji Liu, Jia Xu, Lopamudra Mukherjee, Vikas Singh

Riemannian Sparse Coding for Positive Definite Matrices

Inspired by the great success of sparse coding for vector valued data, our goal is to represent symmetric positive definite (SPD) data matrices as sparse linear combinations of atoms from a dictionary, where each atom itself is an SPD matrix. Since SPD matrices follow a non-Euclidean (in fact a Riemannian) geometry, existing sparse coding techniques for Euclidean data cannot be directly extended. Prior works have approached this problem by defining a sparse coding loss function using either extrinsic similarity measures (such as the log-Euclidean distance) or kernelized variants of statistical measures (such as the Stein divergence, Jeffrey’s divergence, etc.). In contrast, we propose to use the intrinsic Riemannian distance on the manifold of SPD matrices. Our main contribution is a novel mathematical model for sparse coding of SPD matrices; we also present a computationally simple algorithm for optimizing our model. Experiments on several computer vision datasets showcase superior classification and retrieval performance compared with state-of-the-art approaches.

Anoop Cherian, Suvrit Sra

Robust Sparse Coding and Compressed Sensing with the Difference Map

In compressed sensing, we wish to reconstruct a sparse signal

from observed data

. In sparse coding, on the other hand, we wish to find a representation of an observed signal

as a sparse linear combination, with coefficients

, of elements from an overcomplete dictionary. While many algorithms are competitive at both problems when

is very sparse, it can be challenging to recover

when it is

less

sparse. We present the

Difference Map

, which excels at sparse recovery when sparseness is lower. The Difference Map out-performs the state of the art with reconstruction from random measurements and natural image reconstruction via sparse coding.

Will Landecker, Rick Chartrand, Simon DeDeo

Object Co-detection via Efficient Inference in a Fully-Connected CRF

Object detection has seen a surge of interest in recent years, which has lead to increasingly effective techniques. These techniques, however, still mostly perform detection based on local evidence in the input image. While some progress has been made towards exploiting scene context, the resulting methods typically only consider a single image at a time. Intuitively, however, the information contained jointly in multiple images should help overcoming phenomena such as occlusion and poor resolution. In this paper, we address the co-detection problem that aims to leverage this collective power to achieve object detection simultaneously in all the images of a set. To this end, we formulate object co-detection as inference in a fully-connected CRF whose edges model the similarity between object candidates. We then learn a similarity function that allows us to efficiently perform inference in this fully-connected graph, even in the presence of many object candidates. This is in contrast with existing co-detection techniques that rely on exhaustive or greedy search, and thus do not scale well. Our experiments demonstrate the benefits of our approach on several co-detection datasets.

Zeeshan Hayder, Mathieu Salzmann, Xuming He

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g. 224×224) input image. This requirement is “artificial” and may hurt the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with a more principled pooling strategy, “spatial pyramid pooling”, to eliminate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. By removing the fixed-size limitation, we can improve all CNN-based image classification methods in general. Our SPP-net achieves state-of-the-art accuracy on the datasets of ImageNet 2012, Pascal VOC 2007, and Caltech101.

The power of SPP-net is more significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors. This method avoids repeatedly computing the convolutional features. In processing test images, our method computes convolutional features 30-170× faster than the recent leading method R-CNN (and 24-64× faster overall), while achieving better or comparable accuracy on Pascal VOC 2007.

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun

Context as Supervisory Signal: Discovering Objects with Predictable Context

This paper addresses the well-established problem of unsupervised object discovery with a novel method inspired by weakly-supervised approaches. In particular, the ability of an object patch to predict the rest of the object (its context) is used as supervisory signal to help discover visually consistent object clusters. The main contributions of this work are: 1) framing unsupervised clustering as a leave-one-out context prediction task; 2) evaluating the quality of context prediction by statistical hypothesis testing between

thing

and

stuff

appearance models; and 3) an iterative region prediction and context alignment approach that gradually discovers a visual object cluster together with a segmentation mask and fine-grained correspondences. The proposed method outperforms previous unsupervised as well as weakly-supervised object discovery approaches, and is shown to provide correspondences detailed enough to transfer keypoint annotations.

Carl Doersch, Abhinav Gupta, Alexei A. Efros

Learning to Hash with Partial Tags: Exploring Correlation between Tags and Hashing Bits for Large Scale Image Retrieval

Similarity search is an important technique in many large scale vision applications. Hashing approach becomes popular for similarity search due to its computational and memory efficiency. Recently, it has been shown that the hashing quality could be improved by combining supervised information, e.g. semantic tags/labels, into hashing function learning. However, tag information is not fully exploited in existing unsupervised and supervised hashing methods especially when only partial tags are available. This paper proposes a novel semi-supervised tag hashing (SSTH) approach that fully incorporates tag information into learning effective hashing function by exploring the correlation between tags and hashing bits. The hashing function is learned in a unified learning framework by simultaneously ensuring the tag consistency and preserving the similarities between image examples. An iterative coordinate descent algorithm is designed as the optimization procedure. Furthermore, we improve the effectiveness of hashing function through orthogonal transformation by minimizing the quantization error. Extensive experiments on two large scale image datasets demonstrate the superior performance of the proposed approach over several state-of-the-art hashing methods.

Qifan Wang, Luo Si, Dan Zhang

Multi-class Open Set Recognition Using Probability of Inclusion

The perceived success of recent visual recognition approaches has largely been derived from their performance on classification tasks, where all possible classes are known at training time. But what about open set problems, where unknown classes appear at test time? Intuitively, if we could accurately model just the positive data for any known class without overfitting, we could reject the large set of unknown classes even under an assumption of incomplete class knowledge. In this paper, we formulate the problem as one of modeling positive training data at the decision boundary, where we can invoke the statistical extreme value theory. A new algorithm called the

-SVM is introduced for estimating the unnormalized posterior probability of class inclusion.

Lalit P. Jain, Walter J. Scheirer, Terrance E. Boult

Sequential Max-Margin Event Detectors

Many applications in computer vision (e.g., games, human computer interaction) require a reliable and early detector of visual events. Existing event detection methods rely on one-versus-all or multi-class classifiers that do not scale well to online detection of large number of events. This paper proposes Sequential Max-Margin Event Detectors (SMMED) to efficiently detect an event in the presence of a large number of event classes. SMMED sequentially discards classes until only one class is identified as the detected class. This approach has two main benefits w.r.t. standard approaches: (1) It provides an efficient solution for early detection of events in the presence of large number of classes, and (2) it is computationally efficient because only a subset of likely classes are evaluated. The benefits of SMMED in comparison with existing approaches is illustrated in three databases using different modalities: MSRDaliy Activity (3D depth videos), UCF101 (RGB videos) and the CMU-Multi-Modal Action Detection (MAD) database (depth, RGB and skeleton). The CMU-MAD was recorded to target the problem of event detection (not classification), and the data and labels are available at

http://humansensing.cs.cmu.edu/mad/

Dong Huang, Shitong Yao, Yi Wang, Fernando De La Torre

Which Looks Like Which: Exploring Inter-class Relationships in Fine-Grained Visual Categorization

Fine-grained visual categorization aims at classifying visual data at a subordinate level, e.g., identifying different species of birds. It is a highly challenging topic receiving significant research attention recently. Most existing works focused on the design of more discriminative feature representations to capture the subtle visual differences among categories. Very limited efforts were spent on the design of robust model learning algorithms. In this paper, we treat the training of each category classifier as a single learning task, and formulate a

generic

multiple task learning (MTL) framework to train multiple classifiers simultaneously. Different from the existing MTL methods, the proposed generic MTL algorithm enforces no structure assumptions and thus is more flexible in handling complex inter-class relationships. In particular, it is able to automatically discover both clusters of similar categories and outliers. We show that the objective of our generic MTL formulation can be solved using an iterative reweighted ℓ

method. Through an extensive experimental validation, we demonstrate that our method outperforms several state-of-the-art approaches.

Jian Pu, Yu-Gang Jiang, Jun Wang, Xiangyang Xue

Object Detection and Viewpoint Estimation with Auto-masking Neural Network

Simultaneously detecting an object and determining its pose has become a popular research topic in recent years. Due to the large variances of the object appearance in images, it is critical to capture the discriminative object parts that can provide key information about the object pose. Recent part-based models have obtained state-of-the-art results for this task. However, such models either require manually defined object parts with heavy supervision or a complicated algorithm to find discriminative object parts. In this study, we have designed a novel deep architecture, called Auto-masking Neural Network (ANN), for object detection and viewpoint estimation. ANN can automatically learn to select the most discriminative object parts across different viewpoints from training images. We also propose a method of accurate continuous viewpoint estimation based on the output of ANN. Experimental results on related datasets show that ANN outperforms previous methods.

Linjie Yang, Jianzhuang Liu, Xiaoou Tang

Statistical and Spatial Consensus Collection for Detector Adaptation

The increasing interest in automatic adaptation of pedestrian detectors toward specific scenarios is motivated by the drop of performance of common detectors, especially in video-surveillance low resolution images. Different works have been recently proposed for unsupervised adaptation. However, most of these works do not completely solve the

drifting

problem: initial false positive target samples used for training can lead the model to drift. We propose to transform the outlier rejection problem in a weak classifier selection approach. A large set of weak classifiers are trained with random subsets of unsupervised

target

data and their performance is measured on a labeled

source

dataset. We can then select the most accurate classifiers in order to build an ensemble of weakly dependent detectors for the target domain. The experimental results we obtained on two benchmarks show that our system outperforms other pedestrian adaptation state-of-the-art methods.

Enver Sangineto

Deep Learning of Scene-Specific Classifier for Pedestrian Detection

The performance of a detector depends much on its training dataset and drops significantly when the detector is applied to a new scene due to the large variations between the source training dataset and the target scene. In order to bridge this appearance gap, we propose a deep model to automatically learn scene-specific features and visual patterns in static video surveillance without any manual labels from the target scene. It jointly learns a scene-specific classifier and the distribution of the target samples. Both tasks share multi-scale feature representations with both discriminative and representative power. We also propose a cluster layer in the deep model that utilizes the scene-specific visual patterns for pedestrian detection. Our specifically designed objective function not only incorporates the confidence scores of target training samples but also automatically weights the importance of source training samples by fitting the marginal distributions of target samples. It significantly improves the detection rates at 1 FPPI by 10% compared with the state-of-the-art domain adaptation methods on MIT Traffic Dataset and CUHK Square Dataset.

Xingyu Zeng, Wanli Ouyang, Meng Wang, Xiaogang Wang

A Contour Completion Model for Augmenting Surface Reconstructions

The availability of commodity depth sensors such as Kinect has enabled development of methods which can densely reconstruct arbitrary scenes. While the results of these methods are accurate and visually appealing, they are quite often incomplete. This is either due to the fact that only part of the space was visible during the data capture process or due to the surfaces being occluded by other objects in the scene. In this paper, we address the problem of completing and refining such reconstructions. We propose a method for scene completion that can infer the layout of the complete room and the full extent of partially occluded objects. We propose a new probabilistic model, Contour Completion Random Fields, that allows us to complete the boundaries of occluded surfaces. We evaluate our method on synthetic and real world reconstructions of 3D scenes and show that it quantitatively and qualitatively outperforms standard methods. We created a large dataset of partial and complete reconstructions which we will make available to the community as a benchmark for the scene completion task. Finally, we demonstrate the practical utility of our algorithm via an augmented-reality application where objects interact with the completed reconstructions inferred by our method.

Nathan Silberman, Lior Shapira, Ran Gal, Pushmeet Kohli

Interactive Object Counting

Our objective is to count (and localize) object instances in an image

interactively

. We target the regime where individual object detectors do not work reliably due to crowding, or overlap, or size of the instances, and take the approach of estimating an object density.

Our main contribution is an interactive counting system, along with solutions for its main components. Thus, we develop a feature vocabulary that can be efficiently learnt on-the-fly as a user provides dot annotations – this enables densities to be generated in an interactive system. Furthermore, we show that object density can be estimated simply, accurately and efficiently using ridge regression – this matches the counting accuracy of the much more costly learning-to-count method. Finally, we propose two novel visualization methods for region counts that are efficient and effective – these enable integral count regions to be displayed to quickly determine annotation points for relevance feedback.

The interactive system is demonstrated on a variety of visual material, including photographs, microscopy and satellite images.

Carlos Arteta, Victor Lempitsky, J. Alison Noble, Andrew Zisserman

Recognizing City Identity via Attribute Analysis of Geo-tagged Images

After hundreds of years of human settlement, each city has formed a distinct identity, distinguishing itself from other cities. In this work, we propose to characterize the identity of a city via an attribute analysis of 2 million geo-tagged images from 21 cities over 3 continents. First, we estimate the scene attributes of these images and use this representation to build a higher-level set of 7 city attributes, tailored to the form and function of cities. Then, we conduct the city identity recognition experiments on the geo-tagged images and identify images with salient city identity on each city attribute. Based on the misclassification rate of the city identity recognition, we analyze the visual similarity among different cities. Finally, we discuss the potential application of computer vision to urban planning.

Bolei Zhou, Liu Liu, Aude Oliva, Antonio Torralba

A Fast and Simple Algorithm for Producing Candidate Regions

This paper addresses the task of producing candidate regions for detecting objects (e.g., car, cat) and background regions (e.g., sky, water). We describe a simple and rapid algorithm which generates a set of candidate regions

$\mathcal{C_R}$

by combining up to three ”selected-segments”. These are obtained by a hierarchical merging algorithm which seeks to identify segments corresponding to roughly homogeneous regions, followed by a selection stage which removes most of the segments, yielding a small subset of selected-segments

$\mathcal{S}$

. The hierarchical merging makes a novel use of the PageRank algorithm. The selection stage also uses a new criterion based on entropy gain with non-parametric estimation of the segments’ entropy. We evaluate on a new labeling of the Pascal VOC 2010 set where all pixels are labeled with one of 57 class labels. We show that most of the 57 objects and background regions can be largely covered by three of the selected-segments. We present a detailed per-object comparison on the task of proposing candidate regions with several state-of-the-art methods. Our performance is comparable to the best performing method in terms of coverage but is simpler and faster, and needs to output half the number of candidate regions, which is critical for a subsequent stage (e.g, classification).

Boyan Bonev, Alan L. Yuille

Closed-Form Approximate CRF Training for Scalable Image Segmentation

We present LS-CRF, a new method for training cyclic Conditional Random Fields (CRFs) from large datasets that is inspired by classical closed-form expressions for the maximum likelihood parameters of a generative graphical model with tree topology. Training a CRF with LS-CRF requires only solving a set of independent regression problems, each of which can be solved efficiently in closed form or by an iterative solver. This makes LS-CRF orders of magnitude faster than classical CRF training based on probabilistic inference, and at the same time more flexible and easier to implement than other approximate techniques, such as pseudolikelihood or piecewise training. We apply LS-CRF to the task of semantic image segmentation, showing that it achieves on par accuracy to other training techniques at higher speed, thereby allowing efficient CRF training from very large training sets. For example, training a linearly parameterized pairwise CRF on 150,000 images requires less than one hour on a modern workstation.

Alexander Kolesnikov, Matthieu Guillaumin, Vittorio Ferrari, Christoph H. Lampert

A Graph Theoretic Approach for Object Shape Representation in Compositional Hierarchies Using a Hybrid Generative-Descriptive Model

A graph theoretic approach is proposed for object shape representation in a hierarchical compositional architecture called Compositional Hierarchy of Parts (CHOP). In the proposed approach, vocabulary learning is performed using a hybrid generative-descriptive model. First, statistical relationships between parts are learned using a

Minimum Conditional Entropy Clustering

algorithm. Then, selection of

descriptive

parts is defined as a frequent subgraph discovery problem, and solved using a Minimum Description Length (MDL) principle. Finally, part compositions are constructed using learned statistical relationships between parts and their description lengths. Shape representation and computational complexity properties of the proposed approach and algorithms are examined using six benchmark two-dimensional shape image datasets. Experiments show that CHOP can employ part shareability and indexing mechanisms for fast inference of part compositions using learned shape vocabularies. Additionally, CHOP provides better shape retrieval performance than the state-of-the-art shape retrieval methods.

Umit Rusen Aktas, Mete Ozay, Aleš Leonardis, Jeremy L. Wyatt

Finding Approximate Convex Shapes in RGBD Images

We propose a novel method to find approximate convex 3D shapes from single RGBD images. Convex shapes are more general than cuboids, cylinders, cones and spheres. Many real-world objects are near-convex and every non-convex object can be represented using convex parts. By finding approximate convex shapes in RGBD images, we extract important structures of a scene. From a large set of candidates generated from over-segmented superpixels we globally optimize the selection of these candidates so that they are mostly convex, have small intersection, have a small number and mostly cover the scene. The optimization is formulated as a two-stage linear optimization and efficiently solved using a branch and bound method which is guaranteed to give the global optimal solution. Our experiments on thousands of RGBD images show that our method is fast, robust against clutter and is more accurate than competing methods.

Hao Jiang

ShapeForest: Building Constrained Statistical Shape Models with Decision Trees

Constrained local models (CLM) are frequently used to locate points on deformable objects. They usually consist of feature response images, defining the local update of object points and a shape prior used to regularize the final shape. Due to the complex shape variation within an object class this is a challenging problem. However in many segmentation tasks a simpler object representation is available in form of sparse landmarks which can be reliably detected from images. In this work we propose ShapeForest, a novel shape representation which is able to model complex shape variation, preserves local shape information and incorporates prior knowledge during shape space inference. Based on a sparse landmark representation associated with each shape the ShapeForest, trained using decision trees and geometric features, selects a subset of relevant shapes to construct an instance specific parametric shape model. Hereby the ShapeForest learns the association between the geometric features and shape variability. During testing, based on the estimated sparse landmark representation a constrained shape space is constructed and used for shape initialization and regularization during the iterative shape refinement within the CLM framework. We demonstrate the effectiveness of our approach on a set of medical segmentation problems where our database contains complex morphological and pathological variations of several anatomical structures.

Saša Grbić, Joshua K. Y. Swee, Razvan Ionasec

Optimizing Ranking Measures for Compact Binary Code Learning

Hashing has proven a valuable tool for large-scale information retrieval. Despite much success, existing hashing methods optimize over simple objectives such as the reconstruction error or graph Laplacian related loss functions, instead of the performance evaluation criteria of interest—multivariate performance measures such as the AUC and NDCG. Here we present a general framework (termed StructHash) that allows one to directly optimize multivariate performance measures. The resulting optimization problem can involve exponentially or infinitely many variables and constraints, which is more challenging than standard structured output learning. To solve the StructHash optimization problem, we use a combination of column generation and cutting-plane techniques. We demonstrate the generality of StructHash by applying it to ranking prediction and image retrieval, and show that it outperforms a few state-of-the-art hashing methods.

Guosheng Lin, Chunhua Shen, Jianxin Wu

Exploiting Low-Rank Structure from Latent Domains for Domain Generalization

In this paper, we propose a new approach for domain generalization by exploiting the low-rank structure from multiple latent source domains. Motivated by the recent work on exemplar-SVMs, we aim to train a set of exemplar classifiers with each classifier learnt by using only one positive training sample and all negative training samples. While positive samples may come from multiple latent domains, for the positive samples within the same latent domain, their likelihoods from each exemplar classifier are expected to be similar to each other. Based on this assumption, we formulate a new optimization problem by introducing the nuclear-norm based regularizer on the likelihood matrix to the objective function of exemplar-SVMs. We further extend Domain Adaptation Machine (DAM) to learn an optimal target classifier for domain adaptation. The comprehensive experiments for object recognition and action recognition demonstrate the effectiveness of our approach for domain generalization and domain adaptation.

Zheng Xu, Wen Li, Li Niu, Dong Xu

Sparse Additive Subspace Clustering

In this paper, we introduce and investigate a sparse additive model for subspace clustering problems. Our approach, named

SASC

(

parse

dditive

ubspace

lustering), is essentially a functional extension of the Sparse Subspace Clustering (SSC) of Elhamifar & Vidal [7] to the additive nonparametric setting. To make our model computationally tractable, we express SASC in terms of a finite set of basis functions, and thus the formulated model can be estimated via solving a sequence of grouped Lasso optimization problems. We provide theoretical guarantees on the subspace recovery performance of our model. Empirical results on synthetic and real data demonstrate the effectiveness of SASC for clustering noisy data points into their original subspaces.

Xiao-Tong Yuan, Ping Li

Boosting VLAD with Supervised Dictionary Learning and High-Order Statistics

Recent studies show that aggregating local descriptors into super vector yields effective representation for retrieval and classification tasks. A popular method along this line is vector of locally aggregated descriptors (VLAD), which aggregates the residuals between descriptors and visual words. However, original VLAD ignores high-order statistics of local descriptors and its dictionary may not be optimal for classification tasks. In this paper, we address these problems by utilizing high-order statistics of local descriptors and peforming supervised dictionary learning. The main contributions are twofold. Firstly, we propose a high-order VLAD (H-VLAD) for visual recognition, which leverages two kinds of high-order statistics in the VLAD-like framework, namely diagonal covariance and skewness. These high-order statistics provide complementary information for VLAD and allow for efficient computation. Secondly, to further boost the performance of H-VLAD, we design a supervised dictionary learning algorithm to discriminatively refine the dictionary, which can be also extended for other super vector based encoding methods. We examine the effectiveness of our methods in image-based object categorization and video-based action recognition. Extensive experiments on PASCAL VOC 2007, HMDB51, and UCF101 datasets exhibit that our method achieves the state-of-the-art performance on both tasks.

Xiaojiang Peng, Limin Wang, Yu Qiao, Qiang Peng

Recognizing Complex Events in Videos by Learning Key Static-Dynamic Evidences

Complex events consist of various human interactions with different objects in diverse environments. The evidences needed to recognize events may occur in short time periods with variable lengths and can happen anywhere in a video. This fact prevents conventional machine learning algorithms from effectively recognizing the events. In this paper, we propose a novel method that can automatically identify the key evidences in videos for detecting complex events. Both static instances (objects) and dynamic instances (actions) are considered by sampling frames and temporal segments respectively. To compare the characteristic power of heterogeneous instances, we embed static and dynamic instances into a multiple instance learning framework via instance similarity measures, and cast the problem as an Evidence Selective Ranking (ESR) process. We impose ℓ

norm to select key evidences while using the Infinite Push Loss Function to enforce positive videos to have higher detection scores than negative videos. The Alternating Direction Method of Multipliers (ADMM) algorithm is used to solve the optimization problem. Experiments on large-scale video datasets show that our method can improve the detection accuracy while providing the unique capability in discovering key evidences of each complex event.

Kuan-Ting Lai, Dong Liu, Ming-Syan Chen, Shih-Fu Chang

A Hierarchical Representation for Future Action Prediction

We consider inferring the future actions of people from a still image or a short video clip. Predicting future actions before they are actually executed is a critical ingredient for enabling us to effectively interact with other humans on a daily basis. However, challenges are two fold: First, we need to capture the subtle details inherent in human movements that may imply a future action; second, predictions usually should be carried out as quickly as possible in the social world, when limited prior observations are available.

In this paper, we propose

hierarchical movemes

- a new representation to describe human movements at multiple levels of granularities, ranging from atomic movements (e.g. an open arm) to coarser movements that cover a larger temporal extent. We develop a max-margin learning framework for future action prediction, integrating a collection of moveme detectors in a hierarchical way. We validate our method on two publicly available datasets and show that it achieves very promising performance.

Tian Lan, Tsung-Chuan Chen, Silvio Savarese

Continuous Learning of Human Activity Models Using Deep Nets

Learning activity models continuously from streaming videos is an immensely important problem in video surveillance, video indexing, etc. Most of the research on human activity recognition has mainly focused on learning a static model considering that all the training instances are labeled and present in advance, while in streaming videos new instances continuously arrive and are not labeled. In this work, we propose a continuous human activity learning framework from streaming videos by intricately tying together deep networks and active learning. This allows us to automatically select the most suitable features and to take the advantage of incoming unlabeled instances to improve the existing model incrementally. Given the segmented activities from streaming videos, we learn features in an unsupervised manner using deep networks and use active learning to reduce the amount of manual labeling of classes. We conduct rigorous experiments on four challenging human activity datasets to demonstrate the effectiveness of our framework for learning human activity models continuously.

Mahmudul Hasan, Amit K. Roy-Chowdhury

DaMN – Discriminative and Mutually Nearest: Exploiting Pairwise Category Proximity for Video Action Recognition

We propose a method for learning discriminative category-level features and demonstrate state-of-the-art results on large-scale action recognition in video. The key observation is that one-vs-rest classifiers, which are ubiquitously employed for this task, face challenges in separating very similar categories (such as running vs. jogging). Our proposed method automatically identifies such pairs of categories using a criterion of mutual pairwise proximity in the (kernelized) feature space, using a category-level similarity matrix where each entry corresponds to the one-vs-one SVM margin for pairs of categories. We then exploit the observation that while splitting such “Siamese Twin” categories may be difficult, separating them from the remaining categories in a two-vs-rest framework is not. This enables us to augment one-vs-rest classifiers with a judicious selection of “two-vs-rest” classifier outputs, formed from such discriminative and mutually nearest (DaMN) pairs. By combining one-vs-rest and two-vs-rest features in a principled probabilistic manner, we achieve state-of-the-art results on the UCF101 and HMDB51 datasets. More importantly, the same DaMN features, when treated as a mid-level representation also outperform existing methods in knowledge transfer experiments, both cross-dataset from UCF101 to HMDB51 and to new categories with limited training data (one-shot and few-shot learning). Finally, we study the generality of the proposed approach by applying DaMN to other classification tasks; our experiments show that DaMN outperforms related approaches in direct comparisons, not only on video action recognition but also on their original image dataset tasks.

Rui Hou, Amir Roshan Zamir, Rahul Sukthankar, Mubarak Shah

Spatio-temporal Object Detection Proposals

Spatio-temporal detection of actions and events in video is a challenging problem. Besides the difficulties related to recognition, a major challenge for detection in video is the size of the search space defined by spatio-temporal tubes formed by sequences of bounding boxes along the frames. Recently methods that generate unsupervised detection proposals have proven to be very effective for object detection in still images. These methods open the possibility to use strong but computationally expensive features since only a relatively small number of detection hypotheses need to be assessed. In this paper we make two contributions towards exploiting detection proposals for spatio-temporal detection problems. First, we extend a recent 2D object proposal method, to produce spatio-temporal proposals by a randomized supervoxel merging process. We introduce spatial, temporal, and spatio-temporal pairwise supervoxel features that are used to guide the merging process. Second, we propose a new efficient supervoxel method. We experimentally evaluate our detection proposals, in combination with our new supervoxel method as well as existing ones. This evaluation shows that our supervoxels lead to more accurate proposals when compared to using existing state-of-the-art supervoxel methods.

Dan Oneata, Jerome Revaud, Jakob Verbeek, Cordelia Schmid

Computational Photography and Low-Level Vision

Depth-of-Field and Coded Aperture Imaging on XSlit Lens

Recent coded aperture imaging systems have shown great success in scene reconstruction, extended depth-of-field and light field imaging. By far nearly all solutions are built on top of commodity cameras equipped with a single spherical lens. In this paper, we explore coded aperture solutions on a special non-centric lens called the crossed-slit (XSlit) lens. An XSlit lens uses a relay of two orthogonal cylindrical lenses, each coupled with a slit-shaped aperture. Through ray geometry analysis, we first show that the XSlit lens produces a different and potentially advantageous depth-of-field than the regular spherical lens. We then present a coded aperture strategy that individually encodes each slit aperture, one with broadband code and the other with high depth discrepancy code, for scene recovery. Synthetic and real experiments validate our theory and demonstrate the advantages of XSlit coded aperture solutions over the spherical lens ones.

Jinwei Ye, Yu Ji, Wei Yang, Jingyi Yu

Refraction Wiggles for Measuring Fluid Depth and Velocity from Video

We present principled algorithms for measuring the velocity and 3D location of refractive fluids, such as hot air or gas, from natural videos with textured backgrounds. Our main observation is that intensity variations related to movements of refractive fluid elements, as observed by one or more video cameras, are consistent over small space-time volumes. We call these intensity variations “refraction wiggles”, and use them as features for tracking and stereo fusion to recover the fluid motion and depth from video sequences. We give algorithms for 1) measuring the (2D, projected) motion of refractive fluids in monocular videos, and 2) recovering the 3D position of points on the fluid from stereo cameras. Unlike pixel intensities, wiggles can be extremely subtle and cannot be known with the same level of confidence for all pixels, depending on factors such as background texture and physical properties of the fluid. We thus carefully model uncertainty in our algorithms for robust estimation of fluid motion and depth. We show results on controlled sequences, synthetic simulations, and natural videos. Different from previous approaches for measuring refractive flow, our methods operate directly on videos captured with ordinary cameras, do not require auxiliary sensors, light sources or designed backgrounds, and can correctly detect the motion and location of refractive fluids even when they are invisible to the naked eye.

Tianfan Xue, Michael Rubinstein, Neal Wadhwa, Anat Levin, Fredo Durand, William T. Freeman

Blind Deblurring Using Internal Patch Recurrence

Recurrence of small image patches across different scales of a natural image has been previously used for solving ill-posed problems (e.g. super- resolution from a single image). In this paper we show how this multi-scale property can also be used for “blind-deblurring”, namely, removal of an unknown blur from a blurry image. While patches repeat ‘as is’ across scales in a

sharp

natural image, this cross-scale recurrence significantly diminishes in blurry images. We exploit these

deviations from ideal patch recurrence

as a cue for recovering the underlying (unknown) blur kernel. More specifically, we look for the blur kernel

, such that if its effect is

“undone”

(if the blurry image is deconvolved with

), the patch similarity across scales of the image will be maximized. We report extensive experimental evaluations, which indicate that our approach compares favorably to state-of-the-art blind deblurring methods, and in particular, is more robust than them.

Tomer Michaeli, Michal Irani

Crisp Boundary Detection Using Pointwise Mutual Information

Detecting boundaries between semantically meaningful objects in visual scenes is an important component of many vision algorithms. In this paper, we propose a novel method for detecting such boundaries based on a simple underlying principle: pixels belonging to the same object exhibit higher statistical dependencies than pixels belonging to different objects. We show how to derive an affinity measure based on this principle using pointwise mutual information, and we show that this measure is indeed a good predictor of whether or not two pixels reside on the same object. Using this affinity with spectral clustering, we can find object boundaries in the image – achieving state-of-the-art results on the BSDS500 dataset. Our method produces pixel-level accurate boundaries while requiring minimal feature engineering.

Phillip Isola, Daniel Zoran, Dilip Krishnan, Edward H. Adelson

Rolling Guidance Filter

Images contain many levels of important structures and edges. Compared to masses of research to make filters edge preserving, finding scale-aware local operations was seldom addressed in a practical way, albeit similarly vital in image processing and computer vision. We propose a new framework to filter images with the complete control of detail smoothing under a scale measure. It is based on a rolling guidance implemented in an iterative manner that converges quickly. Our method is simple in implementation, easy to understand, fully extensible to accommodate various data operations, and fast to produce results. Our implementation achieves realtime performance and produces artifact-free results in separating different scale structures. This filter also introduces several inspiring properties different from previous edge-preserving ones.

Qi Zhang, Xiaoyong Shen, Li Xu, Jiaya Jia

Physically Grounded Spatio-temporal Object Affordances

Objects in human environments support various functionalities which govern how people interact with their environments in order to perform tasks. In this work, we discuss how to represent and learn a functional understanding of an environment in terms of object affordances. Such an understanding is useful for many applications such as activity detection and assistive robotics. Starting with a semantic notion of affordances, we present a generative model that takes a given environment and human intention into account, and

grounds

the affordances in the form of spatial locations on the object and temporal trajectories in the 3D environment. The probabilistic model also allows uncertainties and variations in the grounded affordances. We apply our approach on RGB-D videos from Cornell Activity Dataset, where we first show that we can successfully ground the affordances, and we then show that learning such affordances improves performance in the labeling tasks.

Hema S. Koppula, Ashutosh Saxena

Backmatter

Title: Computer Vision – ECCV 2014
Editors: David Fleet
Tomas Pajdla
Bernt Schiele
Tinne Tuytelaars
Publisher: Springer International Publishing
Electronic ISBN: 978-3-319-10578-9
Print ISBN: 978-3-319-10577-2
DOI: https://doi.org/10.1007/978-3-319-10578-9

Springer Professional