Skip to main content

2012 | Buch

Computer Vision – ECCV 2012

12th European Conference on Computer Vision, Florence, Italy, October 7-13, 2012, Proceedings, Part II

herausgegeben von: Andrew Fitzgibbon, Svetlana Lazebnik, Pietro Perona, Yoichi Sato, Cordelia Schmid

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

The seven-volume set comprising LNCS volumes 7572-7578 constitutes the refereed proceedings of the 12th European Conference on Computer Vision, ECCV 2012, held in Florence, Italy, in October 2012. The 408 revised papers presented were carefully reviewed and selected from 1437 submissions. The papers are organized in topical sections on geometry, 2D and 3D shapes, 3D reconstruction, visual recognition and classification, visual features and image matching, visual monitoring: action and activities, models, optimisation, learning, visual tracking and image registration, photometry: lighting and colour, and image segmentation.

Inhaltsverzeichnis

Frontmatter

Poster Session 2

Object-Centric Spatial Pooling for Image Classification

Spatial pyramid matching (SPM) based pooling has been the dominant choice for state-of-art image classification systems. In contrast, we propose a novel

object-centric spatial pooling

(OCP) approach, following the intuition that knowing the location of the object of interest can be useful for image classification. OCP consists of two steps: (1) inferring the location of the objects, and (2) using the location information to pool foreground and background features separately to form the image-level representation. Step (1) is particularly challenging in a typical classification setting where precise object location annotations are not available during training. To address this challenge, we propose a framework that learns object detectors using only image-level class labels, or so-called weak labels. We validate our approach on the challenging PASCAL07 dataset. Our learned detectors are comparable in accuracy with state-of-the-art weakly supervised detection methods. More importantly, the resulting OCP approach significantly outperforms SPM-based pooling in image classification.

Olga Russakovsky, Yuanqing Lin, Kai Yu, Li Fei-Fei
Statistics of Patch Offsets for Image Completion

Image completion involves filling missing parts in images. In this paper we address this problem through the statistics of patch offsets. We observe that if we match similar patches in the image and obtain their offsets (relative positions), the statistics of these offsets are sparsely distributed. We further observe that a few dominant offsets provide reliable information for completing the image. With these offsets we fill the missing region by combining a stack of shifted images via optimization. A variety of experiments show that our method yields generally better results and is faster than existing state-of-the-art methods.

Kaiming He, Jian Sun
Spectral Demons – Image Registration via Global Spectral Correspondence

Image registration is a building block for many applications in computer vision and medical imaging. However the current methods are limited when large and highly non-local deformations are present. In this paper, we introduce a new direct feature matching technique for non-parametric image registration where efficient nearest-neighbor searches find global correspondences between intensity, spatial and geometric information. We exploit graph spectral representations that are invariant to isometry under complex deformations. Our direct feature matching technique is used within the established Demons framework for diffeomorphic image registration. Our method, called

Spectral Demons

, can capture very large, complex and highly non-local deformations between images. We evaluate the improvements of our method on 2D and 3D images and demonstrate substantial improvement over the conventional Demons algorithm for large deformations.

Herve Lombaert, Leo Grady, Xavier Pennec, Nicholas Ayache, Farida Cheriet
MatchMiner: Efficient Spanning Structure Mining in Large Image Collections

Many new computer vision applications are utilizing large-scale data- sets of places derived from the many billions of photos on the Web. Such applications often require knowledge of the visual connectivity structure of these image collections—describing which images overlap or are otherwise related—and an important step in understanding this structure is to identify

connected components

of this underlying image graph. As the structure of this graph is often initially unknown, this problem can be posed as one of exploring the connectivity between images as quickly as possible, by intelligently selecting a subset of image pairs for feature matching and geometric verification, without having to test all

O

(

n

2

) possible pairs. We propose a novel, scalable algorithm called MatchMiner that efficiently explores visual relations between images, incorporating ideas from relevance feedback to improve decision making over time, as well as a simple yet effective

rank distance

measure for detecting outlier images. Using these ideas, our algorithm automatically prioritizes image pairs that can potentially connect or contribute to large connected components, using an information-theoretic algorithm to decide which image pairs to test next. Our experimental results show that MatchMiner can efficiently find connected components in large image collections, significantly outperforming state-of-the-art image matching methods.

Yin Lou, Noah Snavely, Johannes Gehrke
V1-Inspired Features Induce a Weighted Margin in SVMs

Image representations derived from simplified models of the primary visual cortex (V1), such as HOG and SIFT, elicit good performance in a myriad of visual classification tasks including object recognition/detection, pedestrian detection and facial expression classification. A central question in the vision, learning and neuroscience communities regards

why

these architectures perform so well. In this paper, we offer a unique perspective to this question by subsuming the role of V1-inspired features directly within a linear support vector machine (SVM). We demonstrate that a specific class of such features in conjunction with a linear SVM can be reinterpreted as inducing a weighted margin on the Kronecker basis expansion of an image. This new viewpoint on the role of V1-inspired features allows us to answer fundamental questions on the uniqueness and redundancies of these features, and offer substantial improvements in terms of computational and storage efficiency.

Hilton Bristow, Simon Lucey
Unsupervised Discovery of Mid-Level Discriminative Patches

The goal of this paper is to discover a set of

discriminative patches

which can serve as a fully unsupervised mid-level visual representation. The desired patches need to satisfy two requirements: 1) to be representative, they need to occur frequently enough in the visual world; 2) to be discriminative, they need to be different enough from the rest of the visual world. The patches could correspond to parts, objects, “visual phrases”, etc. but are not restricted to be any one of them. We pose this as an unsupervised discriminative clustering problem on a huge dataset of image patches. We use an iterative procedure which alternates between clustering and training discriminative classifiers, while applying careful cross-validation at each step to prevent overfitting. The paper experimentally demonstrates the effectiveness of discriminative patches as an unsupervised mid-level visual representation, suggesting that it could be used in place of visual words for many tasks. Furthermore, discriminative patches can also be used in a supervised regime, such as scene classification, where they demonstrate state-of-the-art performance on the MIT Indoor-67 dataset.

Saurabh Singh, Abhinav Gupta, Alexei A. Efros
Self-similar Sketch

We introduce the self-similar sketch, a new method for the extraction of intermediate image features that combines three principles: detection of self-similarity structures, nonaccidental alignment, and instance-specific modelling. The method searches for self-similar image structures that form nonaccidental patterns, for example collinear arrangements. We demonstrate a simple implementation of this idea where self-similar structures are found by looking for SIFT descriptors that map to the same visual words in image-specific vocabularies. This results in a visual word map which is searched for elongated connected components. Finally, segments are fitted to these connected components, extracting linear image structures beyond the ones that can be captured by conventional edge detectors, as the latter implicitly assume a specific appearance for the edges (steps). The resulting collection of segments constitutes a “sketch” of the image. This is applied to the task of estimating vanishing points, horizon, and zenith in standard benchmark data, obtaining state-of-the-art results. We also propose a new vanishing point estimation algorithm based on recently introduced techniques for the continuous-discrete optimisation of energies arising from model selection priors.

Andrea Vedaldi, Andrew Zisserman
Depth Matters: Influence of Depth Cues on Visual Saliency

Most previous studies on visual saliency have only focused on static or dynamic 2D scenes. Since the human visual system has evolved predominantly in natural three dimensional environments, it is important to study whether and how depth information influences visual saliency. In this work, we first collect a large human eye fixation database compiled from a pool of 600 2D-vs-3D image pairs viewed by 80 subjects, where the depth information is directly provided by the Kinect camera and the eye tracking data are captured in both 2D and 3D free-viewing experiments. We then analyze the major discrepancies between 2D and 3D human fixation data of the same scenes, which are further abstracted and modeled as novel depth priors. Finally, we evaluate the performances of state-of-the-art saliency detection models over 3D images, and propose solutions to enhance their performances by integrating the depth priors.

Congyan Lang, Tam V. Nguyen, Harish Katti, Karthik Yadati, Mohan Kankanhalli, Shuicheng Yan
Quaternion-Based Spectral Saliency Detection for Eye Fixation Prediction

In recent years, several authors have reported that spectral saliency detection methods provide state-of-the-art performance in predicting human gaze in images (see, e.g., [1–3]). We systematically integrate and evaluate quaternion DCT- and FFT-based spectral saliency detection [3,4], weighted quaternion color space components [5], and the use of multiple resolutions [1]. Furthermore, we propose the use of the eigenaxes and eigenangles for spectral saliency models that are based on the quaternion Fourier transform. We demonstrate the outstanding performance on the Bruce-Tsotsos (Toronto), Judd (MIT), and Kootstra- Schomacker eye-tracking data sets.

Boris Schauerte, Rainer Stiefelhagen
Human Activities as Stochastic Kronecker Graphs

A human activity can be viewed as a space-time repetition of activity primitives. Both instances of the primitives, and their repetition are stochastic. They can be modeled by a generative model-graph, where nodes correspond to the primitives, and the graph’s adjacency matrix encodes their affinities for probabilistic grouping into observable video features. When a video of the activity is represented by a graph capturing the space-time layout of video features, such a video graph can be viewed as probabilistically sampled from the activity’s model-graph. This sampling is formulated as a successive Kronecker multiplication of the model’s affinity matrix. The resulting Kronecker-power matrix is taken as a noisy permutation of the adjacency matrix of the video graph. The paper presents our: 1) model-graph; 2) memory- and time-efficient, weakly supervised learning of activity primitives and their affinities; and 3) inference aimed at finding the best expected correspondences between the primitives and observed video features. Our results demonstrate good scalability on UCF50, and superior performance to that of the state of the art on individual, structured, and collective activities of UCF YouTube, Olympic, and Collective datasets.

Sinisa Todorovic
Facial Action Transfer with Personalized Bilinear Regression

Facial Action Transfer (FAT) has recently attracted much attention in computer vision due to its diverse applications in the movie industry, computer games, and privacy protection. The goal of FAT is to “clone” the facial actions from the videos of one person (source) to another person (target). In this paper, we will assume that we have a video of the source person but only one frontal image of the target person. Most successful methods for FAT require a training set with annotated correspondence between expressions of different subjects, sometimes including many images of the target subject. However, labeling expressions is time consuming and error prone (i.e., it is difficult to capture the same intensity of the expression across people). Moreover, in many applications it is not realistic to have many labeled images of the target. This paper proposes a method to learn a personalized facial model, that can produce photo-realistic person-specific facial actions (e.g., synthesize wrinkles for smiling), from

only

a neutral image of the target person. More importantly, our learning method does not need an explicit correspondence of expressions across subjects. Experiments on the Cohn-Kanade and the RU-FACS databases show the effectiveness of our approach to generate video-realistic images of the target person driven by spontaneous facial actions of the source. Moreover, we illustrate applications of FAT to face de-identification.

Dong Huang, Fernando De La Torre
Point of Gaze Estimation through Corneal Surface Reflection in an Active Illumination Environment

Eye gaze tracking (EGT) is a common problem with many applications in various fields. While recent methods have achieved improvements in accuracy and usability, current techniques still share several limitations. A major issue is the need for external calibration between the gaze camera system and the scene, which commonly restricts to static planar surfaces and leads to parallax errors. To overcome these issues, the paper proposes a novel scheme that uses the corneal imaging technique to directly analyze reflections from a scene illuminated with structured light. This comprises two major contributions: First, an analytic solution is developed for the forward projection problem to obtain the gaze reflection point (GRP), where light from the point of gaze (PoG) in the scene reflects at the corneal surface into an eye image. We also develop a method to compensate for the individual offset between the optical axis and true visual axis. Second, introducing active coded illumination enables robust and accurate matching at the GRP to obtain the PoG in a scene image, which is the first use of this technique in EGT and corneal reflection analysis. For this purpose, we designed a special high-power IR LED-array projector. Experimental evaluation with a prototype system shows that the proposed scheme achieves considerable accuracy and successfully supports depth-varying environments.

Atsushi Nakazawa, Christian Nitschke
Order-Preserving Sparse Coding for Sequence Classification

In this paper, we investigate order-preserving sparse coding for classifying multi-dimensional sequence data. Such a problem is often tackled by first decomposing the input sequence into individual

frames

and extracting features, then performing sparse coding or other processing for each frame based feature vector independently, and finally aggregating individual responses to classify the input sequence. However, this heuristic approach ignores the underlying temporal order of the input sequence frames, which in turn results in suboptimal discriminative capability. In this work, we introduce a temporal-order-preserving regularizer which aims to preserve the temporal order of the reconstruction coefficients. An efficient Nesterov-type smooth approximation method is developed for optimization of the new regularization criterion, with guaranteed error bounds. Extensive experiments for time series classification on a synthetic dataset, several machine learning benchmarks, and a challenging real-world RGB-D human activity dataset, show that the proposed coding scheme is discriminative and robust, and it outperforms previous art for sequence classification.

Bingbing Ni, Pierre Moulin, Shuicheng Yan
Min-Space Integral Histogram

In this paper, we present a new approach for quickly computing the histograms of a set of unrotating rectangular regions. Although it is related to the well-known Integral Histogram (IH), our approach significantly outperforms it, both in terms of memory requirements and of response times. By preprocessing the region of interest (ROI) computing and storing a temporary histogram for each of its pixels, IH is effective only when a large amount of histograms located in a small ROI need be computed by the user. Unlike IH, our approach, called

Min-Space Integral Histogram

, only computes and stores those temporary histograms that are strictly necessary (less than 4 times the number of regions). Comparative tests highlight its efficiency, which can be up to 75 times faster than IH. In particular, we show that our approach is much less sensitive than IH to histogram quantization and to the size of the ROI.

Séverine Dubuisson, Christophe Gonzales
On Learning Higher-Order Consistency Potentials for Multi-class Pixel Labeling

Pairwise Markov random fields are an effective framework for solving many pixel labeling problems in computer vision. However, their performance is limited by their inability to capture higher-order correlations. Recently proposed higher-order models are showing superior performance to their pairwise counterparts. In this paper, we derive two variants of the higher-order lower linear envelop model and show how to perform tractable move-making inference in these models. We propose a novel use of this model for encoding consistency constraints over large sets of pixels. Importantly these pixel sets do not need to be contiguous. However, the consistency model has a large number of parameters to be tuned for good performance. We exploit the structured SVM paradigm to learn optimal parameters and show some practical techniques to overcome huge computation requirements. We evaluate our model on the problems of image denoising and semantic segmentation.

Kyoungup Park, Stephen Gould
Sparse Coding and Dictionary Learning for Symmetric Positive Definite Matrices: A Kernel Approach

Recent advances suggest that a wide range of computer vision problems can be addressed more appropriately by considering non-Euclidean geometry. This paper tackles the problem of sparse coding and dictionary learning in the space of symmetric positive definite matrices, which form a Riemannian manifold. With the aid of the recently introduced Stein kernel (related to a symmetric version of Bregman matrix divergence), we propose to perform sparse coding by embedding Riemannian manifolds into reproducing kernel Hilbert spaces. This leads to a convex and kernel version of the Lasso problem, which can be solved efficiently. We furthermore propose an algorithm for learning a Riemannian dictionary (used for sparse coding), closely tied to the Stein kernel. Experiments on several classification tasks (face recognition, texture classification, person re-identification) show that the proposed sparse coding approach achieves notable improvements in discrimination accuracy, in comparison to state-of-the-art methods such as tensor sparse coding, Riemannian locality preserving projection, and symmetry-driven accumulation of local features.

Mehrtash T. Harandi, Conrad Sanderson, Richard Hartley, Brian C. Lovell
Learning Class-to-Image Distance via Large Margin and L1-Norm Regularization

Image-to-Class (I2C) distance has demonstrated its effectiveness for object recognition in several single-label datasets. However, for the multi-label problem, where an image may contain several regions belonging to different classes, this distance may not work well since it cannot discriminate local features from different regions in the test image and all local features have to be counted in the I2C distance calculation. In this paper, we propose to use Class-to-Image (C2I) distance and show that this distance performs better than I2C distance for multi-label image classification. However, since the number of local features in a class is huge compared to that in an image, the calculation of C2I distance is much more expensive than I2C distance. Moreover, the label information of training images can be used to help select relevant local features for each class and further improve the recognition performance. Therefore, to make C2I distance faster and perform better, we propose an optimization algorithm using L1-norm regularization and large margin constraint to learn the C2I distance, which will not only reduce the number of local features in the class feature set, but also improve the performance of C2I distance due to the use of label information. Experiments on MSRC, Pascal VOC and MirFlickr datasets show that our method can significantly speed up the C2I distance calculation, while achieves better recognition performance than the original C2I distance and other related methods for multi-labeled datasets.

Zhengxiang Wang, Shenghua Gao, Liang-Tien Chia
Taxonomic Multi-class Prediction and Person Layout Using Efficient Structured Ranking

In computer vision efficient multi-class classification is becoming a key problem as the field develops and the number of object classes to be identified increases. Often objects might have some sort of structure such as a taxonomy in which the mis-classification score for object classes close by, using tree distance within the taxonomy, should be less than for those far apart. This is an example of multi-class classification in which the loss function has a special structure. Another example in vision is for the ubiquitous pictorial structure or parts based model. In this case we would like the mis-classification score to be proportional to the number of parts misclassified.

It transpires both of these are examples of structured output ranking problems. However, so far no efficient large scale algorithm for this problem has been demonstrated. In this work we propose an algorithm for structured output ranking that can be trained in a time linear in the number of samples under a mild assumption common to many computer vision problems: that the loss function can be discretized into a small number of values.

We show the feasibility of structured ranking on these two core computer vision problems and demonstrate a consistent and substantial improvement over competing techniques. Aside from this, we also achieve state-of-the art results for the PASCAL VOC human layout problem.

Arpit Mittal, Matthew B. Blaschko, Andrew Zisserman, Philip H. S. Torr
Robust Point Matching Revisited: A Concave Optimization Approach

The well-known robust point matching (RPM) method uses deterministic annealing for optimization, and it has two problems. First, it cannot guarantee the global optimality of the solution and tends to align the centers of two point sets. Second, deformation needs to be regularized to avoid the generation of undesirable results. To address these problems, in this paper we first show that the energy function of RPM can be reduced to a concave function with very few non-rigid terms after eliminating the transformation variables and applying linear transformation; we then propose to use concave optimization technique to minimize the resulting energy function. The proposed method scales well with problem size, achieves the globally optimal solution, and does not need regularization for simple transformations such as similarity transform. Experiments on synthetic and real data validate the advantages of our method in comparison with state-of-the-art methods.

Wei Lian, Lei Zhang
Learning Discriminative Spatial Relations for Detector Dictionaries: An Application to Pedestrian Detection

The recent availability of large scale training sets in conjunction with accurate classifiers (e.g., SVMs) makes it possible to build large sets of “simple” object detectors and to develop new classification approaches in which dictionaries of visual features are substituted by dictionaries of object detectors. The responses of this collection of detectors can then be used as a high-level image representation. In this work, we propose to go a step further in this direction by modeling spatial relations among different detector responses. We use Random Forests in order to discriminatively select spatial relations which represent frequent co-occurrences of detector responses. We demonstrate our idea in the specific people detection framework, which is a challenging classification task due to the variability of the human body articulations and appearance, and we use the recently proposed

poselets

as our basic object dictionary. The use of poselets is not the only possible, actually the proposed method can be applied more in general since few assumptions are made on the basic object detector. The results obtained show sharp improvements with respect to both the original poselet-based people detection method and to other state-of-the-art approaches on two difficult benchmark datasets.

Enver Sangineto, Marco Cristani, Alessio Del Bue, Vittorio Murino
Learning Deformations with Parallel Transport

Many vision problems, such as object recognition and image synthesis, are greatly impacted by deformation of objects. In this paper, we develop a deformation model based on Lie algebraic analysis. This work aims to provide a generative model that explicitly decouples deformation from appearance, which is fundamentally different from the prior work that focuses on deformation-resilient features or metrics. Specifically, the deformation group for each object can be characterized by a set of Lie algebraic basis. Such basis for different objects are related via parallel transport. Exploiting the parallel transport relations, we formulate an optimization problem, and derive an algorithm that jointly estimates the deformation basis for a class of objects, given a set of images resulted from the action of the deformations. We test the proposed model empirically on both character recognition and face synthesis.

Donglai Wei, Dahua Lin, John Fisher III
Multi-channel Shape-Flow Kernel Descriptors for Robust Video Event Detection and Retrieval

Despite the success of spatio-temporal visual features, they are hand-designed and aggregate image or flow gradients using a pre-specified, uniform set of orientation bins. Kernel descriptors [1] generalize such orientation histograms by defining match kernels over image patches, and have shown superior performance for visual object and scene recognition. In our work, we make two contributions: first, we extend kernel descriptors to the spatio-temporal domain to model salient flow, gradient and texture patterns in video. Further, we apply our kernel descriptors to extract features from different color channels. Second, we present a fast algorithm for kernel descriptor computation of

O

(1) complexity for each pixel in each video patch, producing two orders of magnitude speedup over conventional kernel descriptors and other popular motion features. Our evaluation results on TRECVID MED 2011 dataset indicate that the proposed multi-channel shape-flow kernel descriptors outperform several other features including SIFT, SURF, STIP and Color SIFT.

Pradeep Natarajan, Shuang Wu, Shiv Vitaladevuni, Xiaodan Zhuang, Unsang Park, Rohit Prasad, Premkumar Natarajan
Tracking Using Motion Patterns for Very Crowded Scenes

This paper proposes Motion Structure Tracker (MST) to solve the problem of tracking in very crowded structured scenes. It combines visual tracking, motion pattern learning and multi-target tracking. Tracking in crowded scenes is very challenging due to hundreds of similar objects, cluttered background, small object size, and occlusions. However, structured crowded scenes exhibit clear motion pattern(s), which provides rich prior information. In MST, tracking and detection are performed

jointly

, and motion pattern information is integrated in both steps to enforce scene structure constraint. MST is initially used to track a single target, and further extended to solve a simplified version of the multi-target tracking problem. Experiments are performed on real-world challenging sequences, and MST gives promising results. Our method significantly outperforms several state-of-the-art methods both in terms of track ratio and accuracy.

Xuemei Zhao, Dian Gong, Gérard Medioni
Long-Range Spatio-Temporal Modeling of Video with Application to Fire Detection

We describe a methodology for modeling backgrounds subject to significant variability over time-scales ranging from days to years, where the events of interest exhibit subtle variability relative to the normal mode. The motivating application is fire monitoring from remote stations, where illumination changes spanning the day and the season, meteorological phenomena resembling smoke, and the absence of sufficient training data for the two classes make out-of-the-box classification algorithms ineffective. We exploit low-level descriptors, incorporate explicit modeling of nuisance variability, and learn the residual normal-model variability. Our algorithm achieves state-of-the-art performance not only compared to other anomaly detection schemes, but also compared to human performance, both for untrained and trained operators.

Avinash Ravichandran, Stefano Soatto
GMCP-Tracker: Global Multi-object Tracking Using Generalized Minimum Clique Graphs

Data association is an essential component of any human tracking system. The majority of current methods, such as bipartite matching, incorporate a limited-temporal-locality of the sequence into the data association problem, which makes them inherently prone to IDswitches and difficulties caused by long-term occlusion, cluttered background, and crowded scenes.We propose an approach to data association which incorporates both motion and appearance in a global manner. Unlike limited-temporal-locality methods which incorporate a few frames into the data association problem, we incorporate the whole temporal span and solve the data association problem for one object at a time, while implicitly incorporating the rest of the objects. In order to achieve this, we utilize Generalized Minimum Clique Graphs to solve the optimization problem of our data association method. Our proposed method yields a better formulated approach to data association which is supported by our superior results. Experiments show the proposed method makes significant improvements in tracking in the diverse sequences of Town Center [1], TUD-crossing [2], TUD-Stadtmitte [2], PETS2009 [3], and a new sequence called Parking Lot compared to the state of the art methods.

Amir Roshan Zamir, Afshin Dehghan, Mubarak Shah
Heliometric Stereo: Shape from Sun Position

In this work, we present a method to uncover shape from webcams “in the wild.” We present a variant of photometric stereo which uses the sun as a distant light source, so that lighting direction can be computed from known GPS and timestamps. We propose an iterative, non-linear optimization process that optimizes the error in reproducing all images from an extended time-lapse with an image formation model that accounts for ambient lighting, shadows, changing light color, dense surface normal maps, radiometric calibration, and exposure. Unlike many approaches to uncalibrated outdoor image analysis, this procedure is automatic, and we report quantitative results by comparing extracted surface normals to Google Earth 3D models. We evaluate this procedure on data from a varied set of scenes and emphasize the advantages of including imagery from many months.

Austin Abrams, Christopher Hawley, Robert Pless
Shape from Single Scattering for Translucent Objects

Translucent objects strongly scatter incident light. Scattering makes the problem of estimating shape of translucent objects difficult, because reflective or transmitted light cannot be reliably extracted from the scattering. In this paper, we propose a new shape estimation method by directly utilizing scattering measurements. Although volumetric scattering is a complex phenomenon, single scattering can be relatively easily modeled because it is a simple one-bounce collision of light to a particle in a medium. Based on this observation, our method determines the shape of objects from the observed intensities of the single scattering and its attenuation. We develop a solution method that simultaneously determines scattering parameters and the shape based on energy minimization. We demonstrate the effectiveness of the proposed approach by extensive experiments using synthetic and real data.

Chika Inoshita, Yasuhiro Mukaigawa, Yasuyuki Matsushita, Yasushi Yagi
Scale Invariant Optical Flow

Scale variation commonly arises in images/videos, which cannot be naturally dealt with by optical flow. Invariant feature matching, on the contrary, provides sparse matching and could fail for regions without conspicuous structures. We aim to establish dense correspondence between frames containing objects in different scales and contribute a new framework taking pixel-wise scales into consideration in optical flow estimation. We propose an effective numerical scheme, which iteratively optimizes discrete scale variables and continuous flow ones. This scheme notably expands the practicality of optical flow in natural scenes containing various types of object motion.

Li Xu, Zhenlong Dai, Jiaya Jia
Structured Image Segmentation Using Kernelized Features

Most state-of-the-art approaches to image segmentation formulate the problem using Conditional Random Fields. These models typically include a unary term and a pairwise term, whose parameters must be carefully chosen for optimal performance. Recently, structured learning approaches such as Structured SVMs (SSVM) have made it possible to jointly learn these model parameters. However, they have been limited to linear kernels, since more powerful non-linear kernels cause the learning to become prohibitively expensive. In this paper, we introduce an approach to “kernelize” the features so that a linear SSVM framework can leverage the power of non-linear kernels without incurring the high computational cost. We demonstrate the advantages of this approach in a series of image segmentation experiments on the MSRC data set as well as 2D and 3D datasets containing imagery of neural tissue acquired with electron microscopes.

Aurélien Lucchi, Yunpeng Li, Kevin Smith, Pascal Fua
Salient Object Detection: A Benchmark

Several salient object detection approaches have been published which have been assessed using different evaluation scores and datasets resulting in discrepancy in model comparison. This calls for a methodological framework to compare existing models and evaluate their pros and cons. We analyze benchmark datasets and scoring techniques and, for the first time, provide a quantitative comparison of 35 state-of-the-art saliency detection models. We find that some models perform consistently better than the others. Saliency models that intend to predict eye fixations perform lower on segmentation datasets compared to salient object detection algorithms. Further, we propose combined models which show that integration of the few best models outperforms all models over other datasets. By analyzing the consistency among the best models and among humans for each scene, we identify the scenes where models or humans fail to detect the most salient object. We highlight the current issues and propose future research directions.

Ali Borji, Dicky N. Sihite, Laurent Itti
Automatic Segmentation of Unknown Objects, with Application to Baggage Security

Computed tomography (CT) is used widely to image patients for medical diagnosis and to scan baggage for threatening materials. Automated reading of these images can be used to reduce the costs of a human operator, extract quantitative information from the images or support the judgements of a human operator. Object quantification requires an image segmentation to make measurements about object size, material composition and morphology. Medical applications mostly require the segmentation of prespecified objects, such as specific organs or lesions, which allows the use of customized algorithms that take advantage of training data to provide orientation and anatomical context of the segmentation targets. In contrast, baggage screening requires the segmentation algorithm to provide segmentation of an unspecified number of objects with enormous variability in size, shape, appearance and spatial context. Furthermore, security systems demand 3D segmentation algorithms that can quickly and reliably detect threats. To address this problem, we present a segmentation algorithm for 3D CT images that makes no assumptions on the number of objects in the image or on the composition of these objects. The algorithm features a new Automatic QUality Measure (AQUA) model that measures the segmentation confidence for any single object (from any segmentation method) and uses this confidence measure to both control splitting and to optimize the segmentation parameters at runtime for each dataset. The algorithm is tested on 27 bags that were packed with a large variety of different objects.

Leo Grady, Vivek Singh, Timo Kohlberger, Christopher Alvino, Claus Bahlmann
Multi-scale Clustering of Frame-to-Frame Correspondences for Motion Segmentation

We present an approach for motion segmentation using independently detected keypoints instead of commonly used tracklets or trajectories. This allows us to establish correspondences over non- consecutive frames, thus we are able to handle multiple object occlusions consistently. On a frame-to-frame level, we extend the classical split-and-merge algorithm for fast and precise motion segmentation. Globally, we cluster multiple of these segmentations of different time scales with an accurate estimation of the number of motions. On the standard benchmarks, our approach performs best in comparison to all algorithms which are able to handle unconstrained missing data. We further show that it works on benchmark data with more than 98% of the input data missing. Finally, the performance is evaluated on a mobile-phone-recorded sequence with multiple objects occluded at the same time.

Ralf Dragon, Bodo Rosenhahn, Jörn Ostermann

Oral Session 2: Learning and large-Scale Recognition

Fourier Kernel Learning

Approximations based on random Fourier embeddings have recently emerged as an efficient and formally consistent methodology to design large-scale kernel machines [23]. By expressing the kernel as a Fourier expansion, features are generated based on a finite set of random basis projections, sampled from the Fourier transform of the kernel, with inner products that are Monte Carlo approximations of the original non-linear model. Based on the observation that different kernel-induced Fourier sampling distributions correspond to different kernel parameters, we show that a scalable optimization process in the Fourier domain can be used to identify the different frequency bands that are useful for prediction on training data. This approach allows us to design a family of linear prediction models where we can learn the hyper-parameters of the kernel together with the weights of the feature vectors jointly. Under this methodology, we recover efficient and scalable linear reformulations for both single and multiple kernel learning. Experiments show that our linear models produce fast and accurate predictors for complex datasets such as the Visual Object Challenge 2011 and ImageNet ILSVRC 2011.

Eduard Gabriel Băzăvan, Fuxin Li, Cristian Sminchisescu
Efficient Optimization for Low-Rank Integrated Bilinear Classifiers

In pattern classification, it is needed to efficiently treat two-way data (feature matrices) while preserving the two-way structure such as spatio-temporal relationships,

etc

. The classifier for the feature matrix is generally formulated by multiple bilinear forms which result in a matrix. The rank of the matrix,

i.e.

, the number of bilinear forms, should be low from the viewpoint of generalization performance and computational cost. For that purpose, we propose a low-rank bilinear classifier based on the efficient optimization. In the proposed method, the classifier is optimized by minimizing the trace norm of the classifier (matrix), which contributes to the rank reduction for an efficient classifier without any hard constraint on the rank. We formulate the optimization problem in a tractable convex form and propose the procedure to solve it efficiently with the global optimum. In addition, by considering a kernel-based extension of the bilinear method, we induce a novel multiple kernel learning (MKL), called heterogeneous MKL. The method combines both inter kernels between heterogeneous types of features and the ordinary kernels within homogeneous features into a new discriminative kernel in a unified manner using the bilinear model. In the experiments on various classification problems using feature arrays, co-occurrence feature matrices, and multiple kernels, the proposed method exhibits favorable performances compared to the other methods.

Takumi Kobayashi, Nobuyuki Otsu
Metric Learning for Large Scale Image Classification: Generalizing to New Classes at Near-Zero Cost

We are interested in large-scale image classification and especially in the setting where images corresponding to new or existing classes are continuously added to the training set. Our goal is to devise classifiers which can incorporate such images and classes on-the-fly at (near) zero cost. We cast this problem into one of learning a metric which is shared across all classes and explore k-nearest neighbor (k-NN) and nearest class mean (NCM) classifiers. We learn metrics on the ImageNet 2010 challenge data set, which contains more than 1.2M training images of 1K classes. Surprisingly, the NCM classifier compares favorably to the more flexible k-NN classifier, and has comparable performance to linear SVMs. We also study the generalization performance, among others by using the learned metric on the ImageNet-10K dataset, and we obtain competitive performance. Finally, we explore zero-shot classification, and show how the zero-shot model can be combined very effectively with small training datasets.

Thomas Mensink, Jakob Verbeek, Florent Perronnin, Gabriela Csurka
Leafsnap: A Computer Vision System for Automatic Plant Species Identification

We describe the first mobile app for identifying plant species using automatic visual recognition. The system – called Leafsnap – identifies tree species from photographs of their leaves. Key to this system are computer vision components for discarding non-leaf images, segmenting the leaf from an untextured background, extracting features representing the curvature of the leaf’s contour over multiple scales, and identifying the species from a dataset of the 184 trees in the Northeastern United States. Our system obtains state-of-the-art performance on the real-world images from the new Leafsnap Dataset – the largest of its kind. Throughout the paper, we document many of the practical steps needed to produce a computer vision system such as ours, which currently has nearly a million users.

Neeraj Kumar, Peter N. Belhumeur, Arijit Biswas, David W. Jacobs, W. John Kress, Ida C. Lopez, João V. B. Soares
Large Scale Visual Geo-Localization of Images in Mountainous Terrain

Given a picture taken somewhere in the world, automatic geo-localization of that image is a task that would be extremely useful e.g. for historical and forensic sciences, documentation purposes, organization of the world’s photo material and also intelligence applications. While tremendous progress has been made over the last years in visual location recognition within a single city, localization in natural environments is much more difficult, since vegetation, illumination, seasonal changes make appearance-only approaches impractical. In this work, we target mountainous terrain and use digital elevation models to extract representations for fast visual database lookup. We propose an automated approach for very large scale visual localization that can efficiently exploit visual information (contours) and geometric constraints (consistent orientation) at the same time. We validate the system on the scale of a whole country (Switzerland, 40 000km

2

) using a new dataset of more than 200 landscape query pictures with ground truth.

Georges Baatz, Olivier Saurer, Kevin Köser, Marc Pollefeys

Poster Session 3

Manifold Statistics for Essential Matrices

Riemannian geometry allows for the generalization of statistics designed for Euclidean vector spaces to Riemannian manifolds. It has recently gained popularity within computer vision as many relevant parameter spaces have such a Riemannian manifold structure. Approaches which exploit this have been shown to exhibit improved efficiency and accuracy. The Riemannian logarithmic and exponential mappings are at the core of these approaches.

In this contribution we review recently proposed Riemannian mappings for essential matrices and prove that they lead to sub-optimal manifold statistics. We introduce correct Riemannian mappings by utilizing a multiple-geodesic approach and show experimentally that they provide optimal statistics.

Gijs Dubbelman, Leo Dorst, Henk Pijls
Covariance Propagation and Next Best View Planning for 3D Reconstruction

This paper examines the potential benefits of applying next best view planning to sequential 3D reconstruction from unordered image sequences. A standard sequential structure-and-motion pipeline is extended with active selection of the order in which cameras are resectioned. To this end, approximate covariance propagation is implemented throughout the system, providing running estimates of the uncertainties of the reconstruction, while also enhancing robustness and accuracy. Experiments show that the use of expensive global bundle adjustment can be reduced throughout the process, while the additional cost of propagation is essentially linear in the problem size.

Sebastian Haner, Anders Heyden
Dilated Divergence Based Scale-Space Representation for Curve Analysis

This study proposes the novel dilated divergence scale-space representation for multidimensional curve-like image structure analysis. In the proposed framework, image structures are modeled as curves with arbitrary thickness. The dilated divergence analyzes the structure boundaries along the curve normal space in a multi-scale fashion. The dilated divergence based detection is formulated so as to 1) sustain the disturbance introduced by neighboring objects, 2) recognize the curve normal and tangent spaces. The latter enables the innovative formulation of structure eccentricity analysis and curve tangent space-based structure motion analysis, which have been scarcely investigated in literature. The proposed method is validated using 2D, 3D and 4D images. The structure principal direction estimation accuracies, structure scale detection accuracies and detection stabilities are quantified and compared against two scale-space approaches, showing a competitive performance of the proposed approach, under the disturbance introduced by image noise and neighboring objects. Moreover, as an application example employing the dilated divergence detection responses, an automated approach is tailored for spinal cord centerline extraction. The proposed method is shown to be versatile to well suit a wide range of applications.

Max W. K. Law, KengYeow Tay, Andrew Leung, Gregory J. Garvin, Shuo Li
A Parameterless Line Segment and Elliptical Arc Detector with Enhanced Ellipse Fitting

We propose a combined line segment and elliptical arc detector, which formally guarantees the control of the number of false positives and requires no parameter tuning. The accuracy of the detected elliptical features is improved by using a novel non-iterative ellipse fitting technique, which merges the algebraic distance with the gradient orientation. The performance of the detector is evaluated on computer-generated images and on natural images.

Viorica Pătrăucean, Pierre Gurdjos, Rafael Grompone von Gioi
Detecting and Reconstructing 3D Mirror Symmetric Objects

We present a system that detects 3D mirror-symmetric objects in images and then reconstructs their visible symmetric parts. Our detection stage is based on matching mirror symmetric feature points and descriptors and then estimating the symmetry direction using RANSAC. We enhance this step by augmenting feature descriptors with their affine-deformed versions and matching these extended sets of descriptors. The reconstruction stage uses a novel edge matching algorithm that matches symmetric pairs of curves that are likely to be counterparts. This allows the algorithm to reconstruct lightly textured objects, which are problematic for traditional feature-based and intensity-based stereo matchers.

Sudipta N. Sinha, Krishnan Ramnath, Richard Szeliski
3D Reconstruction of Dynamic Scenes with Multiple Handheld Cameras

Accurate dense 3D reconstruction of dynamic scenes from natural images is still very challenging. Most previous methods rely on a large number of fixed cameras to obtain good results. Some of these methods further require separation of static and dynamic points, which are usually restricted to scenes with known background. We propose a novel dense depth estimation method which can automatically recover accurate and consistent depth maps from the synchronized video sequences taken by a few handheld cameras. Unlike fixed camera arrays, our data capturing setup is much more flexible and easier to use. Our algorithm simultaneously solves bilayer segmentation and depth estimation in a unified energy minimization framework, which combines different spatio-temporal constraints for effective depth optimization and segmentation of static and dynamic points. A variety of examples demonstrate the effectiveness of the proposed framework.

Hanqing Jiang, Haomin Liu, Ping Tan, Guofeng Zhang, Hujun Bao
Joint Face Alignment: Rescue Bad Alignments with Good Ones by Regularized Re-fitting

Nowadays, more and more applications need to jointly align a set of facial images from one specific person, which forms the so-called

joint face alignment

problem. To address this problem, in this paper, starting from an initial face alignment results, we propose to enhance the alignments by a fundamentally novel idea:

rescuing the bad alignments with their well-aligned neighbors

. In our method, a discriminative alignment evaluator is well designed to assess the initial face alignments and separate the well-aligned images from the badly-aligned ones. To correct the bad ones, a robust regularized re-fitting algorithm is proposed by exploiting the appearance consistency between the badly-aligned image and its

k

well-aligned nearest neighbors. Experiments conducted on faces in the wild demonstrate that our method greatly improves the initial face alignment results of an off-the-shelf facial landmark locator. In addition, the effectiveness of our method is validated through comparing with other state-of-the-art methods in joint face alignment under complex conditions.

Xiaowei Zhao, Xiujuan Chai, Shiguang Shan
Dynamic Facial Expression Recognition Using Longitudinal Facial Expression Atlases

In this paper, we propose a new scheme to formulate the dynamic facial expression recognition problem as a longitudinal atlases construction and deformable groupwise image registration problem. The main contributions of this method include: 1) We model human facial feature changes during the facial expression process by a diffeomorphic image registration framework; 2) The subject-specific longitudinal change information of each facial expression is captured by building an expression growth model; 3) Longitudinal atlases of each facial expression are constructed by performing groupwise registration among all the corresponding expression image sequences of different subjects. The constructed atlases can reflect overall facial feature changes of each expression among the population, and can suppress the bias due to inter-personal variations. The proposed method was extensively evaluated on the Cohn-Kanade, MMI, and Oulu-CASIA VIS dynamic facial expression databases and was compared with several state-of-the-art facial expression recognition approaches. Experimental results demonstrate that our method consistently achieves the highest recognition accuracies among other methods under comparison on all the databases.

Yimo Guo, Guoying Zhao, Matti Pietikäinen
Crosstalk Cascades for Frame-Rate Pedestrian Detection

Cascades help make sliding window object detection fast, nevertheless, computational demands remain prohibitive for numerous applications. Currently, evaluation of adjacent windows proceeds independently; this is suboptimal as detector responses at nearby locations and scales are correlated. We propose to exploit these correlations by tightly coupling detector evaluation of nearby windows. We introduce two opposing mechanisms: detector

excitation

of promising neighbors and

inhibition

of inferior neighbors. By enabling neighboring detectors to communicate,

crosstalk cascades

achieve major gains (4-30× speedup) over cascades evaluated independently at each image location. Combined with recent advances in fast multi-scale feature computation, for which we provide an optimized implementation, our approach runs at 35-65 fps on 640×480 images while attaining state-of-the-art accuracy.

Piotr Dollár, Ron Appel, Wolf Kienzle
Query Specific Fusion for Image Retrieval

Recent image retrieval algorithms based on local features indexed by a vocabulary tree and holistic features indexed by compact hashing codes both demonstrate excellent scalability. However, their retrieval precision may vary dramatically among queries. This motivates us to investigate how to fuse the ordered retrieval sets given by multiple retrieval methods, to further enhance the retrieval precision. Thus, we propose a graph-based query specific fusion approach where multiple retrieval sets are merged and reranked by conducting a link analysis on a fused graph. The retrieval quality of an individual method is measured by the consistency of the top candidates’ nearest neighborhoods. Hence, the proposed method is capable of adaptively integrating the strengths of the retrieval methods using local or holistic features for different queries without any supervision. Extensive experiments demonstrate competitive performance on 4 public datasets,

i.e.

, the

UKbench

,

Corel-5K

,

Holidays

and

San Francisco Landmarks

datasets.

Shaoting Zhang, Ming Yang, Timothee Cour, Kai Yu, Dimitris N. Metaxas
Size Matters: Exhaustive Geometric Verification for Image Retrieval Accepted for ECCV 2012

The overreaching goals in large-scale image retrieval are bigger, better and cheaper. For systems based on local features we show how to get both efficient geometric verification of every match and unprecedented speed for the low sparsity situation.

Large-scale systems based on quantized local features usually process the index one term at a time, forcing two separate scoring steps: First, a scoring step to find candidates with enough matches, and then a geometric verification step where a subset of the candidates are checked.

Our method searches through the index a document at a time, verifying the geometry of every candidate in a single pass. We study the behavior of several algorithms with respect to index density—a key element for large-scale databases. In order to further improve the efficiency we also introduce a new new data structure, called the counting min-tree, which outperforms other approaches when working with low database density, a necessary condition for very large-scale systems.

We demonstrate the effectiveness of our approach with a proof of concept system that can match an image against a database of more than 90 billion images in just a few seconds.

Henrik Stewénius, Steinar H. Gunderson, Julien Pilet
Scene Aligned Pooling for Complex Video Recognition

Real-world videos often contain dynamic backgrounds and evolving people activities, especially for those web videos generated by users in unconstrained scenarios. This paper proposes a new visual representation, namely scene aligned pooling, for the task of event recognition in complex videos. Based on the observation that a video clip is often composed with shots of different scenes, the key idea of scene aligned pooling is to decompose any video features into concurrent scene components, and to construct classification models adaptive to different scenes. The experiments on two large scale real-world datasets including the TRECVID Multimedia Event Detection 2011 and the Human Motion Recognition Databases (HMDB) show that our new visual representation can consistently improve various kinds of visual features such as different low-level color and texture features, or middle-level histogram of local descriptors such as SIFT, or space-time interest points, and high level semantic model features, by a significant margin. For example, we improve the-state-of-the-art accuracy on HMDB dataset by 20% in terms of accuracy.

Liangliang Cao, Yadong Mu, Apostol Natsev, Shih-Fu Chang, Gang Hua, John R. Smith
Discovering Latent Domains for Multisource Domain Adaptation

Recent domain adaptation methods successfully learn cross-domain transforms to map points between source and target domains. Yet, these methods are either restricted to a single training domain, or assume that the separation into source domains is known a priori. However, most available training data contains multiple unknown domains. In this paper, we present both a novel domain transform mixture model which outperforms a single transform model when multiple domains are present, and a novel constrained clustering method that successfully discovers latent domains. Our discovery method is based on a novel hierarchical clustering technique that uses available object category information to constrain the set of feasible domain separations. To illustrate the effectiveness of our approach we present experiments on two commonly available image datasets with and without known domain labels: in both cases our method outperforms baseline techniques which use no domain adaptation or domain adaptation methods that presume a single underlying domain shift.

Judy Hoffman, Brian Kulis, Trevor Darrell, Kate Saenko
Visual Recognition Using Local Quantized Patterns

Features such as Local Binary Patterns (LBP) and Local Ternary Patterns (LTP) have been very successful in a number of areas including texture analysis, face recognition and object detection. They are based on the idea that small patterns of qualitative local gray-level differences contain a great deal of information about higher-level image content. Current local pattern features use hand-specified codings that are limited to small spatial supports and coarse graylevel comparisons. We introduce Local Quantized Patterns (LQP), a generalization that uses lookup-table-based vector quantization to code larger or deeper patterns. LQP inherits some of the flexibility and power of visual word representations without sacrificing the run-time speed and simplicity of local pattern ones. We show that it outperforms well-established features including HOG, LBP and LTP and their combinations on a range of challenging object detection and texture classification problems.

Sibt ul Hussain, Bill Triggs
Randomized Spatial Partition for Scene Recognition

The spatial layout of images plays a critical role in natural scene analysis. Despite previous work, e.g., spatial pyramid matching, how to design optimal spatial layout for scene classification remains an open problem due to the large variations of scene categories. This paper presents a novel image representation method, with the objective to characterize the image layout by various patterns, in the form of randomized spatial partition (RSP). The RSP-based image representation makes it possible to mine the most descriptive image layout pattern for each category of scenes, and then combine them by training a discriminative classifier, i.e., the proposed ORSP classifier. Besides RSP image representation, another powerful classifier, called the BRSP classifier, is also proposed. By weighting and boosting a sequence of various partition patterns, the BRSP classifier is more robust to the intra-class variations hence leads to a more accurate classification. Both RSP-based classifiers are tested on three publicly available scene datasets. The experimental results highlight the effectiveness of the proposed methods.

Yuning Jiang, Junsong Yuan, Gang Yu
Nested Sparse Quantization for Efficient Feature Coding

Many state-of-the-art methods in object recognition extract features from an image and encode them, followed by a pooling step and classification. Within this processing pipeline, often the encoding step is the bottleneck, for both computational efficiency and performance. We present a novel assignment-based encoding formulation. It allows for the fusion of assignment-based encoding and sparse coding into one formulation. We also use this to design a new, very efficient, encoding. At the heart of our formulation lies a quantization into a set of

k

-sparse vectors, which we denote as sparse quantization. We design the new encoding as two nested, sparse quantizations. Its efficiency stems from leveraging bit-wise representations. In a series of experiments on standard recognition benchmarks, namely Caltech 101, PASCAL VOC 07 and ImageNet, we demonstrate that our method achieves results that are competitive with the state-of-the-art, and requires orders of magnitude less time and memory. Our method is able to encode one million images using 4 CPUs in a single day, while maintaining a good performance.

Xavier Boix, Gemma Roig, Christian Leistner, Luc Van Gool
Comparative Evaluation of Binary Features

Performance evaluation of salient features has a long-standing tradition in computer vision. In this paper, we fill the gap of evaluation for the recent wave of binary feature descriptors, which aim to provide robustness while achieving high computational efficiency. We use established metrics to embed our assessment into the body of existing evaluations, allowing us to provide a novel taxonomy unifying both traditional and novel binary features. Moreover, we analyze the performance of different detector and descriptor pairings, which are often used in practice but have been infrequently analyzed. Additionally, we complement existing datasets with novel data testing for illumination change, pure camera rotation, pure scale change, and the variety present in photo-collections. Our performance analysis clearly demonstrates the power of the new class of features. To benefit the community, we also provide a website for the automatic testing of new description methods using our provided metrics and datasets (

www.cs.unc.edu/feature-evaluation

).

Jared Heinly, Enrique Dunn, Jan-Michael Frahm
Negative Evidences and Co-occurences in Image Retrieval: The Benefit of PCA and Whitening

The paper addresses large scale image retrieval with short vector representations. We study dimensionality reduction by Principal Component Analysis (PCA) and propose improvements to its different phases. We show and explicitly exploit relations between i) mean subtraction and the negative evidence, i.e., a visual word that is mutually missing in two descriptions being compared, and ii) the axis de-correlation and the co-occurrences phenomenon. Finally, we propose an effective way to alleviate the quantization artifacts through a joint dimensionality reduction of multiple vocabularies. The proposed techniques are simple, yet significantly and consistently improve over the state of the art on compact image representations. Complementary experiments in image classification show that the methods are generally applicable.

Hervé Jégou, Ondřej Chum
WαSH: Weighted α-Shapes for Local Feature Detection

Depending on the application, local feature detectors should comply with properties that are often contradictory,

e.g.

distinctiveness vs. robustness. Providing a good balance is a standing problem in the field. In this direction, we propose a novel approach for local feature detection starting from sampled edges. The detector is based on shape stability measures across the

weighted α-filtration

, a computational geometry construction that captures the shape of a non-uniform set of points. The extracted features are blob-like and include non-extremal regions as well as regions determined by cavities of boundary shape. Overall, the approach provides distinctive regions, while achieving high robustness in terms of

repeatability

and

matching score

, as well as competitive performance in a large scale image retrieval application.

Christos Varytimidis, Konstantinos Rapantzikos, Yannis Avrithis
Sparselet Models for Efficient Multiclass Object Detection

We develop an intermediate representation for deformable part models and show that this representation has favorable performance characteristics for multi-class problems when the number of classes is high. Our model uses sparse coding of part filters to represent each filter as a sparse linear combination of shared dictionary elements. This leads to a universal set of parts that are shared among all object classes. Reconstruction of the original part filter responses via sparse matrix-vector product reduces computation relative to conventional part filter convolutions. Our model is well suited to a parallel implementation, and we report a new GPU DPM implementation that takes advantage of sparse coding of part filters. The speed-up offered by our intermediate representation and parallel computation enable real-time DPM detection of 20 different object classes on a laptop computer.

Hyun Oh Song, Stefan Zickler, Tim Althoff, Ross Girshick, Mario Fritz, Christopher Geyer, Pedro Felzenszwalb, Trevor Darrell
Nested Pictorial Structures

We propose a theoretical construct coined nested pictorial structure to represent an object by parts that are recursively nested. Three innovative ideas are proposed: First, the nested pictorial structure finds a part configuration that is allowed to be deformed in geometric arrangement, while being confined to be topologically nested. Second, we define nested features which lend themselves to better, more detailed accounting of pixel data cost and describe occlusion in a principled way. Third, we develop the concept of constrained distance transform, a variation of the generalized distance transform, to guarantee the topological nesting relations and to further enforce that parts have no overlap with each other. We show that matching an optimal nested pictorial structure of

K

parts on an image of

N

pixels takes

O

(

NK

) time using dynamic programming and constrained distance transform. In our MATLAB/C++ implementation, it takes less than 0.1 seconds to do the global optimal matching when

K

 = 10 and

N

 = 400 ×400. We demonstrate the usefulness of nested pictorial structures in the matching of objects of nested patterns, objects in occlusion, and objects that live in a context.

Steve Gu, Ying Zheng, Carlo Tomasi
Performance Capture of Interacting Characters with Handheld Kinects

We present an algorithm for marker-less performance capture of interacting humans using only three hand-held Kinect cameras. Our method reconstructs human skeletal poses, deforming surface geometry and camera poses for every time step of the depth video. Skeletal configurations and camera poses are found by solving a joint energy minimization problem which optimizes the alignment of RGBZ data from all cameras, as well as the alignment of human shape templates to the Kinect data. The energy function is based on a combination of geometric correspondence finding, implicit scene segmentation, and correspondence finding using image features. Only the combination of geometric and photometric correspondences and the integration of human pose and camera pose estimation enables reliable performance capture with only three sensors. As opposed to previous performance capture methods, our algorithm succeeds on general uncontrolled indoor scenes with potentially dynamic background, and it succeeds even if the cameras are moving.

Genzhi Ye, Yebin Liu, Nils Hasler, Xiangyang Ji, Qionghai Dai, Christian Theobalt
Dynamic Eye Movement Datasets and Learnt Saliency Models for Visual Action Recognition

Systems based on bag-of-words models operating on image features collected at maxima of sparse interest point operators have been extremely successful for both computer-based visual object and action recognition tasks. While the sparse, interest-point based approach to recognition is not inconsistent with visual processing in biological systems that operate in ”saccade and fixate” regimes, the knowledge, methodology, and emphasis in the human and the computer vision communities remains sharply distinct. Here, we make three contributions aiming to bridge this gap. First, we complement existing state-of-the art large-scale dynamic computer vision datasets like Hollywood-2[1] and UCF Sports[2] with human eye movements collected under the ecological constraints of the visual action recognition task. To our knowledge these are the first massive human eye tracking datasets of significant size to be collected for video (497,107 frames, each viewed by 16 subjects), unique in terms of their

(a) large scale and computer vision relevance, (b) dynamic, video stimuli, (c) task control, as opposed to free-viewing

. Second, we introduce novel

dynamic consistency and alignment models

, which underline the remarkable stability of patterns of visual search among subjects. Third, we leverage the massive amounts of collected data in order to pursue studies and build automatic, end-to-end trainable computer vision systems based on human eye movements. Our studies not only shed light on the differences between computer vision spatio-temporal interest point image sampling strategies and human fixations, as well as their impact for visual recognition performance, but also demonstrate that human fixations can be accurately predicted, and when used in an end-to-end

automatic

system, leveraging some of the most advanced computer vision practice, can lead to state of the art results.

Stefan Mathe, Cristian Sminchisescu
Coherent Filtering: Detecting Coherent Motions from Crowd Clutters

Coherent motions, which describe the collective movements of individuals in crowd, widely exist in physical and biological systems. Understanding their underlying priors and detecting various coherent motion patterns from background clutters have both scientific values and a wide range of practical applications, especially for crowd motion analysis. In this paper, we propose and study a prior of coherent motion called

Coherent Neighbor Invariance

, which characterizes the local spatiotemporal relationships of individuals in coherent motion. Based on the coherent neighbor invariance, a general technique of detecting coherent motion patterns from noisy time-series data called

Coherent Filtering

is proposed. It can be effectively applied to data with different distributions at different scales in various real-world problems, where the environments could be sparse or extremely crowded with heavy noise. Experimental evaluation and comparison on synthetic and real data show the existence of Coherence Neighbor Invariance and the effectiveness of our Coherent Filtering.

Bolei Zhou, Xiaoou Tang, Xiaogang Wang
Robust 3D Action Recognition with Random Occupancy Patterns

We study the problem of action recognition from depth sequences captured by depth cameras, where noise and occlusion are common problems because they are captured with a single commodity camera. In order to deal with these issues, we extract semi-local features called random occupancy pattern (ROP) features, which employ a novel sampling scheme that effectively explores an extremely large sampling space. We also utilize a sparse coding approach to robustly encode these features. The proposed approach does not require careful parameter tuning. Its training is very fast due to the use of the high-dimensional integral image, and it is robust to the occlusions. Our technique is evaluated on two datasets captured by commodity depth cameras: an action dataset and a hand gesture dataset. Our classification results are superior to those obtained by the state of the art approaches on both datasets.

Jiang Wang, Zicheng Liu, Jan Chorowski, Zhuoyuan Chen, Ying Wu
Backmatter
Metadaten
Titel
Computer Vision – ECCV 2012
herausgegeben von
Andrew Fitzgibbon
Svetlana Lazebnik
Pietro Perona
Yoichi Sato
Cordelia Schmid
Copyright-Jahr
2012
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-33709-3
Print ISBN
978-3-642-33708-6
DOI
https://doi.org/10.1007/978-3-642-33709-3