Skip to main content

2010 | Book

Computer Vision – ECCV 2010

11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part IV

Editors: Kostas Daniilidis, Petros Maragos, Nikos Paragios

Publisher: Springer Berlin Heidelberg

Book Series: Lecture Notes in Computer Science


About this book

The 2010 edition of the European Conference on Computer Vision was held in Heraklion, Crete. The call for papers attracted an absolute record of 1,174 submissions. We describe here the selection of the accepted papers: Thirty-eight area chairs were selected coming from Europe (18), USA and Canada (16), and Asia (4). Their selection was based on the following criteria: (1) Researchers who had served at least two times as Area Chairs within the past two years at major vision conferences were excluded; (2) Researchers who served as Area Chairs at the 2010 Computer Vision and Pattern Recognition were also excluded (exception: ECCV 2012 Program Chairs); (3) Minimization of overlap introduced by Area Chairs being former student and advisors; (4) 20% of the Area Chairs had never served before in a major conference; (5) The Area Chair selection process made all possible efforts to achieve a reasonable geographic distribution between countries, thematic areas and trends in computer vision. Each Area Chair was assigned by the Program Chairs between 28–32 papers. Based on paper content, the Area Chair recommended up to seven potential reviewers per paper. Such assignment was made using all reviewers in the database including the conflicting ones. The Program Chairs manually entered the missing conflict domains of approximately 300 reviewers. Based on the recommendation of the Area Chairs, three reviewers were selected per paper (with at least one being of the top three suggestions), with 99.

Table of Contents


Spotlights and Posters W1

Kernel Sparse Representation for Image Classification and Face Recognition

Recent research has shown the effectiveness of using sparse coding(Sc) to solve many computer vision problems. Motivated by the fact that kernel trick can capture the nonlinear similarity of features, which may reduce the feature quantization error and boost the sparse coding performance, we propose Kernel Sparse Representation(KSR). KSR is essentially the sparse coding technique in a high dimensional feature space mapped by implicit mapping function. We apply KSR to both image classification and face recognition. By incorporating KSR into Spatial Pyramid Matching(SPM), we propose KSRSPM for image classification. KSRSPM can further reduce the information loss in feature quantization step compared with Spatial Pyramid Matching using Sparse Coding(ScSPM). KSRSPM can be both regarded as the generalization of Efficient Match Kernel(EMK) and an extension of ScSPM. Compared with sparse coding, KSR can learn more discriminative sparse codes for face recognition. Extensive experimental results show that KSR outperforms sparse coding and EMK, and achieves state-of-the-art performance for image classification and face recognition on publicly available datasets.

Shenghua Gao, Ivor Wai-Hung Tsang, Liang-Tien Chia
Every Picture Tells a Story: Generating Sentences from Images

Humans can prepare concise descriptions of pictures, focusing on what they find important. We demonstrate that automatic methods can do so too. We describe a system that can compute a score linking an image to a sentence. This score can be used to attach a descriptive sentence to a given image, or to obtain images that illustrate a given sentence. The score is obtained by comparing an estimate of meaning obtained from the image to one obtained from the sentence. Each estimate of meaning comes from a discriminative procedure that is learned using data. We evaluate on a novel dataset consisting of human-annotated images. While our underlying estimate of meaning is impoverished, it is sufficient to produce very good quantitative results, evaluated with a novel score that can account for synecdoche.

Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, David Forsyth
An Eye Fixation Database for Saliency Detection in Images

To learn the preferential visual attention given by humans to specific image content, we present NUSEF- an eye fixation database compiled from a pool of 758 images and 75 subjects. Eye fixations are an excellent modality to learn semantics-driven human understanding of images, which is vastly different from feature-driven approaches employed by saliency computation algorithms. The database comprises fixation patterns acquired using an eye-tracker, as subjects free-viewed images corresponding to many semantic categories such as


(human and mammal),










). The consistent presence of fixation clusters around specific image regions confirms that visual attention is not subjective, but is directed towards


objects and object-interactions.

We then show how the fixation clusters can be exploited for enhancing image understanding, by using our eye fixation database in an active image segmentation application. Apart from proposing a mechanism to automatically determine characteristic fixation seeds for segmentation, we show that the use of fixation seeds generated from multiple fixation clusters on the


object can lead to a 10% improvement in segmentation performance over the state-of-the-art.

Subramanian Ramanathan, Harish Katti, Nicu Sebe, Mohan Kankanhalli, Tat-Seng Chua
Face Image Relighting using Locally Constrained Global Optimization

A face image relighting method using locally constrained global optimization is presented in this paper. Based on the empirical fact that common radiance environments are locally homogeneous, we propose to use an optimization based solution in which local linear adjustments are performed on overlapping windows throughout the input image. As such, local textures and global smoothness of the input image can be preserved simultaneously when applying the illumination transformation. Experimental results demonstrate the effectiveness of the proposed method comparing to some previous approaches.

Jiansheng Chen, Guangda Su, Jinping He, Shenglan Ben
Correlation-Based Intrinsic Image Extraction from a Single Image

Intrinsic images represent the underlying properties of a scene such as illumination (shading) and surface reflectance. Extracting intrinsic images is a challenging, ill-posed problem. Human performance on tasks such as shadow detection and shape-from-shading is improved by adding colour and texture to surfaces. In particular, when a surface is painted with a textured pattern, correlations between local mean luminance and local luminance amplitude promote the interpretation of luminance variations as illumination changes. Based on this finding, we propose a novel feature, local luminance amplitude, to separate illumination and reflectance, and a framework to integrate this cue with hue and texture to extract intrinsic images. The algorithm uses steerable filters to separate images into frequency and orientation components and constructs shading and reflectance images from weighted combinations of these components. Weights are determined by correlations between corresponding variations in local luminance, local amplitude, colour and texture. The intrinsic images are further refined by ensuring the consistency of local texture elements. We test this method on surfaces photographed under different lighting conditions. The effectiveness of the algorithm is demonstrated by the correlation between our intrinsic images and ground truth shading and reflectance data. Luminance amplitude was found to be a useful cue. Results are also presented for natural images.

Xiaoyue Jiang, Andrew J. Schofield, Jeremy L. Wyatt
ADICT: Accurate Direct and Inverse Color Transformation

A color transfer function describes the relationship between the input and the output colors of a device. Computing this function is difficult when devices do not follow traditionally coveted properties like channel independency or color constancy, as is the case with most commodity capture and display devices (like projectors, camerass and printers). In this paper we present a novel representation for the color transfer function of any device, using higher-dimensional Bézier patches, that does not rely on any restrictive assumptions and hence can handle devices that do not behave in an ideal manner. Using this representation and a novel reparametrization technique, we design a color transformation method that is more accurate and free of local artifacts compared to existing color transformation methods. We demonstrate this method’s generality by using it for color management on a variety of input and output devices. Our method shows significant improvement in the appearance of seamlessness when used in the particularly demanding application of color matching across multi-projector displays or multi-camera systems. Finally we demonstrate that our color transformation method can be performed efficiently using a real-time GPU implementation.

Behzad Sajadi, Maxim Lazarov, Aditi Majumder
Real-Time Specular Highlight Removal Using Bilateral Filtering

In this paper, we propose a simple but effective specular highlight removal method using a single input image. Our method is based on a key observation - the maximum fraction of the diffuse color component (so called maximum diffuse chromaticity in the literature) in local patches in color images changes smoothly. Using this property, we can estimate the maximum diffuse chromaticity values of the specular pixels by directly applying low-pass filter to the maximum fraction of the color components of the original image, such that the maximum diffuse chromaticity values can be propagated from the diffuse pixels to the specular pixels. The diffuse color at each pixel can then be computed as a nonlinear function of the estimated maximum diffuse chromaticity. Our method can be directly extended for multi-color surfaces if edge-preserving filters (e.g., bilateral filter) are used such that the smoothing can be guided by the maximum diffuse chromaticity. But maximum diffuse chromaticity is to be estimated. We thus present an approximation and demonstrate its effectiveness. Recent development in fast bilateral filtering techniques enables our method to run over


× faster than the state-of-the-art on a standard CPU and differentiates our method from previous work.

Qingxiong Yang, Shengnan Wang, Narendra Ahuja
Learning Artistic Lighting Template from Portrait Photographs

This paper presents a method for learning artistic portrait lighting template from a dataset of artistic and daily portrait photographs. The learned template can be used for (1) classification of artistic and daily portrait photographs, and (2) numerical aesthetic quality assessment of these photographs in lighting usage. For learning the template, we adopt Haar-like local lighting contrast features, which are then extracted from pre-defined areas on frontal faces, and selected to form a log-linear model using a stepwise feature pursuit algorithm. Our learned template corresponds well to some typical studio styles of portrait photography. With the template, the classification and assessment tasks are achieved under probability ratio test formulations. On our dataset composed of 350 artistic and 500 daily photographs, we achieve a 89.5% classification accuracy in cross-validated tests, and the assessment model assigns reasonable numerical scores based on portraits’ aesthetic quality in lighting.

Xin Jin, Mingtian Zhao, Xiaowu Chen, Qinping Zhao, Song-Chun Zhu
Photometric Stereo from Maximum Feasible Lambertian Reflections

We present a Lambertian photometric stereo algorithm robust to specularities and shadows and it is based on a maximum feasible subsystem (Max FS) framework. A Big-M method is developed to obtain the maximum subset of images that satisfy the Lambertian constraint among the whole set of captured photometric stereo images which include non-Lambertian reflections such as specularities and shadows. Our algorithm employs purely algebraic pixel-wise optimization without relying on probabilistic/physical reasoning or initialization, and it guarantees the global optimality. It can be applied to the image sets with the number of images ranging from four to hundreds, and we show that the computation time is reasonably short for a medium number of images (10~100). Experiments are carried out with various objects to demonstrate the effectiveness of the algorithm.

Chanki Yu, Yongduek Seo, Sang Wook Lee
Part-Based Feature Synthesis for Human Detection

We introduce a new approach for learning part-based object detection through feature synthesis. Our method consists of an iterative process of feature generation and pruning. A feature generation procedure is presented in which basic part-based features are developed into a feature hierarchy using operators for part localization, part refining and part combination. Feature pruning is done using a new feature selection algorithm for linear SVM, termed Predictive Feature Selection (PFS), which is governed by weight prediction. The algorithm makes it possible to choose from




) features in an efficient but accurate manner. We analyze the validity and behavior of PFS and empirically demonstrate its speed and accuracy advantages over relevant competitors. We present an empirical evaluation of our method on three human detection datasets including the current de-facto benchmarks (the INRIA and Caltech pedestrian datasets) and a new challenging dataset of children images in difficult poses. The evaluation suggests that our approach is on a par with the best current methods and advances the state-of-the-art on the Caltech pedestrian training dataset.

Aharon Bar-Hillel, Dan Levi, Eyal Krupka, Chen Goldberg
Improving the Fisher Kernel for Large-Scale Image Classification

The Fisher kernel (FK) is a generic framework which combines the benefits of generative and discriminative approaches. In the context of image classification the FK was shown to extend the popular bag-of-visual-words (BOV) by going beyond count statistics. However, in practice, this enriched representation has not yet shown its superiority over the BOV. In the first part we show that with several well-motivated modifications over the original framework we can boost the accuracy of the FK. On PASCAL VOC 2007 we increase the Average Precision (AP) from 47.9% to 58.3%. Similarly, we demonstrate state-of-the-art accuracy on CalTech 256. A major advantage is that these results are obtained

using only SIFT descriptors and costless linear classifiers

. Equipped with this representation, we can now explore image classification on a larger scale. In the second part, as an application, we compare two abundant resources of labeled images to learn classifiers: ImageNet and Flickr groups. In an evaluation involving hundreds of thousands of training images we show that classifiers learned on Flickr groups perform surprisingly well (although they were not intended for this purpose) and that they can complement classifiers learned on more carefully annotated datasets.

Florent Perronnin, Jorge Sánchez, Thomas Mensink
Max-Margin Dictionary Learning for Multiclass Image Categorization

Visual dictionary learning and base (binary) classifier training are two basic problems for the recently most popular image categorization framework, which is based on the bag-of-visual-terms (BOV) models and multiclass SVM classifiers. In this paper, we study new algorithms to improve performance of this framework from these two aspects. Typically SVM classifiers are trained with dictionaries fixed, and as a result the traditional loss function can only be minimized with respect to hyperplane parameters (




). We propose a novel loss function for a binary classifier, which links the hinge-loss term with dictionary learning. By doing so, we can further optimize the loss function with respect to the dictionary parameters. Thus, this framework is able to further increase margins of binary classifiers, and consequently decrease the error bound of the aggregated classifier. On two benchmark dataset, Graz [1] and the fifteen scene category dataset [2], our experiment results significantly outperformed state-of-the-art works.

Xiao-Chen Lian, Zhiwei Li, Bao-Liang Lu, Lei Zhang
Towards Optimal Naive Bayes Nearest Neighbor

Naive Bayes Nearest Neighbor (NBNN) is a feature-based image classifier that achieves impressive degree of accuracy [1] by exploiting ‘Image-to-Class’ distances and by avoiding quantization of local image descriptors. It is based on the hypothesis that each local descriptor is drawn from a class-dependent probability measure. The density of the latter is estimated by the non-parametric kernel estimator, which is further simplified under the assumption that the normalization factor is class-independent. While leading to significant simplification, the assumption underlying the original NBNN is too restrictive and considerably degrades its generalization ability. The goal of this paper is to address this issue.

As we relax the incriminated assumption we are faced with a parameter selection problem that we solve by hinge-loss minimization. We also show that our modified formulation naturally generalizes to optimal combinations of feature types. Experiments conducted on several datasets show that the gain over the original NBNN may attain up to 20 percentage points. We also take advantage of the linearity of optimal NBNN to perform classification by detection through efficient sub-window search [2], with yet another performance gain. As a result, our classifier outperforms — in terms of misclassification error — methods based on support vector machine and bags of quantized features on some datasets.

Régis Behmo, Paul Marcombes, Arnak Dalalyan, Véronique Prinet
Weakly Supervised Classification of Objects in Images Using Soft Random Forests

The development of robust classification model is among the important issues in computer vision. This paper deals with weakly supervised learning that generalizes the supervised and semi-supervised learning. In weakly supervised learning training data are given as the priors of each class for each sample. We first propose a weakly supervised strategy for learning soft decision trees. Besides, the introduction of class priors for training samples instead of hard class labels makes natural the formulation of an iterative learning procedure. We report experiments for UCI object recognition datasets. These experiments show that recognition performance close to the supervised learning can be expected using the propose framework. Besides, an application to semi-supervised learning, which can be regarded as a particular case of weakly supervised learning, further demonstrates the pertinence of the contribution. We further discuss the relevance of weakly supervised learning for computer vision applications.

Riwal Lefort, Ronan Fablet, Jean-Marc Boucher
Learning What and How of Contextual Models for Scene Labeling

We present a data-driven approach to predict the importance of edges and construct a Markov network for image analysis based on statistical models of global and local image features. We also address the coupled problem of predicting the feature weights associated with each edge of a Markov network for evaluation of context. Experimental results indicate that this scene dependent structure construction model eliminates spurious edges and improves performance over fully-connected and neighborhood connected Markov network.

Arpit Jain, Abhinav Gupta, Larry S. Davis
Adapting Visual Category Models to New Domains

Domain adaptation is an important emerging topic in computer vision. In this paper, we present one of the first studies of domain shift in the context of object recognition. We introduce a method that adapts object models acquired in a particular

visual domain

to new imaging conditions by learning a transformation that minimizes the effect of domain-induced changes in the feature distribution. The transformation is learned in a supervised manner and can be applied to categories for which there are no labeled examples in the new domain. While we focus our evaluation on object recognition tasks, the transform-based adaptation technique we develop is general and could be applied to non-image data. Another contribution is a new multi-domain object database, freely available for download. We experimentally demonstrate the ability of our method to improve recognition on categories with few or no target domain labels and moderate to large changes in the imaging conditions.

Kate Saenko, Brian Kulis, Mario Fritz, Trevor Darrell
Improved Human Parsing with a Full Relational Model

We show quantitative evidence that a full relational model of the body performs better at upper body parsing than the standard tree model, despite the need to adopt approximate inference and learning procedures. Our method uses an approximate search for inference, and an approximate structure learning method to learn. We compare our method to state of the art methods on our dataset (which depicts a wide range of poses), on the standard Buffy dataset, and on the reduced PASCAL dataset published recently. Our results suggest that the Buffy dataset over emphasizes poses where the arms hang down, and that leads to generalization problems.

Duan Tran, David Forsyth
Multiresolution Models for Object Detection

Most current approaches to recognition aim to be scale-invariant. However, the cues available for recognizing a 300 pixel tall object are qualitatively different from those for recognizing a 3 pixel tall object. We argue that for sensors with finite resolution, one should instead use scale-variant, or multiresolution representations that adapt in complexity to the size of a putative detection window. We describe a multiresolution model that acts as a deformable part-based model when scoring large instances and a rigid template with scoring small instances. We also examine the interplay of resolution and context, and demonstrate that context is most helpful for detecting low-resolution instances when local models are limited in discriminative power. We demonstrate impressive results on the Caltech Pedestrian benchmark, which contains object instances at a wide range of scales. Whereas recent state-of-the-art methods demonstrate missed detection rates of 86%-37% at 1 false-positive-per-image, our multiresolution model reduces the rate to 29%.

Dennis Park, Deva Ramanan, Charless Fowlkes
Accurate Image Localization Based on Google Maps Street View

Finding an image’s exact GPS location is a challenging computer vision problem that has many real-world applications. In this paper, we address the problem of finding the GPS location of images with an accuracy which is comparable to hand-held GPS devices.We leverage a structured data set of about 100,000 images build from Google Maps Street View as the reference images. We propose a localization method in which the SIFT descriptors of the detected SIFT interest points in the reference images are indexed using a tree. In order to localize a query image, the tree is queried using the detected SIFT descriptors in the query image. A novel GPS-tag-based pruning method removes the less reliable descriptors. Then, a smoothing step with an associated voting scheme is utilized; this allows each query descriptor to vote for the location its nearest neighbor belongs to, in order to accurately localize the query image. A parameter called

Confidence of Localization

which is based on the Kurtosis of the distribution of votes is defined to determine how reliable the localization of a particular image is. In addition, we propose a novel approach to localize groups of images accurately in a hierarchical manner. First, each image is localized individually; then, the rest of the images in the group are matched against images in the neighboring area of the found first match. The final location is determined based on the

Confidence of Localization

parameter. The proposed image group localization method can deal with very unclear queries which are not capable of being geolocated individually.

Amir Roshan Zamir, Mubarak Shah
A Minimal Case Solution to the Calibrated Relative Pose Problem for the Case of Two Known Orientation Angles

It this paper we present a novel minimal case solution to the calibrated relative pose problem using 3 point correspondences for the case of two known orientation angles. This case is relevant when a camera is coupled with an inertial measurement unit (IMU) and it recently gained importance with the omnipresence of Smartphones (iPhone, Nokia N900) that are equipped with accelerometers to measure the gravity normal. Similar to the 5-point (6-point), 7-point, and 8-point algorithm for computing the essential matrix in the unconstrained case, we derive a 3-point, 4-point and, 5-point algorithm for the special case of two known orientation angles. We investigate degenerate conditions and show that the new 3-point algorithm can cope with planes and even collinear points. We will show a detailed analysis and comparison on synthetic data and present results on cell phone images. As an additional application we demonstrate the algorithm on relative pose estimation for a micro aerial vehicle’s (MAV) camera-IMU system.

Friedrich Fraundorfer, Petri Tanskanen, Marc Pollefeys
Bilinear Factorization via Augmented Lagrange Multipliers

This paper presents a unified approach to solve different bilinear factorization problems in Computer Vision in the presence of missing data in the measurements. The problem is formulated as a constrained optimization problem where one of the factors is constrained to lie on a specific manifold. To achieve this, we introduce an equivalent reformulation of the bilinear factorization problem. This reformulation decouples the core bilinear aspect from the manifold specificity. We then tackle the resulting constrained optimization problem with Bilinear factorization via Augmented Lagrange Multipliers (BALM). The mechanics of our algorithm are such that only a projector onto the manifold constraint is needed. That is the strength and the novelty of our approach: it can handle seamlessly different Computer Vision problems. We present experiments and results for two popular factorization problems: Non-rigid Structure from Motion and Photometric Stereo.

Alessio Del Bue, João Xavier, Lourdes Agapito, Marco Paladini
Piecewise Quadratic Reconstruction of Non-Rigid Surfaces from Monocular Sequences

In this paper we present a new method for the 3D reconstruction of highly deforming surfaces (for instance a flag waving in the wind) viewed by a single orthographic camera. We assume that the surface is described by a set of feature points which are tracked along an image sequence. Most non-rigid structure from motion algorithms assume a


deformation model where a rigid mean shape component accounts for most of the motion and the deformation modes are small deviations from it. However, in the case of strongly deforming objects, the deformations become more complex and a global model will often fail to explain the intricate deformations which are no longer small linear deviations from a strong mean component. Our proposed algorithm divides the surface into overlapping patches, reconstructs each of these patches individually using a quadratic deformation model and finally registers them imposing the constraint that points shared by patches must correspond to the same 3D points in space. We show good results on challenging motion capture and real video sequences with strong deformations where global methods fail to achieve good reconstructions.

João Fayad, Lourdes Agapito, Alessio Del Bue
Extrinsic Camera Calibration Using Multiple Reflections

This paper presents a method for determining the six-degree-of-freedom (DOF) transformation between a camera and a base frame of interest, while concurrently estimating the 3D base-frame coordinates of unknown point features in the scene. The camera observes the reflections of fiducial points, whose base-frame coordinates are known, and reconstruction points, whose base-frame coordinates are unknown. In this paper, we examine the case in which, due to visibility constraints, none of the points are directly viewed by the camera, but instead are seen via reflection in multiple planar mirrors. Exploiting these measurements, we


compute the camera-to-base transformation and the 3D base-frame coordinates of the unknown reconstruction points, without

a priori

knowledge of the mirror sizes, motions, or placements with respect to the camera. Subsequently, we refine the analytical solution using a maximum-likelihood estimator (MLE), to obtain high-accuracy estimates of the camera-to-base transformation, the mirror configurations for each image, and the 3D coordinates of the reconstruction points in the base frame. We validate the accuracy and correctness of our method with simulations and real-world experiments.

Joel A. Hesch, Anastasios I. Mourikis, Stergios I. Roumeliotis
Probabilistic Deformable Surface Tracking from Multiple Videos

In this paper, we address the problem of tracking the temporal evolution of arbitrary shapes observed in multi-camera setups. This is motivated by the ever growing number of applications that require consistent shape information along temporal sequences. The approach we propose considers a temporal sequence of independently reconstructed surfaces and iteratively deforms a reference mesh to fit these observations. To effectively cope with outlying and missing geometry, we introduce a novel probabilistic mesh deformation framework. Using generic local rigidity priors and accounting for the uncertainty in the data acquisition process, this framework effectively handles missing data, relatively large reconstruction artefacts and multiple objects. Extensive experiments demonstrate the effectiveness and robustness of the method on various 4D datasets.

Cedric Cagniart, Edmond Boyer, Slobodan Ilic
Theory of Optimal View Interpolation with Depth Inaccuracy

Depth inaccuracy greatly affects the quality of free-viewpoint image synthesis. A theoretical framework for a simplified view interpolation setup to quantitatively analyze the effect of depth inaccuracy and provide a principled optimization scheme based on the mean squared error metric is proposed. The theory clarifies that if the probabilistic distribution of disparity errors is available, optimal view interpolation that outperforms conventional linear interpolation can be achieved. It is also revealed that under specific conditions, the optimal interpolation converges to linear interpolation. Furthermore, appropriate band-limitation combined with linear interpolation is also discussed, leading to an easy algorithm that achieves near-optimal quality. Experimental results using real scenes are also presented to confirm this theory.

Keita Takahashi
Practical Methods for Convex Multi-view Reconstruction

Globally optimal formulations of geometric computer vision problems comprise an exciting topic in multiple view geometry. These approaches are unaffected by the quality of a provided initial solution, can directly identify outliers in the given data, and provide a better theoretical understanding of geometric vision problems. The disadvantage of these methods are the substantial computational costs, which limit the tractable problem size significantly, and the tendency of reducing a particular geometric problem to one of the standard programs well-understood in convex optimization. We select a view on these geometric vision tasks inspired by recent progress made on other low-level vision problems using very simple (and easy to parallelize) methods. Our view also enables the utilization of geometrically more meaningful cost functions, which cannot be represented by one of the standard optimization problems. We also demonstrate in the numerical experiments, that our proposed method scales better with respect to the problem size than standard optimization codes.

Christopher Zach, Marc Pollefeys
Building Rome on a Cloudless Day

This paper introduces an approach for dense 3D reconstruction from unregistered Internet-scale photo collections with about 3 million images within the span of a day on a single PC (“cloudless”). Our method advances image clustering, stereo, stereo fusion and structure from motion to achieve high computational performance. We leverage geometric and appearance constraints to obtain a highly parallel implementation on modern graphics processors and multi-core architectures. This leads to two orders of magnitude higher performance on an order of magnitude larger dataset than competing state-of-the-art approaches.

Jan-Michael Frahm, Pierre Fite-Georgel, David Gallup, Tim Johnson, Rahul Raguram, Changchang Wu, Yi-Hung Jen, Enrique Dunn, Brian Clipp, Svetlana Lazebnik, Marc Pollefeys
Camera Pose Estimation Using Images of Planar Mirror Reflections

The image of a planar mirror reflection (IPMR) can be interpreted as a virtual view of the scene, acquired by a camera with a pose symmetric to the pose of the real camera with respect to the mirror plane. The epipolar geometry of virtual views associated with different IPMRs is well understood, and it is possible to recover the camera motion and perform 3D scene reconstruction by applying standard structure-from-motion methods that use image correspondences as input. In this article we address the problem of estimating the pose of the real camera, as well as the positions of the mirror plane, by assuming that the rigid motion between


virtual views induced by planar mirror reflections is known. The solution of this problem enables the registration of objects lying outside the camera field-of-view, which can have important applications in domains like non-overlapping camera network calibration and robot vision. We show that the positions of the mirror planes can be uniquely determined by solving a system of linear equations. This enables to estimate the pose of the real camera in a straightforward closed-form manner using a minimum of


 = 3 virtual views. Both synthetic tests and real experiments show the superiority of our approach with respect to current state-of-the-art methods.

Rui Rodrigues, João P. Barreto, Urbano Nunes
Element-Wise Factorization for N-View Projective Reconstruction

Sturm-Triggs iteration is a standard method for solving the

projective factorization

problem. Like other iterative algorithms, this method suffers from some common drawbacks such as requiring a good initialization, the iteration may not converge or only converge to a local minimum, etc. None of the published works can offer any sort of global optimality guarantee to the problem. In this paper, an optimal solution to projective factorization for structure and motion is presented, based on the same principle of low-rank factorization. Instead of formulating the problem as

matrix factorization

, we recast it as

element-wise factorization

, leading to a convenient and efficient semi-definite program formulation. Our method is thus


, where no initial point is needed, and a globally-optimal solution can be found (up to some relaxation gap). Unlike traditional projective factorization, our method can handle real-world difficult cases like missing data or outliers easily, and all in a unified manner. Extensive experiments on both synthetic and real image data show comparable or superior results compared with existing methods.

Yuchao Dai, Hongdong Li, Mingyi He
Learning Relations among Movie Characters: A Social Network Perspective

If you have ever watched movies or television shows, you know how easy it is to tell the good characters from the bad ones. Little, however, is known “whether” or “how” computers can achieve such high-level understanding of movies. In this paper, we take the first step towards learning the relations among movie characters using visual and auditory cues. Specifically, we use support vector regression to estimate local characterization of adverseness at the scene level. Such local properties are then synthesized via statistical learning based on Gaussian processes to derive the affinity between the movie characters. Once the affinity is learned, we perform social network analysis to find communities of characters and identify the leader of each community. We experimentally demonstrate that the relations among characters can be determined with reasonable accuracy from the movie content.

Lei Ding, Alper Yilmaz

Scene and Object Recognition

What, Where and How Many? Combining Object Detectors and CRFs

Computer vision algorithms for individual tasks such as object recognition, detection and segmentation have shown impressive results in the recent past. The next challenge is to integrate all these algorithms and address the problem of scene understanding. This paper is a step towards this goal. We present a probabilistic framework for reasoning about regions, objects, and their attributes such as object class, location, and spatial extent. Our model is a Conditional Random Field defined on pixels, segments and objects. We define a global energy function for the model, which combines results from sliding window detectors, and low-level pixel-based unary and pairwise relations. One of our primary contributions is to show that this energy function can be solved efficiently. Experimental results show that our model achieves significant improvement over the baseline methods on CamVid and

pascal voc


Ľubor Ladický, Paul Sturgess, Karteek Alahari, Chris Russell, Philip H. S. Torr
Visual Recognition with Humans in the Loop

We present an interactive, hybrid human-computer method for object classification. The method applies to classes of objects that are recognizable by people with appropriate expertise (


, animal species or airplane model), but not (in general) by people without such expertise. It can be seen as a visual version of the

20 questions game

, where questions based on simple visual attributes are posed interactively. The goal is to identify the true class while minimizing the number of questions asked, using the visual content of the image. We introduce a general framework for incorporating almost any off-the-shelf multi-class object recognition algorithm into the visual 20 questions game, and provide methodologies to account for imperfect user responses and unreliable computer vision algorithms. We evaluate our methods on


, a difficult dataset of 200 tightly-related bird species, and on the

Animals With Attributes

dataset. Our results demonstrate that incorporating user input drives up recognition accuracy to levels that are good enough for practical applications, while at the same time, computer vision reduces the amount of human interaction required.

Steve Branson, Catherine Wah, Florian Schroff, Boris Babenko, Peter Welinder, Pietro Perona, Serge Belongie
Localizing Objects While Learning Their Appearance

Learning a new object class from cluttered training images is very challenging when the location of object instances is unknown. Previous works generally require objects covering a large portion of the images. We present a novel approach that can cope with extensive clutter as well as large scale and appearance variations between object instances. To make this possible we propose a conditional random field that starts from generic knowledge and then progressively adapts to the new class. Our approach simultaneously localizes object instances while learning an appearance model specific for the class. We demonstrate this on the challenging


VOC 2007 dataset. Furthermore, our method enables to train any state-of-the-art object detector in a weakly supervised fashion, although it would normally require object location annotations.

Thomas Deselaers, Bogdan Alexe, Vittorio Ferrari
Monocular 3D Scene Modeling and Inference: Understanding Multi-Object Traffic Scenes

Scene understanding has (again) become a focus of computer vision research, leveraging advances in detection, context modeling, and tracking. In this paper, we present a novel probabilistic 3D scene model that encompasses multi-class object detection, object tracking, scene labeling, and 3D geometric relations. This integrated 3D model is able to represent complex interactions like inter-object occlusion, physical exclusion between objects, and geometric context. Inference allows to recover 3D scene context and perform 3D multiobject tracking from a mobile observer, for objects of multiple categories, using only monocular video as input. In particular, we show that a joint scene tracklet model for the evidence collected over multiple frames substantially improves performance. The approach is evaluated for two different types of challenging onboard sequences. We first show a substantial improvement to the state-of-the-art in 3D multi-people tracking. Moreover, a similar performance gain is achieved for multi-class 3D tracking of cars and trucks on a new, challenging dataset.

Christian Wojek, Stefan Roth, Konrad Schindler, Bernt Schiele
Blocks World Revisited: Image Understanding Using Qualitative Geometry and Mechanics

Since most current scene understanding approaches operate either on the 2D image or using a surface-based representation, they do not allow reasoning about the physical constraints within the 3D scene. Inspired by the “Blocks World” work in the 1960’s, we present a


physical representation of an outdoor scene where objects have volume and mass, and relationships describe 3D structure and mechanical configurations. Our representation allows us to apply powerful global geometric constraints between 3D volumes as well as the laws of statics in a qualitative manner. We also present a novel iterative “interpretation-by-synthesis” approach where, starting from an empty ground plane, we progressively “build up” a physically-plausible 3D interpretation of the image. For surface layout estimation, our method demonstrates an improvement in performance over the state-of-the-art [9]. But more importantly, our approach automatically generates

3D parse graphs

which describe qualitative geometric and mechanical properties of objects and relationships between objects within an image.

Abhinav Gupta, Alexei A. Efros, Martial Hebert
Discriminative Learning with Latent Variables for Cluttered Indoor Scene Understanding

We address the problem of understanding an indoor scene from a single image in terms of recovering the layouts of the faces (floor, ceiling, walls) and furniture. A major challenge of this task arises from the fact that most indoor scenes are cluttered by furniture and decorations, whose appearances vary drastically across scenes, and can hardly be modeled (or even hand-labeled) consistently. In this paper we tackle this problem by introducing latent variables to account for clutters, so that the observed image is jointly explained by the face and clutter layouts. Model parameters are learned in the maximum margin formulation, which is constrained by extra prior energy terms that define the role of the latent variables. Our approach enables taking into account and inferring indoor clutter layouts


hand-labeling of the clutters in the training set. Yet it outperforms the state-of-the-art method of Hedau et al. [4] that requires clutter labels.

Huayan Wang, Stephen Gould, Daphne Koller

Spotlights and Posters W2

Visual Tracking Using a Pixelwise Spatiotemporal Oriented Energy Representation

This paper presents a novel pixelwise representation for visual tracking that models both the spatial structure and dynamics of a target in a unified fashion. The representation is derived from spatiotemporal energy measurements that capture underlying local spacetime orientation structure at multiple scales. For interframe motion estimation, the feature representation is instantiated within a pixelwise template warping framework; thus, the spatial arrangement of the pixelwise energy measurements remains intact. The proposed target representation is extremely rich, including appearance and motion information as well as information about how these descriptors are spatially arranged. Qualitative and quantitative empirical evaluation on challenging sequences demonstrates that the resulting tracker outperforms several alternative state-of-the-art systems.

Kevin J. Cannons, Jacob M. Gryn, Richard P. Wildes
A Globally Optimal Approach for 3D Elastic Motion Estimation from Stereo Sequences

Dense and markerless elastic 3D motion estimation based on stereo sequences is a challenge in computer vision. Solutions based on scene flow and 3D registration are mostly restricted to simple non-rigid motions, and suffer from the error accumulation. To address this problem, this paper proposes a globally optimal approach to non-rigid motion estimation which simultaneously recovers the 3D surface as well as its non-rigid motion over time. The instantaneous surface of the object is represented as a set of points which is reconstructed from the matched stereo images, meanwhile its deformation is captured by registering the points over time under spatio-temporal constraints. A global energy is defined on the constraints of stereo, spatial smoothness and temporal continuity, which is optimized via an iterative algorithm to approximate the minimum. Our extensive experiments on real video sequences including different facial expressions, cloth flapping, flag waves, etc. proved the robustness of our method and showed the method effectively handles complex nonrigid motions.

Qifan Wang, Linmi Tao, Huijun Di
Occlusion Boundary Detection Using Pseudo-depth

We address the problem of detecting occlusion boundaries from motion sequences, which is important for motion segmentation, estimating depth order, and related tasks. Previous work by Stein and Hebert has addressed this problem and obtained good results on a benchmarked dataset using two-dimensional image cues, motion estimation, and a

global boundary model

[1]. In this paper we describe a method for detecting occlusion boundaries which uses depth cues and local segmentation cues. More specifically, we show that crude scaled estimates of depth, which we call


, can be extracted from motion sequences containing a small number of image frames using standard SVD factorization methods followed by weak smoothing using a Markov Random Field defined over super-pixels. We then train a classifier for occlusion boundaries using pseudo-depth and local static boundary cues (adding motion cues only gives slightly better results). We evaluate performance on Stein and Hebert’s dataset and obtain results of similar average quality which are better in the low recall/high precision range. Note that our cues and methods are different from [1] – in particular we did not use their sophisticated global boundary model – and so we conjecture that a unified approach would yield even better results.

Xuming He, Alan Yuille
Multiple Target Tracking in World Coordinate with Single, Minimally Calibrated Camera

Tracking multiple objects is important in many application domains. We propose a novel algorithm for multi-object tracking that is capable of working under very challenging conditions such as minimal hardware equipment, uncalibrated monocular camera, occlusions and severe background clutter. To address this problem we propose a new method that jointly estimates object tracks, estimates corresponding 2D/3D temporal trajectories in the camera reference system as well as estimates the model parameters (pose, focal length, etc) within a coherent probabilistic formulation. Since our goal is to estimate stable and robust tracks that can be univocally associated to the object IDs, we propose to include in our formulation an interaction (attraction and repulsion) model that is able to model multiple 2D/3D trajectories in space-time and handle situations where objects occlude each other. We use a MCMC particle filtering algorithm for parameter inference and propose a solution that enables accurate and efficient tracking and camera model estimation. Qualitative and quantitative experimental results obtained using our own dataset and the publicly available ETH dataset shows very promising tracking and camera estimation results.

Wongun Choi, Silvio Savarese
Joint Estimation of Motion, Structure and Geometry from Stereo Sequences

We present a novel variational method for the simultaneous estimation of dense scene flow and structure from stereo sequences. In contrast to existing approaches that rely on a fully calibrated camera setup, we assume that only the intrinsic camera parameters are known. To couple the estimation of motion, structure and geometry, we propose a joint energy functional that integrates spatial and temporal information from two subsequent image pairs subject to an unknown stereo setup. We further introduce a normalisation of image and stereo constraints such that deviations from model assumptions can be interpreted in a geometrical way. Finally, we suggest a separate discontinuity-preserving regularisation to improve the accuracy. Experiments on calibrated and uncalibrated data demonstrate the excellent performance of our approach. We even outperform recent techniques for the rectified case that make explicit use of the simplified geometry.

Levi Valgaerts, Andrés Bruhn, Henning Zimmer, Joachim Weickert, Carsten Stoll, Christian Theobalt
Dense, Robust, and Accurate Motion Field Estimation from Stereo Image Sequences in Real-Time

In this paper a novel approach for estimating the three dimensional motion field of the visible world from stereo image sequences is proposed. This approach combines dense variational optical flow estimation, including spatial regularization, with Kalman filtering for temporal smoothness and robustness. The result is a dense, robust, and accurate reconstruction of the three-dimensional motion field of the current scene that is computed in real-time. Parallel implementation on a GPU and an FPGA yields a vision-system which is directly applicable in real-world scenarios, like automotive driver assistance systems or in the field of surveillance. Within this paper we systematically show that the proposed algorithm is physically motivated and that it outperforms existing approaches with respect to computation time and accuracy.

Clemens Rabe, Thomas Müller, Andreas Wedel, Uwe Franke
Estimation of 3D Object Structure, Motion and Rotation Based on 4D Affine Optical Flow Using a Multi-camera Array

In this paper we extend a standard affine optical flow model to 4D and present how affine parameters can be used for estimation of 3D object structure, 3D motion and rotation using a 1D camera grid. Local changes of the projected motion vector field are modelled not only on the image plane as usual for affine optical flow, but also in camera displacement direction, and in time. We identify all parameters of this 4D fully affine model with terms depending on scene structure, scene motion, and camera displacement. We model the scene by planar, translating, and rotating surface patches and project them with a pinhole camera grid model. Imaged intensities of the projected surface points are then modelled by a brightness change model handling illumination changes. Experiments demonstrate the accuracy of the new model. It outperforms not only 2D affine optical flow models but range flow for varying illumination. Moreover we are able to estimate surface normals and rotation parameters. Experiments on real data of a plant physiology experiment confirm the applicability of our model.

Tobias Schuchert, Hanno Scharr
Efficiently Scaling Up Video Annotation with Crowdsourced Marketplaces

Accurately annotating entities in video is labor intensive and expensive. As the quantity of online video grows, traditional solutions to this task are unable to scale to meet the needs of researchers with limited budgets. Current practice provides a temporary solution by paying dedicated workers to label a fraction of the total frames and otherwise settling for linear interpolation. As budgets and scale require sparser key frames, the assumption of linearity fails and labels become inaccurate. To address this problem we have created a public framework for dividing the work of labeling video data into micro-tasks that can be completed by huge labor pools available through crowdsourced marketplaces. By extracting pixel-based features from manually labeled entities, we are able to leverage more sophisticated interpolation between key frames to maximize performance given a budget. Finally, by validating the power of our framework on difficult, real-world data sets we demonstrate an inherent trade-off between the mix of human and cloud computing used vs. the accuracy and cost of the labeling.

Carl Vondrick, Deva Ramanan, Donald Patterson
Robust and Fast Collaborative Tracking with Two Stage Sparse Optimization

The sparse representation has been widely used in many areas and utilized for visual tracking. Tracking with sparse representation is formulated as searching for samples with minimal reconstruction errors from learned template subspace. However, the computational cost makes it unsuitable to utilize high dimensional advanced features which are often important for robust tracking under dynamic environment. Based on the observations that a target can be reconstructed from several templates, and only some of the features with discriminative power are significant to separate the target from the background, we propose a novel online tracking algorithm with two stage sparse optimization to jointly minimize the target reconstruction error and maximize the discriminative power. As the target template and discriminative features usually have temporal and spatial relationship, dynamic group sparsity (DGS) is utilized in our algorithm. The proposed method is compared with three state-of-art trackers using five public challenging sequences, which exhibit appearance changes, heavy occlusions, and pose variations. Our algorithm is shown to outperform these methods.

Baiyang Liu, Lin Yang, Junzhou Huang, Peter Meer, Leiguang Gong, Casimir Kulikowski
Nonlocal Multiscale Hierarchical Decomposition on Graphs

The decomposition of images into their meaningful components is one of the major tasks in computer vision. Tadmor, Nezzar and Vese [1] have proposed a general approach for multiscale hierarchical decomposition of images. On the basis of this work, we propose a multiscale hierarchical decomposition of functions on graphs. The decomposition is based on a discrete variational framework that makes it possible to process arbitrary discrete data sets with the natural introduction of nonlocal interactions. This leads to an approach that can be used for the decomposition of images, meshes, or arbitrary data sets by taking advantage of the graph structure. To have a fully automatic decomposition, the issue of parameter selection is fully addressed. We illustrate our approach with numerous decomposition results on images, meshes, and point clouds and show the benefits.

Moncef Hidane, Olivier Lézoray, Vinh-Thong Ta, Abderrahim Elmoataz
Adaptive Regularization for Image Segmentation Using Local Image Curvature Cues

Image segmentation techniques typically require proper weighting of competing data fidelity and regularization terms. Conventionally, the associated parameters are set through tedious trial and error procedures and kept constant over the image. However, spatially varying structural characteristics, such as object curvature, combined with varying noise and imaging artifacts, significantly complicate the selection process of segmentation parameters. In this work, we propose a novel approach for automating the parameter selection by employing a robust structural cue to prevent excessive regularization of trusted (i.e. low noise) high curvature image regions. Our approach autonomously adapts local regularization weights by combining local measures of image curvature and edge evidence that are gated by a signal reliability measure. We demonstrate the utility and favorable performance of our approach within two major segmentation frameworks, graph cuts and active contours, and present quantitative and qualitative results on a variety of natural and medical images.

Josna Rao, Rafeef Abugharbieh, Ghassan Hamarneh
A Static SMC Sampler on Shapes for the Automated Segmentation of Aortic Calcifications

In this paper, we propose a sampling-based shape segmentation method that builds upon a global shape and a local appearance model. It is suited for challenging problems where there is high uncertainty about the correct solution due to a low signal-to-noise ratio, clutter, occlusions or an erroneous model. Our method suits for segmentation tasks where the number of objects is not known a priori, or where the object of interest is invisible and can only be inferred from other objects in the image. The method was inspired by shape particle filtering from de Bruijne and Nielsen, but shows substantial improvements to it. The principal contributions of this paper are as follows: (i) We introduce statistically motivated importance weights that lead to better performance and facilitate the application to new problems. (ii) We adapt the static sequential Monte Carlo (SMC) algorithm to the problem of image segmentation, where the algorithm proves to sample efficiently from high-dimensional static spaces. (iii) We evaluate the static SMC sampler on shapes on a medical problem of high relevance: the automated quantification of aortic calcifications on X-ray radiographs for the prognosis and diagnosis of cardiovascular disease and mortality. Our results suggest that the static SMC sampler on shapes is more generic, robust, and accurate than shape particle filtering, while being computationally equally costly.

Kersten Petersen, Mads Nielsen, Sami S. Brandt
Fast Dynamic Texture Detection

Dynamic textures can be considered to be spatio-temporally varying visual patterns in image sequences with certain temporal regularity. We propose a novel and efficient approach to explore the violation of the brightness constancy assumption, as an indication of presence of dynamic texture, using simple optical flow techniques. We assume that dynamic texture regions are those that have poor spatio-temporal optical flow coherence. Further, we propose a second approach that uses robust global parametric motion estimators that effectively and efficiently detect motion outliers, and which we exploit as powerful cues to localize dynamic textures. Experimental and comparative studies on a range of synthetic and real-world dynamic texture sequences show the feasibility of the proposed approaches, with results which are competitive to or better than recent state-of-art approaches and significantly faster.

V. Javier Traver, Majid Mirmehdi, Xianghua Xie, Raúl Montoliu
Finding Semantic Structures in Image Hierarchies Using Laplacian Graph Energy

Many segmentation algorithms describe images in terms of a hierarchy of regions. Although such hierarchies can produce state of the art segmentations and have many applications, they often contain more data than is required for an efficient description. This paper shows Laplacian graph energy is a generic measure that can be used to identify semantic structures within hierarchies, independently of the algorithm that produces them. Quantitative experimental validation using hierarchies from two state of art algorithms show we can reduce the number of levels and regions in a hierarchy by an order of magnitude with little or no loss in performance when compared against human produced ground truth. We provide a tracking application that illustrates the value of reduced hierarchies.

Yi-Zhe Song, Pablo Arbelaez, Peter Hall, Chuan Li, Anupriya Balikai
Semantic Segmentation of Urban Scenes Using Dense Depth Maps

In this paper we present a framework for semantic scene parsing and object recognition based on dense depth maps. Five view-independent 3D features that vary with object class are extracted from dense depth maps at a superpixel level for training a classifier using randomized decision forest technique. Our formulation integrates multiple features in a Markov Random Field (MRF) framework to segment and recognize different object classes in query street scene images. We evaluate our method both quantitatively and qualitatively on the challenging Cambridge-driving Labeled Video Database (CamVid). The result shows that only using dense depth information, we can achieve overall better accurate segmentation and recognition than that from sparse 3D features or appearance, or even the combination of sparse 3D features and appearance, advancing state-of-the-art performance. Furthermore, by aligning 3D dense depth based features into a unified coordinate frame, our algorithm can handle the special case of view changes between training and testing scenarios. Preliminary evaluation in cross training and testing shows promising results.

Chenxi Zhang, Liang Wang, Ruigang Yang
Tensor Sparse Coding for Region Covariances

Sparse representation of signals has been the focus of much research in the recent years. A vast majority of existing algorithms deal with vectors, and higher–order data like images are usually vectorized before processing. However, the structure of the data may be lost in the process, leading to poor representation and overall performance degradation. In this paper we propose a novel approach for sparse representation of positive definite matrices, where vectorization would have destroyed the inherent structure of the data. The sparse decomposition of a positive definite matrix is formulated as a convex optimization problem, which falls under the category of determinant maximization (MAXDET) problems [1], for which efficient interior point algorithms exist. Experimental results are shown with simulated examples as well as in real–world computer vision applications, demonstrating the suitability of the new model. This forms the first step toward extending the cornucopia of sparsity-based algorithms to positive definite matrices.

Ravishankar Sivalingam, Daniel Boley, Vassilios Morellas, Nikolaos Papanikolopoulos
Improving Local Descriptors by Embedding Global and Local Spatial Information

In this paper, we present a novel problem: “Given local descriptors, how can we incorporate both local and global spatial information into the descriptors, and obtain compact and discriminative features?” To address this problem, we proposed a general framework to improve any local descriptors by embedding both local and global spatial information. In addition, we proposed a simple and powerful combination method for different types of features. We evaluated the proposed method for the most standard scene and object recognition dataset, and confirm the effectiveness of the proposed method from the viewpoint of speed and accuracy.

Tatsuya Harada, Hideki Nakayama, Yasuo Kuniyoshi
Detecting Faint Curved Edges in Noisy Images

A fundamental question for edge detection is how faint an edge can be and still be detected. In this paper we offer a formalism to study this question and subsequently introduce a hierarchical edge detection algorithm designed to detect faint curved edges in noisy images. In our formalism we view edge detection as a search in a space of feasible curves, and derive expressions to characterize the behavior of the optimal detection threshold as a function of curve length and the combinatorics of the search space. We then present an algorithm that efficiently searches for edges through a very large set of curves by hierarchically constructing difference filters that match the curves traced by the sought edges. We demonstrate the utility of our algorithm in simulations and in applications to challenging real images.

Sharon Alpert, Meirav Galun, Boaz Nadler, Ronen Basri
Spatial Statistics of Visual Keypoints for Texture Recognition

In this paper, we propose a new descriptor of texture images based on the characterization of the spatial patterns of image keypoints. Regarding the set of visual keypoints of a given texture sample as the realization of marked point process, we define texture features from multivariate spatial statistics. Our approach initially relies on the construction of a codebook of the visual signatures of the keypoints. Here these visual signatures are given by SIFT feature vectors and the codebooks are issued from a hierarchical clustering algorithm suitable for processing large high-dimensional dataset. The texture descriptor is formed by cooccurrence statistics of neighboring keypoint pairs for different neighborhood radii. The proposed descriptor inherits the invariance properties of the SIFT w.r.t. contrast change and geometric image transformation (rotation, scaling). An application to texture recognition using the discriminative classifiers, namely: k-NN, SVM and random forest, is considered and a quantitative evaluation is reported for two case-studies: UIUC texture database and real sonar textures. The proposed approach favourably compares to previous work. We further discuss the properties of the proposed descriptor, including dimensionality aspects.

Huu-Giao Nguyen, Ronan Fablet, Jean-Marc Boucher
BRIEF: Binary Robust Independent Elementary Features

We propose to use binary strings as an efficient feature point descriptor, which we call BRIEF.We show that it is highly discriminative even when using relatively few bits and can be computed using simple intensity difference tests. Furthermore, the descriptor similarity can be evaluated using the Hamming distance, which is very efficient to compute, instead of the



norm as is usually done.

As a result, BRIEF is very fast both to build and to match. We compare it against SURF and U-SURF on standard benchmarks and show that it yields a similar or better recognition performance, while running in a fraction of the time required by either.

Michael Calonder, Vincent Lepetit, Christoph Strecha, Pascal Fua
Multi-label Feature Transform for Image Classifications

Image and video annotations are challenging but important tasks to understand digital multimedia contents in computer vision, which by nature is a

multi-label multi-class

classification problem because every image is usually associated with more than one semantic keyword. As a result, label assignments are no longer confined to class membership indications as in traditional single-label multi-class classification, which also convey important characteristic information to assess object similarity from knowledge perspective. Therefore, besides implicitly making use of label assignments to formulate label correlations as in many existing multi-label classification algorithms, we propose a novel Multi-Label Feature Transform (MLFT) approach to also explicitly use them as part of data features. Through two transformations on attributes and label assignments respectively, MLFT approach uses kernel to implicitly construct a

label-augmented feature vector

to integrate attributes and labels of a data set in a balanced manner, such that the data discriminability is enhanced because of taking advantage of the information from both data and label perspectives. Promising experimental results on four standard multi-label data sets from image annotation and other applications demonstrate the effectiveness of our approach.

Hua Wang, Heng Huang, Chris Ding
Computer Vision – ECCV 2010
Kostas Daniilidis
Petros Maragos
Nikos Paragios
Copyright Year
Springer Berlin Heidelberg
Electronic ISBN
Print ISBN

Premium Partner