nach oben

2014 | Buch

Kapitel lesen Erstes Kapitel lesen

Computer Vision – ECCV 2014

13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part II

herausgegeben von: David Fleet, Tomas Pajdla, Bernt Schiele, Tinne Tuytelaars

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

The seven-volume set comprising LNCS volumes 8689-8695 constitutes the refereed proceedings of the 13th European Conference on Computer Vision, ECCV 2014, held in Zurich, Switzerland, in September 2014.

The 363 revised papers presented were carefully reviewed and selected from 1444 submissions. The papers are organized in topical sections on tracking and activity recognition; recognition; learning and inference; structure from motion and feature matching; computational photography and low-level vision; vision; segmentation and saliency; context and 3D scenes; motion and 3D scene analysis; and poster sessions.

Inhaltsverzeichnis

Frontmatter

Learning and Inference (continued)

Coarse-to-Fine Auto-Encoder Networks (CFAN) for Real-Time Face Alignment

Accurate face alignment is a vital prerequisite step for most face perception tasks such as face recognition, facial expression analysis and non-realistic face re-rendering. It can be formulated as the nonlinear inference of the facial landmarks from the detected face region. Deep network seems a good choice to model the nonlinearity, but it is nontrivial to apply it directly. In this paper, instead of a straightforward application of deep network, we propose a Coarse-to-Fine Auto-encoder Networks (CFAN) approach, which cascades a few successive Stacked Auto-encoder Networks (SANs). Specifically, the first SAN predicts the landmarks quickly but accurately enough as a preliminary, by taking as input a low-resolution version of the detected face holistically. The following SANs then progressively refine the landmark by taking as input the local features extracted around the current landmarks (output of the previous SAN) with higher and higher resolution. Extensive experiments conducted on three challenging datasets demonstrate that our CFAN outperforms the state-of-the-art methods and performs in real-time(40+fps excluding face detection on a desktop).

Jie Zhang, Shiguang Shan, Meina Kan, Xilin Chen

From Manifold to Manifold: Geometry-Aware Dimensionality Reduction for SPD Matrices

Representing images and videos with Symmetric Positive Definite (SPD) matrices and considering the Riemannian geometry of the resulting space has proven beneficial for many recognition tasks. Unfortunately, computation on the Riemannian manifold of SPD matrices –especially of high-dimensional ones– comes at a high cost that limits the applicability of existing techniques. In this paper we introduce an approach that lets us handle high-dimensional SPD matrices by constructing a lower-dimensional, more discriminative SPD manifold. To this end, we model the mapping from the high-dimensional SPD manifold to the low-dimensional one with an orthonormal projection. In particular, we search for a projection that yields a low-dimensional manifold with maximum discriminative power encoded via an affinity-weighted similarity measure based on metrics on the manifold. Learning can then be expressed as an optimization problem on a Grassmann manifold. Our evaluation on several classification tasks shows that our approach leads to a significant accuracy gain over state-of-the-art methods.

Mehrtash T. Harandi, Mathieu Salzmann, Richard Hartley

Pose Machines: Articulated Pose Estimation via Inference Machines

State-of-the-art approaches for articulated human pose estimation are rooted in parts-based graphical models. These models are often restricted to tree-structured representations and simple parametric potentials in order to enable tractable inference. However, these simple dependencies fail to capture all the interactions between body parts. While models with more complex interactions can be defined, learning the parameters of these models remains challenging with intractable or approximate inference. In this paper, instead of performing inference on a learned graphical model, we build upon the

inference machine

framework and present a method for articulated human pose estimation. Our approach incorporates rich spatial interactions among multiple parts and information across parts of different scales. Additionally, the modular framework of our approach enables both ease of implementation without specialized optimization solvers, and efficient inference. We analyze our approach on two challenging datasets with large pose variation and outperform the state-of-the-art on these benchmarks.

Varun Ramakrishna, Daniel Munoz, Martial Hebert, James Andrew Bagnell, Yaser Sheikh

Poster Session 2

Piecewise-Planar StereoScan:Structure and Motion from Plane Primitives

This article describes a pipeline that receives as input a sequence of images acquired by a calibrated stereo rig and outputs the camera motion and a Piecewise-Planar Reconstruction (PPR) of the scene. It firstly detects the 3D planes viewed by each stereo pair from semi-dense depth estimation. This is followed by estimating the pose between consecutive views using a new closed-form minimal algorithm that relies in point correspondences only when plane correspondences are insufficient to fully constrain the motion. Finally, the camera motion and the PPR are jointly refined, alternating between discrete optimization for generating plane hypotheses and continuous bundle adjustment. The approach differs from previous works in PPR by determining the poses from plane-primitives, by jointly estimating motion and piecewise-planar structure, and by operating sequentially, being suitable for applications of SLAM and visual odometry. Experiments are carried in challenging wide-baseline datasets where conventional point-based SfM usually fails.

Carolina Raposo, Michel Antunes, Joao P. Barreto

Nonrigid Surface Registration and Completion from RGBD Images

Nonrigid surface registration is a challenging problem that suffers from many ambiguities. Existing methods typically assume the availability of full volumetric data, or require a global model of the surface of interest. In this paper, we introduce an approach to nonrigid registration that performs on relatively low-quality RGBD images and does not assume prior knowledge of the global surface shape. To this end, we model the surface as a collection of patches, and infer the patch deformations by performing inference in a graphical model. Our representation lets us fill in the holes in the input depth maps, thus essentially achieving surface completion. Our experimental evaluation demonstrates the effectiveness of our approach on several sequences, as well as its robustness to missing data and occlusions.

Weipeng Xu, Mathieu Salzmann, Yongtian Wang, Yue Liu

Unsupervised Dense Object Discovery, Detection, Tracking and Reconstruction

In this paper, we present an unsupervised framework for discovering, detecting, tracking, and reconstructing dense objects from a video sequence. The system simultaneously localizes a moving camera, and discovers a set of shape and appearance models for multiple objects, including the scene background. Each object model is represented by both a 2D and 3D level-set. This representation is used to improve detection, 2D-tracking, 3D-registration and importantly subsequent updates to the level-set itself. This single framework performs dense simultaneous localization and mapping as well as unsupervised object discovery. At each iteration portions of the scene that fail to track, such as bulk outliers on moving rigid bodies, are used to either seed models for new objects or to update models of known objects. For the latter, once an object is successfully tracked in 2D with aid from a 2D level-set segmentation, the level-set is updated and then used to aid registration and evolution of a 3D level-set that captures shape information. For a known object either learned by our system or introduced from a third-party library, our framework can detect similar appearances and geometries in the scene. The system is tested using single and multiple object data sets. Results demonstrate an improved method for discovering and reconstructing 2D and 3D object models, which aid tracking even under significant occlusion or rapid motion.

Lu Ma, Gabe Sibley

Know Your Limits: Accuracy of Long Range Stereoscopic Object Measurements in Practice

Modern applications of stereo vision, such as advanced driver assistance systems and autonomous vehicles, require highest precision when determining the location and velocity of potential obstacles. Subpixel disparity accuracy in selected image regions is therefore essential. Evaluation benchmarks for stereo correspondence algorithms, such as the popular Middlebury and KITTI frameworks, provide important reference values regarding dense matching performance, but do not sufficiently treat local sub-pixel matching accuracy. In this paper, we explore this important aspect in detail. We present a comprehensive statistical evaluation of selected state-of-the-art stereo matching approaches on an extensive dataset and establish reference values for the precision limits actually achievable in practice. For a carefully calibrated camera setup under real-world imaging conditions, a consistent error limit of 1/10 pixel is determined. We present guidelines on algorithmic choices derived from theory which turn out to be relevant to achieving this limit in practice.

Peter Pinggera, David Pfeiffer, Uwe Franke, Rudolf Mester

As-Rigid-As-Possible Stereo under Second Order Smoothness Priors

Imposing smoothness priors is a key idea of the top-ranked global stereo models. Recent progresses demonstrated the power of second order priors which are usually defined by either explicitly considering three-pixel neighborhoods, or implicitly using a so-called 3D-label for each pixel. In contrast to the traditional first-order priors which only prefer fronto-parallel surfaces, second-order priors encourage arbitrary collinear structures. However, we still can find defective regions in matching results even under such powerful priors, e.g., large textureless regions. One reason is that most of the stereo models are non-convex, where pixel-wise smoothness priors, i.e.,

local constraints

, are too flexible to prevent the solution from trapping in bad local minimums. On the other hand,

long-range spatial constraints

, especially the segment-based priors, have advantages on this problem. However, segment-based priors are too rigid to handle curved surfaces. We present a mixture model to combine the benefits of these two kinds of priors, whose energy function consists of two terms 1) a Laplacian operator on the disparity map which imposes pixel-wise second-order smoothness; 2) a segment-wise matching cost as a function of quadratic surface, which encourages “as-rigid-as-possible” smoothness. To effectively solve the problem, we introduce an intermediate term to decouple the two subenergies, which enables an alternated optimization algorithm that is about an order of magnitude faster than PatchMatch [1]. Our approach is one of the top ranked models on the Middlebury benchmark at sub-pixel accuracy.

Chi Zhang, Zhiwei Li, Rui Cai, Hongyang Chao, Yong Rui

Real-Time Minimization of the Piecewise Smooth Mumford-Shah Functional

We propose an algorithm for efficiently minimizing the piecewise smooth Mumford-Shah functional. The algorithm is based on an extension of a recent primal-dual algorithm from convex to non-convex optimization problems. The key idea is to rewrite the proximal operator in the primal-dual algorithm using Moreau’s identity. The resulting algorithm computes piecewise smooth approximations of color images at 15-20 frames per second at VGA resolution using GPU acceleration. Compared to convex relaxation approaches [18], it is orders of magnitude faster and does not require a discretization of color values. In contrast to the popular Ambrosio-Tortorelli approach [2], it naturally combines piecewise smooth and piecewise constant approximations, it does not require an epsilon-approximation and it is not based on an alternation scheme. The achieved energies are in practice at most 5% off the optimal value for one-dimensional problems. Numerous experiments demonstrate that the proposed algorithm is well-suited to perform discontinuity-preserving smoothing and real-time video cartooning.

Evgeny Strekalovskiy, Daniel Cremers

A MAP-Estimation Framework for Blind Deblurring Using High-Level Edge Priors

In this paper we propose a general MAP-estimation framework for blind image deconvolution that allows the incorporation of powerful priors regarding predicting the edges of the latent image, which is known to be a crucial factor for the success of blind deblurring. This is achieved in a principled, robust and unified manner through the use of a global energy function that can take into account multiple constraints. Based on this framework, we show how to successfully make use of a particular prior of this type that is quite strong and also applicable to a wide variety of cases. It relates to the strong structural regularity that is exhibited by many scenes, and which affects the location and distribution of the corresponding image edges. We validate the excellent performance of our approach through an extensive set of experimental results and comparisons to the state-of-the-art.

Yipin Zhou, Nikos Komodakis

Efficient Color Constancy with Local Surface Reflectance Statistics

The aim of computational color constancy is to estimate the actual surface color in an acquired scene disregarding its illuminant. Many solutions try to first estimate the illuminant and then correct the image with the illuminant estimate. Based on the linear image formation model, we propose in this work a new strategy to estimate the illuminant. Inspired by the feedback modulation from horizontal cells to the cones in the retina, we first normalize each local patch with its local maximum to obtain the so-called locally normalized reflectance estimate (LNRE). Then, we experimentally found that the ratio of the global summation of true surface reflectance to the global summation of LNRE in a scene is approximately achromatic for both indoor and outdoor scenes. Based on this substantial observation, we estimate the illuminant by computing the ratio of the global summation of the intensities to the global summation of the locally normalized intensities of the color-biased image. The proposed model has only one free parameter and requires no explicit training with learning-based approach. Experimental results on four commonly used datasets show that our model can produce competitive or even better results compared to the state-of-the-art approaches with low computational cost.

Shaobing Gao, Wangwang Han, Kaifu Yang, Chaoyi Li, Yongjie Li

A Contrast Enhancement Framework with JPEG Artifacts Suppression

Contrast enhancement is used for many algorithms in computer vision. It is applied either explicitly, such as histogram equalization and tone-curve manipulation, or implicitly via methods that deal with degradation from physical phenomena such as haze, fog or underwater imaging. While contrast enhancement boosts the image appearance, it can unintentionally boost unsightly image artifacts, especially artifacts from JPEG compression. Most JPEG implementations optimize the compression in a scene-dependent manner such that low-contrast images exhibit few perceivable artifacts even for relatively high-compression factors. After contrast enhancement, however, these artifacts become significantly visible. Although there are numerous approaches targeting JPEG artifact reduction, these are generic in nature and are applied either as pre- or post-processing steps. When applied as pre-processing, existing methods tend to over smooth the image. When applied as post-processing, these are often ineffective at removing the boosted artifacts. To resolve this problem, we propose a framework that suppresses compression artifacts as an integral part of the contrast enhancement procedure. We show that this approach can produce compelling results superior to those obtained by existing JPEG artifacts removal methods for several types of contrast enhancement problems.

Yu Li, Fangfang Guo, Robby T. Tan, Michael S. Brown

Radial Bright Channel Prior for Single Image Vignetting Correction

This paper presents a novel prior,

radial bright channel (RBC) prior

, for single image vignetting correction. The RBC prior is derived from a statistical property of vignetting-free images: for the pixels sharing the same radius in polar coordinates of an image, at least one pixel has a high intensity value at some color channel. Exploiting the prior, we can effectively estimate and correct the vignetting effect of a given image. We represent the vignetting effect as an 1D function of the distance from the optical center, and estimate the function using the RBC prior. As it works completely in 1D, our method provides high efficiency in terms of computation and storage costs. Experimental results demonstrate that our method runs an order of magnitude faster than previous work, while producing higher quality results of vignetting correction.

Hojin Cho, Hyunjoon Lee, Seungyong Lee

Tubular Structure Filtering by Ranking Orientation Responses of Path Operators

Thin objects in 3D volumes, for instance vascular networks in medical imaging or various kinds of fibres in materials science, have been of interest for some time to computer vision. Particularly, tubular objects are everywhere elongated in one principal direction – which varies spatially – and are thin in the other two perpendicular directions. Filters for detecting such structures use for instance an analysis of the three principal directions of the Hessian, which is a local feature. In this article, we present a low-level tubular structure detection filter. This filter relies on paths, which are semi-global features that avoid any blurring effect induced by scale-space convolution. More precisely, our filter is based on recently developed morphological path operators. These require sampling only in a few principal directions, are robust to noise and do not assume feature regularity. We show that by ranking the directional response of this operator, we are further able to efficiently distinguish between blob, thin planar and tubular structures. We validate this approach on several applications, both from a qualitative and a quantitative point of view, demonstrating noise robustness and an efficient response on tubular structures.

Odyssée Merveille, Hugues Talbot, Laurent Najman, Nicolas Passat

Optimization-Based Artifact Correction for Electron Microscopy Image Stacks

Investigations of biological ultrastructure, such as comprehensive mapping of connections within a nervous system, increasingly rely on large, high-resolution electron microscopy (EM) image volumes. However, discontinuities between the registered section images from which these volumes are assembled, due to variations in imaging conditions and section thickness, among other artifacts, impede truly 3-D analysis of these volumes. We propose an optimization procedure, called EMISAC (EM Image Stack Artifact Correction), to correct these discontinuities. EMISAC optimizes the parameters of spatially varying linear transformations of the data in order to minimize the squared norm of the gradient along the section axis, subject to detail-preserving regularization.

Assessment on a mouse cortex dataset demonstrates the effectiveness of our approach. Relative to the original data, EMISAC produces a large improvement both in NIQE score, a measure of statistical similarity between orthogonal cross-sections and the original image sections, as well as in accuracy of neurite segmentation, a critical task for this type of data. Compared to a recent independently-developed gradient-domain algorithm, EMISAC achieves significantly better NIQE image quality scores, and equivalent segmentation accuracy; future segmentation algorithms may be able to take advantage of the higher image quality.

In addition, on several time-lapse videos, EMISAC significantly reduces lighting artifacts, resulting in greatly improved video quality.

A software release is available at

http://rll.berkeley.edu/2014_

ECCV_EMISAC

Samaneh Azadi, Jeremy Maitin-Shepard, Pieter Abbeel

Metric-Based Pairwise and Multiple Image Registration

Registering pairs or groups of images is a widely-studied problem that has seen a variety of solutions in recent years. Most of these solutions are variational, using objective functions that should satisfy several basic and desired properties. In this paper, we pursue two additional properties – (1) invariance of objective function under identical warping of input images and (2) the objective function induces a proper metric on the set of equivalence classes of images – and motivate their importance. Then, a registration framework that satisfies these properties, using the

-norm between a novel representation of images, is introduced. Additionally, for multiple images, the induced metric enables us to compute a mean image, or a template, and perform joint registration. We demonstrate this framework using examples from a variety of image types and compare performances with some recent methods.

Qian Xie, Sebastian Kurtek, Eric Klassen, Gary E. Christensen, Anuj Srivastava

Canonical Correlation Analysis on Riemannian Manifolds and Its Applications

Canonical correlation analysis (CCA) is a widely used statistical technique to capture correlations between two sets of multi-variate random variables and has found a multitude of applications in computer vision, medical imaging and machine learning. The classical formulation assumes that the data live in a pair of

vector spaces

which makes its use in certain important scientific domains problematic. For instance, the set of symmetric positive definite matrices (SPD), rotations and probability distributions, all belong to certain curved Riemannian manifolds where vector-space operations are in general not applicable. Analyzing the space of such data via the classical versions of inference models is rather sub-optimal. But perhaps more importantly, since the algorithms do not respect the underlying geometry of the data space, it is hard to provide statistical guarantees (if any) on the results. Using the space of SPD matrices as a concrete example, this paper gives a principled generalization of the well known CCA to the Riemannian setting. Our CCA algorithm operates on the product Riemannian manifold representing SPD matrix-valued fields to identify meaningful statistical relationships on the product Riemannian manifold. As a proof of principle, we present results on an Alzheimer’s disease (AD) study where the analysis task involves identifying correlations across diffusion tensor images (DTI) and Cauchy deformation tensor fields derived from T1-weighted magnetic resonance (MR) images.

Hyunwoo J. Kim, Nagesh Adluru, Barbara B. Bendlin, Sterling C. Johnson, Baba C. Vemuri, Vikas Singh

Scalable 6-DOF Localization on Mobile Devices

Recent improvements in image-based localization have produced powerful methods that scale up to the massive 3D models emerging from modern Structure-from-Motion techniques. However, these approaches are too resource intensive to run in real-time, let alone to be implemented on mobile devices. In this paper, we propose to combine the scalability of such a global localization system running on a server with the speed and precision of a local pose tracker on a mobile device. Our approach is both scalable and drift-free by design and eliminates the need for loop closure. We propose two strategies to combine the information provided by local tracking and global localization. We evaluate our system on a large-scale dataset of the historic inner city of Aachen where it achieves interactive framerates at a localization error of less than 50cm while using less than 5MB of memory on the mobile device.

Sven Middelberg, Torsten Sattler, Ole Untzelmann, Leif Kobbelt

On Mean Pose and Variability of 3D Deformable Models

We present a novel methodology for the analysis of complex object shapes in motion observed by multiple video cameras. In particular, we propose to learn local surface rigidity probabilities (i.e., deformations), and to estimate a mean pose over a temporal sequence. Local deformations can be used for rigidity-based dynamic surface segmentation, while a mean pose can be used as a sequence keyframe or a cluster prototype and has therefore numerous applications, such as motion synthesis or sequential alignment for compression or morphing. We take advantage of recent advances in surface tracking techniques to formulate a generative model of 3D temporal sequences using a probabilistic framework, which conditions shape fitting over all frames to a simple set of intrinsic surface rigidity properties. Surface tracking and rigidity variable estimation can then be formulated as an Expectation-Maximization inference problem and solved by alternatively minimizing two nested fixed point iterations. We show that this framework provides a new fundamental building block for various applications of shape analysis, and achieves comparable tracking performance to state of the art surface tracking techniques on real datasets, even compared to approaches using strong kinematic priors such as rigid skeletons.

Benjamin Allain, Jean-Sébastien Franco, Edmond Boyer, Tony Tung

Hybrid Stochastic / Deterministic Optimization for Tracking Sports Players and Pedestrians

Although ‘tracking-by-detection’ is a popular approach when reliable object detectors are available, missed detections remain a difficult hurdle to overcome. We present a hybrid stochastic/deterministic optimization scheme that uses RJMCMC to perform stochastic search over the space of detection configurations, interleaved with deterministic computation of the optimal multi-frame data association for each proposed detection hypothesis. Since object trajectories do not need to be estimated directly by the sampler, our approach is more efficient than traditional MCMCDA techniques. Moreover, our holistic formulation is able to generate longer, more reliable trajectories than baseline tracking-by-detection approaches in challenging multi-target scenarios.

Robert T. Collins, Peter Carr

What Do I See? Modeling Human Visual Perception for Multi-person Tracking

This paper presents a novel approach for multi-person tracking utilizing a model motivated by the human vision system. The model predicts human motion based on modeling of perceived information. An attention map is designed to mimic human reasoning that integrates both spatial and temporal information. The spatial component addresses human attention allocation to different areas in a scene and is represented using a retinal mapping based on the log-polar transformation while the temporal component denotes the human attention allocation to subjects with different motion velocity and is modeled as a static-dynamic attention map. With the static-dynamic attention map and retinal mapping, attention driven motion of the tracked target is estimated with a center-surround search mechanism. This perception based motion model is integrated into a data association tracking framework with appearance and motion features. The proposed algorithm tracks a large number of subjects in complex scenes and the evaluation on public datasets show promising improvements over state-of-the-art methods.

Xu Yan, Ioannis A. Kakadiaris, Shishir K. Shah

Consistent Re-identification in a Camera Network

Most existing person re-identification methods focus on finding similarities between persons between pairs of cameras (camera pairwise re-identification) without explicitly maintaining consistency of the results across the network. This may lead to infeasible associations when results from different camera pairs are combined. In this paper, we propose a network consistent re-identification (NCR) framework, which is formulated as an optimization problem that not only maintains consistency in re-identification results across the network, but also improves the camera pairwise re-identification performance between all the individual camera pairs. This can be solved as a binary integer programing problem, leading to a globally optimal solution. We also extend the proposed approach to the more general case where all persons may not be present in every camera. Using two benchmark datasets, we validate our approach and compare against state-of-the-art methods.

Abir Das, Anirban Chakraborty, Amit K. Roy-Chowdhury

Surface Normal Deconvolution: Photometric Stereo for Optically Thick Translucent Objects

This paper presents a photometric stereo method that works for optically thick translucent objects exhibiting subsurface scattering. Our method is built upon the previous studies showing that subsurface scattering is approximated as convolution with a blurring kernel. We extend this observation and show that the original surface normal convolved with the scattering kernel corresponds to the blurred surface normal that can be obtained by a conventional photometric stereo technique. Based on this observation, we cast the photometric stereo problem for optically thick translucent objects as a deconvolution problem, and develop a method to recover accurate surface normals. Experimental results of both synthetic and real-world scenes show the effectiveness of the proposed method.

Chika Inoshita, Yasuhiro Mukaigawa, Yasuyuki Matsushita, Yasushi Yagi

Intrinsic Video

Intrinsic images such as albedo and shading are valuable for later stages of visual processing. Previous methods for extracting albedo and shading use either single images or images together with depth data. Instead, we define

intrinsic video

estimation as the problem of extracting temporally coherent albedo and shading from video alone. Our approach exploits the assumption that albedo is constant over time while shading changes slowly. Optical flow aids in the accurate estimation of intrinsic video by providing temporal continuity as well as putative surface boundaries. Additionally, we find that the estimated albedo sequence can be used to improve optical flow accuracy in sequences with changing illumination. The approach makes only weak assumptions about the scene and we show that it substantially outperforms existing single-frame intrinsic image methods. We evaluate this quantitatively on synthetic sequences as well on challenging natural sequences with complex geometry, motion, and illumination.

Naejin Kong, Peter V. Gehler, Michael J. Black

Robust and Accurate Non-parametric Estimation of Reflectance Using Basis Decomposition and Correction Functions

A common approach to non-parametric BRDF estimation is the approximation of the sparsely measured input using basis decomposition. In this paper we greatly improve the fitting accuracy of such methods by iteratively applying a novel correction function to an initial estimate. We also introduce a basis to efficiently represent such a function. Based on this general concept we propose an iterative algorithm that is able to explicitly identify and treat outliers in the input data. Our method is invariant to different error metrics which alleviates the error-prone choice of an appropriate one for the given input. We evaluate our method based on a large set of experiments generated from 100 real-world BRDFs and 16 newly measured materials. The experiments show that our method outperforms other evaluated state-of-the-art basis decomposition methods by an order of magnitude in the perceptual sense for outlier ratios up to 40%.

Tobias Nöll, Johannes Köhler, Didier Stricker

Intrinsic Textures for Relightable Free-Viewpoint Video

This paper presents an approach to estimate the intrinsic texture properties (albedo, shading, normal) of scenes from multiple view acquisition under unknown illumination conditions. We introduce the concept of

intrinsic textures

, which are pixel-resolution surface textures representing the intrinsic appearance parameters of a scene. Unlike previous video relighting methods, the approach does not assume regions of uniform albedo, which makes it applicable to richly textured scenes. We show that intrinsic image methods can be used to refine an initial, low-frequency shading estimate based on a global lighting reconstruction from an original texture and coarse scene geometry in order to resolve the inherent global ambiguity in shading. The method is applied to relighting of free-viewpoint rendering from multiple view video capture. This demonstrates relighting with reproduction of fine surface detail. Quantitative evaluation on synthetic models with textured appearance shows accurate estimation of intrinsic surface reflectance properties.

James Imber, Jean-Yves Guillemaut, Adrian Hilton

Reasoning about Object Affordances in a Knowledge Base Representation

Reasoning about objects and their affordances is a fundamental problem for visual intelligence. Most of the previous work casts this problem as a classification task where separate classifiers are trained to label objects, recognize attributes, or assign affordances. In this work, we consider the problem of object affordance reasoning using a knowledge base representation. Diverse information of objects are first harvested from images and other meta-data sources. We then learn a knowledge base (KB) using a Markov Logic Network (MLN). Given the learned KB, we show that a diverse set of visual inference tasks can be done in this unified framework without training separate classifiers, including zero-shot affordance prediction and object recognition given human poses.

Yuke Zhu, Alireza Fathi, Li Fei-Fei

Binary Codes Embedding for Fast Image Tagging with Incomplete Labels

Tags have been popularly utilized for better annotating, organizing and searching for desirable images. Image tagging is the problem of automatically assigning tags to images. One major challenge for image tagging is that the existing/training labels associated with image examples might be incomplete and noisy. Valuable prior work has focused on improving the accuracy of the assigned tags, but very limited work tackles the efficiency issue in image tagging, which is a critical problem in many large scale real world applications. This paper proposes a novel Binary Codes Embedding approach for Fast Image Tagging (BCE-FIT) with incomplete labels. In particular, we construct compact binary codes for both image examples and tags such that the observed tags are consistent with the constructed binary codes. We then formulate the problem of learning binary codes as a discrete optimization problem. An efficient iterative method is developed to solve the relaxation problem, followed by a novel binarization method based on orthogonal transformation to obtain the binary codes from the relaxed solution. Experimental results on two large scale datasets demonstrate that the proposed approach can achieve similar accuracy with state-of-the-art methods while using much less time, which is important for large scale applications.

Qifan Wang, Bin Shen, Shumiao Wang, Liang Li, Luo Si

Recognizing Products: A Per-exemplar Multi-label Image Classification Approach

Large-scale instance-level image retrieval aims at retrieving specific instances of objects or scenes. Simultaneously retrieving multiple objects in a test image adds to the difficulty of the problem, especially if the objects are visually similar. This paper presents an efficient approach for per-exemplar multi-label image classification, which targets the recognition and localization of products in retail store images. We achieve runtime efficiency through the use of discriminative random forests, deformable dense pixel matching and genetic algorithm optimization. Cross-dataset recognition is performed, where our training images are taken in ideal conditions with only one single training image per product label, while the evaluation set is taken using a mobile phone in real-life scenarios in completely different conditions. In addition, we provide a large novel dataset and labeling tools for products image search, to motivate further research efforts on multi-label retail products image classification. The proposed approach achieves promising results in terms of both accuracy and runtime efficiency on 680 annotated images of our dataset, and 885 test images of GroZi-120 dataset. We make our dataset of 8350 different product images and the 680 test images from retail stores with complete annotations available to the wider community.

Marian George, Christian Floerkemeier

Part-Pair Representation for Part Localization

In this paper, we propose a novel part-pair representation for part localization. In this representation, an object is treated as a collection of part pairs to model its shape and appearance. By changing the set of pairs to be used, we are able to impose either stronger or weaker geometric constraints on the part configuration. As for the appearance, we build pair detectors for each part pair, which model the appearance of an object at different levels of granularities. Our method of part localization exploits the part-pair representation, featuring the combination of non-parametric exemplars and parametric regression models. Non-parametric exemplars help generate reliable part hypotheses from very noisy pair detections. Then, the regression models are used to group the part hypotheses in a flexible way to predict the part locations. We evaluate our method extensively on the dataset CUB-200-2011 [32], where we achieve significant improvement over the state-of-the-art method on bird part localization. We also experiment with human pose estimation, where our method produces comparable results to existing works.

Jiongxin Liu, Yinxiao Li, Peter N. Belhumeur

Weakly Supervised Learning of Objects, Attributes and Their Associations

When humans describe images they tend to use combinations of nouns and adjectives, corresponding to objects and their associated attributes respectively. To generate such a description automatically, one needs to model objects, attributes and their associations. Conventional methods require strong annotation of object and attribute locations, making them less scalable. In this paper, we model object-attribute associations from weakly labelled images, such as those widely available on media sharing sites (e.g. Flickr), where only image-level labels (either object or attributes) are given, without their locations and associations. This is achieved by introducing a novel weakly supervised non-parametric Bayesian model. Once learned, given a new image, our model can describe the image, including objects, attributes and their associations, as well as their locations and segmentation. Extensive experiments on benchmark datasets demonstrate that our weakly supervised model performs at par with strongly supervised models on tasks such as image description and retrieval based on object-attribute associations.

Zhiyuan Shi, Yongxin Yang, Timothy M. Hospedales, Tao Xiang

Interestingness Prediction by Robust Learning to Rank

The problem of predicting image or video interestingness from their low-level feature representations has received increasing interest. As a highly subjective visual attribute, annotating the interestingness value of training data for learning a prediction model is challenging. To make the annotation less subjective and more reliable, recent studies employ crowdsourcing tools to collect pairwise comparisons – relying on majority voting to prune the annotation outliers/errors. In this paper, we propose a more principled way to identify annotation outliers by formulating the interestingness prediction task as a unified robust learning to rank problem, tackling both the outlier detection and interestingness prediction tasks jointly. Extensive experiments on both image and video interestingness benchmark datasets demonstrate that our new approach significantly outperforms state-of-the-art alternatives.

Yanwei Fu, Timothy M. Hospedales, Tao Xiang, Shaogang Gong, Yuan Yao

Pairwise Probabilistic Voting: Fast Place Recognition without RANSAC

Place recognition currently suffers from a lack of scalability due to the need for strong geometric constraints, which as of yet are typically limited to RANSAC implementations. In this paper, we present a method to successfully achieve state-of-the-art performance, in both recognition accuracy and speed, without the need for RANSAC. We propose to discretise each feature pair in an image, in both appearance and 2D geometry, to create a triplet of words: one each for the appearance of the two features, and one for the pairwise geometry. This triplet is then passed through an inverted index to find examples of such pairwise configurations in the database. Finally, a global geometry constraint is enforced by considering the maximum-clique in an adjacency graph of pairwise correspondences. The discrete nature of the problem allows for tractable probabilistic scores to be assigned to each correspondence, and the least informative feature pairs can be eliminated from the database for memory and time efficiency. We demonstrate the performance of our method on several large-scale datasets, and show improvements over several baselines.

Edward David Johns, Guang-Zhong Yang

Robust Instance Recognition in Presence of Occlusion and Clutter

We present a robust learning based instance recognition framework from single view point clouds. Our framework is able to handle real-world instance recognition challenges, i.e, clutter, similar looking distractors and occlusion. Recent algorithms have separately tried to address the problem of clutter [9] and occlusion [16] but fail when these challenges are combined. In comparison we handle all challenges within a single framework. Our framework uses a soft label Random Forest [5] to learn discriminative shape features of an object and use them to classify both its location and pose. We propose a novel iterative training scheme for forests which maximizes the margin between classes to improve recognition accuracy, as compared to a conventional training procedure. The learnt forest outperforms template matching, DPM [7] in presence of similar looking distractors. Using occlusion information, computed from the depth data, the forest learns to emphasize the shape features from the visible regions thus making it robust to occlusion. We benchmark our system with the state-of-the-art recognition systems [9,7] in challenging scenes drawn from the largest publicly available dataset. To complement the lack of occlusion tests in this dataset, we introduce our

Desk3D

dataset and demonstrate that our algorithm outperforms other methods in all settings.

Ujwal Bonde, Vijay Badrinarayanan, Roberto Cipolla

Learning 6D Object Pose Estimation Using 3D Object Coordinates

This work addresses the problem of estimating the 6D Pose of specific objects from a single RGB-D image. We present a flexible approach that can deal with generic objects, both textured and texture-less. The key new concept is a learned, intermediate representation in form of a dense 3D object coordinate labelling paired with a dense class labelling. We are able to show that for a common dataset with texture-less objects, where template-based techniques are suitable and state of the art, our approach is slightly superior in terms of accuracy. We also demonstrate the benefits of our approach, compared to template-based techniques, in terms of robustness with respect to varying lighting conditions. Towards this end, we contribute a new ground truth dataset with 10k images of 20 objects captured each under three different lighting conditions. We demonstrate that our approach scales well with the number of objects and has capabilities to run fast.

Eric Brachmann, Alexander Krull, Frank Michel, Stefan Gumhold, Jamie Shotton, Carsten Rother

Growing Regression Forests by Classification: Applications to Object Pose Estimation

In this work, we propose a novel node splitting method for regression trees and incorporate it into the regression forest framework. Unlike traditional binary splitting, where the splitting rule is selected from a predefined set of binary splitting rules via trial-and-error, the proposed node splitting method first finds clusters of the training data which at least locally minimize the empirical loss without considering the input space. Then splitting rules which preserve the found clusters as much as possible are determined by casting the problem into a classification problem. Consequently, our new node splitting method enjoys more freedom in choosing the splitting rules, resulting in more efficient tree structures. In addition to the Euclidean target space, we present a variant which can naturally deal with a circular target space by the proper use of circular statistics. We apply the regression forest employing our node splitting to head pose estimation (Euclidean target space) and car direction estimation (circular target space) and demonstrate that the proposed method significantly outperforms state-of-the-art methods (38.5% and 22.5% error reduction respectively).

Kota Hara, Rama Chellappa

Stacked Deformable Part Model with Shape Regression for Object Part Localization

This paper explores the localization of pre-defined semantic object parts, which is much more challenging than traditional object detection and very important for applications such as face recognition, HCI and fine-grained object recognition. To address this problem, we make two critical improvements over the widely used deformable part model (DPM). The first is that we use appearance based shape regression to globally estimate the anchor location of each part and then locally refine each part according to the estimated anchor location under the constraint of DPM. The DPM with shape regression (SR-DPM) is more flexible than the traditional DPM by relaxing the fixed anchor location of each part. It enjoys the efficient dynamic programming inference as traditional DPM and can be discriminatively trained via a coordinate descent procedure. The second is that we propose to stack multiple SR-DPMs, where each layer uses the output of previous SR-DPM as the input to progressively refine the result. It provides an analogy to deep neural network while benefiting from hand-crafted feature and model. The proposed methods are applied to human pose estimation, face alignment and general object part localization tasks and achieve state-of-the-art performance.

Junjie Yan, Zhen Lei, Yang Yang, Stan Z. Li

Transductive Multi-view Embedding for Zero-Shot Recognition and Annotation

Most existing zero-shot learning approaches exploit transfer learning via an intermediate-level semantic representation such as visual attributes or semantic word vectors. Such a semantic representation is shared between an annotated auxiliary dataset and a target dataset with no annotation. A projection from a low-level feature space to the semantic space is learned from the auxiliary dataset and is applied without adaptation to the target dataset. In this paper we identify an inherent limitation with this approach. That is, due to having disjoint and potentially unrelated classes, the projection functions learned from the auxiliary dataset/domain are biased when applied directly to the target dataset/domain. We call this problem the

projection domain shift

problem and propose a novel framework,

transductive multi-view embedding

, to solve it. It is ‘transductive’ in that unlabelled target data points are explored for projection adaptation, and ‘multi-view’ in that both low-level feature (view) and multiple semantic representations (views) are embedded to rectify the projection shift. We demonstrate through extensive experiments that our framework (1) rectifies the projection shift between the auxiliary and target domains, (2) exploits the complementarity of multiple semantic representations, (3) achieves state-of-the-art recognition results on image and video benchmark datasets, and (4) enables novel cross-view annotation tasks.

Yanwei Fu, Timothy M. Hospedales, Tao Xiang, Zhenyong Fu, Shaogang Gong

Self-explanatory Sparse Representation for Image Classification

Traditional sparse representation algorithms usually operate in a single Euclidean space. This paper leverages a self-explanatory reformulation of sparse representation, i.e., linking the learned dictionary atoms with the original feature spaces explicitly, to extend simultaneous dictionary learning and sparse coding into reproducing kernel Hilbert spaces (RKHS). The resulting single-view self-explanatory sparse representation (SSSR) is applicable to an arbitrary kernel space and has the nice property that the derivatives with respect to parameters of the coding are independent of the chosen kernel. With SSSR, multiple-view self-explanatory sparse representation (MSSR) is proposed to capture and combine various salient regions and structures from different kernel spaces. This is equivalent to learning a nonlinear structured dictionary, whose complexity is reduced by learning a set of smaller dictionary blocks via SSSR. SSSR and MSSR are then incorporated into a spatial pyramid matching framework and developed for image classification. Extensive experimental results on four benchmark datasets, including UIUC-Sports, Scene 15, Caltech-101, and Caltech-256, demonstrate the effectiveness of our proposed algorithm.

Bao-Di Liu, Yu-Xiong Wang, Bin Shen, Yu-Jin Zhang, Martial Hebert

Efficient k-Support Matrix Pursuit

In this paper, we study the

-support norm regularized matrix pursuit problem, which is regarded as the core formulation for several popular computer vision tasks. The

-support matrix norm, a convex relaxation of the matrix sparsity combined with the ℓ

-norm penalty, generalizes the recently proposed

-support vector norm. The contributions of this work are two-fold. First, the proposed

-support matrix norm does not suffer from the disadvantages of existing matrix norms towards sparsity and/or low-rankness: 1) too sparse/dense, and/or 2) column independent. Second, we present an efficient procedure for

-support norm optimization, in which the computation of the key

proximity operator

is substantially accelerated by binary search. Extensive experiments on subspace segmentation, semi-supervised classification and sparse coding well demonstrate the superiority of the new regularizer over existing matrix-norm regularizers, and also the orders-of-magnitude speedup compared with the existing optimization procedure for the

-support norm.

Hanjiang Lai, Yan Pan, Canyi Lu, Yong Tang, Shuicheng Yan

Geodesic Regression on the Grassmannian

This paper considers the problem of regressing data points on the Grassmann manifold over a scalar-valued variable. The Grassmannian has recently gained considerable attention in the vision community with applications in domain adaptation, face recognition, shape analysis, or the classification of linear dynamical systems. Motivated by the success of these approaches, we introduce a principled formulation for regression tasks on that manifold. We propose an

intrinsic

geodesic regression model generalizing classical linear least-squares regression. Since geodesics are parametrized by a starting point and a velocity vector, the model enables the synthesis of new observations on the manifold. To exemplify our approach, we demonstrate its applicability on three vision problems where data objects can be represented as points on the Grassmannian: the prediction of traffic speed and crowd counts from dynamical system models of surveillance videos and the modeling of aging trends in human brain structures using an affine-invariant shape representation.

Yi Hong, Roland Kwitt, Nikhil Singh, Brad Davis, Nuno Vasconcelos, Marc Niethammer

Model Selection by Linear Programming

Budget constraints arise in many computer vision problems. Computational costs limit many automated recognition systems while crowdsourced systems are hindered by monetary costs. We leverage wide variability in image complexity and learn adaptive model selection policies. Our learnt policy maximizes performance under average budget constraints by selecting “cheap” models for low complexity instances and utilizing descriptive models only for complex ones. During training, we assume access to a set of models that utilize features of different costs and types. We consider a binary tree architecture where each leaf corresponds to a different model. Internal decision nodes adaptively guide model-selection process along paths on a tree. The learning problem can be posed as an empirical risk minimization over training data with a non-convex objective function. Using hinge loss surrogates we show that adaptive model selection reduces to a linear program thus realizing substantial computational efficiencies and guaranteed convergence properties.

Joseph Wang, Tolga Bolukbasi, Kirill Trapeznikov, Venkatesh Saligrama

Perceptually Inspired Layout-Aware Losses for Image Segmentation

Interactive image segmentation is an important computer vision problem that has numerous real world applications. Models for image segmentation are generally trained to minimize the Hamming error in pixel labeling. The Hamming loss does not ensure that the topology/structure of the object being segmented is preserved and therefore is not a strong indicator of the quality of the segmentation as perceived by users. However, it is still ubiquitously used for training models because it decomposes over pixels and thus enables efficient learning. In this paper, we propose the use of a novel family of higher-order loss functions that encourage segmentations whose layout is similar to the ground-truth segmentation. Unlike the Hamming loss, these loss functions do not decompose over pixels and therefore cannot be directly used for loss-augmented inference. We show how our loss functions can be transformed to allow efficient learning and demonstrate the effectiveness of our method on a challenging segmentation dataset and validate the results using a user study. Our experimental results reveal that training with our layout-aware loss functions results in better segmentations that are preferred by users over segmentations obtained using conventional loss functions.

Anton Osokin, Pushmeet Kohli

Large Margin Local Metric Learning

Linear metric learning is a widely used methodology to learn a dissimilarity function from a set of similar/dissimilar example pairs. Using a single metric may be a too restrictive assumption when handling heterogeneous datasets. Recently, local metric learning methods have been introduced to overcome this limitation. However, they are subjects to constraints preventing their usage in many applications. For example, they require knowledge of the class label of the training points. In this paper, we present a novel local metric learning method, which overcomes some limitations of previous approaches. The method first computes a Gaussian Mixture Model from a low dimensional embedding of training data. Then it estimates a set of local metrics by solving a convex optimization problem; finally, a dissimilarity function is obtained by aggregating the local metrics. Our experiments show that the proposed method achieves state-of-the-art results on four datasets.

Julien Bohné, Yiming Ying, Stéphane Gentric, Massimiliano Pontil

Movement Pattern Histogram for Action Recognition and Retrieval

We present a novel action representation based on encoding the global temporal movement of an action. We represent an action as a set of movement pattern histograms that encode the global temporal dynamics of an action. Our key observation is that temporal dynamics of an action are robust to variations in appearance and viewpoint changes, making it useful for action recognition and retrieval. We pose the problem of computing similarity between action representations as a maximum matching problem in a bipartite graph. We demonstrate the effectiveness of our method for cross-view action recognition on the IXMAS dataset. We also show how our representation complements existing bag-of-features representations on the UCF50 dataset. Finally we show the power of our representation for action retrieval on a new real-world dataset containing repetitive motor movements emitted by children with autism in an unconstrained classroom setting.

Arridhana Ciptadi, Matthew S. Goodwin, James M. Rehg

Pose Filter Based Hidden-CRF Models for Activity Detection

Detecting activities which involve a sequence of complex pose and motion changes in unsegmented videos is a challenging task, and common approaches use sequential graphical models to infer the human pose-state in every frame. We propose an alternative model based on detecting the key-poses in a video, where only the temporal positions of a few key-poses are inferred. We also introduce a novel pose summarization algorithm to automatically discover the key-poses of an activity. We learn a detection filter for each key-pose, which along with a bag-of-words root filter are combined in an HCRF model, whose parameters are learned using the latent-SVM optimization. We evaluate the performance of our model for detection on unsegmented videos on four human action datasets, which include challenging crowded scenes with dynamic backgrounds, inter-person occlusions, multi-human interactions and hard-to-detect daily use objects.

Prithviraj Banerjee, Ram Nevatia

Action Recognition Using Super Sparse Coding Vector with Spatio-temporal Awareness

This paper presents a novel framework for human action recognition based on sparse coding. We introduce an effective coding scheme to aggregate low-level descriptors into the super descriptor vector (SDV). In order to incorporate the spatio-temporal information, we propose a novel approach of super location vector (SLV) to model the space-time locations of local interest points in a much more compact way compared to the spatio-temporal pyramid representations. SDV and SLV are in the end combined as the super sparse coding vector (SSCV) which jointly models the motion, appearance, and location cues. This representation is computationally efficient and yields superior performance while using linear classifiers. In the extensive experiments, our approach significantly outperforms the state-of-the-art results on the two public benchmark datasets, i.e., HMDB51 and YouTube.

Xiaodong Yang, YingLi Tian

HOPC: Histogram of Oriented Principal Components of 3D Pointclouds for Action Recognition

Existing techniques for 3D action recognition are sensitive to viewpoint variations because they extract features from depth images which change significantly with viewpoint. In contrast, we directly process the pointclouds and propose a new technique for action recognition which is more robust to noise, action speed and viewpoint variations. Our technique consists of a novel descriptor and keypoint detection algorithm. The proposed descriptor is extracted at a point by encoding the Histogram of Oriented Principal Components (HOPC) within an adaptive spatio-temporal support volume around that point. Based on this descriptor, we present a novel method to detect Spatio-Temporal Key-Points (STKPs) in 3D pointcloud sequences. Experimental results show that the proposed descriptor and STKP detector outperform state-of-the-art algorithms on three benchmark human activity datasets. We also introduce a new multiview public dataset and show the robustness of our proposed method to viewpoint variations.

Hossein Rahmani, Arif Mahmood, Du Q Huynh, Ajmal Mian

Natural Action Recognition Using Invariant 3D Motion Encoding

We investigate the recognition of actions “in the wild” using 3D motion information. The lack of control over (and knowledge of) the camera configuration, exacerbates this already challenging task, by introducing systematic projective inconsistencies between 3D motion fields, hugely increasing intra-class variance. By introducing a robust, sequence based, stereo calibration technique, we reduce these inconsistencies from fully projective to a simple similarity transform. We then introduce motion encoding techniques which provide the necessary scale invariance, along with additional invariances to changes in camera viewpoint.

On the recent Hollywood 3D natural action recognition dataset, we show improvements of 40% over previous state-of-the-art techniques based on implicit motion encoding. We also demonstrate that our robust sequence calibration simplifies the task of recognising actions, leading to recognition rates 2.5 times those for the same technique without calibration. In addition, the sequence calibrations are made available.

Simon Hadfield, Karel Lebeda, Richard Bowden

Detecting Social Actions of Fruit Flies

We describe a system that tracks pairs of fruit flies and automatically detects and classifies their actions. We compare experimentally the value of a frame-level feature representation with the more elaborate notion of ‘bout features’ that capture the structure within actions. Similarly, we compare a simple sliding window classifier architecture with a more sophisticated structured output architecture, and find that window based detectors outperform the much slower structured counterparts, and approach human performance. In addition we test our top performing detector on the CRIM13 mouse dataset, finding that it matches the performance of the best published method. Our Fly-vs-Fly dataset contains 22 hours of video showing pairs of fruit flies engaging in 10 social interactions in three different contexts; it is fully annotated by experts, and published with articulated pose trajectory features.

Eyrun Eyjolfsdottir, Steve Branson, Xavier P. Burgos-Artizzu, Eric D. Hoopfer, Jonathan Schor, David J. Anderson, Pietro Perona

Structure from Motion and Feature Matching

Progressive Mode-Seeking on Graphs for Sparse Feature Matching

Sparse feature matching poses three challenges to graph-based methods: (1) the combinatorial nature makes the number of possible matches huge; (2) most possible matches might be outliers; (3) high computational complexity is often incurred. In this paper, to resolve these issues, we propose a simple, yet surprisingly effective approach to explore the huge matching space in order to significantly boost true matches while avoiding outliers. The key idea is to perform mode-seeking on graphs progressively based on our proposed guided graph density. We further design a density-aware sampling technique to considerably accelerate mode-seeking. Experimental study on various benchmark data sets demonstrates that our method is several orders faster than the state-of-the-art methods while achieving much higher precision and recall.

Chao Wang, Lei Wang, Lingqiao Liu

Globally Optimal Inlier Set Maximization with Unknown Rotation and Focal Length

Identifying inliers and outliers among data is a fundamental problem for model estimation. This paper considers models composed of rotation and focal length, which typically occurs in the context of panoramic imaging. An efficient approach consists in computing the underlying model such that the number of inliers is maximized. The most popular tool for inlier set maximization must be RANSAC and its numerous variants. While they can provide interesting results, they are not guaranteed to return the globally optimal solution, i.e. the model leading to the highest number of inliers. We propose a novel globally optimal approach based on branch-and-bound. It computes the rotation and the focal length maximizing the number of inlier correspondences and considers the reprojection error in the image space. Our approach has been successfully applied on synthesized data and real images.

Jean-Charles Bazin, Yongduek Seo, Richard Hartley, Marc Pollefeys

Match Selection and Refinement for Highly Accurate Two-View Structure from Motion

We present an approach to enhance the accuracy of structure from motion (SfM) in the two-view case. We first answer the question: “fewer data with higher accuracy, or more data with less accuracy?” For this, we establish a relation between SfM errors and a function of the number of matches and their epipolar errors. Using an accuracy estimator of individual matches, we then propose a method to select a subset of matches that has a good quality vs. quantity compromise. We also propose a variant of least squares matching to refine match locations based on a focused grid and a multi-scale exploration. Experiments show that both selection and refinement contribute independently to a better accuracy. Their combination reduces errors by a factor of 1.1 to 2.0 for rotation, and 1.6 to 3.8 for translation.

Zhe Liu, Pascal Monasse, Renaud Marlet

LSD-SLAM: Large-Scale Direct Monocular SLAM

We propose a direct (feature-less) monocular SLAM algorithm which, in contrast to current state-of-the-art regarding direct methods, allows to build large-scale, consistent maps of the environment. Along with highly accurate pose estimation based on direct image alignment, the 3D environment is reconstructed in real-time as pose-graph of keyframes with associated semi-dense depth maps. These are obtained by filtering over a large number of pixelwise small-baseline stereo comparisons. The explicitly scale-drift aware formulation allows the approach to operate on challenging sequences including large variations in scene scale. Major enablers are two key novelties: (1) a novel direct tracking method which operates on

$\mathfrak{sim}(3)$

, thereby explicitly detecting scale-drift, and (2) an elegant probabilistic solution to include the effect of noisy depth values into tracking. The resulting direct monocular SLAM system runs in real-time on a CPU.

Jakob Engel, Thomas Schöps, Daniel Cremers

Backmatter

Titel: Computer Vision – ECCV 2014
herausgegeben von: David Fleet
Tomas Pajdla
Bernt Schiele
Tinne Tuytelaars
Verlag: Springer International Publishing
Electronic ISBN: 978-3-319-10605-2
Print ISBN: 978-3-319-10604-5
DOI: https://doi.org/10.1007/978-3-319-10605-2