Skip to main content

2017 | Buch

Pattern Recognition

39th German Conference, GCPR 2017, Basel, Switzerland, September 12–15, 2017, Proceedings

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 39th German Conference on Pattern Recognition, GCPR 2017, held in Basel, Switzerland, in September 2017.The 33 revised full papers presented were carefully reviewed and selected from 60 submissions. The papers are organized in topical sections on biomedical image processing and analysis; classification and detection; computational photography; image and video processing; machine learning and pattern recognition; mathematical foundations, statistical data analysis and models; motion and segmentation; pose, face and gesture; reconstruction and depth; and tracking.

Inhaltsverzeichnis

Frontmatter

Biomedical Image Processing and Analysis

Frontmatter
A Quantitative Assessment of Image Normalization for Classifying Histopathological Tissue of the Kidney

The advancing pervasion of digital pathology in research and clinical practice results in a strong need for image analysis techniques in the field of histopathology. Due to diverse reasons, histopathological imaging generally exhibits a high degree of variability. As automated segmentation approaches are known to be vulnerable, especially to unseen variability, we investigate several stain normalization methods to compensate for variations between different whole slide images. In a large experimental study, we investigate all combinations of five image normalization (not only stain normalization) methods as well as five image representations with respect to the classification performance in two application scenarios in kidney histopathology. Finally, we also pose the question, if color normalization is sufficient to compensate for the changed properties between whole slide images in an application scenario with few training data.

Michael Gadermayr, Sean Steven Cooper, Barbara Klinkhammer, Peter Boor, Dorit Merhof

Classification and Detection

Frontmatter
Deep Learning for Vanishing Point Detection Using an Inverse Gnomonic Projection

We present a novel approach for vanishing point detection from uncalibrated monocular images. In contrast to state-of-the-art, we make no a priori assumptions about the observed scene. Our method is based on a convolutional neural network (CNN) which does not use natural images, but a Gaussian sphere representation arising from an inverse gnomonic projection of lines detected in an image. This allows us to rely on synthetic data for training, eliminating the need for labelled images. Our method achieves competitive performance on three horizon estimation benchmark datasets. We further highlight some additional use cases for which our vanishing point detection algorithm can be used.

Florian Kluger, Hanno Ackermann, Michael Ying Yang, Bodo Rosenhahn
Learning Where to Drive by Watching Others

The most prominent approach for autonomous cars to learn what areas of a scene are drivable is to utilize tedious human supervision in the form of pixel-wise image labeling for training deep semantic segmentation algorithms. However, the underlying CNNs require vast amounts of this training information, rendering the expensive pixel-wise labeling of images a bottleneck. Thus, we propose a self-supervised approach that is able to utilize the myriad of easily available dashcam videos from YouTube or from autonomous vehicles to perform fully automatic training by simply watching others drive. We play training videos backwards in time and track patches that cars have driven over together with their spatio-temporal interrelations, which are a rich source of context information. Collecting large numbers of these local regions enables fully automatic self-supervision for training a CNN. The proposed method has the potential to extend and complement the popular supervised CNN learning of drivable pixels by using a rich, presently untapped source of unlabeled training data.

Miguel A. Bautista, Patrick Fuchs, Björn Ommer
Learning Dilation Factors for Semantic Segmentation of Street Scenes

Contextual information is crucial for semantic segmentation. However, finding the optimal trade-off between keeping desired fine details and at the same time providing sufficiently large receptive fields is non trivial. This is even more so, when objects or classes present in an image significantly vary in size. Dilated convolutions have proven valuable for semantic segmentation, because they allow to increase the size of the receptive field without sacrificing image resolution. However, in current state-of-the-art methods, dilation parameters are hand-tuned and fixed. In this paper, we present an approach for learning dilation parameters adaptively per channel, consistently improving semantic segmentation results on street-scene datasets like Cityscapes and Camvid.

Yang He, Margret Keuper, Bernt Schiele, Mario Fritz
Learning to Filter Object Detections

Most object detection systems consist of three stages. First, a set of individual hypotheses for object locations is generated using a proposal generating algorithm. Second, a classifier scores every generated hypothesis independently to obtain a multi-class prediction. Finally, all scored hypotheses are filtered via a non-differentiable and decoupled non-maximum suppression (NMS) post-processing step. In this paper, we propose a filtering network (FNet), a method which replaces NMS with a differentiable neural network that allows joint reasoning and re-scoring of the generated set of hypotheses per image. This formulation enables end-to-end training of the full object detection pipeline. First, we demonstrate that FNet, a feed-forward network architecture, is able to mimic NMS decisions, despite the sequential nature of NMS. We further analyze NMS failures and propose a loss formulation that is better aligned with the mean average precision (mAP) evaluation metric. We evaluate FNet on several standard detection datasets. Results surpass standard NMS on highly occluded settings of a synthetic overlapping MNIST dataset and show competitive behavior on PascalVOC2007 and KITTI detection benchmarks.

Sergey Prokudin, Daniel Kappler, Sebastian Nowozin, Peter Gehler

Computational Photography

Frontmatter
Motion Deblurring in the Wild

We propose a deep learning approach to remove motion blur from a single image captured in the wild, i.e., in an uncontrolled setting. Thus, we consider motion blur degradations that are due to both camera and object motion, and by occlusion and coming into view of objects. In this scenario, a model-based approach would require a very large set of parameters, whose fitting is a challenge on its own. Hence, we take a data-driven approach and design both a novel convolutional neural network architecture and a dataset for blurry images with ground truth. The network produces directly the sharp image as output and is built into three pyramid stages, which allow to remove blur gradually from a small amount, at the lowest scale, to the full amount, at the scale of the input image. To obtain corresponding blurry and sharp image pairs, we use videos from a high frame-rate video camera. For each small video clip we select the central frame as the sharp image and use the frame average as the corresponding blurred image. Finally, to ensure that the averaging process is a sufficient approximation to real blurry images we estimate optical flow and select frames with pixel displacements smaller than a pixel. We demonstrate state of the art performance on datasets with both synthetic and real images.

Mehdi Noroozi, Paramanand Chandramouli, Paolo Favaro
Robust Multi-image HDR Reconstruction for the Modulo Camera

Photographing scenes with high dynamic range (HDR) poses great challenges to consumer cameras with their limited sensor bit depth. To address this, Zhao et al. recently proposed a novel sensor concept – the modulo camera – which captures the least significant bits of the recorded scene instead of going into saturation. Similar to conventional pipelines, HDR images can be reconstructed from multiple exposures, but significantly fewer images are needed than with a typical saturating sensor. While the concept is appealing, we show that the original reconstruction approach assumes noise-free measurements and quickly breaks down otherwise. To address this, we propose a novel reconstruction algorithm that is robust to image noise and produces significantly fewer artifacts. We theoretically analyze correctness as well as limitations, and show that our approach significantly outperforms the baseline on real data.

Florian Lang, Tobias Plötz, Stefan Roth
Trainable Regularization for Multi-frame Superresolution

In this paper, we present a novel method for multi-frame superresolution (SR). Our main goal is to improve the spatial resolution of a multi-line scan camera for an industrial inspection task. High resolution output images are reconstructed using our proposed SR algorithm for multi-channel data, which is based on the trainable reaction-diffusion model. As this is a supervised learning approach, we simulate ground truth data for a real imaging scenario. We show that learning a regularizer for the SR problem improves the reconstruction results compared to an iterative reconstruction algorithm using TV or TGV regularization. We test the learned regularizer, trained on simulated data, on images acquired with the real camera setup and achieve excellent results.

Teresa Klatzer, Daniel Soukup, Erich Kobler, Kerstin Hammernik, Thomas Pock

Image and Video Processing

Frontmatter
A Comparative Study of Local Search Algorithms for Correlation Clustering

This paper empirically compares four local search algorithms for correlation clustering by applying these to a variety of instances of the correlation clustering problem for the tasks of image segmentation, hand-written digit classification and social network analysis. Although the local search algorithms establish neither lower bounds nor approximation certificates, they converge monotonously to a fixpoint, offering a feasible solution at any time. For some algorithms, the time of convergence is affordable for all instances we consider. This finding encourages a broader application of correlation clustering, especially in settings where the number of clusters is not known and needs to be estimated from data.

Evgeny Levinkov, Alexander Kirillov, Bjoern Andres
Combined Precise Extraction and Topology of Points, Lines and Curves in Man-Made Environments

This article presents a novel method for a combined extraction of points, lines and arcs in images. Geometric primitives are fitted into extracted edge pixels. In order to get points, the intersections between the geometric primitives are calculated. The method allows a precise and at the same time robust detection of the image features. By constructing a graph describing the topology between the features, more complex structures can be described over multiple connected primitives.

Dominik Wolters, Reinhard Koch
Recurrent Residual Learning for Action Recognition

Action recognition is a fundamental problem in computer vision with a lot of potential applications such as video surveillance, human computer interaction, and robot learning. Given pre-segmented videos, the task is to recognize actions happening within videos. Historically, hand crafted video features were used to address the task of action recognition. With the success of Deep ConvNets as an image analysis method, a lot of extensions of standard ConvNets were purposed to process variable length video data. In this work, we propose a novel recurrent ConvNet architecture called recurrent residual networks to address the task of action recognition. The approach extends ResNet, a state of the art model for image classification. While the original formulation of ResNet aims at learning spatial residuals in its layers, we extend the approach by introducing recurrent connections that allow to learn a spatio-temporal residual. In contrast to fully recurrent networks, our temporal connections only allow a limited range of preceding frames to contribute to the output for the current frame, enabling efficient training and inference as well as limiting the temporal context to a reasonable local range around each frame. On a large-scale action recognition dataset, we show that our model improves over both, the standard ResNet architecture and a ResNet extended by a fully recurrent layer.

Ahsan Iqbal, Alexander Richard, Hilde Kuehne, Juergen Gall
A Local Spatio-Temporal Approach to Plane Wave Ultrasound Particle Image Velocimetry

We present a simple and efficient approach to plane wave ultrasound particle image velocimetry (Echo PIV). Specifically, a carefully designed bank of local motion-sensitive filters is introduced, together with a method for non-linear flow parameter estimation based on time-averaged local flow estimates. The approach is validated and quantitatively assessed using both simulated and in-vitro real data, in scenarios with laminar as well as with turbulent flow.

Ecaterina Bodnariuc, Stefania Petra, Christoph Schnörr, Jason Voorneveld

Machine Learning and Pattern Recognition

Frontmatter
Object Boundary Detection and Classification with Image-Level Labels

Semantic boundary and edge detection aims at simultaneously detecting object edge pixels in images and assigning class labels to them. Systematic training of predictors for this task requires the labeling of edges in images which is a particularly tedious task. We propose a novel strategy for solving this task, when pixel-level annotations are not available, performing it in an almost zero-shot manner by relying on conventional whole image neural net classifiers that were trained using large bounding boxes. Our method performs the following two steps at test time. Firstly it predicts the class labels by applying the trained whole image network to the test images. Secondly, it computes pixel-wise scores from the obtained predictions by applying backprop gradients as well as recent visualization algorithms such as deconvolution and layer-wise relevance propagation. We show that high pixel-wise scores are indicative for the location of semantic boundaries, which suggests that the semantic boundary problem can be approached without using edge labels during the training phase.

Jing Yu Koh, Wojciech Samek, Klaus-Robert Müller, Alexander Binder
Semantic Segmentation of Outdoor Areas Using 3D Moment Invariants and Contextual Cues

In this paper, we propose an approach for the semantic segmentation of a 3D point cloud using local 3D moment invariants and the integration of contextual information. Specifically, we focus on the task of analyzing forestal and urban areas which were recorded by terrestrial LiDAR scanners. We demonstrate how 3D moment invariants can be leveraged as local features and that they are on a par with established descriptors. Furthermore, we show how an iterative learning scheme can increase the overall quality by taking neighborhood relationships between classes into account. Our experiments show that the approach achieves very good results for a variety of tasks including both binary and multi-class settings.

Sven Sickert, Joachim Denzler
Neuron Pruning for Compressing Deep Networks Using Maxout Architectures

This paper presents an efficient and robust approach for reducing the size of deep neural networks by pruning entire neurons. It exploits maxout units for combining neurons into more complex convex functions and it makes use of a local relevance measurement that ranks neurons according to their activation on the training set for pruning them. Additionally, a parameter reduction comparison between neuron and weight pruning is shown. It will be empirically shown that the proposed neuron pruning reduces the number of parameters dramatically. The evaluation is performed on two tasks, the MNIST handwritten digit recognition and the LFW face verification, using a LeNet-5 and a VGG16 network architecture. The network size is reduced by up to $$74\%$$74% and $$61\%$$61%, respectively, without affecting the network’s performance. The main advantage of neuron pruning is its direct influence on the size of the network architecture. Furthermore, it will be shown that neuron pruning can be combined with subsequent weight pruning, reducing the size of the LeNet-5 and VGG16 up to $$92\%$$92% and $$80\%$$80% respectively.

Fernando Moya Rueda, Rene Grzeszick, Gernot A. Fink
A Primal Dual Network for Low-Level Vision Problems

In the past, classic energy optimization techniques were the driving force in many innovations and are a building block for almost any problem in computer vision. Efficient algorithms are mandatory to achieve real-time processing, needed in many applications like autonomous driving. However, energy models - even if designed by human experts - might never be able to fully capture the complexity of natural scenes and images. Similar to optimization techniques, Deep Learning has changed the landscape of computer vision in recent years and has helped to push the performance of many models to never experienced heights. Our idea of a primal-dual network is to combine the structure of regular energy optimization techniques, in particular of first order methods, with the flexibility of Deep Learning to adapt to the statistics of the input data.

Christoph Vogel, Thomas Pock
End-to-End Learning of Video Super-Resolution with Motion Compensation

Learning approaches have shown great success in the task of super-resolving an image given a low resolution input. Video super-resolution aims for exploiting additionally the information from multiple images. Typically, the images are related via optical flow and consecutive image warping. In this paper, we provide an end-to-end video super-resolution network that, in contrast to previous works, includes the estimation of optical flow in the overall network architecture. We analyze the usage of optical flow for video super-resolution and find that common off-the-shelf image warping does not allow video super-resolution to benefit much from optical flow. We rather propose an operation for motion compensation that performs warping from low to high resolution directly. We show that with this network configuration, video super-resolution can benefit from optical flow and we obtain state-of-the-art results on the popular test sets. We also show that the processing of whole images rather than independent patches is responsible for a large increase in accuracy.

Osama Makansi, Eddy Ilg, Thomas Brox
Convolutional Neural Networks for Movement Prediction in Videos

In this work we present a convolutional neural network-based (CNN) model that predicts future movements of a ball given a series of images depicting the ball and its environment. For training and evaluation, we use artificially generated images sequences. Two scenarios are analyzed: Prediction in a simple table tennis environment and a more challenging squash environment. Classical 2D convolution layers are compared with 3D convolution layers that extract the motion information of the ball from contiguous frames. Moreover, we investigate whether networks with stereo visual input perform better than those with monocular vision only. Our experiments suggest that CNNs can indeed predict physical behaviour with small error rates on unseen data but the performance drops for very complex underlying movements.

Alexander Warnecke, Timo Lüddecke, Florentin Wörgötter
Finding the Unknown: Novelty Detection with Extreme Value Signatures of Deep Neural Activations

Achieving or even surpassing human-level accuracy became recently possible in a variety of application scenarios due to the rise of convolutional neural networks (CNNs) trained from large datasets. However, solving supervised visual recognition tasks by discriminating among known categories is only one side of the coin. In contrast to this, novelty detection is still an unsolved task where instances of yet unknown categories need to be identified. Therefore, we propose to leverage the powerful discriminative nature of CNNs to novelty detection tasks by investigating class-specific activation patterns. More precisely, we assume that a semantic category can be described by its extreme value signature, that specifies which dimensions of deep neural activations have largest values. By following this intuition, we show that already a small number of high-valued dimensions allows to separate known from unknown categories. Our approach is simple, intuitive, and can be easily put on top of CNNs trained for vanilla classification tasks. We empirically validate the benefits of our approach in terms of accuracy and speed by comparing it against established methods in a variety of novelty detection tasks derived from ImageNet. Finally, we show that visualizing extreme value signatures allows to inspect class-specific patterns learned during training which may ultimately help to better understand CNN models.

Alexander Schultheiss, Christoph Käding, Alexander Freytag, Joachim Denzler
Improving Facial Landmark Detection via a Super-Resolution Inception Network

Modern convolutional neural networks for facial landmark detection have become increasingly robust against occlusions, lighting conditions and pose variations. With the predictions being close to pixel-accurate in some cases, intuitively, the input resolution should be as high as possible. We verify this intuition by thoroughly analyzing the impact of low image resolution on landmark prediction performance. Indeed, performance degradations are already measurable for faces smaller than $$50\,\times \,50\,\mathrm {px}$$50×50px. In order to mitigate those degradations, a new super-resolution inception network architecture is developed which outperforms recent super-resolution methods on various data sets. By enhancing low resolution images with our model, we are able to improve upon the state of the art in facial landmark detection.

Martin Knoche, Daniel Merget, Gerhard Rigoll

Mathematical Foundations, Statistical Data Analysis and Models

Frontmatter
Diverse M-Best Solutions by Dynamic Programming

Many computer vision pipelines involve dynamic programming primitives such as finding a shortest path or the minimum energy solution in a tree-shaped probabilistic graphical model. In such cases, extracting not merely the best, but the set of M-best solutions is useful to generate a rich collection of candidate proposals that can be used in downstream processing. In this work, we show how M-best solutions of tree-shaped graphical models can be obtained by dynamic programming on a special graph with M layers. The proposed multi-layer concept is optimal for searching M-best solutions, and so flexible that it can also approximate M-best diverse solutions. We illustrate the usefulness with applications to object detection, panorama stitching and centerline extraction.

Carsten Haubold, Virginie Uhlmann, Michael Unser, Fred A. Hamprecht
Adaptive Regularization in Convex Composite Optimization for Variational Imaging Problems

We propose an adaptive regularization scheme in a variational framework where a convex composite energy functional is optimized. We consider a number of imaging problems including segmentation and motion estimation, which are considered as optimal solutions of the energy functionals that mainly consist of data fidelity, regularization and a control parameter for their trade-off. We presents an algorithm to determine the relative weight between data fidelity and regularization based on the residual that measures how well the observation fits the model. Our adaptive regularization scheme is designed to locally control the regularization at each pixel based on the assumption that the diversity of the residual of a given imaging model spatially varies. The energy optimization is presented in the alternating direction method of multipliers (ADMM) framework where the adaptive regularization is iteratively applied along with mathematical analysis of the proposed algorithm. We demonstrate the robustness and effectiveness of our adaptive regularization through experimental results presenting that the qualitative and quantitative evaluation results of each imaging task are superior to the results with a constant regularization scheme. The desired properties, robustness and effectiveness, of the regularization parameter selection in a variational framework for imaging problems are achieved by merely replacing the static regularization parameter with our adaptive one.

Byung-Woo Hong, Ja-Keoung Koo, Hendrik Dirks, Martin Burger
Variational Networks: Connecting Variational Methods and Deep Learning

In this paper, we introduce variational networks (VNs) for image reconstruction. VNs are fully learned models based on the framework of incremental proximal gradient methods. They provide a natural transition between classical variational methods and state-of-the-art residual neural networks. Due to their incremental nature, VNs are very efficient, but only approximately minimize the underlying variational model. Surprisingly, in our numerical experiments on image reconstruction problems it turns out that giving up exact minimization leads to a consistent performance increase, in particular in the case of convex models.

Erich Kobler, Teresa Klatzer, Kerstin Hammernik, Thomas Pock
Gradient Flows on a Riemannian Submanifold for Discrete Tomography

We present a smooth geometric approach to discrete tomography that jointly performs tomographic reconstruction and label assignment. The flow evolves on a submanifold equipped with a Hessian Riemannian metric and properly takes into account given projection constraints. The metric naturally extends the Fisher-Rao metric from labeling problems with directly observed data to the inverse problem of discrete tomography where projection data only is available. The flow simultaneously performs reconstruction and label assignment. We show that it can be numerically integrated by an implicit scheme based on a Bregman proximal point iteration. A numerical evaluation on standard test-datasets in the few angles scenario demonstrates an improvement of the reconstruction quality compared to competitive methods.

Matthias Zisler, Fabrizio Savarino, Stefania Petra, Christoph Schnörr
Model Selection for Gaussian Process Regression

Gaussian processes are powerful tools since they can model non-linear dependencies between inputs, while remaining analytically tractable. A Gaussian process is characterized by a mean function and a covariance function (kernel), which are determined by a model selection criterion. The functions to be compared do not just differ in their parametrization but in their fundamental structure. It is often not clear which function structure to choose, for instance to decide between a squared exponential and a rational quadratic kernel. Based on the principle of posterior agreement, we develop a general framework for model selection to rank kernels for Gaussian process regression and compare it with maximum evidence (also called marginal likelihood) and leave-one-out cross-validation. Given the disagreement between current state-of-the-art methods in our experiments, we show the difficulty of model selection and the need for an information-theoretic approach.

Nico S. Gorbach, Andrew An Bian, Benjamin Fischer, Stefan Bauer, Joachim M. Buhmann

Motion and Segmentation

Frontmatter
Scalable Full Flow with Learned Binary Descriptors

We propose a method for large displacement optical flow in which local matching costs are learned by a convolutional neural network (CNN) and a smoothness prior is imposed by a conditional random field (CRF). We tackle the computation- and memory-intensive operations on the 4D cost volume by a min-projection which reduces memory complexity from quadratic to linear and binary descriptors for efficient matching. This enables evaluation of the cost on the fly and allows to perform learning and CRF inference on high resolution images without ever storing the 4D cost volume. To address the problem of learning binary descriptors we propose a new hybrid learning scheme. In contrast to current state of the art approaches for learning binary CNNs we can compute the exact non-zero gradient within our model. We compare several methods for training binary descriptors and show results on public available benchmarks.

Gottfried Munda, Alexander Shekhovtsov, Patrick Knöbelreiter, Thomas Pock
Edge Adaptive Seeding for Superpixel Segmentation

Finding a suitable seeding resolution when using superpixel segmentation methods is usually challenging. Different parts of the image contain different levels of clutter, resulting in an either too dense or too coarse segmentation. Since both possible solutions cause problems with respect to subsequent processing, we propose an edge adaptive seeding for superpixel segmentation methods, generating more seeds in areas with more edges and vise versa. This follows the assumption that edges distinguish objects and thus are a good indicator of the level of clutter in an image region. We show in our evaluation on five datasets by using three popular superpixel segmentation methods that using edge adaptive seeding leads to improved results compared to other priors as well as to uniform seeding.

Christian Wilms, Simone Frintrop

Pose, Face and Gesture

Frontmatter
Optical Flow-Based 3D Human Motion Estimation from Monocular Video

This paper presents a method to estimate 3D human pose and body shape from monocular videos. While recent approaches infer the 3D pose from silhouettes and landmarks, we exploit properties of optical flow to temporally constrain the reconstructed motion. We estimate human motion by minimizing the difference between computed flow fields and the output of our novel flow renderer. By just using a single semi-automatic initialization step, we are able to reconstruct monocular sequences without joint annotation. Our test scenarios demonstrate that optical flow effectively regularizes the under-constrained problem of human shape and motion estimation from monocular video.

Thiemo Alldieck, Marc Kassubeck, Bastian Wandt, Bodo Rosenhahn, Marcus Magnor
On the Diffusion Process for Heart Rate Estimation from Face Videos Under Realistic Conditions

This work addresses the problem of estimating heart rate from face videos under real conditions using a model based on the recursive inference problem that leverages the local invariance of the heart rate. The proposed solution is based on the canonical state space representation of an Itō process and a Wiener velocity model. Empirical results yield to excellent real-time and estimation performance of heart rate in presence of disturbing factors, like rigid head motion, talking and facial expressions under natural illumination conditions making the process of heart rate estimation from face videos applicable in a much broader sense. To facilitate comparisons and to support research we made the code and data for reproducing the results public available.

Christian S. Pilz, Jarek Krajewski, Vladimir Blazek

Reconstruction and Depth

Frontmatter
Multi-view Continuous Structured Light Scanning

We introduce a highly accurate and precise multi-view, multi-projector, and multi-pattern phase scanning method for shape acquisition that is able to handle occlusions and optically challenging materials. The 3D reconstruction is formulated as a two-step process which first estimates reliable measurement samples and then simultaneously optimizes over all cameras, projectors, and patterns. This holistic approach results in significant quality improvements. Furthermore, the acquisition time is drastically reduced by relying on just six high-frequency sinusoidal captures without the need of phase unwrapping, which is implicitly provided by the multi-view geometry.

Fabian Groh, Benjamin Resch, Hendrik P. A. Lensch
Down to Earth: Using Semantics for Robust Hypothesis Selection for the Five-Point Algorithm

The computation of the essential matrix using the five-point algorithm is a staple task usually considered as being solved. However, we show that the algorithm frequently selects erroneous solutions in the presence of noise and outliers. These errors arise when the supporting point correspondences supplied to the algorithm do not adequately cover all essential planes in the scene, leading to ambiguous essential matrix solutions. This is not merely a theoretical problem: such scene conditions often occur in 3D reconstruction of real-world data when fronto-parallel point correspondences, such as points on building facades, are captured but correspondences on obliquely observed planes, such as the ground plane, are missed. To solve this problem, we propose to leverage semantic labelings of image features to guide hypothesis selection in the five-point algorithm. More specifically, we propose a two-stage RANSAC procedure in which, in the first step, only features classified as ground points are processed. These inlier ground features are subsequently used to score two-view geometry hypotheses generated by the five-point algorithm using samples of non-ground points. Results for scenes with prominent ground regions demonstrate the ability of our approach to recover epipolar geometries that describe the entire scene, rather than only well-sampled scene planes.

Andreas Kuhn, True Price, Jan-Michael Frahm, Helmut Mayer
An Efficient Octree Design for Local Variational Range Image Fusion

We present a reconstruction pipeline for a large-scale 3D environment viewed by a single moving RGB-D camera. Our approach combines advantages of fast and direct, regularization-free depth fusion and accurate, but costly variational schemes. The scene’s depth geometry is extracted from each camera view and efficiently integrated into a large, dense grid as a truncated signed distance function, which is organized in an octree. To account for noisy real-world input data, variational range image integration is performed in local regions of the volume directly on this octree structure. We focus on algorithms which are easily parallelizable on GPUs, allowing the pipeline to be used in real-time scenarios where the user can interactively view the reconstruction and adapt camera motion as required.

Nico Marniok, Ole Johannsen, Bastian Goldluecke

Tracking

Frontmatter
Measuring the Accuracy of Object Detectors and Trackers

The accuracy of object detectors and trackers is most commonly evaluated by the Intersection over Union (IoU) criterion. To date, most approaches are restricted to axis-aligned or oriented boxes and, as a consequence, many datasets are only labeled with boxes. Nevertheless, axis-aligned or oriented boxes cannot accurately capture an object’s shape. To address this, a number of densely segmented datasets has started to emerge in both the object detection and the object tracking communities. However, evaluating the accuracy of object detectors and trackers that are restricted to boxes on densely segmented data is not straightforward. To close this gap, we introduce the relative Intersection over Union (rIoU) accuracy measure. The measure normalizes the IoU with the optimal box for the segmentation to generate an accuracy measure that ranges between 0 and 1 and allows a more precise measurement of accuracies. Furthermore, it enables an efficient and easy way to understand scenes and the strengths and weaknesses of an object detection or tracking approach. We display how the new measure can be efficiently calculated and present an easy-to-use evaluation framework. The framework is tested on the DAVIS and the VOT2016 segmentations and has been made available to the community.

Tobias Böttger, Patrick Follmann, Michael Fauser
Backmatter
Metadaten
Titel
Pattern Recognition
herausgegeben von
Volker Roth
Thomas Vetter
Copyright-Jahr
2017
Electronic ISBN
978-3-319-66709-6
Print ISBN
978-3-319-66708-9
DOI
https://doi.org/10.1007/978-3-319-66709-6

Premium Partner