Skip to main content

2015 | Buch

Pattern Recognition

37th German Conference, GCPR 2015, Aachen, Germany, October 7-10, 2015, Proceedings

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 37th German Conference on Pattern Recognition, GCPR 2015, held in Aachen, Germany, in October 2015. The 45 revised full papers and one Young Researchers Forum presented were carefully reviewed and selected from 108 submissions. The papers are organized in topical sections on motion and reconstruction; mathematical foundations and image processing; biomedical image analysis and applications; human pose analysis; recognition and scene understanding.

Inhaltsverzeichnis

Frontmatter

Motion and Reconstruction

Frontmatter
Road Condition Estimation Based on Spatio-Temporal Reflection Models

Automated road condition estimation is a crucial basis for Advanced Driver Assistance Systems (ADAS) and even more for highly and fully automated driving functions in future. In order to improve vehicle safety relevant vehicle dynamics parameters,

e.g. last-point-to-brake

(LPB),

last-point-to-steer

(LPS), or

vehicle curve speed

should be adapted depending on the current weather-related road surface conditions. As vision-based systems are already integrated in many of today’s vehicles they constitute a beneficial resource for such a task. As a first contribution, we present a novel approach for reflection modeling which is a reliable and robust indicator for wet road surface conditions.We then extend our method by texture description features since local structures enable for the distinction of snow-covered and bare road surfaces. Based on a large real-life dataset we evaluate the performance of our approach and achieve results which clearly outperform other established visionbased methods while ensuring real-time capability.

Manuel Amthor, Bernd Hartmann, Joachim Denzler
Discrete Optimization for Optical Flow

We propose to look at large-displacement optical flow from a discrete point of view. Motivated by the observation that sub-pixel accuracy is easily obtained given pixel-accurate optical flow, we conjecture that computing the integral part is the hardest piece of the problem. Consequently, we formulate optical flow estimation as a discrete inference problem in a conditional random field, followed by sub-pixel refinement. Naïve discretization of the 2D flow space, however, is intractable due to the resulting size of the label set. In this paper, we therefore investigate three different strategies, each able to reduce computation and memory demands by several orders of magnitude. Their combination allows us to estimate large-displacement optical flow both accurately and efficiently and demonstrates the potential of discrete optimization for optical flow. We obtain state-of-the-art performance on MPI Sintel and KITTI.

Moritz Menze, Christian Heipke, Andreas Geiger
Multi-Camera Structure from Motion with Eye-to-Eye Calibration

Imaging systems consisting of multiple conventional cameras are of increasing interest for computer vision applications such as Structure from Motion (SfM) due to their large combined field of view and high composite image resolution. In this work we present a SfM framework for multi-camera systems w/o overlapping camera views that integrates on-line extrinsic camera calibration, local scene reconstruction, and global optimization based on combining hand-eye calibration methods with standard SfM. For this purpose, we propose a novel method for extrinsic calibration based on rigid motion constraints that uses visual measurements directly instead of motion correspondences. Only a single calibration pattern visible within the view of one camera is needed to provide an accurate reconstruction with absolute scale.

Sandro Esquivel, Reinhard Koch
Estimating Vehicle Ego-Motion and Piecewise Planar Scene Structure from Optical Flow in a Continuous Framework

We propose a variational approach for estimating egomotion and structure of a static scene from a pair of images recorded by a single moving camera. In our approach the scene structure is described by a set of 3D planar surfaces, which are linked to a SLIC superpixel decomposition of the image domain. The continuously parametrized planes are determined along with the extrinsic camera parameters by jointly minimizing a non-convex smooth objective function, that comprises a data term based on the pre-calculated optical flow between the input images and suitable priors on the scene variables. Our experiments demonstrate that our approach estimates egomotion and scene structure with a high quality, that reaches the accuracy of state-of-the-art stereo methods, but relies on a single sensor that ismore cost-efficient for autonomous systems.

Andreas Neufeld, Johannes Berger, Florian Becker, Frank Lenzen, Christoph Schnörr
Efficient Two-View Geometry Classification

Typical Structure-from-Motion systems spend major computational effort on geometric verification. Geometric verification recovers the epipolar geometry of two views for a moving camera by estimating a fundamental or essential matrix. The essential matrix describes the relative geometry for two views up to an unknown scale. Two-view triangulation or multi-model estimation approaches can reveal the relative geometric configuration of two views, e.g., small or large baseline and forward or sideward motion. Information about the relative configuration is essential for many problems in Structure-from-Motion. However, essential matrix estimation and assessment of the relative geometric configuration are computationally expensive. In this paper, we propose a learning-based approach for efficient two-view geometry classification, leveraging the by-products of feature matching. Our approach can predict whether two views have scene overlap and for overlapping views it can assess the relative geometric configuration. Experiments on several datasets demonstrate the performance of the proposed approach and its utility for Structure-from-Motion.

Johannes L. Schönberger, Alexander C. Berg, Jan-Michael Frahm

Mathematical Foundations and Image Processing

Frontmatter
A Convex Relaxation Approach to the Affine Subspace Clustering Problem

Prototypical data clustering is known to suffer from poor initializations. Recently, a semidefinite relaxation has been proposed to overcome this issue and to enable the use of convex programming instead of ad-hoc procedures. Unfortunately, this relaxation does not extend to the more involved case where clusters are defined by parametric models, and where the computation of means has to be replaced by parametric regression. In this paper, we provide a novel convex relaxation approach to this more involved problem class that is relevant to many scenarios of unsupervised data analysis. Our approach applies, in particular, to data sets where assumptions of model recovery through sparse regularization, like the independent subspace model, do not hold. Our mathematical analysis enables to distinguish scenarios where the relaxation is tight enough and scenarios where the approach breaks down.

Francesco Silvestri, Gerhard Reinelt, Christoph Schnörr
Introducing Maximal Anisotropy into Second Order Coupling Models

On the one hand, anisotropic diffusion is a well-established concept that has improved numerous computer vision approaches by permitting direction-dependent smoothing. On the other hand, recent applications have uncovered the importance of second order regularisation. The goal of this work is to combine the benefits of both worlds. To this end, we propose a second order regulariser that allows to penalise both jumps and kinks in a direction-dependent way. We start with an isotropic coupling model, and systematically introduce anisotropic concepts from first order approaches. We demonstrate the benefits of our model by experiments, and apply it to improve an existing focus fusion method.

David Hafner, Christopher Schroers, Joachim Weickert
Binarization Driven Blind Deconvolution for Document Image Restoration

Blind deconvolution is a common method for restoration of blurred text images, while binarization is employed to analyze and interpret the text semantics. In literature, these tasks are typically treated independently. This paper introduces a novel binarization driven blind deconvolution approach to couple both tasks in a common framework. The proposed method is derived as an energy minimization problem regularized by a novel consistency term to exploit text binarization as a prior for blind deconvolution. The binarization to establish our consistency term is inferred by spatially regularized soft clustering based on a set of discriminative features. Our algorithm is formulated by the alternating direction method of multipliers and iteratively refines blind deconvolution and binarization. In our experimental evaluation, we show that our joint framework is superior to treating binarization and deconvolution as independent subproblems. We also demonstrate the application of our method for the restoration and binarization of historic document images, where it improves the visual recognition of handwritten text.

Thomas Köhler, Andreas Maier, Vincent Christlein

Biomedical Image Analysis and Applications

Frontmatter
Unsupervised and Accurate Extraction of Primitive Unit Cells from Crystal Images

We present a novel method for the unsupervised estimation of a primitive unit cell, i.e. a unit cell that can’t be further simplified, from a crystal image. Significant peaks of the projective standard deviations of the image serve as candidate lattice vector angles. Corresponding fundamental periods are determined by clustering local minima of a periodicity energy. Robust unsupervised selection of the number of clusters is obtained from the likelihoods of multi-variance cluster models induced by the Akaike information criterion. Initial estimates for lattice angles and periods obtained in this manner are refined jointly using non-linear optimization. Results on both synthetic and experimental images show that the method is able to estimate complex primitive unit cells with sub-pixel accuracy, despite high levels of noise.

Niklas Mevenkamp, Benjamin Berkels
Copula Archetypal Analysis

We present an extension of classical archetypal analysis (AA). It is motivated by the observation that classical AA is not invariant against strictly monotone increasing transformations. Establishing such an invariance is desirable since it makes AA independent of the chosen measure: representing a data set in meters or log(meters) should lead to approximately the same archetypes. The desired invariance is achieved by introducing a semi-parametric Gaussian copula. This ensures the desired invariance and makes AA more robust against outliers and missing values. Furthermore, our framework can deal with mixed discrete/ continuous data, which certainly is the most widely encountered type of data in real world applications. Since the proposed extension is presented in form of a preprocessing step, updating existing classical AA models is especially effortless.

Dinu Kaufmann, Sebastian Keller, Volker Roth
Interactive Image Retrieval for Biodiversity Research

On a daily basis, experts in biodiversity research are confronted with the challenging task of classifying individuals to build statistics over their distributions, their habitats, or the overall biodiversity. While the number of species is vast, experts with affordable time-budgets are rare. Image retrieval approaches could greatly assist experts: when new images are captured, a list of visually similar and previously collected individuals could be returned for further comparison. Following this observation, we start by transferring latest image retrieval techniques to biodiversity scenarios. We then propose to additionally incorporate an expert’s knowledge into this process by allowing him to select must-haveregions. The obtained annotations are used to train exemplar-models for region detection. Detection scores efficiently computed with convolutions are finally fused with an initial ranking to reflect both sources of information, global and local aspects. The resulting approach received highly positive feedback from several application experts. On datasets for butterfly and bird identification, we quantitatively proof the benefit of including expert-feedback resulting in gains of accuracy up to 25% and we extensively discuss current limitations and further research directions.

Alexander Freytag, Alena Schadt, Joachim Denzler
Temporal Acoustic Words for Online Acoustic Event Detection

The Bag-of-Features principle proved successful in many pattern recognition tasks ranging from document analysis and image classification to gesture recognition and even forensic applications. Lately these methods emerged in the field of acoustic event detection and showed very promising results. The detection and classification of acoustic events is an important task for many practical applications like video understanding, surveillance or speech enhancement. In this paper a novel approach for online acoustic event detection is presented that builds on top of the Bag-of-Features principle. Features are calculated for all frames in a given window. Applying the concept of feature augmentation additional temporal information is encoded in each feature vector. These feature vectors are then softly quantized so that a Bag-of-Feature representation is computed. These representations are evaluated by a classifier in a sliding window approach. The experiments on a challenging indoor dataset of acoustic events will show that the proposed method yields state-of-the-art results compared to other online event detection methods. Furthermore, it will be shown that the temporal feature augmentation significantly improves the recognition rates.

Rene Grzeszick, Axel Plinge, Gernot A. Fink

Human Pose Analysis

Frontmatter
Biternion Nets: Continuous Head Pose Regression from Discrete Training Labels

While head pose estimation has been studied for some time, continuous head pose estimation is still an open problem. Most approaches either cannot deal with the periodicity of angular data or require very fine-grained regression labels. We introduce biternion nets, a CNN-based approach that can be trained on very coarse regression labels and still estimate fully continuous 360° head poses.We show state-of-theart results on several publicly available datasets. Finally, we demonstrate how easy it is to record and annotate a new dataset with coarse orientation labels in order to obtain continuous head pose estimates using our biternion nets.

Lucas Beyer, Alexander Hermans, Bastian Leibe
A Physics-Based Statistical Model for Human Gait Analysis

Physics-based modeling is a powerful tool for human gait analysis and synthesis. Unfortunately, its application suffers from high computational cost regarding the solution of optimization problems and uncertainty in the choice of a suitable objective energy function and model parametrization. Our approach circumvents these problems by learning model parameters based on a training set of walking sequences. We propose a combined representation of motion parameters and physical parameters to infer missing data without the need for tedious optimization. Both a

k

-nearest-neighbour approach and asymmetrical principal component analysis are used to deduce ground reaction forces and joint torques directly from an input motion. We evaluate our methods by comparing with an iterative optimization-based method and demonstrate the robustness of our algorithm by reducing the input joint information. With decreasing input information the combined statistical model regression increasingly outperforms the iterative optimization-based method.

Petrissa Zell, Bodo Rosenhahn

Recognition and Scene Understanding

Frontmatter
Joint 3D Object and Layout Inference from a Single RGB-D Image

Inferring 3D objects and the layout of indoor scenes from a single RGB-D image captured with a Kinect camera is a challenging task. Towards this goal, we propose a high-order graphical model and jointly reason about the layout, objects and superpixels in the image. In contrast to existing holistic approaches, our model leverages detailed 3D geometry using inverse graphics and explicitly enforces occlusion and visibility constraints for respecting scene properties and projective geometry. We cast the task as MAP inference in a factor graph and solve it efficiently using message passing. We evaluate our method with respect to several baselines on the challenging NYUv2 indoor dataset using 21 object categories. Our experiments demonstrate that the proposed method is able to infer scenes with a large degree of clutter and occlusions.

Andreas Geiger, Chaohui Wang
Object Proposals Estimation in Depth Image Using Compact 3D Shape Manifolds

Man-made objects, such as chairs, often have very large shape variations, making it challenging to detect them. In this work we investigate the task of finding particular object shapes from a single depth image. We tackle this task by exploiting the inherently low dimensionality in the object shape variations, which we discover and encode as a compact shape space. Starting from any collection of 3D models, we first train a low dimensional Gaussian Process Latent Variable Shape Space. We then sample this space, effectively producing infinite amounts of shape variations, which are used for training. Additionally, to support fast and accurate inference, we improve the standard 3D object category proposal generation pipeline by applying a shallow convolutional neural network-based filtering stage. This combination leads to considerable improvements for proposal generation, in both speed and accuracy. We compare our full system to previous state-of-the-art approaches, on four different shape classes, and show a clear improvement.

Shuai Zheng, Victor Adrian Prisacariu, Melinos Averkiou, Ming-Ming Cheng, Niloy J. Mitra, Jamie Shotton, Philip H. S. Torr, Carsten Rother
The Long-Short Story of Movie Description

Generating descriptions for videos has many applications including assisting blind people and human-robot interaction. The recent advances in image captioning as well as the release of large-scale movie description datasets such as MPII-MD [28] and M-VAD [31] allow to study this task in more depth. Many of the proposed methods for image captioning rely on pre-trained object classifier CNNs and Long Short- Term Memory recurrent networks (LSTMs) for generating descriptions. While image description focuses on objects, we argue that it is important to distinguish verbs, objects, and places in the setting of movie description. In this work we show how to learn robust visual classifiers from the weak annotations of the sentence descriptions. Based on these classifiers we generate a description using an LSTM. We explore different design choices to build and train the LSTM and achieve the best performance to date on the challenging MPII-MD and M-VAD datasets. We compare and analyze our approach and prior work along various dimensions to better understand the key challenges of the movie description task.

Anna Rohrbach, Marcus Rohrbach, Bernt Schiele
Graph-Based Deformable 3D Object Matching

We present a method for efficient detection of deformed 3D objects in 3D point clouds that can handle large amounts of clutter, noise, and occlusion. The method generalizes well to different object classes and does not require an explicit deformation model. Instead, deformations are learned based on a few registered deformed object instances. The approach builds upon graph matching to find correspondences between scene and model points. The robustness is increased through a parametrization where each graph vertex represents a full rigid transformation. We speed up the matching through greedy multi-step graph pruning and a constant-time feature matching. Quantitative and qualitative experiments demonstrate that our method is robust, efficient, able to detect rigid and non-rigid objects and exceeds state of the art.

Bertram Drost, Slobodan Ilic

Posters

Frontmatter
Line3D: Efficient 3D Scene Abstraction for the Built Environment

Extracting 3D information from a moving camera is traditionally based on interest point detection and matching. This is especially challenging in the built environment, where the number of distinctive interest points is naturally limited. While common Structurefrom- Motion (SfM) approaches usually manage to obtain the correct camera poses, the number of accurate 3D points is very small due to the low number of matchable features. Subsequent Multi-view Stereo approaches may help to overcome this problem, but suffer from a high computational complexity. We propose a novel approach for the task of 3D scene abstraction, which uses straight line segments as underlying features. We use purely geometric constraints to match 2D line segments from different images, and formulate the reconstruction procedure as a graph-clustering problem. We show that our method generates accurate 3D models, with a low computational overhead compared to SfM alone.

Manuel Hofer, Michael Maurer, Horst Bischof
An Efficient Linearisation Approach for Variational Perspective Shape from Shading

Recently, variational methods have become increasingly more popular for perspective shape from shading due to their robustness under noise and missing information. So far, however, due to the strong nonlinearity of the data term, existing numerical schemes for minimising the corresponding energy functionals were restricted to simple explicit schemes that require thousands or even millions of iterations to provide accurate results. In this paper we tackle the problem by proposing an efficient linearisation approach for the recent variational model of Ju

et

al

. [14]. By embedding such a linearisation in a coarse-to-fine Gauß-Newton scheme, we show that we can reduce the runtime by more than three orders of magnitude without degrading the quality of results. Hence, it is not only possible to apply variational methods for perspective SfS to significantly larger image sizes. Our approach also allows a practical choice of the regularisation parameter so that noise can be suppressed efficiently at the same time.

Daniel Maurer, Yong Chul Ju, Michael Breuß, Andrés Bruhn
TomoGC: Binary Tomography by Constrained GraphCuts

We present an iterative reconstruction algorithm for binary tomography, called TomoGC, that solves the reconstruction problem based on a constrained graphical model by a sequence of graphcuts. TomoGC reconstructs objects even if a low number of measurements are only given, which enables shorter observation periods and lower radiation doses in industrial and medical applications.We additionally suggest some modifications of established methods that improve state-of-the-art methods. A comprehensive numerical evaluation demonstrates that the proposed method can reconstruct objects from a small number of projections more accurate and also faster than competitive methods.

Jörg Hendrik Kappes, Stefania Petra, Christoph Schnörr, Matthias Zisler
An Improved Eikonal Method for Surface Normal Integration

The integration of surface normals is a classic problem in computer vision. Recently, an approach to integration based on an equation of eikonal type has been proposed. A crucial component of this model is the data term in which the given data is complemented by a convex function describing a squared Euclidean distance. The resulting equation has been solved by a classic fast marching (FM) scheme. However, while that method is computationally efficient, the reconstruction error is considerable, especially in diagonal grid directions. In this paper, we present two improvements in order to deal with this problem. On the modeling side, we present a novel robust formulation of the data term. Moreover, we propose to use a semi-Lagrangian discretisation which improves the rotational invariance while it allows to keep the FM strategy. Our experiments confirm that our novel method gives a superior quality compared to the previous methods.

Martin Bähr, Michael Breuß
GraphFlow - 6D Large Displacement Scene Flow via Graph Matching

We present an approach for computing dense scene flow from two large displacement RGB-D images. When dealing with large displacements the crucial step is to estimate the overall motion correctly. While state-of-the-art approaches focus on RGB information to establish guiding correspondences, we explore the power of depth edges. To achieve this, we present a new graph matching technique that brings sparse depth edges into correspondence. An additional contribution is the formulation of a continuous-label energy which is used to densify the sparse graph matching output. We present results on challenging Kinect images, for which we outperform state-of-the-art techniques.

Hassan Abu Alhaija, Anita Sellent, Daniel Kondermann, Carsten Rother
Fast Techniques for Monocular Visual Odometry

In this paper, fast techniques are proposed to achieve real time and robust monocular visual odometry. We apply an iterative 5- point method to estimate instantaneous camera motion parameters in the context of a RANSAC algorithm to cope with outliers efficiently. In our method, landmarks are localized in space using a probabilistic triangulation method utilized to enhance the estimation of the last camera pose. The enhancement is performed by multiple observations of landmarks and minimization of a cost function consisting of epipolar geometry constraints for far landmarks and projective constraints for close landmarks. The performance of the proposed method is demonstrated through application to the challenging KITTI visual odometry dataset.

M. Hossein Mirabdollah, Bärbel Mertsching
Iterative Automated Foreground Segmentation in Video Sequences Using Graph Cuts

In this paper we propose a method for foreground object segmentation in videos using an improved version of the GrabCut algorithm. Motivated by applications in de-identification, we consider a static camera scenario and take into account common problems with the original algorithm that can result in poor segmentation. Our improvements are as follows: (i) using background subtraction, we build GMM-based segmentation priors; (ii) in building foreground and background GMMs, the contributions of pixels are weighted depending on their distance from the boundary of the object prior; (iii) probabilities of pixels belonging to foreground or background are modified by taking into account the prior pixel classification as well as its estimated confidence; and (iv) the smoothness term of GrabCut is modified by discouraging boundaries further away from the object prior. We perform experiments on CDnet 2014 Pedestrian Dataset and show considerable improvements over a reference implementation of GrabCut.

Tomislav Hrkać, Karla Brkić
A Novel Tree Block-Coordinate Method for MAP Inference

Block-coordinate methods inspired by belief propagation are among the most successful methods for approximate MAP inference in graphical models. The set of unknowns optimally updated in such block-coordinate methods is typically very small and spans only single edges or shallow trees. We derive a method that optimally updates sets of unknowns spanned by an arbitrary tree that is different from one reported in the literature. It provides some insight why “tree blockcoordinate” methods are not as useful as expected, and enables a simple technique to makes these tree updates more effective.

Christopher Zach
A Parametric Spectral Model for Texture-Based Salience

We present a novel saliency mechanism based on texture. Local texture at each pixel is characterised by the 2D spectrum obtained from oriented Gabor filters. We then apply a parametric model and describe the texture at each pixel by a combination of two 1D Gaussian approximations. This results in a simple model which consists of only four parameters. These four parameters are then used as feature channels and standard Difference-of-Gaussian blob detection is applied in order to detect salient areas in the image, similar to the Itti and Koch model. Finally, a diffusion process is used to sharpen the resulting regions. Evaluation on a large saliency dataset shows a significant improvement of our method over the baseline Itti and Koch model.

Kasim Terzić, Sai Krishna, J. M. H. du Buf
High Speed Lossless Image Compression

We introduce a simple approach to lossless image compression, which makes use of SIMD vectorization at every processing step to provide very high speed on modern CPUs. This is achieved by basing the compression on delta coding for prediction and bit packing for the actual compression, allowing a tuneable tradeoff between efficiency and speed, via the block size used for bit packing. The maximum achievable speed surpasses main memory bandwidth on the tested CPU, as well as the speed of all previous methods that achieve at least the same coding efficiency.

Hendrik Siedelmann, Alexander Wender, Martin Fuchs
Learning Reaction-Diffusion Models for Image Inpainting

In this paper we present a trained diffusion model for image inpainting based on the structural similarity measure. The proposed diffusion model uses several parametrized linear filters and influence functions. Those parameters are learned in a loss based approach, where we first perform a greedy training before conducting a joint training to further improve the inpainting performance. We provide a detailed comparison to state-of-the-art inpainting algorithms based on the TUM-image inpainting database. The experimental results show that the proposed diffusion model is efficient and achieves superior performance. Moreover, we also demonstrate that the proposed method has a texture preserving property, that makes it stand out from previous PDE based methods.

Wei Yu, Stefan Heber, Thomas Pock
Image Orientation Estimation with Convolutional Networks

Rectifying the orientation of scanned documents has been an important problem that was solved long ago. In this paper, we focus on the harder case of estimating and correcting the exact orientation of general images, for instance, of holiday snapshots. Especially when the horizon or other horizontal and vertical lines in the image are missing, it is hard to find features that yield the canonical orientation of the image. We demonstrate that a convolutional network can learn subtle features to predict the canonical orientation of images. In contrast to prior works that just distinguish between portrait and landscape orientation, the network regresses the exact orientation angle. The approach runs in realtime and, thus, can be applied also to live video streams.

Philipp Fischer, Alexey Dosovitskiy, Thomas Brox
Semi-Automatic Basket Catheter Reconstruction from Two X-Ray Views

Ablation guided by focal impulse and rotor mapping (FIRM) is a novel treatment option for atrial fibrillation, a frequent heart arrhythmia. This procedure is performed minimally invasively and at least partially under fluoroscopic guidance. It involves a basket catheter comprising 64 electrodes. The 3-D position of these electrodes is important during treatment. We propose a novel model-based method for 3-D reconstruction of this catheter using two X-ray images taken from different views. Our approach requires only little user interaction. An evaluation of the method found that the electrodes of the basket catheter can be reconstructed with a median error of 1.5mm for phantom data and 3.4mm for clinical data.

Xia Zhong, Matthias Hoffmann, Norbert Strobel, Andreas Maier
Fast Brain MRI Registration with Automatic Landmark Detection Using a Single Template Image

Automatic registration of brain MR images is still a challenging problem. We have chosen an approach based on landmarks matching. However, manual landmarking of the images is cumbersome. Existing algorithms for automatic identification of pre-defined set of landmarks usually require manually landmarked training bases.We propose the registration algorithm that involves automatic detection of landmarks with the use of only one manually landmarked template image. Landmarks are detected using Canny edge detector and point descriptors. Evaluation of four types of descriptors showed that SURF provides the best trade-off between speed and accuracy. Thin plate spline transformation is used for landmark-based registration. The proposed algorithm was compared with the best existing registration algorithm without the use of local features. Our algorithm showed significant speed-up and better accuracy in matching of anatomical structures surrounded by the landmarks. All the experiments were performed on the IBSR database.

Olga V. Senyukova, Denis S. Zobnin
Photorealistic Face Transfer in 2D and 3D Video

3D face transfer has been employed in a wide field of settings such as videoconferencing, gaming, or Hollywood movie production. State-of-the-art algorithms often suffer from a high sensitivity to tracking errors, require manual post-processing, or are overly complex in terms of computation time. Addressing these issues, we propose a lightweight system which is capable to transfer facial features in both 2D and 3D. This is accomplished by finding a dense correspondence between a source and target face, and then performing Poisson cloning. We solve the correspondence problem efficiently by a sparse initial registration and a subsequent warping, which is refined in a surface matching step using topological projections. Additional processing power is saved by converting extrapolation problems to simple interpolation problems without loss of precision. The final results are photorealistic face transfers in either 2D or 3D between arbitrary facial video streams.

Daniel Merget, Philipp Tiefenbacher, Mohammadreza Babaee, Nikola Mitov, Gerhard Rigoll
FlowCap: 2D Human Pose from Optical Flow

We estimate 2D human pose from video using

only

optical

flow

. The key insight is that dense optical flow can provide information about 2D body pose. Like range data, flow is largely invariant to appearance but unlike depth it can be directly computed from monocular video. We demonstrate that body parts can be detected from dense flow using the same random forest approach used by the Microsoft Kinect. Unlike range data, however, when people stop moving, there is no optical flow and they effectively disappear. To address this, our

FlowCap

method uses a Kalman filter to propagate body part positions and velocities over time and a regression method to predict 2D body pose from part centers. No range sensor is required and FlowCap estimates 2D human pose from monocular video sources containing human motion. Such sources include hand-held phone cameras and archival television video. We demonstrate 2D body pose estimation in a range of scenarios and show that the method works with real-time optical flow. The results suggest that optical flow shares invariances with range data that, when complemented with tracking, make it valuable for pose estimation.

Javier Romero, Matthew Loper, Michael J. Black
3D Facial Landmark Detection: How to Deal with Head Rotations?

3D facial landmark detection is important for applications like facial expression analysis and head pose estimation. However, accurate estimation of facial landmarks in 3D with head rotations is still challenging due to perspective variations. Current state-of-the-art methods are based on random forests. These methods rely on a large amount of training data covering the whole range of head rotations. We present a method based on regression forests which can handle rotations even if they are not included in the training data. To achieve this, we modify both the weak predictors of the tree and the leaf node regressors to adapt to head rotations better. Our evaluation on two benchmark datasets, Bosphorus and FRGC v2, shows that our method outperforms state-of-the-art methods with respect to head rotations, if trained solely on frontal faces.

Anke Schwarz, Esther-Sabrina Wacker, Manuel Martin, M. Saquib Sarfraz, Rainer Stiefelhagen
Enhanced GPT Correlation for 2D Projection Transformation Invariant Template Matching

This paper describes a newly enhanced technique of 2D projection transformation invariant template matching, GPT (Global Projection Transformation) correlation. The key ideas are threefold. First, we show that arbitrary 2D projection transformation (PT) with a total of eight parameters can be approximated by a simpler expression. Second, using the simpler PT expression we propose an efficient computational model for determining sub-optimal eight parameters of PT that maximize a normalized cross-correlation value between a PT-superimposed input image and a template. Third, we obtain optimal eight parameters of PT via the successive iteration method. Experiments using templates and their artificially distorted images with random noise as input images demonstrate that the proposed method is far superior to the former GPT correlation method. Moreover,

k

-NN classification of handwritten numerals by the proposed method shows a high recognition accuracy through its distortion-tolerant template matching ability.

Toru Wakahara, Yukihiko Yamashita
Semantic Segmentation Based Traffic Light Detection at Day and at Night

Traffic light detection from a moving vehicle is an important technology both for new safety driver assistance functions as well as for autonomous driving in the city. In this paper we present a machine learning framework for detection of traffic lights that can handle in realtime both day and night situations in a unified manner. A semantic segmentation method is employed to generate traffic light candidates, which are then confirmed and classified by a geometric and color features based classifier. Temporal consistency is enforced by using a tracking by detection method.

We evaluate our method on a publicly available dataset recorded at daytime in order to compare to existing methods and we show similar performance. We also present an evaluation on two additional datasets containing more than 50 intersections with multiple traffic lights recorded both at day and during nighttime and we show that our method performs consistently in those situations.

Vladimir Haltakov, Jakob Mayr, Christian Unger, Slobodan Ilic
Pose Estimation and Shape Retrieval with Hough Voting in a Continuous Voting Space

In this paper we present a method for 3D shape classification and pose estimation. Our approach is related to the recently popular adaptations of Implicit Shape Models to 3D data, but differs in some key aspects. We propose to omit the quantization of feature descriptors in favor of a better descriptiveness of training data. Additionally, a continuous voting space, in contrast to discrete Hough spaces in state of the art approaches, allows for more stable classification results under parameter variations. We evaluate and compare the performance of our approach with recently presented methods. The proposed algorithm achieves best results on three challenging datasets for 3D shape retrieval.

Viktor Seib, Norman Link, Dietrich Paulus
Fast Approximate GMM Soft-Assign for Fine-Grained Image Classification with Large Fisher Vectors

We address two drawbacks of image classification with large Fisher vectors. The first drawback is the computational cost of assigning a large number of patch descriptors to a large number of GMM components. We propose to alleviate that by a generally applicable approximate soft-assignment procedure based on a balanced GMM tree. This approximation significantly reduces the computational complexity while only marginally affecting the fine-grained classification performance. The second drawback is a very high dimensionality of the image representation, which makes the classifier learning and inference computationally complex and prone to overtraining.We propose to alleviate that by regularizing the classification model with group Lasso. The resulting block-sparse models achieve better fine-grained classification performance in addition to memory savings and faster prediction. We demonstrate and evaluate our contributions on a standard fine-grained categorization benchmark.

Josip Krapac, Siniša Šegvić
Regressor Based Estimation of the Eye Pupil Center

The locations of the eye pupil centers are used in a wide range of computer vision applications. Although there are successful commercial eye gaze tracking systems, their practical employment is limited due to required specialized hardware and extra restrictions on the users. On the other hand, the precision and robustness of the off the shelf camera based systems are not at desirable levels. We propose a general purpose eye pupil center estimation method without any specialized hardware. The system trains a regressor using HoG features with the distance between the ground-truth pupil center and the center of the train patches. We found HoG features to be very useful to capture the unique gradient angle information around the eye pupils. The system uses a sliding window approach to produce a score image that contains the regressor estimated distances to the pupil center. The best center positions of two pupils among the candidate centers are selected from the produced score images. We evaluate our method on the challenging BioID and Columbia CAVE data sets. The results of the experiments are overall very promising and the system exceeds the precision of the similar state of the art methods. The performance of the proposed system is especially favorable on extreme eye gaze angles and head poses. The results of all test images are publicly available.

Necmeddin Said Karakoc, Samil Karahan, Yusuf Sinan Akgul
Patch-Level Spatial Layout for Classification and Weakly Supervised Localization

We propose a discriminative patch-level model which combines appearance and spatial layout cues. We start from a block-sparse model of patch appearance based on the normalized Fisher vector representation. The appearance model is responsible for (i) selecting a discriminative subset of visual words, and (ii) identifying distinctive patches assigned to the selected subset. These patches are further filtered by a sparse spatial model operating on a novel representation of pairwise patch layout. We have evaluated the proposed pipeline in image classification and weakly supervised localization experiments on a public traffic sign dataset. The results show significant advantage of the combined model over state of the art appearance models.

Valentina Zadrija, Josip Krapac, Jakob Verbeek, Siniša Šegvić
A Deeper Look at Dataset Bias

The presence of a bias in each image data collection has recently attracted a lot of attention in the computer vision community showing the limits in generalization of any learning method trained on a specific dataset. At the same time, with the rapid development of deep learning architectures, the activation values of Convolutional Neural Networks (CNN) are emerging as reliable and robust image descriptors. In this paper we propose to verify the potential of the DeCAF features when facing the dataset bias problem. We conduct a series of analyses looking at how existing datasets differ among each other and verifying the performance of existing debiasing methods under different representations. We learn important lessons on which part of the dataset bias problem can be considered solved and which open questions still need to be tackled.

Tatiana Tommasi, Novi Patricia, Barbara Caputo, Tinne Tuytelaars
What Is Holding Back Convnets for Detection?

Convolutional neural networks have recently shown excellent results in general object detection and many other tasks. Albeit very effective, they involve many user-defined design choices. In this paper we want to better understand these choices by inspecting two key aspects “what did the network learn?”, and “what can the network learn?”. We exploit new annotations (Pascal3D+), to enable a new empirical analysis of the R-CNN detector. Despite common belief, our results indicate that existing state-of-the-art convnets are not invariant to various appearance factors. In fact, all considered networks have similar weak points which cannot be mitigated by simply increasing the training data (architectural changes are needed). We show that overall performance can improve when using image renderings as data augmentation. We report the best known results on Pascal3D+ detection and view-point estimation tasks.

Bojan Pepik, Rodrigo Benenson, Tobias Ritschel, Bernt Schiele
A Modified Isomap Approach to Manifold Learning in Word Spotting

Word spotting is an effective paradigm for indexing document images with minimal human effort. Here, the use of the Bagof- Features principle has been shown to achieve competitive results on different benchmarks. Recently, a spatial pyramid approach was used as a word image representation to improve the retrieval results even further. The high dimensionality of the spatial pyramids was attempted to be countered by applying Latent Semantic Analysis. However, this leads to increasingly worse results when reducing to lower dimensions. In this paper, we propose a new approach to reducing the dimensionality of word image descriptors which is based on a modified version of the Isomap Manifold Learning algorithm. This approach is able to not only outperform Latent Semantic Analysis but also to reduce a word image descriptor to up to 0.12% of its original size without losing retrieval precision. We evaluate our approach on two different datasets.

Sebastian Sudholt, Gernot A. Fink
Offline Writer Identification Using Convolutional Neural Network Activation Features

Convolutional neural networks (CNNs) have recently become the state-of-the-art tool for large-scale image classification. In this work we propose the use of activation features from CNNs as local descriptors for writer identification. A global descriptor is then formed by means of GMM supervector encoding, which is further improved by normalization with the KL-Kernel. We evaluate our method on two publicly available datasets: the ICDAR 2013 benchmark database and the CVL dataset. While we perform comparably to the state of the art on CVL, our proposed method yields about 0.21 absolute improvement in terms of mAP on the challenging bilingual ICDAR dataset.

Vincent Christlein, David Bernecker, Andreas Maier, Elli Angelopoulou

Young Researchers Forum

Frontmatter
Superpixel Segmentation: An Evaluation

In recent years, superpixel algorithms have become a standard tool in computer vision and many approaches have been proposed. However, different evaluation methodologies make direct comparison difficult. We address this shortcoming with a thorough and fair comparison of thirteen state-of-the-art superpixel algorithms. To include algorithms utilizing depth information we present results on both the Berkeley Segmentation Dataset [3] and the NYU Depth Dataset [19]. Based on qualitative and quantitative aspects, our work allows to guide algorithm selection by identifying important quality characteristics.

David Stutz
Backmatter
Metadaten
Titel
Pattern Recognition
herausgegeben von
Juergen Gall
Peter Gehler
Bastian Leibe
Copyright-Jahr
2015
Electronic ISBN
978-3-319-24947-6
Print ISBN
978-3-319-24946-9
DOI
https://doi.org/10.1007/978-3-319-24947-6