Skip to main content
Top

2019 | Book

Pattern Recognition

40th German Conference, GCPR 2018, Stuttgart, Germany, October 9-12, 2018, Proceedings

insite
SEARCH

About this book

This book constitutes the refereed proceedings of the 40th German Conference on Pattern Recognition, GCPR 2018, held in Stuttgart, Germany, in October 2018.

The 48 revised full papers presented were carefully reviewed and selected from 118 submissions. The German Conference on Pattern Recognition is the annual symposium of the German Association for Pattern Recognition (DAGM). It is the national venue for recent advances in image processing, pattern recognition, and computer vision and it follows the long tradition of the DAGM conference series, which has been renamed to GCPR in 2013 to reflect its increasing internationalization. In 2018 in Stuttgart, the conference series celebrated its 40th anniversary.

Table of Contents

Frontmatter

Poster Session 1

Frontmatter
Topology-Based 3D Reconstruction of Mid-Level Primitives in Man-Made Environments

In this paper a novel reconstruction method is presented that uses the topological relationship of detected image features to create a highly abstract but semantically rich 3D model of the reconstructed scenes. In the first step, a combined image-based reconstruction of points and lines is performed based on the current state of art structure from motion methods. Subsequently, connected planar three-dimensional structures are reconstructed by a novel method that uses the topological relationships between the detected image features. The reconstructed 3D models enable a simple extraction of geometric shapes, such as rectangles, in the scene.

Dominik Wolters, Reinhard Koch
Associative Deep Clustering: Training a Classification Network with No Labels

We propose a novel end-to-end clustering training schedule for neural networks that is direct, i.e. the output is a probability distribution over cluster memberships. A neural network maps images to embeddings. We introduce centroid variables that have the same shape as image embeddings. These variables are jointly optimized with the network’s parameters. This is achieved by a cost function that associates the centroid variables with embeddings of input images. Finally, an additional layer maps embeddings to logits, allowing for the direct estimation of the respective cluster membership. Unlike other methods, this does not require any additional classifier to be trained on the embeddings in a separate step. The proposed approach achieves state-of-the-art results in unsupervised classification and we provide an extensive ablation study to demonstrate its capabilities.

Philip Haeusser, Johannes Plapp, Vladimir Golkov, Elie Aljalbout, Daniel Cremers
A Table Tennis Robot System Using an Industrial KUKA Robot Arm

In recent years robotic table tennis has become a popular research challenge for image processing and robot control. Here we present a novel table tennis robot system with high accuracy vision detection and fast robot reaction. Our system is based on an industrial KUKA Agilus R900 sixx robot with 6 DOF. Four cameras are used for ball position detection at 150 fps. We employ a multiple-camera calibration method, and use iterative triangulation to reconstruct the 3D ball position with an accuracy of 2.0 mm. In order to detect the flying ball with higher velocities in real-time, we combine color and background thresholding. For predicting the ball’s trajectory we test both a curve fitting approach and an extended Kalman filter. Our robot is able to play rallies with a human counting up to 50 consequential strokes and has a general hitting rate of 87%.

Jonas Tebbe, Yapeng Gao, Marc Sastre-Rienietz, Andreas Zell
View-Aware Person Re-identification

Appearance-based person re-identification (PRID) is currently an active and challenging research topic. Recently proposed approaches have mostly dealt with low- and middle-level processing of images. Furthermore, there is very limited research that has focused on view information. View variation limits the performance of most approaches because a person’s appearance from one view can be completely different from that of another view, which makes the re-identification challenging. In this work, we study the influence of the view on PRID and propose several fusion strategies that utilize multi-view information to handle the PRID problem. We perform experiments on a re-mapped version of Market-1501 dataset and an internal dataset. Our proposed multi-view strategy increases the recognition rate at rank-one by a large margin in comparison with that obtained via random view matching or multi-shot.

Gregor Blott, Jie Yu, Christian Heipke
MC2SLAM: Real-Time Inertial Lidar Odometry Using Two-Scan Motion Compensation

We propose a real-time, low-drift laser odometry approach that tightly integrates sequentially measured 3D multi-beam LIDAR data with inertial measurements. The laser measurements are motion-compensated using a novel algorithm based on non-rigid registration of two consecutive laser sweeps and a local map. IMU data is being tightly integrated by means of factor-graph optimization on a pose graph. We evaluate our method on a public dataset and also obtain results on our own datasets that contain information not commonly found in existing datasets. At the time of writing, our method was ranked within the top five laser-only algorithms of the KITTI odometry benchmark.

Frank Neuhaus, Tilman Koß, Robert Kohnen, Dietrich Paulus
An Analysis by Synthesis Approach for Automatic Vertebral Shape Identification in Clinical QCT

Quantitative computed tomography (QCT) is a widely used tool for osteoporosis diagnosis and monitoring. The assessment of cortical markers like cortical bone mineral density (BMD) and thickness is a demanding task, mainly because of the limited spatial resolution of QCT. We propose a direct model based method to automatically identify the surface through the center of the cortex of human vertebra. We develop a statistical bone model and analyze its probability distribution after the imaging process. Using an as-rigid-as-possible deformation we find the cortical surface that maximizes the likelihood of our model given the input volume. Using the European Spine Phantom (ESP) and a high resolution µCT scan of a cadaveric vertebra, we show that the proposed method is able to accurately identify the real center of cortex ex-vivo. To demonstrate the in-vivo applicability of our method we use manually obtained surfaces for comparison.

Stefan Reinhold, Timo Damm, Lukas Huber, Reimer Andresen, Reinhard Barkmann, Claus-C. Glüer, Reinhard Koch
Parcel Tracking by Detection in Large Camera Networks

Inside parcel distribution hubs, several tenth of up 100 000 parcels processed each day get lost. Human operators have to tediously recover these parcels by searching through large amounts of video footage from the installed large-scale camera network. We want to assist these operators and work towards an automatic solution. The challenge lies both in the size of the hub with a high number of cameras and in the adverse conditions. We describe and evaluate an industry scale tracking framework based on state-of-the-art methods such as Mask R-CNN. Moreover, we adapt a siamese network inspired feature vector matching with a novel feature improver network, which increases tracking performance. Our calibration method exploits a calibration parcel and is suitable for both overlapping and non-overlapping camera views. It requires little manual effort and needs only a single drive-by of the calibration parcel for each conveyor belt. With these methods, most parcels can be tracked start-to-end.

Sascha Clausen, Claudius Zelenka, Tobias Schwede, Reinhard Koch
Segmentation of Head and Neck Organs at Risk Using CNN with Batch Dice Loss

This paper deals with segmentation of organs at risk (OAR) in head and neck area in CT images which is a crucial step for reliable intensity modulated radiotherapy treatment. We introduce a convolution neural network with encoder-decoder architecture and a new loss function, the batch soft Dice loss function, used to train the network. The resulting model produces segmentations of every OAR in the public MICCAI 2015 Head And Neck Auto-Segmentation Challenge dataset. Despite the heavy class imbalance in the data, we improve accuracy of current state-of-the-art methods by 0.33 mm in terms of average surface distance and by 0.11 in terms of Dice overlap coefficient on average.

Oldřich Kodym, Michal Španěl, Adam Herout
Detection of Mechanical Damages in Sawn Timber Using Convolutional Neural Networks

The quality control of timber products is vital for the sawmill industry pursuing more efficient production processes. This paper considers the automatic detection of mechanical damages in wooden board surfaces occurred during the sawing process. Due to the high variation in the appearance of the mechanical damages and the presence of several other surface defects on the boards, the detection task is challenging. In this paper, an efficient convolutional neural network based framework that can be trained with a limited amount of annotated training data is proposed. The framework includes a patch extraction step to produce multiple training samples from each damaged region in the board images, followed by the patch classification and damage localization steps. In the experiments, multiple network architectures were compared: the VGG-16 architecture achieved the best results with over 92% patch classification accuracy and it enabled accurate localization of the mechanical damages.

Nikolay Rudakov, Tuomas Eerola, Lasse Lensu, Heikki Kälviäinen, Heikki Haario
Compressed-Domain Video Object Tracking Using Markov Random Fields with Graph Cuts Optimization

We propose a method for tracking objects in H.264/AVC compressed videos using a Markov Random Field model. Given an initial segmentation of the target object in the first frame, our algorithm applies a graph-cuts-based optimization to output a binary segmentation map for the next frame. Our model uses only the motion vectors and block coding modes from the compressed bitstream. Thus, complexity and storage requirements are significantly reduced compared to pixel-domain algorithms. We evaluate our method over two datasets and compare its performance to a state-of-the-art compressed-domain algorithm. Results show that we achieve better results in more challenging sequences.

Fernando Bombardelli, Serhan Gül, Cornelius Hellge
Metric-Driven Learning of Correspondence Weighting for 2-D/3-D Image Registration

Registration of pre-operative 3-D volumes to intra-operative 2-D X-ray images is important in minimally invasive medical procedures. Rigid registration can be performed by estimating a global rigid motion that optimizes the alignment of local correspondences. However, inaccurate correspondences challenge the registration performance. To minimize their influence, we estimate optimal weights for correspondences using PointNet. We train the network directly with the criterion to minimize the registration error. We propose an objective function which includes point-to-plane correspondence-based motion estimation and projection error computation, thereby enabling the learning of a weighting strategy that optimally fits the underlying formulation of the registration task in an end-to-end fashion. For single-vertebra registration, we achieve an accuracy of $$0.74\pm 0.26$$ mm and highly improved robustness. The success rate is increased from 79.3% to 94.3% and the capture range from 3 mm to 13 mm.

Roman Schaffert, Jian Wang, Peter Fischer, Anja Borsdorf, Andreas Maier
Multi-view X-Ray R-CNN

Motivated by the detection of prohibited objects in carry-on luggage as a part of avionic security screening, we develop a CNN-based object detection approach for multi-view X-ray image data. Our contributions are two-fold. First, we introduce a novel multi-view pooling layer to perform a 3D aggregation of 2D CNN-features extracted from each view. To that end, our pooling layer exploits the known geometry of the imaging system to ensure geometric consistency of the feature aggregation. Second, we introduce an end-to-end trainable multi-view detection pipeline based on Faster R-CNN, which derives the region proposals and performs the final classification in 3D using these aggregated multi-view features. Our approach shows significant accuracy gains compared to single-view detection while even being more efficient than performing single-view detection in each view.

Jan-Martin O. Steitz, Faraz Saeedan, Stefan Roth
Ex Paucis Plura: Learning Affordance Segmentation from Very Few Examples

While annotating objects in images is already time-consuming, annotating finer details like object parts or affordances of objects is even more tedious. Given the fact that large datasets with object annotations already exist, we address the question whether we can leverage such information to train a convolutional neural network for segmenting affordances or object parts from very few examples with finer annotations. To achieve this, we use a semantic alignment network to transfer the annotations from the small set of annotated examples to a large set of images with only coarse annotations at object level. We then train a convolutional neural network weakly supervised on the small annotated training set and the additional images with transferred labels. We evaluate our approach on the IIT-AFF and Pascal Parts dataset where our approach outperforms other weakly supervised approaches.

Johann Sawatzky, Martin Garbade, Juergen Gall

Oral Session 1: Learning I

Frontmatter
Domain Generalization with Domain-Specific Aggregation Modules

Visual recognition systems are meant to work in the real world. For this to happen, they must work robustly in any visual domain, and not only on the data used during training. Within this context, a very realistic scenario deals with domain generalization, i.e. the ability to build visual recognition algorithms able to work robustly in several visual domains, without having access to any information about target data statistic. This paper contributes to this research thread, proposing a deep architecture that maintains separated the information about the available source domains data while at the same time leveraging over generic perceptual information. We achieve this by introducing domain-specific aggregation modules that through an aggregation layer strategy are able to merge generic and specific information in an effective manner. Experiments on two different benchmark databases show the power of our approach, reaching the new state of the art in domain generalization.

Antonio D’Innocente, Barbara Caputo
X-GAN: Improving Generative Adversarial Networks with ConveX Combinations

Recent neural architectures for image generation are capable of producing photo-realistic results but the distributions of real and faked images still differ. While the lack of a structured latent representation for GANs results in mode collapse, VAEs enforce a prior to the latent space that leads to an unnatural representation of the underlying real distribution. We introduce a method that preserves the natural structure of the latent manifold. By utilizing neighboring relations within the set of discrete real samples, we reproduce the full continuous latent manifold. We propose a novel image generation network X-GAN that creates latent input vectors from random convex combinations of adjacent real samples. This way we ensure a structured and natural latent space by not requiring prior assumptions. In our experiments, we show that our model outperforms recent approaches in terms of the missing mode problem while maintaining a high image quality.

Oliver Blum, Biagio Brattoli, Björn Ommer
A Randomized Gradient-Free Attack on ReLU Networks

It has recently been shown that neural networks but also other classifiers are vulnerable to so called adversarial attacks e.g. in object recognition an almost non-perceivable change of the image changes the decision of the classifier. Relatively fast heuristics have been proposed to produce these adversarial inputs but the problem of finding the optimal adversarial input, that is with the minimal change of the input, is NP-hard. While methods based on mixed-integer optimization which find the optimal adversarial input have been developed, they do not scale to large networks. Currently, the attack scheme proposed by Carlini and Wagner is considered to produce the best adversarial inputs. In this paper we propose a new attack scheme for the class of ReLU networks based on a direct optimization on the resulting linear regions. In our experimental validation we improve in all except one experiment out of 18 over the Carlini-Wagner attack with a relative improvement of up to 9%. As our approach is based on the geometrical structure of ReLU networks, it is less susceptible to defences targeting their functional properties.

Francesco Croce, Matthias Hein
Cross and Learn: Cross-Modal Self-supervision

In this paper we present a self-supervised method for representation learning utilizing two different modalities. Based on the observation that cross-modal information has a high semantic meaning we propose a method to effectively exploit this signal. For our approach we utilize video data since it is available on a large scale and provides easily accessible modalities given by RGB and optical flow. We demonstrate state-of-the-art performance on highly contested action recognition datasets in the context of self-supervised learning. We show that our feature representation also transfers to other tasks and conduct extensive ablation studies to validate our core contributions.

Nawid Sayed, Biagio Brattoli, Björn Ommer
KS(conf): A Light-Weight Test if a ConvNet Operates Outside of Its Specifications

Computer vision systems for automatic image categorization have become accurate and reliable enough that they can run continuously for days or even years as components of real-world commercial applications. A major open problem in this context, however, is quality control. Good classification performance can only be expected if systems run under the specific conditions, in particular data distributions, that they were trained for. Surprisingly, none of the currently used deep network architectures have a built-in functionality that could detect if a network operates on data from a distribution it was not trained for, such that potentially a warning to the human users could be triggered.In this work, we describe KS(conf), a procedure for detecting such outside of specifications (out-of-specs) operation, based on statistical testing of the network outputs. We show by extensive experiments using the ImageNet, AwA2 and DAVIS datasets on a variety of ConvNets architectures that KS(conf) reliably detects out-of-specs situations. It furthermore has a number of properties that make it a promising candidate for practical deployment: it is easy to implement, adds almost no overhead to the system, works with all networks, including pretrained ones, and requires no a priori knowledge of how the data distribution could change.

Rémy Sun, Christoph H. Lampert

Joint Oral Session 1

Frontmatter
Sublabel-Accurate Convex Relaxation with Total Generalized Variation Regularization

We propose a novel idea to introduce regularization based on second order total generalized variation ( $$\text {TGV}$$ ) into optimization frameworks based on functional lifting. The proposed formulation extends a recent sublabel-accurate relaxation for multi-label problems and thus allows for accurate solutions using only a small number of labels, significantly improving over previous approaches towards lifting the total generalized variation. Moreover, even recent sublabel accurate methods exhibit staircasing artifacts when used in conjunction with common first order regularizers such as the total variation ( $$\text {TV}$$ ). This becomes very obvious for example when computing derivatives of disparity maps computed with these methods to obtain normals, which immediately reveals their local flatness and yields inaccurate normal maps. We show that our approach is effective in reducing these artifacts, obtaining disparity maps with a smooth normal field in a single optimization pass.

Michael Strecke, Bastian Goldluecke

Oral Session 2: Motion and Video I

Frontmatter
On the Integration of Optical Flow and Action Recognition

Most of the top performing action recognition methods use optical flow as a “black box” input. Here we take a deeper look at the combination of flow and action recognition, and investigate why optical flow is helpful, what makes a flow method good for action recognition, and how we can make it better. In particular, we investigate the impact of different flow algorithms and input transformations to better understand how these affect a state-of-the-art action recognition method. Furthermore, we fine tune two neural-network flow methods end-to-end on the most widely used action recognition dataset (UCF101). Based on these experiments, we make the following five observations: (1) optical flow is useful for action recognition because it is invariant to appearance, (2) optical flow methods are optimized to minimize end-point-error (EPE), but the EPE of current methods is not well correlated with action recognition performance, (3) for the flow methods tested, accuracy at boundaries and at small displacements is most correlated with action recognition performance, (4) training optical flow to minimize classification error instead of minimizing EPE improves recognition performance, and (5) optical flow learned for the task of action recognition differs from traditional optical flow especially inside the human body and at the boundary of the body. These observations may encourage optical flow researchers to look beyond EPE as a goal and guide action recognition researchers to seek better motion cues, leading to a tighter integration of the optical flow and action recognition communities.

Laura Sevilla-Lara, Yiyi Liao, Fatma Güney, Varun Jampani, Andreas Geiger, Michael J. Black
Context-driven Multi-stream LSTM (M-LSTM) for Recognizing Fine-Grained Activity of Drivers

Automatic recognition of in-vehicle activities has significant impact on the next generation intelligent vehicles. In this paper, we present a novel Multi-stream Long Short-Term Memory (M-LSTM) network for recognizing driver activities. We bring together ideas from recent works on LSTMs, transfer learning for object detection and body pose by exploring the use of deep convolutional neural networks (CNN). Recent work has also shown that representations such as hand-object interactions are important cues in characterizing human activities. The proposed M-LSTM integrates these ideas under one framework, where two streams focus on appearance information with two different levels of abstractions. The other two streams analyze the contextual information involving configuration of body parts and body-object interactions. The proposed contextual descriptor is built to be semantically rich and meaningful, and even when coupled with appearance features it is turned out to be highly discriminating. We validate this on two challenging datasets consisting driver activities.

Ardhendu Behera, Alexander Keidel, Bappaditya Debnath
3D Fluid Flow Estimation with Integrated Particle Reconstruction

The standard approach to densely reconstruct the motion in a volume of fluid is to inject high-contrast tracer particles and record their motion with multiple high-speed cameras. Almost all existing work processes the acquired multi-view video in two separate steps: first, a per-frame reconstruction of the particles, usually in the form of soft occupancy likelihoods in a voxel representation; followed by 3D motion estimation, with some form of dense matching between the precomputed voxel grids from different time steps. In this sequential procedure, the first step cannot use temporal consistency considerations to support the reconstruction, while the second step has no access to the original, high-resolution image data. We show, for the first time, how to jointly reconstruct both the individual tracer particles and a dense 3D fluid motion field from the image data, using an integrated energy minimization. Our hybrid Lagrangian/Eulerian model explicitly reconstructs individual particles, and at the same time recovers a dense 3D motion field in the entire domain. Making particles explicit greatly reduces the memory consumption and allows one to use the high-resolution input images for matching. Whereas the dense motion field makes it possible to include physical a-priori constraints and account for the incompressibility and viscosity of the fluid. The method exhibits greatly ( $${\approx }70\%$$ ) improved results over a recent baseline with two separate steps for 3D reconstruction and motion estimation. Our results with only two time steps are comparable to those of state-of-the-art tracking-based methods that require much longer sequences.

Katrin Lasinger, Christoph Vogel, Thomas Pock, Konrad Schindler

Joint Oral Session 2

Frontmatter
NRST: Non-rigid Surface Tracking from Monocular Video

We propose an efficient method for non-rigid surface tracking from monocular RGB videos. Given a video and a template mesh, our algorithm sequentially registers the template non-rigidly to each frame. We formulate the per-frame registration as an optimization problem that includes a novel texture term specifically tailored towards tracking objects with uniform texture but fine-scale structure, such as the regular micro-structural patterns of fabric. Our texture term exploits the orientation information in the micro-structures of the objects, e.g., the yarn patterns of fabrics. This enables us to accurately track uniformly colored materials that have these high frequency micro-structures, for which traditional photometric terms are usually less effective. The results demonstrate the effectiveness of our method on both general textured non-rigid objects and monochromatic fabrics.

Marc Habermann, Weipeng Xu, Helge Rhodin, Michael Zollhöfer, Gerard Pons-Moll, Christian Theobalt

Oral Session 3: Applications

Frontmatter
Counting the Uncountable: Deep Semantic Density Estimation from Space

We propose a new method to count objects of specific categories that are significantly smaller than the ground sampling distance of a satellite image. This task is hard due to the cluttered nature of scenes where different object categories occur. Target objects can be partially occluded, vary in appearance within the same class and look alike to different categories. Since traditional object detection is infeasible due to the small size of objects with respect to the pixel size, we cast object counting as a density estimation problem. To distinguish objects of different classes, our approach combines density estimation with semantic segmentation in an end-to-end learnable convolutional neural network (CNN). Experiments show that deep semantic density estimation can robustly count objects of various classes in cluttered scenes. Experiments also suggest that we need specific CNN architectures in remote sensing instead of blindly applying existing ones from computer vision.

Andres C. Rodriguez, Jan D. Wegner
Acquire, Augment, Segment and Enjoy: Weakly Supervised Instance Segmentation of Supermarket Products

Grocery stores have thousands of products that are usually identified using barcodes with a human in the loop. For automated checkout systems, it is necessary to count and classify the groceries efficiently and robustly. One possibility is to use a deep learning algorithm for instance-aware semantic segmentation. Such methods achieve high accuracies but require a large amount of annotated training data.We propose a system to generate the training annotations in a weakly supervised manner, drastically reducing the labeling effort. We assume that for each training image, only the object class is known. The system automatically segments the corresponding object from the background. The obtained training data is augmented to simulate variations similar to those seen in real-world setups.Our experiments show that with appropriate data augmentation, our approach obtains competitive results compared to a fully-supervised baseline, while drastically reducing the amount of manual labeling.

Patrick Follmann, Bertram Drost, Tobias Böttger
Vehicle Re-identification in Context

Existing vehicle re-identification (re-id) evaluation benchmarks consider strongly artificial test scenarios by assuming the availability of high quality images and fine-grained appearance at an almost constant image scale, reminiscent to images required for Automatic Number Plate Recognition, e.g. VeRi-776. Such assumptions are often invalid in realistic vehicle re-id scenarios where arbitrarily changing image resolutions (scales) are the norm. This makes the existing vehicle re-id benchmarks limited for testing the true performance of a re-id method. In this work, we introduce a more realistic and challenging vehicle re-id benchmark, called Vehicle Re-Identification in Context (VRIC). In contrast to existing vehicle re-id datasets, VRIC is uniquely characterised by vehicle images subject to more realistic and unconstrained variations in resolution (scale), motion blur, illumination, occlusion, and viewpoint. It contains 60,430 images of 5,622 vehicle identities captured by 60 different cameras at heterogeneous road traffic scenes in both day-time and night-time. Given the nature of this new benchmark, we further investigate a multi-scale matching approach to vehicle re-id by learning more discriminative feature representations from multi-resolution images. Extensive evaluations show that the proposed multi-scale method outperforms the state-of-the-art vehicle re-id methods on three benchmark datasets: VehicleID, VeRi-776, and VRIC (Available at http://qmul-vric.github.io ).

Aytaç Kanacı, Xiatian Zhu, Shaogang Gong
Low-Shot Learning of Plankton Categories

The size of current plankton image datasets renders manual classification virtually infeasible. The training of models for machine classification is complicated by the fact that a large number of classes consist of only a few examples. We employ the recently introduced weight imprinting technique in order to use the available training data to train accurate classifiers in absence of enough examples for some classes.The model architecture used in this work succeeds in the identification of plankton using machine learning with its unique challenges, i.e. a limited number of training examples and a severely skewed class size distribution. Weight imprinting enables a neural network to recognize small classes immediately without re-training. This permits the mining of examples for novel classes.

Simon-Martin Schröder, Rainer Kiko, Jean-Olivier Irisson, Reinhard Koch

Poster Session 2

Frontmatter
Multimodal Dense Stereo Matching

In this paper, we propose a new approach for dense depth estimation based on multimodal stereo images. Our approach employs a combined cost function utilizing robust metrics and a transformation to an illumination independent representation. Additionally, we present a confidence based weighting scheme which allows a pixel-wise weight adjustment within the cost function. We demonstrate the capabilities of our approach using RGB- and thermal images. The resulting depth maps are evaluated by comparing them to depth measurements of a Velodyne HDL-64E LiDAR sensor. We show that our method outperforms current state of the art dense matching methods regarding depth estimation based on multimodal input images.

Max Mehltretter, Sebastian P. Kleinschmidt, Bernardo Wagner, Christian Heipke
Deep Distance Transform to Segment Visually Indistinguishable Merged Objects

We design a two stage image segmentation method, comprising a distance transform estimating neural network and watershed segmentation. It allows segmentation and tracking of colliding objects without any assumptions on object behavior or global object appearance as the proposed machine learning step is trained on contour information only. Our method is also capable of segmenting partially vanishing contact surfaces of visually merged objects. The evaluation is performed on a dataset of collisions of Drosophila melanogaster larvae manually labeled with pixel accuracy. The proposed pipeline needs no manual parameter tuning and operates at high frame rates. We provide a detailed evaluation of the neural network design including 1200 trained networks.

Sören Klemm, Xiaoyi Jiang, Benjamin Risse
Multi-class Cell Segmentation Using CNNs with F-measure Loss Function

Cell segmentation is one of the fundamental problems in biomedical image processing as it is often mandatory for the quantitative analysis of biological processes. Sometimes, a binary segmentation of the cells is not sufficient, for instance if biologists are interested in the appearance of specific cell parts. Such a setting requires multiple foreground classes, which can significantly increase the complexity of the segmentation task. This is especially the case if very fine structures need to be detected. Here, we propose a method for multi-class segmentation of Drosophila macrophages in in-vivo fluorescence microscopy images to segment complex cell structures such as the lamellipodium and filopodia. Our approach is based on a convolutional neural network, more specifically the U-net architecture. The network is trained using a loss function based on the F $$_1$$ -measure which we have extended for multi-class scenarios to account for class imbalances in the image data. We compare the F $$_1$$ -measure loss function to a weighted cross entropy loss and show that the CNN outperforms other segmentation approaches.

Aaron Scherzinger, Philipp Hugenroth, Marike Rüder, Sven Bogdan, Xiaoyi Jiang
Improved Semantic Stixels via Multimodal Sensor Fusion

This paper presents a compact and accurate representation of 3D scenes that are observed by a LiDAR sensor and a monocular camera. The proposed method is based on the well-established Stixel model originally developed for stereo vision applications. We extend this Stixel concept to incorporate data from multiple sensor modalities. The resulting mid-level fusion scheme takes full advantage of the geometric accuracy of LiDAR measurements as well as the high resolution and semantic detail of RGB images. The obtained environment model provides a geometrically and semantically consistent representation of the 3D scene at a significantly reduced amount of data while minimizing information loss at the same time. Since the different sensor modalities are considered as input to a joint optimization problem, the solution is obtained with only minor computational overhead. We demonstrate the effectiveness of the proposed multimodal Stixel algorithm on a manually annotated ground truth dataset. Our results indicate that the proposed mid-level fusion of LiDAR and camera data improves both the geometric and semantic accuracy of the Stixel model significantly while reducing the computational overhead as well as the amount of generated data in comparison to using a single modality on its own.

Florian Piewak, Peter Pinggera, Markus Enzweiler, David Pfeiffer, Marius Zöllner
Convolve, Attend and Spell: An Attention-based Sequence-to-Sequence Model for Handwritten Word Recognition

This paper proposes Convolve, Attend and Spell, an attention-based sequence-to-sequence model for handwritten word recognition. The proposed architecture has three main parts: an encoder, consisting of a CNN and a bi-directional GRU, an attention mechanism devoted to focus on the pertinent features and a decoder formed by a one-directional GRU, able to spell the corresponding word, character by character. Compared with the recent state-of-the-art, our model achieves competitive results on the IAM dataset without needing any pre-processing step, predefined lexicon nor language model. Code and additional results are available in https://github.com/omni-us/research-seq2seq-HTR .

Lei Kang, J. Ignacio Toledo, Pau Riba, Mauricio Villegas, Alicia Fornés, Marçal Rusiñol
Illumination Estimation Is Sufficient for Indoor-Outdoor Image Classification

Indoor-outdoor image classification is a well-known problem for which multiple solutions have been proposed, many of which use both low-level and high-level features put into various models. Despite varying complexity, the accuracy of most of these models is reported to be around 90%. In this paper it is shown that the same accuracy can be obtained by simple manipulation of only low-level features extracted from the image in the early phase of image formation and based on the simplest forms of illumination estimation, namely methods such as Gray-World. Additionally, it is shown how using the built-in camera auto white balance is also enough to effectively achieve state-of-the-art indoor-outdoor classification accuracy. The results are presented and discussed.

Nikola Banić, Sven Lončarić
DeepKey: Towards End-to-End Physical Key Replication from a Single Photograph

This paper describes DeepKey, an end-to-end deep neural architecture capable of taking a digital RGB image of an ‘everyday’ scene containing a pin tumbler key (e.g. lying on a table or carpet) and fully automatically inferring a printable 3D key model. We report on the key detection performance and describe how candidates can be transformed into physical prints. We show an example opening a real-world lock. Our system is described in detail, providing a breakdown of all components including key detection, pose normalisation, bitting segmentation and 3D model inference. We provide an in-depth evaluation and conclude by reflecting on limitations, applications, potential security risks and societal impact. We contribute the DeepKey Datasets of 5, 300+ images covering a few test keys with bounding boxes, pose and unaligned mask data.

Rory Smith, Tilo Burghardt
Deriving Neural Network Architectures Using Precision Learning: Parallel-to-Fan Beam Conversion

In this paper, we derive a neural network architecture based on an analytical formulation of the parallel-to-fan beam conversion problem following the concept of precision learning. The network allows to learn the unknown operators in this conversion in a data-driven manner avoiding interpolation and potential loss of resolution. Integration of known operators results in a small number of trainable parameters that can be estimated from synthetic data only. The concept is evaluated in the context of Hybrid MRI/X-ray imaging where transformation of the parallel-beam MRI projections to fan-beam X-ray projections is required. The proposed method is compared to a traditional rebinning method. The results demonstrate that the proposed method is superior to ray-by-ray interpolation and is able to deliver sharper images using the same amount of parallel-beam input projections which is crucial for interventional applications. We believe that this approach forms a basis for further work uniting deep learning, signal processing, physics, and traditional pattern recognition.

Christopher Syben, Bernhard Stimpel, Jonathan Lommen, Tobias Würfl, Arnd Dörfler, Andreas Maier
Detecting Face Morphing Attacks by Analyzing the Directed Distances of Facial Landmarks Shifts

Face morphing attacks create face images that are verifiable to multiple identities. Associating such images to identity documents lead to building faulty identity links, causing attacks on operations like border crossing. Most of previously proposed morphing attack detection approaches directly classified features extracted from the investigated image. We discuss the operational opportunity of having a live face probe to support the morphing detection decision and propose a detection approach that take advantage of that. Our proposed solution considers the facial landmarks shifting patterns between reference and probe images. This is represented by the directed distances to avoid confusion with shifts caused by other variations. We validated our approach using a publicly available database, built on 549 identities. Our proposed detection concept is tested with three landmark detectors and proved to outperform the baseline concept based on handcrafted and transferable CNN features.

Naser Damer, Viola Boller, Yaza Wainakh, Fadi Boutros, Philipp Terhörst, Andreas Braun, Arjan Kuijper
KloudNet: Deep Learning for Sky Image Analysis and Irradiance Forecasting

We present a novel image-based approach for estimating irradiance fluctuations from sky images. Our goal is a very short-term prediction of the irradiance state around a photovoltaic power plant 5–10 min ahead of time, in order to adjust alternative energy sources and ensure a stable energy network. To this end, we propose a convolutional neural network with residual building blocks that learns to predict the future irradiance state from a small set of sky images. Our experiments on two large datasets demonstrate that the network abstracts upon local site-specific properties such as day- and month-dependent sun positions, as well as generic properties about moving, creating, dissolving clouds, or seasonal changes. Moreover, our approach significantly outperforms the established baseline and state-of-the-art methods.

Dinesh Pothineni, Martin R. Oswald, Jan Poland, Marc Pollefeys
Learning Style Compatibility for Furniture

When judging style, a key question that often arises is whether or not a pair of objects are compatible with each other. In this paper we investigate how Siamese networks can be used efficiently for assessing the style compatibility between images of furniture items. We show that the middle layers of pretrained CNNs can capture essential information about furniture style, which allows for efficient applications of such networks for this task. We also use a joint image-text embedding method that allows for the querying of stylistically compatible furniture items, along with additional attribute constraints based on text. To evaluate our methods, we collect and present a large scale dataset of images of furniture of different style categories accompanied by text attributes.

Divyansh Aggarwal, Elchin Valiyev, Fadime Sener, Angela Yao
Temporal Interpolation as an Unsupervised Pretraining Task for Optical Flow Estimation

The difficulty of annotating training data is a major obstacle to using CNNs for low-level tasks in video. Synthetic data often does not generalize to real videos, while unsupervised methods require heuristic losses. Proxy tasks can overcome these issues, and start by training a network for a task for which annotation is easier or which can be trained unsupervised. The trained network is then fine-tuned for the original task using small amounts of ground truth data. Here, we investigate frame interpolation as a proxy task for optical flow. Using real movies, we train a CNN unsupervised for temporal interpolation. Such a network implicitly estimates motion, but cannot handle untextured regions. By fine-tuning on small amounts of ground truth flow, the network can learn to fill in homogeneous regions and compute full optical flow fields. Using this unsupervised pre-training, our network outperforms similar architectures that were trained supervised using synthetic optical flow.

Jonas Wulff, Michael J. Black
Decoupling Respiratory and Angular Variation in Rotational X-ray Scans Using a Prior Bilinear Model

Data-driven respiratory signal extraction from rotational X-ray scans is a challenge as angular effects overlap with respiration-induced change in the scene. In this paper, we use the linearity of the X-ray transform to propose a bilinear model based on a prior 4D scan to separate angular and respiratory variation. The bilinear estimation process is supported by a B-spline interpolation using prior knowledge about the trajectory angle. Consequently, extraction of respiratory features simplifies to a linear problem. Though the need for a prior 4D CT seems steep, our proposed use-case of driving a respiratory motion model in radiation therapy usually meets this requirement. We evaluate on DRRs of 5 patient 4D CTs in a leave-one-phase-out manner and achieve a mean estimation error of $$3.01\%$$ in the gray values for unseen viewing angles. We further demonstrate suitability of the extracted weights to drive a motion model for treatments with a continuously rotating gantry.

Tobias Geimer, Paul Keall, Katharina Breininger, Vincent Caillet, Michelle Dunbar, Christoph Bert, Andreas Maier

Oral Session 4: Learning II

Frontmatter
Inference, Learning and Attention Mechanisms that Exploit and Preserve Sparsity in CNNs

While CNNs naturally lend themselves to densely sampled data, and sophisticated implementations are available, they lack the ability to efficiently process sparse data. In this work we introduce a suite of tools that exploit sparsity in both the feature maps and the filter weights, and thereby allow for significantly lower memory footprints and computation times than the conventional dense framework, when processing data with a high degree of sparsity. Our scheme provides (i) an efficient GPU implementation of a convolution layer based on direct, sparse convolution; (ii) a filter step within the convolution layer, which we call attention, that prevents fill-in, i.e., the tendency of convolution to rapidly decrease sparsity, and guarantees an upper bound on the computational resources; and (iii) an adaptation of back-propagation that makes it possible to combine our approach with standard learning frameworks, while still exploiting sparsity in the data and the model.

Timo Hackel, Mikhail Usvyatsov, Silvano Galliani, Jan Dirk Wegner, Konrad Schindler
End-to-End Learning of Deterministic Decision Trees

Conventional decision trees have a number of favorable properties, including interpretability, a small computational footprint and the ability to learn from little training data. However, they lack a key quality that has helped fuel the deep learning revolution: that of being end-to-end trainable. Kontschieder 2015 has addressed this deficit, but at the cost of losing a main attractive trait of decision trees: the fact that each sample is routed along a small subset of tree nodes only. We here propose a model and Expectation-Maximization training scheme for decision trees that are fully probabilistic at train time, but after an annealing process become deterministic at test time. We analyze the learned oblique split parameters on image datasets and show that Neural Networks can be trained at each split. In summary, we present an end-to-end learning scheme for deterministic decision trees and present results on par or superior to published standard oblique decision tree algorithms.

Thomas M. Hehn, Fred A. Hamprecht
Taming the Cross Entropy Loss

We present the Tamed Cross Entropy (TCE) loss function, a robust derivative of the standard Cross Entropy (CE) loss used in deep learning for classification tasks. However, unlike other robust losses, the TCE loss is designed to exhibit the same training properties than the CE loss in noiseless scenarios. Therefore, the TCE loss requires no modification on the training regime compared to the CE loss and, in consequence, can be applied in all applications where the CE loss is currently used. We evaluate the TCE loss using the ResNet architecture on four image datasets that we artificially contaminated with various levels of label noise. The TCE loss outperforms the CE loss in every tested scenario.

Manuel Martinez, Rainer Stiefelhagen
Supervised Deep Kriging for Single-Image Super-Resolution

We propose a novel single-image super-resolution approach based on the geostatistical method of kriging. Kriging is a zero-bias minimum-variance estimator that performs spatial interpolation based on a weighted average of known observations. Rather than solving for the kriging weights via the traditional method of inverting covariance matrices, we propose a supervised form in which we learn a deep network to generate said weights. We combine the kriging weight generation and kriging process into a joint network that can be learned end-to-end. Our network achieves competitive super-resolution results as other state-of-the-art methods. In addition, since the super-resolution process follows a known statistical framework, we are able to estimate bias and variance, something which is rarely possible for other deep networks.

Gianni Franchi, Angela Yao, Andreas Kolb
Information-Theoretic Active Learning for Content-Based Image Retrieval

We propose Information-Theoretic Active Learning (ITAL), a novel batch-mode active learning method for binary classification, and apply it for acquiring meaningful user feedback in the context of content-based image retrieval. Instead of combining different heuristics such as uncertainty, diversity, or density, our method is based on maximizing the mutual information between the predicted relevance of the images and the expected user feedback regarding the selected batch. We propose suitable approximations to this computationally demanding problem and also integrate an explicit model of user behavior that accounts for possible incorrect labels and unnameable instances. Furthermore, our approach does not only take the structure of the data but also the expected model output change caused by the user feedback into account. In contrast to other methods, ITAL turns out to be highly flexible and provides state-of-the-art performance across various datasets, such as MIRFLICKR and ImageNet.

Björn Barz, Christoph Käding, Joachim Denzler

Oral Session 5: Optimization and Clustering

Frontmatter
AFSI: Adaptive Restart for Fast Semi-Iterative Schemes for Convex Optimisation

Smooth optimisation problems arise in many fields including image processing, and having fast methods for solving them has clear benefits. Widely and successfully used strategies to solve them are accelerated gradient methods. They accelerate standard gradient-based schemes by means of extrapolation. Unfortunately, most acceleration strategies are generic, in the sense, that they ignore specific information about the objective function. In this paper, we implement an adaptive restarting into a recently proposed efficient acceleration strategy that was coined Fast Semi-Iterative (FSI) scheme. Our analysis shows clear advantages of the adaptive restarting in terms of a theoretical convergence rate guarantee and state-of-the-art performance on a challenging image processing task.

Jón Arnar Tómasson, Peter Ochs, Joachim Weickert
Invexity Preserving Transformations for Projection Free Optimization with Sparsity Inducing Non-convex Constraints

Forward stagewise and Frank Wolfe are popular gradient based projection free optimization algorithms which both require convex constraints. We propose a method to extend the applicability of these algorithms to problems of the form $$\min _x f(x) \quad s.t. \quad g(x) \le \kappa $$ where f(x) is an invex (Invexity is a generalization of convexity and ensures that all local optima are also global optima.) objective function and g(x) is a non-convex constraint. We provide a theorem which defines a class of monotone component-wise transformation functions $$x_i = h(z_i)$$ . These transformations lead to a convex constraint function $$G(z) = g(h(z))$$ . Assuming invexity of the original function f(x) that same transformation $$x_i = h(z_i)$$ will lead to a transformed objective function $$F(z) = f(h(z))$$ which is also invex. For algorithms that rely on a non-zero gradient $$\nabla F$$ to produce new update steps invexity ensures that these algorithms will move forward as long as a descent direction exists.

Sebastian Mathias Keller, Damian Murezzan, Volker Roth
Unsupervised Label Learning on Manifolds by Spatially Regularized Geometric Assignment

Manifold models of image features abound in computer vision. We present a novel approach that combines unsupervised computation of representative manifold-valued features, called labels, and the spatially regularized assignment of these labels to given manifold-valued data. Both processes evolve dynamically through two Riemannian gradient flows that are coupled. The representation of labels and assignment variables are kept separate, to enable the flexible application to various manifold data models. As a case study, we apply our approach to the unsupervised learning of covariance descriptors on the positive definite matrix manifold, through spatially regularized geometric assignment.

Artjom Zern, Matthias Zisler, Freddie Åström, Stefania Petra, Christoph Schnörr
Backmatter
Metadata
Title
Pattern Recognition
Editors
Thomas Brox
Andrés Bruhn
Mario Fritz
Copyright Year
2019
Electronic ISBN
978-3-030-12939-2
Print ISBN
978-3-030-12938-5
DOI
https://doi.org/10.1007/978-3-030-12939-2

Premium Partner