Skip to main content
Top

2019 | Book

Pattern Recognition

41st DAGM German Conference, DAGM GCPR 2019, Dortmund, Germany, September 10–13, 2019, Proceedings

insite
SEARCH

About this book

This book constitutes the proceedings of the 41st DAGM German Conference on Pattern Recognition, DAGM GCPR 2019, held in Dortmund, Germany, in September 2019.

The 43 revised full papers presented were carefully reviewed and selected from 91 submissions. The German Conference on Pattern Recognition is the annual symposium of the German Association for Pattern Recognition (DAGM). It is the national venue for recent advances in image processing, pattern recognition, and computer vision and it follows the long tradition of the DAGM conference series.

Table of Contents

Frontmatter

Oral Session I: Image Processing and Analysis

Frontmatter
Learned Collaborative Stereo Refinement

In this work, we propose a learning-based method to denoise and refine disparity maps of a given stereo method. The proposed variational network arises naturally from unrolling the iterates of a proximal gradient method applied to a variational energy defined in a joint disparity, color, and confidence image space. Our method allows to learn a robust collaborative regularizer leveraging the joint statistics of the color image, the confidence map and the disparity map. Due to the variational structure of our method, the individual steps can be easily visualized, thus enabling interpretability of the method. We can therefore provide interesting insights into how our method refines and denoises disparity maps. The efficiency of our method is demonstrated by the publicly available stereo benchmarks Middlebury 2014 and Kitti 2015.

Patrick Knöbelreiter, Thomas Pock
Plane Completion and Filtering for Multi-View Stereo Reconstruction

Multi-View Stereo (MVS)-based 3D reconstruction is a major topic in computer vision for which a vast number of methods have been proposed over the last decades showing impressive visual results. Long-since, benchmarks like Middlebury [32] numerically rank the individual methods considering accuracy and completeness as quality attributes. While the Middlebury benchmark provides low-resolution images only, the recently published ETH3D [31] and Tanks and Temples [19] benchmarks allow for an evaluation of high-resolution and large-scale MVS from natural camera configurations. This benchmarking reveals that still only few methods can be used for the reconstruction of large-scale models. We present an effective pipeline for large-scale 3D reconstruction which extends existing methods in several ways: (i) We introduce an outlier filtering considering the MVS geometry. (ii) To avoid incomplete models from local matching methods we propose a plane completion method based on growing superpixels allowing a generic generation of high-quality 3D models. (iii) Finally, we use deep learning for a subsequent filtering of outliers in segmented sky areas. We give experimental evidence on benchmarks that our contributions improve the quality of the 3D model and our method is state-of-the-art in high-quality 3D reconstruction from high-resolution images or large image sets.

Andreas Kuhn, Shan Lin, Oliver Erdler
Simultaneous Semantic Segmentation and Outlier Detection in Presence of Domain Shift

Recent success on realistic road driving datasets has increased interest in exploring robust performance in real-world applications. One of the major unsolved problems is to identify image content which can not be reliably recognized with a given inference engine. We therefore study approaches to recover a dense outlier map alongside the primary task with a single forward pass, by relying on shared convolutional features. We consider semantic segmentation as the primary task and perform extensive validation on WildDash val (inliers), LSUN val (outliers), and pasted objects from Pascal VOC 2007 (outliers). We achieve the best validation performance by training to discriminate inliers from pasted ImageNet-1k content, even though ImageNet-1k contains many road-driving pixels, and, at least nominally, fails to account for the full diversity of the visual world. The proposed two-head model performs comparably to the C-way multi-class model trained to predict uniform distribution in outliers, while outperforming several other validated approaches. We evaluate our best two models on the WildDash test dataset and set a new state of the art on the WildDash benchmark.

Petra Bevandić, Ivan Krešo, Marin Oršić, Siniša Šegvić
3D Bird’s-Eye-View Instance Segmentation

Recent deep learning models achieve impressive results on 3D scene analysis tasks by operating directly on unstructured point clouds. A lot of progress was made in the field of object classification and semantic segmentation. However, the task of instance segmentation is currently less explored. In this work, we present 3D-BEVIS (3D bird’s-eye-view instance segmentation), a deep learning framework for joint semantic- and instance-segmentation on 3D point clouds. Following the idea of previous proposal-free instance segmentation approaches, our model learns a feature embedding and groups the obtained feature space into semantic instances. Current point-based methods process local sub-parts of a full scene independently, followed by a heuristic merging step. However, to perform instance segmentation by clustering on a full scene, globally consistent features are required. Therefore, we propose to combine local point geometry with global context information using an intermediate bird’s-eye view representation.

Cathrin Elich, Francis Engelmann, Theodora Kontogianni, Bastian Leibe
Classification-Specific Parts for Improving Fine-Grained Visual Categorization

Fine-grained visual categorization is a classification task for distinguishing categories with high intra-class and small inter-class variance. While global approaches aim at using the whole image for performing the classification, part-based solutions gather additional local information in terms of attentions or parts. We propose a novel classification-specific part estimation that uses an initial prediction as well as back-propagation of feature importance via gradient computations in order to estimate relevant image regions. The subsequently detected parts are then not only selected by a-posteriori classification knowledge, but also have an intrinsic spatial extent that is determined automatically. This is in contrast to most part-based approaches and even to available ground-truth part annotations, which only provide point coordinates and no additional scale information. We show in our experiments on various widely-used fine-grained datasets the effectiveness of the mentioned part selection method in conjunction with the extracted part features.

Dimitri Korsch, Paul Bodesheim, Joachim Denzler

Oral Session II: Imaging Techniques, Image Analysis

Frontmatter
Adjustment and Calibration of Dome Port Camera Systems for Underwater Vision

Dome ports act as spherical windows in underwater housings through which a camera can observe objects in the water. As compared to flat glass interfaces, they do not limit the field of view, and they do not cause refraction of light observed by a pinhole camera positioned exactly in the center of the dome. Mechanically adjusting a real lens to this position is a challenging task, in particular for those integrated in deep sea housings. In this contribution a mechanical adjustment procedure based on straight line observations above and below water is proposed that allows for accurate alignments. Additionally, we show a chessboard-based method employing an underwater/above-water image pair to estimate potentially remaining offsets from the dome center to allow refraction correction in photogrammetric applications. Besides providing intuition about the severity of refraction in certain settings, we demonstrate the methods on real data for acrylic and glass domes in the water.

Mengkun She, Yifan Song, Jochen Mohrmann, Kevin Köser
Training Auto-Encoder-Based Optimizers for Terahertz Image Reconstruction

Terahertz (THz) sensing is a promising imaging technology for a wide variety of different applications. Extracting the interpretable and physically meaningful parameters for such applications, however, requires solving an inverse problem in which a model function determined by these parameters needs to be fitted to the measured data. Since the underlying optimization problem is nonconvex and very costly to solve, we propose learning the prediction of suitable parameters from the measured data directly. More precisely, we develop a model-based autoencoder in which the encoder network predicts suitable parameters and the decoder is fixed to a physically meaningful model function, such that we can train the encoding network in an unsupervised way. We illustrate numerically that the resulting network is more than 140 times faster than classical optimization techniques while making predictions with only slightly higher objective values. Using such predictions as starting points of local optimization techniques allows us to converge to better local minima about twice as fast as optimizing without the network-based initialization.

Tak Ming Wong, Matthias Kahl, Peter Haring-Bolívar, Andreas Kolb, Michael Möller
Joint Viewpoint and Keypoint Estimation with Real and Synthetic Data

The estimation of viewpoints and keypoints effectively enhance object detection methods by extracting valuable traits of the object instances. While the output of both processes differ, i.e., angles vs. list of characteristic points, they indeed share the same focus on how the object is placed in the scene, inducing that there is a certain level of correlation between them. Therefore, we propose a convolutional neural network that jointly computes the viewpoint and keypoints for different object categories. By training both tasks together, each task improves the accuracy of the other. Since the labelling of object keypoints is very time consuming for human annotators, we also introduce a new synthetic dataset with automatically generated viewpoint and keypoints annotations. Our proposed network can also be trained on datasets that contain viewpoint and keypoints annotations or only one of them. The experiments show that the proposed approach successfully exploits this implicit correlation between the tasks and outperforms previous techniques that are trained independently .

Pau Panareda Busto, Juergen Gall
Non-causal Tracking by Deblatting

Tracking by Deblatting (Deblatting = deblurring and matting) stands for solving an inverse problem of deblurring and image matting for tracking motion-blurred objects. We propose non-causal Tracking by Deblatting which estimates continuous, complete and accurate object trajectories. Energy minimization by dynamic programming is used to detect abrupt changes of motion, called bounces. High-order polynomials are fitted to segments, which are parts of the trajectory separated by bounces. The output is a continuous trajectory function which assigns location for every real-valued time stamp from zero to the number of frames. Additionally, we show that from the trajectory function precise physical calculations are possible, such as radius, gravity or sub-frame object velocity. Velocity estimation is compared to the high-speed camera measurements and radars. Results show high performance of the proposed method in terms of Trajectory-IoU, recall and velocity estimation.

Denys Rozumnyi, Jan Kotera, Filip Šroubek, Jiří Matas

Oral Session III: Learning

Frontmatter
Group Pruning Using a Bounded- Norm for Group Gating and Regularization

Deep neural networks achieve state-of-the-art results on several tasks while increasing in complexity. It has been shown that neural networks can be pruned during training by imposing sparsity inducing regularizers. In this paper, we investigate two techniques for group-wise pruning during training in order to improve network efficiency. We propose a gating factor after every convolutional layer to induce channel level sparsity, encouraging insignificant channels to become exactly zero. Further, we introduce and analyse a bounded variant of the $$\ell _1$$ ℓ 1 regularizer, which interpolates between $$\ell _1$$ ℓ 1 and $$\ell _0$$ ℓ 0 -norms to retain performance of the network at higher pruning rates. To underline effectiveness of the proposed methods, we show that the number of parameters of ResNet-164, DenseNet-40 and MobileNetV2 can be reduced down by $$30\%$$ 30 % , $$69\%$$ 69 % , and $$75\%$$ 75 % on CIFAR100 respectively without a significant drop in accuracy. We achieve state-of-the-art pruning results for ResNet-50 with higher accuracy on ImageNet. Furthermore, we show that the light weight MobileNetV2 can further be compressed on ImageNet without a significant drop in performance .

Chaithanya Kumar Mummadi, Tim Genewein, Dan Zhang, Thomas Brox, Volker Fischer
On the Estimation of the Wasserstein Distance in Generative Models

Generative Adversarial Networks (GANs) have been used to model the underlying probability distribution of sample based datasets. GANs are notoriuos for training difficulties and their dependence on arbitrary hyperparameters. One recent improvement in GAN literature is to use the Wasserstein distance as loss function leading to Wasserstein Generative Adversarial Networks (WGANs). Using this as a basis, we show various ways in which the Wasserstein distance is estimated for the task of generative modelling. Additionally, the secrets in training such models are shown and summarized at the end of this work. Where applicable, we extend current works to different algorithms, different cost functions, and different regularization schemes to improve generative models.

Thomas Pinetz, Daniel Soukup, Thomas Pock
Deep Archetypal Analysis

Deep Archetypal Analysis (DeepAA) generates latent representations of high-dimensional datasets in terms of intuitively understandable basic entities called archetypes. The proposed method extends linear Archetypal Analysis (AA), an unsupervised method to represent multivariate data points as convex combinations of extremal data points. Unlike the original formulation, Deep AA is generative and capable of handling side information. In addition, our model provides the ability for data-driven representation learning which reduces the dependence on expert knowledge. We empirically demonstrate the applicability of our approach by exploring the chemical space of small organic molecules. In doing so, we employ the archetype constraint to learn two different latent archetype representations for the same dataset, with respect to two chemical properties. This type of supervised exploration marks a distinct starting point and let us steer de novo molecular design.

Sebastian Mathias Keller, Maxim Samarin, Mario Wieser, Volker Roth

Oral Session IV: Image Analysis, Applications

Frontmatter
Single Level Feature-to-Feature Forecasting with Deformable Convolutions

Future anticipation is of vital importance in autonomous driving and other decision-making systems. We present a method to anticipate semantic segmentation of future frames in driving scenarios based on feature-to-feature forecasting. Our method is based on a semantic segmentation model without lateral connections within the upsampling path. Such design ensures that the forecasting addresses only the most abstract features on a very coarse resolution. We further propose to express feature-to-feature forecasting with deformable convolutions. This increases the modelling power due to being able to represent different motion patterns within a single feature map. Experiments show that our models with deformable convolutions outperform their regular and dilated counterparts while minimally increasing the number of parameters. Our method achieves state of the art performance on the Cityscapes validation set when forecasting nine timesteps into the future.

Josip Šarić, Marin Oršić, Tonći Antunović, Sacha Vražić, Siniša Šegvić
Predicting Landscapes from Environmental Conditions Using Generative Networks

Landscapes are meaningful ecological units that strongly depend on the environmental conditions. Such dependencies between landscapes and the environment have been noted since the beginning of Earth sciences and cast into conceptual models describing the interdependencies of climate, geology, vegetation and geomorphology. Here, we ask whether landscapes, as seen from space, can be statistically predicted from pertinent environmental conditions. To this end we adapted a deep learning generative model in order to establish the relationship between the environmental conditions and the view of landscapes from the Sentinel-2 satellite. We trained a conditional generative adversarial network to generate multispectral imagery given a set of climatic, terrain and anthropogenic predictors. The generated imagery of the landscapes share many characteristics with the real one. Results based on landscape patch metrics, indicative of landscape composition and structure, show that the proposed generative model creates landscapes that are more similar to the targets than the baseline models while overall reflectance and vegetation cover are predicted better. We demonstrate that for many purposes the generated landscapes behave as real with immediate application for global change studies. We envision the application of machine learning as a tool to forecast the effects of climate change on the spatial features of landscapes, while we assess its limitations and breaking points.

Christian Requena-Mesa, Markus Reichstein, Miguel Mahecha, Basil Kraft, Joachim Denzler
Semi-supervised Segmentation of Salt Bodies in Seismic Images Using an Ensemble of Convolutional Neural Networks

Seismic image analysis plays a crucial role in a wide range of industrial applications and has been receiving significant attention. One of the essential challenges of seismic imaging is detecting subsurface salt structure which is indispensable for the identification of hydrocarbon reservoirs and drill path planning. Unfortunately, the exact identification of large salt deposits is notoriously difficult and professional seismic imaging often requires expert human interpretation of salt bodies. Convolutional neural networks (CNNs) have been successfully applied in many fields, and several attempts have been made in the field of seismic imaging. But the high cost of manual annotations by geophysics experts and scarce publicly available labeled datasets hinder the performance of the existing CNN-based methods. In this work, we propose a semi-supervised method for segmentation (delineation) of salt bodies in seismic images which utilizes unlabeled data for multi-round self-training. To reduce error amplification during self-training we propose a scheme which uses an ensemble of CNNs. We show that our approach outperforms state-of-the-art on the TGS Salt Identification Challenge dataset and is ranked the first among the 3234 competing methods. The source code is available at GitHub .

Yauhen Babakhin, Artsiom Sanakoyeu, Hirotoshi Kitamura
Entrack: A Data-Driven Maximum-Entropy Approach to Fiber Tractography

The combined effort of brain anatomy experts and computerized methods has continuously improved the quality of available gold-standard tractograms for diffusion-weighted MRI. These prototypical tractograms contain information that can be utilized by other brain mapping applications. However, this transfer requires data-driven tractography algorithms, which learn from example tractograms, to deliver the obtained knowledge to other diffusion-weighted MRI data. The value of these data-driven methods would be greatly enhanced, if they could also estimate and control the uncertainty of their predictions. These reasons lead us to propose a generic machine learning method for probabilistic tractography. We demonstrate the general approach with a basic Fisher-von-Mises distribution to model local fiber direction. The distributional parameters are inferred from diffusion data by a neural network. For training the neural network, we derive an analytic, entropy-regularized cost function, which allows to control model uncertainty in accordance with the level of noise in the data. We highlight the ability of our method to quantify the probability of a given fiber, which makes it a useful tool for outlier detection. The tracking performance of the model is evaluated on the ISMRM 2015 Tractography Challenge.

Viktor Wegmayr, Giacomo Giuliari, Joachim M. Buhmann

Posters

Frontmatter
Generative Aging of Brain MR-Images and Prediction of Alzheimer Progression

Predicting the age progression of individual brain images from longitudinal data has been a challenging problem, while its solution is considered key to improve dementia prognosis. Often, approaches are limited to group-level predictions, lack the ability to extrapolate, can not scale to many samples, or do not operate directly on image inputs. We address these issues with the first approach to artificial aging of brain images based on Wasserstein Generative Adversarial Networks. We develop a novel recursive generator model for brain image time series, and train it on large-scale longitudinal data sets (ADNI/AIBL). In addition to thorough analysis of results on healthy and demented subjects, we demonstrate the predictive value of our brain aging model in the context of conversion prognosis from mild cognitive impairment to Alzheimer’s disease. Conversion prognosis for a baseline image is achieved in two steps. First, we estimate the future brain image with the Generative Adversarial Network. This follow-up image is passed to a CNN classifier, pre-trained to discriminate between mild cognitive impairment and Alzheimer’s disease. It estimates the Alzheimer probability for the follow-up image, which represents an effective measure for future disease risk.

Viktor Wegmayr, Maurice Hörold, Joachim M. Buhmann
Nonlinear Causal Link Estimation Under Hidden Confounding with an Application to Time Series Anomaly Detection

Causality analysis represents one of the most important tasks when examining dynamical systems such as ecological time series. We propose to mitigate the problem of inferring nonlinear cause-effect dependencies in the presence of a hidden confounder by using deep learning with domain knowledge integration. Moreover, we suggest a time series anomaly detection approach using causal link intensity increase as an indicator of the anomaly. Our proposed method is based on the Causal Effect Variational Autoencoder (CEVAE) which we extend and apply to anomaly detection in time series. We evaluate our method on synthetic data having properties of ecological time series and compare to the vector autoregressive Granger causality (VAR-GC) baseline.

Violeta Teodora Trifunov, Maha Shadaydeh, Jakob Runge, Veronika Eyring, Markus Reichstein, Joachim Denzler
Iris Verification with Convolutional Neural Network and Unit-Circle Layer

We propose a novel convolutional neural network to verify a match between two normalized images of the human iris. The network is trained end-to-end and validated on three publicly available datasets yielding state-of-the-art results against four baseline methods. The network performs better by a $$10\%$$ margin to the state-of-the-art method on the CASIA.v4 dataset. In the network, we use a novel “Unit-Circle” layer which replaces the Gabor-filtering step in a common iris-verification pipeline. We show that the layer improves the performance of the model up to $$15\%$$ on previously-unseen data.

Radim Špetlík, Ivan Razumenić
SDNet: Semantically Guided Depth Estimation Network

Autonomous vehicles and robots require a full scene understanding of the environment to interact with it. Such a perception typically incorporates pixel-wise knowledge of the depths and semantic labels for each image from a video sensor. Recent learning-based methods estimate both types of information independently using two separate CNNs. In this paper, we propose a model that is able to predict both outputs simultaneously, which leads to improved results and even reduced computational costs compared to independent estimation of depth and semantics. We also empirically prove that the CNN is capable of learning more meaningful and semantically richer features. Furthermore, our SDNet estimates the depth based on ordinal classification. On the basis of these two enhancements, our proposed method achieves state-of-the-art results in semantic segmentation and depth estimation from single monocular input images on two challenging datasets.

Matthias Ochs, Adrian Kretz, Rudolf Mester
Object Segmentation Using Pixel-Wise Adversarial Loss

Recent deep learning based approaches have shown remarkable success on object segmentation tasks. However, there is still room for further improvement. Inspired by generative adversarial networks, we present a generic end-to-end adversarial approach, which can be combined with a wide range of existing semantic segmentation networks to improve their segmentation performance. The key element of our method is to replace the commonly used binary adversarial loss with a high resolution pixel-wise loss. In addition, we train our generator employing stochastic weight averaging fashion, which further enhances the predicted output label maps leading to state-of-the-art results. We show, that this combination of pixel-wise adversarial training and weight averaging leads to significant and consistent gains in segmentation performance, compared to the baseline models.

Ricard Durall, Franz-Josef Pfreundt, Ullrich Köthe, Janis Keuper
Visual Coin-Tracking: Tracking of Planar Double-Sided Objects

We introduce a new video analysis problem – tracking of rigid planar objects in sequences where both their sides are visible. Such coin-like objects often rotate fast with respect to an arbitrary axis producing unique challenges, such as fast incident light and aspect ratio change and rotational motion blur. Despite being common, neither tracking sequences containing coin-like objects nor suitable algorithm have been published.As a second contribution, we present a novel coin-tracking benchmark containing 17 video sequences annotated with object segmentation masks. Experiments show that the sequences differ significantly from the ones encountered in standard tracking datasets. We propose a baseline coin-tracking method based on convolutional neural network segmentation and explicit pose modeling. Its performance confirms that coin-tracking is an open and challenging problem.

Jonáš Šerých, Jiří Matas
Exploiting Attention for Visual Relationship Detection

Visual relationship detection targets on predicting categories of predicates and object pairs, and also locating the object pairs. Recognizing the relationships between individual objects is important for describing visual scenes in static images. In this paper, we propose a novel end-to-end framework on the visual relationship detection task. First, we design a spatial attention model for specializing predicate features. Compared to a normal ROI-pooling layer, this structure significantly improves Predicate Classification performance. Second, for extracting relative spatial configuration, we propose to map simple geometric representations to a high dimension, which boosts relationship detection accuracy. Third, we implement a feature embedding model with a bi-directional RNN which considers subject, predicate and object as a time sequence. We evaluate our method on three tasks. The experiments demonstrate that our method achieves competitive results compared to state-of-the-art methods.

Tongxin Hu, Wentong Liao, Michael Ying Yang, Bodo Rosenhahn
Learning Task-Specific Generalized Convolutions in the Permutohedral Lattice

Dense prediction tasks typically employ encoder-decoder architectures, but the prevalent convolutions in the decoder are not image-adaptive and can lead to boundary artifacts. Different generalized convolution operations have been introduced to counteract this. We go beyond these by leveraging guidance data to redefine their inherent notion of proximity. Our proposed network layer builds on the permutohedral lattice, which performs sparse convolutions in a high-dimensional space allowing for powerful non-local operations despite small filters. Multiple features with different characteristics span this permutohedral space. In contrast to prior work, we learn these features in a task-specific manner by generalizing the basic permutohedral operations to learnt feature representations. As the resulting objective is complex, a carefully designed framework and learning procedure are introduced, yielding rich feature embeddings in practice. We demonstrate the general applicability of our approach in different joint upsampling tasks. When adding our network layer to state-of-the-art networks for optical flow and semantic segmentation, boundary artifacts are removed and the accuracy is improved.

Anne S. Wannenwetsch, Martin Kiefel, Peter V. Gehler, Stefan Roth
Achieving Generalizable Robustness of Deep Neural Networks by Stability Training

We study the recently introduced stability training as a general-purpose method to increase the robustness of deep neural networks against input perturbations. In particular, we explore its use as an alternative to data augmentation and validate its performance against a number of distortion types and transformations including adversarial examples. In our image classification experiments using ImageNet data stability training performs on a par or even outperforms data augmentation for specific transformations, while consistently offering improved robustness against a broader range of distortion strengths and types unseen during training, a considerably smaller hyperparameter dependence and less potentially negative side effects compared to data augmentation.

Jan Laermann, Wojciech Samek, Nils Strodthoff
2D and 3D Segmentation of Uncertain Local Collagen Fiber Orientations in SHG Microscopy

Collagen fiber orientations in bones, visible with Second Harmonic Generation (SHG) microscopy, represent the inner structure and its alteration due to influences like cancer. While analyses of these orientations are valuable for medical research, it is not feasible to analyze the needed large amounts of local orientations manually. Since we have uncertain borders for these local orientations only rough regions can be segmented instead of a pixel-wise segmentation. We analyze the effect of these uncertain borders on human performance by a user study. Furthermore, we compare a variety of 2D and 3D methods such as classical approaches like Fourier analysis with state-of-the-art deep neural networks for the classification of local fiber orientations. We present a general way to use pretrained 2D weights in 3D neural networks, such as Inception-ResNet-3D a 3D extension of Inception-ResNet-v2. In a 10 fold cross-validation our two stage segmentation based on Inception-ResNet-3D and transferred 2D ImageNet weights achieves a human comparable accuracy.

Lars Schmarje, Claudius Zelenka, Ulf Geisen, Claus-C. Glüer, Reinhard Koch
Points2Pix: 3D Point-Cloud to Image Translation Using Conditional GANs

We present the first approach for 3D point-cloud to image translation based on conditional Generative Adversarial Networks (cGAN). The model handles multi-modal information sources from different domains, i.e. raw point-sets and images. The generator is capable of processing three conditions, whereas the point-cloud is encoded as raw point-set and camera projection. An image background patch is used as constraint to bias environmental texturing. A global approximation function within the generator is directly applied on the point-cloud (Point-Net). Hence, the representative learning model incorporates global 3D characteristics directly at the latent feature space. Conditions are used to bias the background and the viewpoint of the generated image. This opens up new ways in augmenting or texturing 3D data to aim the generation of fully individual images. We successfully evaluated our method on the KITTI and SunRGBD dataset with an outstanding object detection inception score.

Stefan Milz, Martin Simon, Kai Fischer, Maximillian Pöpperl, Horst-Michael Gross
MLAttack: Fooling Semantic Segmentation Networks by Multi-layer Attacks

Despite the immense success of deep neural networks, their applicability is limited because they can be fooled by adversarial examples, which are generated by adding visually imperceptible and structured perturbations to the original image. Semantic segmentation is required in several visual recognition tasks, but unlike image classification, only a few studies are available for attacking semantic segmentation networks. The existing semantic segmentation adversarial attacks employ different gradient based loss functions which are defined using only the last layer of the network for gradient backpropogation. But some components of semantic segmentation networks implicitly mitigate several adversarial attacks (like multiscale analysis) due to which the existing attacks perform poorly. This provides us the motivation to introduce a new attack in this paper known as MLAttack, i.e., Multiple Layers Attack. It carefully selects several layers and use them to define a loss function for gradient based adversarial attack on semantic segmentation architectures. Experiments conducted on publicly available dataset using the state-of-the-art segmentation network architectures, demonstrate that MLAttack performs better than existing state-of-the-art semantic segmentation attacks.

Puneet Gupta, Esa Rahtu
Not Just a Matter of Semantics: The Relationship Between Visual and Semantic Similarity

Knowledge transfer, zero-shot learning and semantic image retrieval are methods that aim at improving accuracy by utilizing semantic information, e.g., from WordNet. It is assumed that this information can augment or replace missing visual data in the form of labeled training images because semantic similarity correlates with visual similarity.This assumption may seem trivial, but is crucial for the application of such semantic methods. Any violation can cause mispredictions. Thus, it is important to examine the visual-semantic relationship for a certain target problem. In this paper, we use five different semantic and visual similarity measures each to thoroughly analyze the relationship without relying too much on any single definition.We postulate and verify three highly consequential hypotheses on the relationship. Our results show that it indeed exists and that WordNet semantic similarity carries more information about visual similarity than just the knowledge of “different classes look different”. They suggest that classification is not the ideal application for semantic methods and that wrong semantic information is much worse than none.

Clemens-Alexander Brust, Joachim Denzler
DynGraph: Visual Question Answering via Dynamic Scene Graphs

Due to the rise of deep learning, reasoning across various domains, such as vision, language, robotics, and control, has seen major progress in recent years. A popular benchmark for evaluating models for visual reasoning is Visual Question Answering (VQA), which aims at answering questions about a given input image by joining the two modalities: (1) the text representing the question, as well as, (2) the visual information extracted from the input image. In this work, we propose a structured approach for VQA that is based on dynamic graphs learned automatically from the input. Unlike the common approach for VQA that relies on an attention mechanism applied on a cell-structured global embedding of the image, our model leverages the rich structure in the image depicted in the object instances and their interaction. In our model, nodes in the graph correspond to object instances present in the image while the edges represent relations among them. Our model automatically constructs the scene graph and attends to the relations among the nodes to answer the given question. Hence, our model can be trained end-to-end and it does not require additional training labels in the form of predefined graphs or relations. We demonstrate the effectiveness of our approach on the challenging open-ended Visual Genome [14] benchmark for VQA.

Monica Haurilet, Ziad Al-Halah, Rainer Stiefelhagen
Training Invertible Neural Networks as Autoencoders

Autoencoders are able to learn useful data representations in an unsupervised matter and have been widely used in various machine learning and computer vision tasks. In this work, we present methods to train Invertible Neural Networks (INNs) as (variational) autoencoders which we call INN (variational) autoencoders. Our experiments on MNIST, CIFAR and CelebA show that for low bottleneck sizes our INN autoencoder achieves results similar to the classical autoencoder. However, for large bottleneck sizes our INN autoencoder outperforms its classical counterpart. Based on the empirical results, we hypothesize that INN autoencoders might not have any intrinsic information loss and thereby are not bounded to a maximal number of layers (depth) after which only suboptimal results can be achieved (Code available at https://github.com/Xenovortex/Training-Invertible-Neural-Networks-as-Autoencoders.git ).

The-Gia Leo Nguyen, Lynton Ardizzone, Ullrich Köthe
Weakly Supervised Learning of Dense Semantic Correspondences and Segmentation

Finding semantic correspondences is a challenging problem. With the breakthrough of CNNs stronger features are available for tasks like classification but not specifically for the requirements of semantic matching. In the following we present a weakly supervised learning approach which generates stronger features by encoding far more context than previous methods. First, we generate more suitable training data using a geometrically informed correspondence mining method which is less prone to spurious matches and requires only image category labels as supervision. Second, we introduce a new convolutional layer which is a learned mixture of differently strided convolutions and allows the network to encode much more context while preserving matching accuracy at the same time. The strong geometric encoding on the feature side enables us to learn a semantic flow network, which generates more natural deformations than parametric transformation based models and is able to predict foreground regions at the same time. Our semantic flow network outperforms current state-of-the-art on several semantic matching benchmarks and the learned features show astonishing performance regarding simple nearest neighbor matching.

Nikolai Ufer, Kam To Lui, Katja Schwarz, Paul Warkentin, Björn Ommer
A Neural-Symbolic Architecture for Inverse Graphics Improved by Lifelong Meta-learning

We follow the idea of formulating vision as inverse graphics and propose a new type of element for this task, a neural-symbolic capsule. It is capable of de-rendering a scene into semantic information feed-forward, as well as rendering it feed-backward. An initial set of capsules for graphical primitives is obtained from a generative grammar and connected into a full capsule network. Lifelong meta-learning continuously improves this network’s detection capabilities by adding capsules for new and more complex objects it detects in a scene using few-shot learning. Preliminary results demonstrate the potential of our novel approach.

Michael Kissner, Helmut Mayer
Unsupervised Multi-source Domain Adaptation Driven by Deep Adversarial Ensemble Learning

We address the problem of multi-source unsupervised domain adaptation (MS-UDA) for the purpose of visual recognition. As opposed to single source UDA, MS-UDA deals with multiple labeled source domains and a single unlabeled target domain. Notice that the conventional MS-UDA training is based on formalizing independent mappings between the target and the individual source domains without explicitly assessing the need for aligning the source domains among themselves. We argue that such a paradigm invariably overlooks the inherent category-level correlation among the source domains which, on the contrary, is deemed to bring meaningful complementarity in the learned shared feature space. In this regard, we propose a novel approach which simultaneously (i) aligns the source domains at the class-level in a shared feature space, and (ii) maps the target domain data in the same space through an adversarially trained ensemble of source domain classifiers. Experimental results obtained on the Office-31, ImageCLEF-DA, and Office-CalTech dataset validate that our approach achieves a superior accuracy compared to state-of-the-art methods .

Sayan Rakshit, Biplab Banerjee, Gemma Roig, Subhasis Chaudhuri
Time-Frequency Causal Inference Uncovers Anomalous Events in Environmental Systems

Causal inference in dynamical systems is a challenge for different research areas. So far it is mostly about understanding to what extent the underlying causal mechanisms can be derived from observed time series. Here we investigate whether anomalous events can also be identified based on the observed changes in causal relationships. We use a parametric time-frequency representation of vector autoregressive Granger causality for causal inference. The use of time-frequency approach allows for dealing with the nonstationarity of the time series as well as for defining the time scale on which changes occur. We present two representative examples in environmental systems: land-atmosphere ecosystem and marine climate. We show that an anomalous event can be identified as the event where the causal intensities differ according to a distance measure from the average causal intensities. The driver of the anomalous event can then be identified based on the analysis of changes in the causal effect relationships.

Maha Shadaydeh, Joachim Denzler, Yanira Guanche García, Miguel Mahecha
Tongue Contour Tracking in Ultrasound Images with Spatiotemporal LSTM Networks

Analysis of ultrasound images of the human tongue has many applications such as tongue modeling, speech therapy, language education and speech disorder diagnosis. In this paper we propose a novel ultrasound tongue contour tracker that enforces constraints of ultrasound imaging of the tongue such as spatial and temporal smoothness of the tongue contours. We use 3 different LSTM networks in sequence to satisfy these constraints. The first network uses only spatial image information from each video frame separately. The second and third networks add temporal information to the results of the first spatial network. Our networks are designed by considering the ultrasound image formation process of the human tongue. We use polar Brightness-Mode of the ultrasound images, which makes it possible to assume that each column of the image can contain at most one contour position. We tested our system on a dataset that we collected from 4 volunteers while they read written text. The final accuracy results are very promising and they exceed the state of the art results while keeping the run times at very reasonable levels (several frames per second). We provide the complete results of our system as supplementary material.

Enes Aslan, Yusuf Sinan Akgul
Localized Interactive Instance Segmentation

In current interactive instance segmentation works, the user is granted a free hand when providing clicks to segment an object; clicks are allowed on background pixels and other object instances far from the target object. This form of interaction is highly inconsistent with the end goal of efficiently isolating objects of interest. In our work, we propose a clicking scheme wherein user interactions are restricted to the proximity of the object. In addition, we propose a novel transformation of the user-provided clicks to generate a weak localization prior on the object which is consistent with image structures such as edges, textures etc. We demonstrate the effectiveness of our proposed clicking scheme and localization strategy through detailed experimentation in which we raise state-of-the-art on several standard interactive segmentation benchmarks.

Soumajit Majumder, Angela Yao
Iterative Greedy Matching for 3D Human Pose Tracking from Multiple Views

In this work we propose an approach for estimating 3D human poses of multiple people from a set of calibrated cameras. Estimating 3D human poses from multiple views has several compelling properties: human poses are estimated within a global coordinate space and multiple cameras provide an extended field of view which helps in resolving ambiguities, occlusions and motion blur. Our approach builds upon a real-time 2D multi-person pose estimation system and greedily solves the association problem between multiple views. We utilize bipartite matching to track multiple people over multiple frames. This proofs to be especially efficient as problems associated with greedy matching such as occlusion can be easily resolved in 3D. Our approach achieves state-of-the-art results on popular benchmarks and may serve as a baseline for future work.

Julian Tanke, Juergen Gall
Visual Person Understanding Through Multi-task and Multi-dataset Learning

We address the problem of learning a single model for person re-identification, attribute classification, body part segmentation, and pose estimation. With predictions for these tasks we gain a more holistic understanding of persons, which is valuable for many applications. This is a classical multi-task learning problem. However, no dataset exists that these tasks could be jointly learned from. Hence several datasets need to be combined during training, which in other contexts has often led to reduced performance in the past. We extensively evaluate how the different task and datasets influence each other and how different degrees of parameter sharing between the tasks affect performance. Our final model matches or outperforms its single-task counterparts without creating significant computational overhead, rendering it highly interesting for resource-constrained scenarios such as mobile robotics.

Kilian Pfeiffer, Alexander Hermans, István Sárándi, Mark Weber, Bastian Leibe
Dynamic Classifier Chains for Multi-label Learning

In this paper, we deal with the task of building a dynamic ensemble of chain classifiers for multi-label classification. To do so, we proposed two concepts of the classifier chain algorithms that are able to change the label order of the chain without rebuilding the entire model. Such models allow anticipating the instance-specific chain order without the significant increase in the computational burden. The proposed chain models are built using the Naive Bayes classifier and nearest neighbour approaches. To take the benefits of the proposed algorithms, we developed a simple heuristic that allows the system to find relatively good label order. The experimental results showed that the proposed models and the heuristic are efficient tools for building dynamic chain classifiers.

Pawel Trajdos, Marek Kurzynski
Learning 3D Semantic Reconstruction on Octrees

We present a fully convolutional neural network that jointly predicts a semantic 3D reconstruction of a scene as well as a corresponding octree representation. This approach leverages the efficiency of an octree data structure to improve the capacities of volumetric semantic 3D reconstruction methods, especially in terms of scalability. At every octree level, the network predicts a semantic class for every voxel and decides which voxels should be further split in order to refine the reconstruction, thus working in a coarse-to-fine manner. The semantic prediction part of our method builds on recent work that combines traditional variational optimization and neural networks. In contrast to previous networks that work on dense voxel grids, our network is much more efficient in terms of memory consumption and inference efficiency, while achieving similar reconstruction performance. This allows for a high resolution reconstruction in case of limited memory. We perform experiments on the SUNCG and ScanNetv2 datasets on which our network shows comparable reconstruction results to the corresponding dense network while consuming less memory.

Xiaojuan Wang, Martin R. Oswald, Ian Cherabier, Marc Pollefeys
Learning to Disentangle Latent Physical Factors for Video Prediction

Physical scene understanding is a fundamental human ability. Empowering artificial systems with such understanding is an important step towards flexible and adaptive behavior in the real world. As a step in this direction, we propose a novel approach to physical scene understanding in video. We train a deep neural network for video prediction which embeds the video sequence in a low-dimensional recurrent latent space representation. We optimize the total correlation of the latent dimensions within a variational recurrent auto-encoder framework. This encourages the representation to disentangle the latent physical factors of variation in the training data. To train and evaluate our approach, we use synthetic video sequences in three different physical scenarios with various degrees of difficulty. Our experiments demonstrate that our model can disentangle several appearance-related properties in the unsupervised case. If we add supervision signals for the latent code, our model can further improve the disentanglement of dynamics-related properties.

Deyao Zhu, Marco Munderloh, Bodo Rosenhahn, Jörg Stückler
Learning to Train with Synthetic Humans

Neural networks need big annotated datasets for training. However, manual annotation can be too expensive or even unfeasible for certain tasks, like multi-person 2D pose estimation with severe occlusions. A remedy for this is synthetic data with perfect ground truth. Here we explore two variations of synthetic data for this challenging problem; a dataset with purely synthetic humans and a real dataset augmented with synthetic humans. We then study which approach better generalizes to real data, as well as the influence of virtual humans in the training loss. Using the augmented dataset, without considering synthetic humans in the loss, leads to the best results. We observe that not all synthetic samples are equally informative for training, while the informative samples are different for each training stage. To exploit this observation, we employ an adversarial student-teacher framework; the teacher improves the student by providing the hardest samples for its current state as a challenge. Experiments show that the student-teacher framework outperforms normal training on the purely synthetic dataset.

David T. Hoffmann, Dimitrios Tzionas, Michael J. Black, Siyu Tang
Backmatter
Metadata
Title
Pattern Recognition
Editors
Gernot A. Fink
Dr. Simone Frintrop
Prof. Xiaoyi Jiang
Copyright Year
2019
Electronic ISBN
978-3-030-33676-9
Print ISBN
978-3-030-33675-2
DOI
https://doi.org/10.1007/978-3-030-33676-9

Premium Partner