scroll identifier for mobile
main-content

## Über dieses Buch

The six-volume set comprising the LNCS volumes 11129-11134 constitutes the refereed proceedings of the workshops that took place in conjunction with the 15th European Conference on Computer Vision, ECCV 2018, held in Munich, Germany, in September 2018.43 workshops from 74 workshops proposals were selected for inclusion in the proceedings. The workshop topics present a good orchestration of new trends and traditional issues, built bridges into neighboring fields, and discuss fundamental technologies and novel applications.

## Inhaltsverzeichnis

### Real-Time Embedded Computer Vision on UAVs

UAVision2018 Workshop Summary

In this paper we present an overview of the contributed work presented at the UAVision2018 ECCV workshop. This workshop focused on real-time image processing on-board of Unmanned Aerial Vehicles (UAVs). For such applications the computational complexity of state-of-the-art computer vision algorithms often conflicts with the need for real-time operation and the extreme resource limitations of the hardware. Apart from a summary of the accepted workshop papers, this work also aims to identify common challenges and concerns which were addressed by multiple authors during the workshop, and their proposed solutions.

Kristof Van Beeck, Tinne Tuytelaars, Davide Scarramuza, Toon Goedemé

### Teaching UAVs to Race: End-to-End Regression of Agile Controls in Simulation

Automating the navigation of unmanned aerial vehicles (UAVs) in diverse scenarios has gained much attention in recent years. However, teaching UAVs to fly in challenging environments remains an unsolved problem, mainly due to the lack of training data. In this paper, we train a deep neural network to predict UAV controls from raw image data for the task of autonomous UAV racing in a photo-realistic simulation. Training is done through imitation learning with data augmentation to allow for the correction of navigation mistakes. Extensive experiments demonstrate that our trained network (when sufficient data augmentation is used) outperforms state-of-the-art methods and flies more consistently than many human pilots. Additionally, we show that our optimized network architecture can run in real-time on embedded hardware, allowing for efficient on-board processing critical for real-world deployment. From a broader perspective, our results underline the importance of extensive data augmentation techniques to improve robustness in end-to-end learning setups.

Matthias Müller, Vincent Casser, Neil Smith, Dominik L. Michels, Bernard Ghanem

### Onboard Hyperspectral Image Compression Using Compressed Sensing and Deep Learning

We propose a real-time onboard compression scheme for hyperspectral datacube which consists of a very low complexity encoder and a deep learning based parallel decoder architecture for fast decompression. The encoder creates a set of coded snapshots from a given datacube using a measurement code matrix. The decoder decompresses the coded snapshots by using a sparse recovery algorithm. We solve this sparse recovery problem using a deep neural network for fast reconstruction. We present experimental results which demonstrate that our technique performs very well in terms of quality of reconstruction and in terms of computational requirements compared to other transform based techniques with some tradeoff in PSNR. The proposed technique also enables faster inference in compressed domain, suitable for on-board requirements.

Saurabh Kumar, Subhasis Chaudhuri, Biplab Banerjee, Feroz Ali

### SafeUAV: Learning to Estimate Depth and Safe Landing Areas for UAVs from Synthetic Data

The emergence of relatively low cost UAVs has prompted a global concern about the safe operation of such devices. Since most of them can ‘autonomously’ fly by means of GPS way-points, the lack of a higher logic for emergency scenarios leads to an abundance of incidents involving property or personal injury. In order to tackle this problem, we propose a small, embeddable ConvNet for both depth and safe landing area estimation. Furthermore, since labeled training data in the 3D aerial field is scarce and ground images are unsuitable, we capture a novel synthetic aerial 3D dataset obtained from 3D reconstructions. We use the synthetic data to learn to estimate depth from in-flight images and segment them into ‘safe-landing’ and ‘obstacle’ regions. Our experiments demonstrate compelling results in practice on both synthetic data and real RGB drone footage.

Alina Marcu, Dragoş Costea, Vlad Licăreţ, Mihai Pîrvu, Emil Sluşanschi, Marius Leordeanu

### Aerial GANeration: Towards Realistic Data Augmentation Using Conditional GANs

Environmental perception for autonomous aerial vehicles is a rising field. Recent years have shown a strong increase of performance in terms of accuracy and efficiency with the aid of convolutional neural networks. Thus, the community has established data sets for benchmarking several kinds of algorithms. However, public data is rare for multi-sensor approaches or either not large enough to train very accurate algorithms. For this reason, we propose a method to generate multi-sensor data sets using realistic data augmentation based on conditional generative adversarial networks (cGAN). cGANs have shown impressive results for image to image translation. We use this principle for sensor simulation. Hence, there is no need for expensive and complex 3D engines. Our method encodes ground truth data, e.g. semantics or object boxes that could be drawn randomly, in the conditional image to generate realistic consistent sensor data. Our method is proven for aerial object detection and semantic segmentation on visual data, such as 3D Lidar reconstruction using the ISPRS and DOTA data set. We demonstrate qualitative accuracy improvements for state-of-the-art object detection (YOLO) using our augmentation technique.

Stefan Milz, Tobias Rüdiger, Sebastian Süss

### Metrics for Real-Time Mono-VSLAM Evaluation Including IMU Induced Drift with Application to UAV Flight

Vision based algorithms became popular for state estimation and subsequent (local) control of mobile robots. Currently a large variety of such algorithms exists and their performance is often characterized through their drift relative to the total trajectory traveled. However, this metric has relatively low relevance for local vehicle control/stabilization. In this paper, we propose a set of metrics which allows to evaluate a vision based algorithm with respect to its usability for state estimation and subsequent (local) control of highly dynamic autonomous mobile platforms such as multirotor UAVs. As such platforms usually make use of inertial measurements to mitigate the relatively low update rate of the visual algorithm, we particularly focus on a new metric taking the expected IMU-induced drift between visual readings into consideration based on the probabilistic properties of the sensor. We demonstrate this set of metrics by comparing ORB-SLAM, LSD-SLAM and DSO on different datasets.

Alexander Hardt-Stremayr, Matthias Schörghuber, Stephan Weiss, Martin Humenberger

### ShuffleDet: Real-Time Vehicle Detection Network in On-Board Embedded UAV Imagery

On-board real-time vehicle detection is of great significance for UAVs and other embedded mobile platforms. We propose a computationally inexpensive detection network for vehicle detection in UAV imagery which we call ShuffleDet. In order to enhance the speed-wise performance, we construct our method primarily using channel shuffling and grouped convolutions. We apply inception modules and deformable modules to consider the size and geometric shape of the vehicles. ShuffleDet is evaluated on CARPK and PUCPR+ datasets and compared against the state-of-the-art real-time object detection networks. ShuffleDet achieves 3.8 GFLOPs while it provides competitive performance on test sets of both datasets. We show that our algorithm achieves real-time performance by running at the speed of 14 frames per second on NVIDIA Jetson TX2 showing high potential for this method for real-time processing in UAVs.

Seyed Majid Azimi

### Joint Exploitation of Features and Optical Flow for Real-Time Moving Object Detection on Drones

Moving object detection is an imperative task in computer vision, where it is primarily used for surveillance applications. With the increasing availability of low-altitude aerial vehicles, new challenges for moving object detection have surfaced, both for academia and industry. In this paper, we propose a new approach that can detect moving objects efficiently and handle parallax cases. By introducing sparse flow based parallax handling and downscale processing, we push the boundaries of real-time performance with 16 FPS on limited embedded resources (a five-fold improvement over existing baselines), while managing to perform comparably or even improve the state-of-the-art in two different datasets. We also present a roadmap for extending our approach to exploit multi-modal data in order to mitigate the need for parameter tuning.

Hazal Lezki, I. Ahu Ozturk, M. Akif Akpinar, M. Kerim Yucel, K. Berker Logoglu, Aykut Erdem, Erkut Erdem

### UAV-GESTURE: A Dataset for UAV Control and Gesture Recognition

Current UAV-recorded datasets are mostly limited to action recognition and object tracking, whereas the gesture signals datasets were mostly recorded in indoor spaces. Currently, there is no outdoor recorded public video dataset for UAV commanding signals. Gesture signals can be effectively used with UAVs by leveraging the UAVs visual sensors and operational simplicity. To fill this gap and enable research in wider application areas, we present a UAV gesture signals dataset recorded in an outdoor setting. We selected 13 gestures suitable for basic UAV navigation and command from general aircraft handling and helicopter handling signals. We provide 119 high-definition video clips consisting of 37151 frames. The overall baseline gesture recognition performance computed using Pose-based Convolutional Neural Network (P-CNN) is 91.9%. All the frames are annotated with body joints and gesture classes in order to extend the dataset’s applicability to a wider research area including gesture recognition, action recognition, human pose recognition and situation awareness.

Asanka G. Perera, Yee Wei Law, Javaan Chahl

### ChangeNet: A Deep Learning Architecture for Visual Change Detection

The increasing urban population in cities necessitates the need for the development of smart cities that can offer better services to its citizens. Drone technology plays a crucial role in the smart city environment and is already involved in a number of functions in smart cities such as traffic control and construction monitoring. A major challenge in fast growing cities is the encroachment of public spaces. A robotic solution using visual change detection can be used for such purposes. For the detection of encroachment, a drone can monitor outdoor urban areas over a period of time to infer the visual changes. Visual change detection is a higher level inference task that aims at accurately identifying variations between a reference image (historical) and a new test image depicting the current scenario. In case of images, the challenges are complex considering the variations caused by environmental conditions that are actually unchanged events. Human mind interprets the change by comparing the current status with historical data at intelligence level rather than using only visual information. In this paper, we present a deep architecture called ChangeNet for detecting changes between pairs of images and express the same semantically (label the change). A parallel deep convolutional neural network (CNN) architecture for localizing and identifying the changes between image pair has been proposed in this paper. The architecture is evaluated with VL-CMU-CD street view change detection, TSUNAMI and Google Street View (GSV) datasets that resemble drone captured images. The performance of the model for different lighting and seasonal conditions are experimented quantitatively and qualitatively. The result shows that ChangeNet outperforms the state of the art by achieving 98.3% pixel accuracy, 77.35% object based Intersection over Union (IoU) and 88.9% area under Receiver Operating Characteristics (RoC) curve.

Ashley Varghese, Jayavardhana Gubbi, Akshaya Ramaswamy, P. Balamuralidhar

### DeeSIL: Deep-Shallow Incremental Learning

Incremental Learning (IL) is an interesting AI problem when the algorithm is assumed to work on a budget. This is especially true when IL is modeled using a deep learning approach, where two complex challenges arise due to limited memory, which induces catastrophic forgetting and delays related to the retraining needed in order to incorporate new classes. Here we introduce DeeSIL, an adaptation of a known transfer learning scheme that combines a fixed deep representation used as feature extractor and learning independent shallow classifiers to increase recognition capacity. This scheme tackles the two aforementioned challenges since it works well with a limited memory budget and each new concept can be added within a minute. Moreover, since no deep retraining is needed when the model is incremented, DeeSIL can integrate larger amounts of initial data that provide more transferable features. Performance is evaluated on ImageNet LSVRC 2012 against three state of the art algorithms. Results show that, at scale, DeeSIL performance is 23 and 33 points higher than the best baseline when using the same and more initial data respectively.

### Dynamic Adaptation on Non-stationary Visual Domains

Domain adaptation aims to learn models on a supervised source domain that perform well on an unsupervised target. Prior work has examined domain adaptation in the context of stationary domain shifts, i.e. static data sets. However, with large-scale or dynamic data sources, data from a defined domain is not usually available all at once. For instance, in a streaming data scenario, dataset statistics effectively become a function of time. We introduce a framework for adaptation over non-stationary distribution shifts applicable to large-scale and streaming data scenarios. The model is adapted sequentially over incoming unsupervised streaming data batches. This enables improvements over several batches without the need for any additionally annotated data. To demonstrate the effectiveness of our proposed framework, we modify associative domain adaptation to work well on source and target data batches with unequal class distributions. We apply our method to several adaptation benchmark datasets for classification and show improved classifier accuracy not only for the currently adapted batch, but also when applied on future stream batches. Furthermore, we show the applicability of our associative learning modifications to semantic segmentation, where we achieve competitive results.

Sindi Shkodrani, Michael Hofmann, Efstratios Gavves

### Domain Adaptive Semantic Segmentation Through Structure Enhancement

Although fully convolutional networks have recently achieved great advances in semantic segmentation, the performance leaps heavily rely on supervision with pixel-level annotations which are extremely expensive and time-consuming to collect. Training models on synthetic data is a feasible way to relieve the annotation burden. However, the domain shift between synthetic and real images usually lead to poor generalization performance. In this work, we propose an effective method to adapt the segmentation network trained on synthetic images to real scenarios in an unsupervised fashion. To improve the adaptation performance for semantic segmentation, we enhance the structure information of the target images at both the feature level and the output level. Specifically, we enforce the segmentation network to learn a representation that encodes the target images’ visual cues through image reconstruction, which is beneficial to the structured prediction of the target images. Further more, we implement adversarial training at the output space of the segmentation network to align the structured prediction of the source and target images based on the similar spatial structure they share. To validate the performance of our method, we conduct comprehensive experiments on the “GTA5 to Cityscapes” dataset which is a standard domain adaptation benchmark for semantic segmentation. The experimental results clearly demonstrate that our method can effectively bridge the synthetic and real image domains and obtain better adaptation performance compared with the existing state-of-the-art methods.

Fengmao Lv, Qing Lian, Guowu Yang, Guosheng Lin, Sinno Jialin Pan, Lixin Duan

Massimiliano Mancini, Elisa Ricci, Barbara Caputo, Samuel Rota Bulò

### Generating Shared Latent Variables for Robots to Imitate Human Movements and Understand Their Physical Limitations

Assistive robotics and particularly robot coaches may be very helpful for rehabilitation healthcare. In this context, we propose a method based on Gaussian Process Latent Variable Model (GP-LVM) to transfer knowledge between a physiotherapist, a robot coach and a patient. Our model is able to map visual human body features to robot data in order to facilitate the robot learning and imitation. In addition, we propose to extend the model to adapt the robots’ understanding to patients’ physical limitations during assessment of rehabilitation exercises. Experimental evaluation demonstrates promising results for both robot imitation and model adaptation according to patients’ limitations.

Maxime Devanne, Sao Mai Nguyen

### Model Selection for Generalized Zero-Shot Learning

In the problem of generalized zero-shot learning, the datapoints from unknown classes are not available during training. The main challenge for generalized zero-shot learning is the unbalanced data distribution which makes it hard for the classifier to distinguish if a given testing sample comes from a seen or unseen class. However, using Generative Adversarial Network (GAN) to generate auxiliary datapoints by the semantic embeddings of unseen classes alleviates the above problem. Current approaches combine the auxiliary datapoints and original training data to train the generalized zero-shot learning model and obtain state-of-the-art results. Inspired by such models, we propose to feed the generated data via a model selection mechanism. Specifically, we leverage two sources of datapoints (observed and auxiliary) to train some classifier to recognize which test datapoints come from seen and which from unseen classes. This way, generalized zero-shot learning can be divided into two disjoint classification tasks, thus reducing the negative influence of the unbalanced data distribution. Our evaluations on four publicly available datasets for generalized zero-shot learning show that our model obtains state-of-the-art results.

Hongguang Zhang, Piotr Koniusz

### Multi-Domain Pose Network for Multi-Person Pose Estimation and Tracking

Multi-person human pose estimation and tracking in the wild is important and challenging. For training a powerful model, large-scale training data are crucial. While there are several datasets for human pose estimation, the best practice for training on multi-dataset has not been investigated. In this paper, we present a simple network called Multi-Domain Pose Network (MDPN) to address this problem. By treating the task as multi-domain learning, our methods can learn a better representation for pose prediction. Together with prediction heads fine-tuning and multi-branch combination, it shows significant improvement over baselines and achieves the best performance on PoseTrack ECCV 2018 Challenge without additional datasets other than MPII and COCO.

Hengkai Guo, Tang Tang, Guozhong Luo, Riwei Chen, Yongchen Lu, Linfu Wen

### Enhanced Two-Stage Multi-person Pose Estimation

In this paper we introduce an enhanced multi-person pose estimation method for the competition of the PoseTrack [6] workshop in ECCV 2018. We employ a two-stage human pose detector, where human region detection and keypoint detection are separately performed. A strong encoder-decoder network for keypoint detection has achieved 70.4% mAP for PoseTrack 2018 validation dataset.

Hiroto Honda, Tomohiro Kato, Yusuke Uchida

### Multi-person Pose Estimation for Pose Tracking with Enhanced Cascaded Pyramid Network

Multi-person pose estimation is a fundamental yet challenging task in machine learning. In parallel, recent development of pose estimation has increased interests on pose tracking in recent years. In this work, we propose an efficient and powerful method to locate and track human pose. Our proposed method builds upon the state-of-the-art single person pose estimation system (Cascaded Pyramid Network), and adopts the IOU-tracker module to identify the people in the wild. We conduct experiments on the released multi-person video pose estimation benchmark (PoseTrack2018) to validate the effectiveness of our network. Our model achieves an accuracy of 80.9% on the validation and 77.1% on the test set using the Mean Average Precision (MAP) metric, an accuracy of 64.0% on the validation and 57.4% on the test set using the Multi-Object Tracking Accuracy (MOTA) metric.

Dongdong Yu, Kai Su, Jia Sun, Changhu Wang

### A Top-Down Approach to Articulated Human Pose Estimation and Tracking

Both the tasks of multi-person human pose estimation and pose tracking in videos are quite challenging. Existing methods can be categorized into two groups: top-down and bottom-up approaches. In this paper, following the top-down approach, we aim to build a strong baseline system with three modules: human candidate detector, single-person pose estimator and human pose tracker. Firstly, we choose a generic object detector among state-of-the-art methods to detect human candidates. Then, cascaded pyramid network is used to estimate the corresponding human pose. Finally, we use a flow-based pose tracker to render keypoint-association across frames, i.e., assigning each human candidate a unique and temporally-consistent id, for the multi-target pose tracking purpose. We conduct extensive ablative experiments to validate various choices of models and configurations. We take part in two ECCV’18 PoseTrack challenges ( https://posetrack.net/workshops/eccv2018/posetrack_eccv_2018_results.html ): pose estimation and pose tracking.

Guanghan Ning, Ping Liu, Xiaochuan Fan, Chi Zhang

### Deep Fusion Network for Splicing Forgery Localization

Digital splicing is a common type of image forgery: some regions of an image are replaced with contents from other images. To locate altered regions in a tampered picture is a challenging work because the difference is unknown between the altered regions and the original regions and it is thus necessary to search the large hypothesis space for a convincing result. In this paper, we proposed a novel deep fusion network to locate tampered area by tracing its border. A group of deep convolutional neural networks called Base-Net were firstly trained to response the certain type of splicing forgery respectively. Then, some layers of the Base-Net are selected and combined as a deep fusion neural network (Fusion-Net). After fine-tuning by a very small number of pictures, Fusion-Net is able to discern whether an image block is synthesized from different origins. Experiments on the benchmark datasets show that our method is effective in various situations and outperform state-of-the-art methods.

Bo Liu, Chi-Man Pun

### Image Splicing Localization via Semi-global Network and Fully Connected Conditional Random Fields

We address the problem of image splicing localization: given an input image, localizing the spliced region which is cut from another image. We formulate this as a classification task but, critically, instead of classifying the spliced region by local patch, we leverage the features from whole image and local patch together to classify patch. We call this structure Semi-Global Network. Our approach exploits the observation that the spliced region should not only highly relate to local features (spliced edges), but also global features (semantic information, illumination, etc.) from the whole image. Furthermore, we first integrate Fully Connected Conditional Random Fields as post-processing technique in image splicing to improve the consistency between the input image and the output of the network. We show that our method outperforms other state-of-the-art methods in three popular datasets.

Xiaodong Cun, Chi-Man Pun

### Bridging Machine Learning and Cryptography in Defence Against Adversarial Attacks

Olga Taran, Shideh Rezaeifar, Slava Voloshynovskiy

### Bidirectional Convolutional LSTM for the Detection of Violence in Videos

The field of action recognition has gained tremendous traction in recent years. A subset of this, detection of violent activity in videos, is of great importance, particularly in unmanned surveillance or crowd footage videos. In this work, we explore this problem on three standard benchmarks widely used for violence detection: the Hockey Fights, Movies, and Violent Flows datasets. To this end, we introduce a Spatiotemporal Encoder, built on the Bidirectional Convolutional LSTM (BiConvLSTM) architecture. The addition of bidirectional temporal encodings and an elementwise max pooling of these encodings in the Spatiotemporal Encoder is novel in the field of violence detection. This addition is motivated by a desire to derive better video representations via leveraging long-range information in both temporal directions of the video. We find that the Spatiotemporal network is comparable in performance with existing methods for all of the above datasets. A simplified version of this network, the Spatial Encoder is sufficient to match state-of-the-art performance on the Hockey Fights and Movies datasets. However, on the Violent Flows dataset, the Spatiotemporal Encoder outperforms the Spatial Encoder.

Alex Hanson, Koutilya PNVR, Sanjukta Krishnagopal, Larry Davis

### Are You Tampering with My Data?

We propose a novel approach towards adversarial attacks on neural networks (NN), focusing on tampering the data used for training instead of generating attacks on trained models. Our network-agnostic method creates a backdoor during training which can be exploited at test time to force a neural network to exhibit abnormal behaviour. We demonstrate on two widely used datasets (CIFAR-10 and SVHN) that a universal modification of just one pixel per image for all the images of a class in the training set is enough to corrupt the training procedure of several state-of-the-art deep neural networks, causing the networks to misclassify any images to which the modification is applied. Our aim is to bring to the attention of the machine learning community, the possibility that even learning-based methods that are personally trained on public datasets can be subject to attacks by a skillful adversary.

Michele Alberti, Vinaychandran Pondenkandath, Marcel Würsch, Manuel Bouillon, Mathias Seuret, Rolf Ingold, Marcus Liwicki

### Adversarial Examples Detection in Features Distance Spaces

Maliciously manipulated inputs for attacking machine learning methods – in particular deep neural networks – are emerging as a relevant issue for the security of recent artificial intelligence technologies, especially in computer vision. In this paper, we focus on attacks targeting image classifiers implemented with deep neural networks, and we propose a method for detecting adversarial images which focuses on the trajectory of internal representations (i.e. hidden layers neurons activation, also known as deep features) from the very first, up to the last. We argue that the representations of adversarial inputs follow a different evolution with respect to genuine inputs, and we define a distance-based embedding of features to efficiently encode this information. We train an LSTM network that analyzes the sequence of deep features embedded in a distance space to detect adversarial examples. The results of our preliminary experiments are encouraging: our detection scheme is able to detect adversarial inputs targeted to the ResNet-50 classifier pre-trained on the ILSVRC’12 dataset and generated by a variety of crafting algorithms.

Fabio Carrara, Rudy Becarelli, Roberto Caldelli, Fabrizio Falchi, Giuseppe Amato

### Give Ear to My Face: Modelling Multimodal Attention to Social Interactions

We address the deployment of perceptual attention to social interactions as displayed in conversational clips, when relying on multimodal information (audio and video). A probabilistic modelling framework is proposed that goes beyond the classic saliency paradigm while integrating multiple information cues. Attentional allocation is determined not just by stimulus-driven selection but, importantly, by social value as modulating the selection history of relevant multimodal items. Thus, the construction of attentional priority is the result of a sampling procedure conditioned on the potential value dynamics of socially relevant objects emerging moment to moment within the scene. Preliminary experiments on a publicly available dataset are presented.

Giuseppe Boccignone, Vittorio Cuculo, Alessandro D’Amelio, Giuliano Grossi, Raffaella Lanzarotti

### Investigating Depth Domain Adaptation for Efficient Human Pose Estimation

Convolutional Neural Networks (CNN) are the leading models for human body landmark detection from RGB vision data. However, as such models require high computational load, an alternative is to rely on depth images which, due to their more simple nature, can allow the use of less complex CNNs and hence can lead to a faster detector. As learning CNNs from scratch requires large amounts of labeled data, which are not always available or expensive to obtain, we propose to rely on simulations and synthetic examples to build a large training dataset with precise labels. Nevertheless, the final performance on real data will suffer from the mismatch between the training and test data, also called domain shift between the source and target distributions. Thus in this paper, our main contribution is to investigate the use of unsupervised domain adaptation techniques to fill the gap in performance introduced by these distribution differences. The challenge lies in the important noise differences (not only gaussian noise, but many missing values around body limbs) between synthetic and real data, as well as the fact that we address a regression task rather than a classification one. In addition, we introduce a new public dataset of synthetically generated depth images to cover the cases of multi-person pose estimation. Our experiments show that domain adaptation provides some improvement, but that further network fine-tuning with real annotated data is worth including to supervise the adaptation process.

Angel Martínez-González, Michael Villamizar, Olivier Canévet, Jean-Marc Odobez

### Filling the Gaps: Predicting Missing Joints of Human Poses Using Denoising Autoencoders

State of the art pose estimators are able to deal with different challenges present in real-world scenarios, such as varying body appearance, lighting conditions and rare body poses. However, when body parts are severely occluded by objects or other people, the resulting poses might be incomplete, negatively affecting applications where estimating a full body pose is important (e.g. gesture and pose-based behavior analysis). In this work, we propose a method for predicting the missing joints from incomplete human poses. In our model we consider missing joints as noise in the input and we use an autoencoder-based solution to enhance the pose prediction. The method can be easily combined with existing pipelines and, by using only 2D coordinates as input data, the resulting model is small and fast to train, yet powerful enough to learn a robust representation of the low dimensional domain. Finally, results show improved predictions over existing pose estimation algorithms.

Nicolò Carissimi, Paolo Rota, Cigdem Beyan, Vittorio Murino

### Pose Guided Human Image Synthesis by View Disentanglement and Enhanced Weighting Loss

View synthesis aims at generating a novel, unseen view of an object. This is a challenging task in the presence of occlusions and asymmetries. In this paper, we present View-Disentangled Generator (VDG), a two-stage deep network for pose-guided human-image generation that performs coarse view prediction followed by a refinement stage. In the first stage, the network predicts the output from a target human pose, the source-image and the corresponding human pose, which are processed in different branches separately. This enables the network to learn a disentangled representation from the source and target view. In the second stage, the coarse output from the first stage is refined by adversarial training. Specifically, we introduce a masked version of the structural similarity loss that facilitates the network to focus on generating a higher quality view. Experiments on Market-1501 and DeepFashion demonstrate the effectiveness of the proposed generator.

Mohamed Ilyes Lakhal, Oswald Lanz, Andrea Cavallaro

### A Semi-supervised Data Augmentation Approach Using 3D Graphical Engines

Deep learning approaches have been rapidly adopted across a wide range of fields because of their accuracy and flexibility, but require large labeled training datasets. This presents a fundamental problem for applications with limited, expensive, or private data (i.e. small data), such as human pose and behavior estimation/tracking which could be highly personalized. In this paper, we present a semi-supervised data augmentation approach that can synthesize large scale labeled training datasets using 3D graphical engines based on a physically-valid low dimensional pose descriptor. To evaluate the performance of our synthesized datasets in training deep learning-based models, we generated a large synthetic human pose dataset, called ScanAva using 3D scans of only 7 individuals based on our proposed augmentation approach. A state-of-the-art human pose estimation deep learning model then was trained from scratch using our ScanAva dataset and could achieve the pose estimation accuracy of 91.2% at PCK0.5 criteria after applying an efficient domain adaptation on the synthetic images, in which its pose estimation accuracy was comparable to the same model trained on large scale pose data from real humans such as MPII dataset and much higher than the model trained on other synthetic human dataset such as SURREAL.

### Towards Learning a Realistic Rendering of Human Behavior

Realistic rendering of human behavior is of great interest for applications such as video animations, virtual reality and gaming engines. Commonly animations of persons performing actions are rendered by articulating explicit 3D models based on sequences of coarse body shape representations simulating a certain behavior. While the simulation of natural behavior can be efficiently learned, the corresponding 3D models are typically designed in manual, laborious processes or reconstructed from costly (multi-)sensor data. In this work, we present an approach towards a holistic learning framework for rendering human behavior in which all components are learned from easily available data. To enable control over the generated behavior, we utilize motion capture data and generate realistic motions based on user inputs. Alternatively, we can directly copy behavior from videos and learn a rendering of characters using RGB camera data only. Our experiments show that we can further improve data efficiency by training on multiple characters at the same time. Overall our approach shows a new path towards easily available, personalized avatar creation.

Patrick Esser, Johannes Haux, Timo Milbich, Björn Ommer

### Human Action Recognition Based on Temporal Pose CNN and Multi-dimensional Fusion

To take advantage of recent advances in human pose estimation from images, we develop a deep neural network model for action recognition from videos by computing temporal human pose features with a 3D CNN model. The proposed temporal pose features can provide more discriminative human action information than previous video features, such as appearance and short-term motion. In addition, we propose a novel fusion network that combines temporal pose, spatial and motion feature maps for the classification by bridging the gap between the dimension difference between 3D and 2D CNN feature maps. We show that the proposed action recognition system provides superior accuracy compared to the previous methods through experiments on Sub-JHMDB and PennAction datasets.

Yi Huang, Shang-Hong Lai, Shao-Heng Tai

### Rendering Realistic Subject-Dependent Expression Images by Learning 3DMM Deformation Coefficients

Automatic analysis of facial expressions is now attracting an increasing interest, thanks to the many potential applications it can enable. However, collecting images with labeled expression for large sets of images or videos is a quite complicated operation that, in most of the cases, requires substantial human intervention. In this paper, we propose a solution that, starting from a neutral image of a subject, is capable of producing a realistic expressive face image of the same subject. This is possible thanks to the use of a particular 3D morphable model (3DMM) that can effectively and efficiently fit to 2D images, and then deform itself under the action of deformation parameters learned expression-by-expression in a subject-independent manner. Ultimately, the application of such deformation parameters to the neutral model of a subject allows the rendering of realistic expressive images of the subject. Experiments demonstrate that such deformation parameters can be learned from a small set of training data using simple statistical tools; despite this simplicity, very realistic subject-dependent expression renderings can be obtained. Furthermore, robustness to cross dataset tests is also evidenced.

Claudio Ferrari, Stefano Berretti, Pietro Pala, Alberto Del Bimbo

### Deep Multitask Gaze Estimation with a Constrained Landmark-Gaze Model

As an indicator of attention, gaze is an important cue for human behavior and social interaction analysis. Recent deep learning methods for gaze estimation rely on plain regression of the gaze from images without accounting for potential mismatches in eye image cropping and normalization. This may impact the estimation of the implicit relation between visual cues and the gaze direction when dealing with low resolution images or when training with a limited amount of data. In this paper, we propose a deep multitask framework for gaze estimation, with the following contributions. (i) we proposed a multitask framework which relies on both synthetic data and real data for end-to-end training. During training, each dataset provides the label of only one task but the two tasks are combined in a constrained way. (ii) we introduce a Constrained Landmark-Gaze Model (CLGM) modeling the joint variation of eye landmark locations (including the iris center) and gaze directions. By relating explicitly visual information (landmarks) to the more abstract gaze values, we demonstrate that the estimator is more accurate and easier to learn. (iii) by decomposing our deep network into a network inferring jointly the parameters of the CLGM model and the scale and translation parameters of eye regions on one hand, and a CLGM based decoder deterministically inferring landmark positions and gaze from these parameters and head pose on the other hand, our framework decouples gaze estimation from irrelevant geometric variations in the eye image (scale, translation), resulting in a more robust model. Thorough experiments on public datasets demonstrate that our method achieves competitive results, improving over state-of-the-art results in challenging free head pose gaze estimation tasks and on eye landmark localization (iris location) ones.

Yu Yu, Gang Liu, Jean-Marc Odobez

### Photorealistic Facial Synthesis in the Dimensional Affect Space

This paper presents a novel approach for synthesizing facial affect, which is based on our annotating 600,000 frames of the 4DFAB database in terms of valence and arousal. The input of this approach is a pair of these emotional state descriptors and a neutral 2D image of a person to whom the corresponding affect will be synthesized. Given this target pair, a set of 3D facial meshes is selected, which is used to build a blendshape model and generate the new facial affect. To synthesize the affect on the 2D neutral image, 3DMM fitting is performed and the reconstructed face is deformed to generate the target facial expressions. Last, the new face is rendered into the original image. Both qualitative and quantitative experimental studies illustrate the generation of realistic images, when the neutral image is sampled from a variety of well known databases, such as the Aff-Wild, AFEW, Multi-PIE, AFEW-VA, BU-3DFE, Bosphorus.

Dimitrios Kollias, Shiyang Cheng, Maja Pantic, Stefanos Zafeiriou

### Generating Synthetic Video Sequences by Explicitly Modeling Object Motion

Recent GAN-based video generation approaches model videos as the combination of a time-independent scene component and a time-varying motion component, thus factorizing the generation problem into generating background and foreground separately. One of the main limitations of current approaches is that both factors are learned by mapping one source latent space to videos, which complicates the generation task as a single data point must be informative of both background and foreground content. In this paper we propose a GAN framework for video generation that, instead, employs two latent spaces in order to structure the generative process in a more natural way: (1) a latent space to generate the static visual content of a scene (background), which remains the same for the whole video, and (2) a latent space where motion is encoded as a trajectory between sampled points and whose dynamics are modeled through an RNN encoder (jointly trained with the generator and the discriminator) and then mapped by the generator to visual objects’ motion. Performance evaluation showed that our approach is able to control effectively the generation process as well as to synthesize more realistic videos than state-of-the-art methods.

S. Palazzo, C. Spampinato, P. D’Oro, D. Giordano, M. Shah

### A Semi-supervised Deep Generative Model for Human Body Analysis

Deep generative modelling for human body analysis is an emerging problem with many interesting applications. However, the latent space learned by such models is typically not interpretable, resulting in less flexible models. In this work, we adopt a structured semi-supervised approach and present a deep generative model for human body analysis where the body pose and the visual appearance are disentangled in the latent space. Such a disentanglement allows independent manipulation of pose and appearance, and hence enables applications such as pose-transfer without being explicitly trained for such a task. In addition, our setting allows for semi-supervised pose estimation, relaxing the need for labelled data. We demonstrate the capabilities of our generative model on the Human3.6M and on the DeepFashion datasets.

Rodrigo de Bem, Arnab Ghosh, Thalaiyasingam Ajanthan, Ondrej Miksik, N. Siddharth, Philip Torr

### Role of Group Level Affect to Find the Most Influential Person in Images

Group affect analysis is an important cue for predicting various group traits. Generally, the estimation of the group affect, emotional responses, eye gaze and position of people in images are the important cues to identify an important person from a group of people. The main focus of this paper is to explore the importance of group affect in finding the representative of a group. We call that person the “Most Influential Person” (for the first impression) or “leader” of a group. In order to identify the main visual cues for “Most Influential Person”, we conducted a user survey. Based on the survey statistics, we annotate the “influential persons” in 1000 images of Group AFfect database (GAF 2.0) via LabelMe toolbox and propose the “GAF-personage database”. In order to identify “Most Influential Person”, we proposed a DNN based Multiple Instance Learning (Deep MIL) method which takes deep facial features as input. To leverage the deep facial features, we first predict the individual emotion probabilities via CapsNet and rank the detected faces on the basis of it. Then, we extract deep facial features of the top-3 faces via VGG-16 network. Our method performs better than maximum facial area and saliency-based importance methods and achieves the human-level perception of “Most Influential Person” at group-level.

Shreya Ghosh, Abhinav Dhall

### Residual Stacked RNNs for Action Recognition

Action recognition pipelines that use Recurrent Neural Networks (RNN) are currently 5–10% less accurate than Convolutional Neural Networks (CNN). While most works that use RNNs employ a 2D CNN on each frame to extract descriptors for action recognition, we extract spatiotemporal features from a 3D CNN and then learn the temporal relationship of these descriptors through a stacked residual recurrent neural network (Res-RNN). We introduce for the first time residual learning to counter the degradation problem in multi-layer RNNs, which have been successful for temporal aggregation in two-stream action recognition pipelines. Finally, we use a late fusion strategy to combine RGB and optical flow data of the two-stream Res-RNN. Experimental results show that the proposed pipeline achieves competitive results on UCF-101 and state of-the-art results for RNN-like architectures on the challenging HMDB-51 dataset.

Mohamed Ilyes Lakhal, Albert Clapés, Sergio Escalera, Oswald Lanz, Andrea Cavallaro

### Semantically Selective Augmentation for Deep Compact Person Re-Identification

We present a deep person re-identification approach that combines semantically selective, deep data augmentation with clustering-based network compression to generate high performance, light and fast inference networks. In particular, we propose to augment limited training data via sampling from a deep convolutional generative adversarial network (DCGAN), whose discriminator is constrained by a semantic classifier to explicitly control the domain specificity of the generation process. Thereby, we encode information in the classifier network which can be utilized to steer adversarial synthesis, and which fuels our CondenseNet ID-network training. We provide a quantitative and qualitative analysis of the approach and its variants on a number of datasets, obtaining results that outperform the state-of-the-art on the LIMA dataset for long-term monitoring in indoor living spaces.

Víctor Ponce-López, Tilo Burghardt, Sion Hannunna, Dima Damen, Alessandro Masullo, Majid Mirmehdi

### Recognizing People in Blind Spots Based on Surrounding Behavior

Recent advances in computer vision have achieved remarkable performance improvements. These technologies mainly focus on recognition of visible targets. However, there are many invisible targets in blind spots in real situations. Humans may be able to recognize such invisible targets based on contexts (e.g. visible human behavior and environments) around the targets, and used such recognition to predict situations in blind spots on a daily basis. As the first step towards recognizing targets in blind spots captured in videos, we propose a convolutional neural network that recognizes whether or not there is a person in a blind spot. Based on the experiments that used the volleyball dataset, which includes various interactions of players, with artificial occlusions, our proposed method achieved 90.3% accuracy in the recognition.

Kensho Hara, Hirokatsu Kataoka, Masaki Inaba, Kenichi Narioka, Yutaka Satoh

### Visual Relationship Prediction via Label Clustering and Incorporation of Depth Information

In this paper, we investigate the use of an unsupervised label clustering technique and demonstrate that it enables substantial improvements in visual relationship prediction accuracy on the Person in Context (PIC) dataset. We propose to group object labels with similar patterns of relationship distribution in the dataset into fewer categories. Label clustering not only mitigates both the large classification space and class imbalance issues, but also potentially increases data samples for each clustered category. We further propose to incorporate depth information as an additional feature into the instance segmentation model. The additional depth prediction path supplements the relationship prediction model in a way that bounding boxes or segmentation masks are unable to deliver. We have rigorously evaluated the proposed techniques and performed various ablation analysis to validate the benefits of them.

Hsuan-Kung Yang, An-Chieh Cheng, Kuan-Wei Ho, Tsu-Jui Fu, Chun-Yi Lee

### Human-Centric Visual Relation Segmentation Using Mask R-CNN and VTransE

In this paper, we propose a novel human-centric visual relation segmentation method based on Mask R-CNN model and VTransE model. We first retain the Mask R-CNN model, and segment both human and object instances. Because Mask R-CNN may omit some human instances in instance segmentation, we further detect the omitted faces and extend them to localize the corresponding human instances. Finally, we retrain the last layer of VTransE model, and detect the visual relations between each pair of human instance and human/object instance. The experimental results show that our method obtains 0.4799, 0.4069, and 0.2681 on the criteria of R@100 with the m-IoU of 0.25, 0.50 and 0.75, respectively, which outperforms other methods in Person in Context Challenge.

Fan Yu, Xin Tan, Tongwei Ren, Gangshan Wu

### Learning Spatiotemporal 3D Convolution with Video Order Self-supervision

The purpose of this work is to explore self-supervised learning (SSL) strategy to capture a better feature with spatiotemporal 3D convolution. Although one of the next frontier in video recognition must be spatiotemporal 3D CNN, the convergence of the 3D convolutions is really difficult because of their enormous parameters or missing temporal(motion) feature. One of the effective solutions is to collect a $$10^5$$ 10 5 -order video database such as Kinetics/Moments in Time. However, this is not an efficient with burden of manual annotations. In the paper, we train 3D CNN on wrong video-sequence detection tasks in a self-supervised manner (without any manual annotation). The shuffling and verification of consecutive video-frame-order is effective for 3D CNN to capture temporal feature and get a good start point of parameters to be fine-tuned. In the experimental section, we verify that our pretrained 3D CNN on wrong clip detection improves the level of performance on UCF101 ( $$+3.99\%$$ + 3.99 % better than baseline, namely training 3D convolution from scratch).

Tomoyuki Suzuki, Takahiro Itazuri, Kensho Hara, Hirokatsu Kataoka

### What Was Monet Seeing While Painting? Translating Artworks to Photo-Realistic Images

State of the art Computer Vision techniques exploit the availability of large-scale datasets, most of which consist of images captured from the world as it is. This brings to an incompatibility between such methods and digital data from the artistic domain, on which current techniques under-perform. A possible solution is to reduce the domain shift at the pixel level, thus translating artistic images to realistic copies. In this paper, we present a model capable of translating paintings to photo-realistic images, trained without paired examples. The idea is to enforce a patch level similarity between real and generated images, aiming to reproduce photo-realistic details from a memory bank of real images. This is subsequently adopted in the context of an unpaired image-to-image translation framework, mapping each image from one distribution to a new one belonging to the other distribution. Qualitative and quantitative results are presented on Monet, Cezanne and Van Gogh paintings translation tasks, showing that our approach increases the realism of generated images with respect to the CycleGAN approach.

Matteo Tomei, Lorenzo Baraldi, Marcella Cornia, Rita Cucchiara

### Saliency-Driven Variational Retargeting for Historical Maps

We study the problem of georeferencing artistic historical maps. Since they were primarily conceived as work of art more than an accurate cartographic tool, the common warping approaches implemented in Geographic Application Systems (GIS) usually lead to an overly-stretched image in which the actual pictorial content (like written text, compass roses, buildings, etc.) is un-naturally deformed. On the other hand, domain transformation of images driven by the perceived salient visual content is a well-known topic known as “image retargeting” which has been mostly limited to a change of scale of the image (i.e. changing the width and height) rather than a more general control-points based warping.In this work we propose a variational image retargeting approach in which the local transformations are estimated to accommodate a set of control points instead of image boundaries. The direction and severity of warping is modulated by a novel tensor-based saliency formulation considering both the visual content and the shape of the underlying features to transform. The optimization includes a flow projection step based on the isotonic regression to avoid singularities and flip overs of the resulting distortion map.

Filippo Bergamasco, Arianna Traviglia, Andrea Torsello

### Deep Transfer Learning for Art Classification Problems

In this paper we investigate whether Deep Convolutional Neural Networks (DCNNs), which have obtained state of the art results on the ImageNet challenge, are able to perform equally well on three different art classification problems. In particular, we assess whether it is beneficial to fine tune the networks instead of just using them as off the shelf feature extractors for a separately trained softmax classifier. Our experiments show how the first approach yields significantly better results and allows the DCNNs to develop new selective attention mechanisms over the images, which provide powerful insights about which pixel regions allow the networks successfully tackle the proposed classification challenges. Furthermore, we also show how DCNNs, which have been fine tuned on a large artistic collection, outperform the same architectures which are pre-trained on the ImageNet dataset only, when it comes to the classification of heritage objects from a different dataset.

Matthia Sabatelli, Mike Kestemont, Walter Daelemans, Pierre Geurts

### Reflecting on How Artworks Are Processed and Analyzed by Computer Vision

The intersection between computer vision and art history has resulted in new ways of seeing, engaging and analyzing digital images. Innovative methods and tools have assisted with the evaluation of large datasets, performing tasks such as classification, object detection, image description and style transfer or assisting with a form and content analysis. At this point, in order to progress, past works and established practices must be revisited and evaluated on the ground of their usability for art history. This paper provides a reflection from an art historical perspective to point to erroneous assumptions and where improvements are still needed.

Sabine Lang, Björn Ommer

### Seeing the World Through Machinic Eyes: Reflections on Computer Vision in the Arts

Today, computer vision is broadly implemented and operates in the background of many systems. For users of these technologies, there is often no visual feedback, making it hard to understand the mechanisms that drive it. When computer vision is used to generate visual representations like Google Earth, it remains difficult to perceive the particular process and principles that went into its creation. This text examines computer vision as a medium and a system of representation by analyzing the work of design studio Onformative, designer Bernhard Hopfengärtner and artist Clement Valla. By using technical failures and employing computer vision in unforeseen ways, these artists and designers expose the differences between computer vision and human perception. Since computer vision is increasingly used to facilitate (visual) communication, artistic reflections like these help us understand the nature of computer vision and how it shapes our perception of the world.

Marijke Goeting

### A Digital Tool to Understand the Pictorial Procedures of 17 Century Realism

To unveil the mystery of the exquisitely rendered materials in Dutch 17th century paintings, we need to understand the pictorial procedures of this period. We focused on the Dutch master Jan de Heem, known for his highly convincing still-lifes. We reconstructed his systematic multi-layered approach to paint grapes, based on pigment distribution maps, layers stratigraphy, and a 17th century textual source. We digitised the layers reconstruction to access the temporal information of the painting procedure. We combined the layers via optical mixing into a digital tool that can be used to answer “what if” art historical questions about the painting composition, by editing the order, weight and colour of the layers.

Francesca Di Cicco, Lisa Wiersma, Maarten Wijntjes, Joris Dik, Jeroen Stumpel, Sylvia Pont

### How to Read Paintings: Semantic Art Understanding with Multi-modal Retrieval

Automatic art analysis has been mostly focused on classifying artworks into different artistic styles. However, understanding an artistic representation involves more complex processes, such as identifying the elements in the scene or recognizing author influences. We present SemArt, a multi-modal dataset for semantic art understanding. SemArt is a collection of fine-art painting images in which each image is associated to a number of attributes and a textual artistic comment, such as those that appear in art catalogues or museum collections. To evaluate semantic art understanding, we envisage the Text2Art challenge, a multi-modal retrieval task where relevant paintings are retrieved according to an artistic text, and vice versa. We also propose several models for encoding visual and textual artistic representations into a common semantic space. Our best approach is able to find the correct image within the top 10 ranked images in the 45.5% of the test samples. Moreover, our models show remarkable levels of art understanding when compared against human evaluation.

Noa Garcia, George Vogiatzis

### Weakly Supervised Object Detection in Artworks

We propose a method for the weakly supervised detection of objects in paintings. At training time, only image-level annotations are needed. This, combined with the efficiency of our multiple-instance learning method, enables one to learn new classes on-the-fly from globally annotated databases, avoiding the tedious task of manually marking objects. We show on several databases that dropping the instance-level annotations only yields mild performance losses. We also introduce a new database, IconArt, on which we perform detection experiments on classes that could not be learned on photographs, such as Jesus Child or Saint Sebastian. To the best of our knowledge, these are the first experiments dealing with the automatic (and in our case weakly supervised) detection of iconographic elements in paintings. We believe that such a method is of great benefit for helping art historians to explore large digital databases.

Nicolas Gonthier, Yann Gousseau, Said Ladjal, Olivier Bonfait

### Images of Image Machines. Visual Interpretability in Computer Vision for Art

Despite the emergence of interpretable machine learning as a distinct area of research, the role and possible uses of interpretability in digital art history are still unclear. Focusing on feature visualization as the most common technical manifestation of visual interpretability, we argue that in computer vision for art visual interpretability is desirable, if not indispensable. We propose that feature visualization images can be a useful tool if they are used in a non-traditional way that embraces their peculiar representational status. Moreover, we suggest that exactly because of this peculiar representational status, feature visualization images themselves deserve more attention from the computer vision and digital art history communities.

Fabian Offert

### Backmatter

Weitere Informationen

## BranchenIndex Online

Die B2B-Firmensuche für Industrie und Wirtschaft: Kostenfrei in Firmenprofilen nach Lieferanten, Herstellern, Dienstleistern und Händlern recherchieren.

## Whitepaper

- ANZEIGE -

### Best Practices für die Mitarbeiter-Partizipation in der Produktentwicklung

Unternehmen haben das Innovationspotenzial der eigenen Mitarbeiter auch außerhalb der F&E-Abteilung erkannt. Viele Initiativen zur Partizipation scheitern in der Praxis jedoch häufig. Lesen Sie hier  - basierend auf einer qualitativ-explorativen Expertenstudie - mehr über die wesentlichen Problemfelder der mitarbeiterzentrierten Produktentwicklung und profitieren Sie von konkreten Handlungsempfehlungen aus der Praxis.