nach oben

2019 | Buch

Kapitel lesen Erstes Kapitel lesen

Artificial Neural Networks and Machine Learning – ICANN 2019: Image Processing

28th International Conference on Artificial Neural Networks, Munich, Germany, September 17–19, 2019, Proceedings, Part III

herausgegeben von: Igor V. Tetko, Dr. Věra Kůrková, Pavel Karpov, Prof. Fabian Theis

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

The proceedings set LNCS 11727, 11728, 11729, 11730, and 11731 constitute the proceedings of the 28th International Conference on Artificial Neural Networks, ICANN 2019, held in Munich, Germany, in September 2019. The total of 277 full papers and 43 short papers presented in these proceedings was carefully reviewed and selected from 494 submissions. They were organized in 5 volumes focusing on theoretical neural computation; deep learning; image processing; text and time series; and workshop and special sessions.

Inhaltsverzeichnis

Frontmatter

Image Denoising

Frontmatter

Unsharp Masking Layer: Injecting Prior Knowledge in Convolutional Networks for Image Classification

Image enhancement refers to the enrichment of certain image features such as edges, boundaries, or contrast. The main objective is to process the original image so that the overall performance of visualization, classification and segmentation tasks is considerably improved. Traditional techniques require manual fine-tuning of the parameters to control enhancement behavior. To date, recent Convolutional Neural Network (CNN) approaches frequently employ the aforementioned techniques as an enriched pre-processing step. In this work, we present the first intrinsic CNN pre-processing layer based on the well-known unsharp masking algorithm. The proposed layer injects prior knowledge about how to enhance the image, by adding high frequency information to the input, to subsequently emphasize meaningful image features. The layer optimizes the unsharp masking parameters during model training, without any manual intervention. We evaluate the network performance and impact on two applications: CIFAR100 image classification, and the PlantCLEF identification challenge. Results obtained show a significant improvement over popular CNNs, yielding 9.49% and 2.42% for PlantCLEF and general-purpose CIFAR100, respectively. The design of an unsharp enhancement layer plainly boosts the accuracy with negligible performance cost on simple CNN models, as prior knowledge is directly injected to improve its robustness.

Jose Carranza-Rojas, Saul Calderon-Ramirez, Adán Mora-Fallas, Michael Granados-Menani, Jordina Torrents-Barrena

Distortion Estimation Through Explicit Modeling of the Refractive Surface

Precise calibration is a must for high reliance 3D computer vision algorithms. A challenging case is when the camera is behind a protective glass or transparent object: due to refraction, the image is heavily distorted; the pinhole camera model alone can not be used and a distortion correction step is required. By directly modeling the geometry of the refractive media, we build the image generation process by tracing individual light rays from the camera to a target. Comparing the generated images to their distorted – observed – counterparts, we estimate the geometry parameters of the refractive surface via model inversion by employing an RBF neural network. We present an image collection methodology that produces data suited for finding the distortion parameters and test our algorithm on synthetic and real-world data. We analyze the results of the algorithm.

Szabolcs Pável, Csanád Sándor, Lehel Csató

Eye Movement-Based Analysis on Methodologies and Efficiency in the Process of Image Noise Evaluation

Noise level (image quality) evaluation is an important and popular topic in many applications. However, the knowledge of how people visually explore distorted images for making decision on noise evaluation is rather limited. In this paper, we conducted psychophysical eye-tracking studies to deeply understand the process of image noise evaluation. We identified two different types of methodologies in the evaluation processing, speed-driven and accuracy-driven respectively, in terms of both evaluation time and decision error. The speed-driven methodology, compared with the accuracy-driven one, uses less time to give evaluation results, with shorter fixation duration and stronger central bias. Furthermore, based on the utilization of temporal-spatial entropy analysis on eye movement data, a quantitative measure is obtained to show significant correlation with the decision-making efficiency of evaluation processing, which is characterized by evaluation time and decision error. As a result, the new measure may be used as a proxy definition for this decision-making efficiency.

Cheng Peng, Qing Xu, Yuejun Guo, Klaus Schoeffmann

IBDNet: Lightweight Network for On-orbit Image Blind Denoising

To reduce the data transmission pressure from the satellite to the ground, it is meaningful to process the image directly on the satellite. As the cornerstone of image processing, image denoising exceedingly improves the image quality to contribute to subsequent works. For on-orbit image denoising, we propose an end-to-end trainable image blind denoising network, namely IBDNet. Unlike existing image denoising methods, which either have a large number of parameters or are unable to perform image blind denoising, the proposed network is lightweight due to the residual bottleneck blocks as the main structure. Although our network does not use clean images for training, the experimental results on the public datasets indicate that the blindly denoised image quality of our method can be roughly the same as that of the state-of-the-art denoisers. Furthermore, we deploy the model (513 KB only) on the same equipment as the one on a satellite, which verifies the feasibility of running on the satellite.

Ling Li, Junxing Hu, Yijun Lin, Fengge Wu, Junsuo Zhao

Object Detection

Frontmatter

Aggregating Rich Deep Semantic Features for Fine-Grained Place Classification

This paper proposes a method that aggregates rich deep semantic features for fine-grained place classification. As is known to all, the category of images depends on the objects and text as well as the various semantic regions, hierarchical structure, and spatial layout. However, most recently designed fine-grained classification systems ignored this, the complex multi-level semantic structure of images associated with fine-grained classes has not yet been well explored. Therefore, in this work, our approach composed of two modules: Content Estimator (CNE) and Context Estimator (CXE). CNE generates deep content features by encoding global visual cues of images. CXE obtains rich context features of images, and it consists of three children Estimator: Text Context Estimator (TCE), Object Context Estimator (OCE), and Scene Context Estimator (SCE). When inputting an image into CXE, TCE encodes text cues to identify word-level semantic information, OCE extracts high-dimensional feature then maps it to object semantic information, SCE gains hierarchical structure and spatial layout information by recognizing scene cues. To aggregate rich deep semantic features, we fuse the information about CNE and CXE for fine-grained classification. To the best of our knowledge, this is the first work to leverage the text information from an arbitrary-oriented scene text detector for extracting context information. Moreover, our method explores the fusion of semantic features and demonstrates scene features to give more complementary information with the other cues. Furthermore, the proposed approach achieves state-of-the-art performance on a fine-grained classification dataset, 84.3% on Con-Text.

Tingyu Wei, Wenxin Hu, Xingjiao Wu, Yingbin Zheng, Hao Ye, Jing Yang, Liang He

Improving Reliability of Object Detection for Lunar Craters Using Monte Carlo Dropout

In the task of detecting craters on the lunar surface, some craters were difficult to detect correctly, and a Deep Neural Network (DNN) could not represent the uncertainty of such detection. However, a measure of uncertainty could be expressed as the variance of the prediction by using Monte Carlo Dropout Sampling (MC Dropout). Although MC Dropout has often been applied to fully connected layers in a network in recent studies, many convolutional layers are used to recognize the subtle features of a crater in the crater-detecting network. In this paper, we extended the application of MC Dropout to a network having a number of convolutional layers, and also evaluated the methodology of dropping out the convolutional layers. As a result, in the convolutional neural network, we represent the more correct variance by using filter-based dropout and evaluating the uncertainty for each feature map size. The precision of prediction in lunar crater detection was improved by 2.1% by rejecting a prediction result with high variance as a false positive compared with the variance when predicting the training data.

Tomoyuki Myojin, Shintaro Hashimoto, Kenji Mori, Keisuke Sugawara, Naoki Ishihama

An Improved Convolutional Neural Network for Steganalysis in the Scenario of Reuse of the Stego-Key

The topic of this paper is the use of deep learning techniques, more specifically convolutional neural networks, for steganalysis of digital images. The steganalysis scenario of the repeated use of the stego-key is considered. Firstly, a study of the influence of the depth and width of the convolution layers on the effectiveness of classification was conducted. Next, a study on the influence of depth and width of fully connected layers on the effectiveness of classification was conducted. Based on the conclusions from the studies, an improved convolutional neural network was created, which is characterized by the state-of-art level of classification efficiency but containing 20 times less parameters to learn during the training process. Smaller number of learnable parameters results in faster network learning, easier convergence, and smaller memory and computing power requirements. The paper contains description of the current state of art, description of the experimental environment, structures of the studied networks and the results of classification accuracy.

Bartosz Czaplewski

A New Learning-Based One Shot Detection Framework for Natural Images

Nowadays, existing object detection methods based on deep learning usually need vast amounts of training data and cannot deal with unseen classes of objects well. In this paper, we propose a new framework that applies one-shot learning to object detection. During the training period, the network learns an ability from known object classes to compare the similarity of two image parts. For the image of a new category, selective search seeks proposals in the first step. Then the comparison based on traditional feature is used to screen out some inaccurate proposals. Next, our deep learning model can extract features and measure the similarity through feature fusion (which means concatenating the channels of two feature maps in this paper). After these steps, we can obtain a temporary result. Based on this result and some proposals related to it, we refine the proposals through the intersection. Then we conduct second-round detection with new proposals and improve the accuracy. Experiments on different datasets demonstrate that our method is effective and has a certain transferability.

Sen Na, Ruoyu Yan

Dense Receptive Field Network: A Backbone Network for Object Detection

Although training object detectors with ImageNet pre-trained models is very common, the models designed for classification are not suitable enough for detection tasks. So, designing a special backbone network for detection tasks is one of the best solutions. In this paper, a backbone network named Dense Receptive Field Network (DRFNet) is proposed for object detection. DRFNet is based on Darknet-60 (our modified version of Darknet-53) and contains a novel architecture named Dense Receptive Field Block (DenseRFB) module. DenseRFB is a densely connected mode of RFB and can form much denser effective receptive fields, which can greatly improve the feature presentation of DRFNet and keep its fast speed. The proposed DRFNet is firstly tested with ScratchDet for fast evaluation. Moreover, as a pre-trained model on ImageNet, DRFNet is also tested with SSD. All the experiments show that DRFNet is an effective and efficient backbone network for object detection.

Fei Gao, Chengguang Yang, Yisu Ge, Shufang Lu, Qike Shao

Referring Expression Comprehension via Co-attention and Visual Context

As a research hotspot of multimodal media analysis, referring expression comprehension locates the referred object region in an image by mapping a natural language. Though the localizing accuracy of similar objects is often distorted by the presence or absence of supporting objects in the referring expression, we propose a referring expression comprehension method via co-attention and visual context. For lacking supporting objects in referring expression, we propose co-attention to enhance the attention on attributes for the subject module. For existing supporting objects, we introduce visual context to explore the latent link between the candidate object and its supporters. Experiments on three datasets RefCOCO, RefCOCO+, and RefCOCOg, show that our approach outperforms published approaches by a considerable margin.

Youming Gao, Yi Ji, Ting Xu, Yunlong Xu, Chunping Liu

Comparison Between U-Net and U-ReNet Models in OCR Tasks

The goal of this paper is to explore the benefits of using RNNs instead of using CNNs for image transformation tasks. We are interested in two models for image transformation: U-Net (based on CNNs) and U-ReNet (partially based on CNNs and RNNs). In this work, we propose a novel U-ReNet which is almost entirely RNN based. We compare U-Net, U-ReNet (partially RNN), and our U-ReNet (almost entirely RNN based) in two datasets based on MNIST. The task is to transform text lines of overlapping digits to text lines of separated digits. Our model reaches the best performance in one dataset and comparable results in the other dataset. Additionally, the proposed U-ReNet with RNN upsampling has fewer parameters than U-Net and is more robust to translation transformation.

Brian B. Moser, Federico Raue, Jörn Hees, Andreas Dengel

Severe Convective Weather Classification in Remote Sensing Images by Semantic Segmentation

Severe convective weather is a catastrophic weather that can cause great harm to the public. One of the key studies for meteorological practitioners is how to recognize severe convection weather accurately and effectively, and it is also an important issue in government climate risk management. However, most existing methods extract features from satellite data by classifying individual pixels instead of using tightly integrated spatial information, ignoring the fact the clouds are highly dynamic. In this paper, we propose a new classification model, which is based on image segmentation of deep learning. And it uses U-net architecture as the technology platform to identify all weather conditions in the datasets accurately. As heavy rainfall is one of the most frequent and widespread server weather hazards, when the storms come ashore with high speed of wind, it makes the precipitation time longer and causes serious damage in turn. Therefore, we suggest a new evaluation metric to evaluate the performance of detecting heavy rainfall. Compared with existing methods, the model based on Himawari-8 dataset has a better performance. Further, we explore the representations learned by our model in order to better understand this important dataset. The results play a crucial role in the prediction of climate change risks and the formulation of government policies on climate change.

Ming Yuan, Zhilei Chai, Wenlai Zhao

Action Recognition Based on Divide-and-Conquer

Recently, deep convolutional neural networks have made great breakthroughs in the field of action recognition. Since sequential video frames have a lot of redundant information, compared with dense sampling, sparse sampling network can also achieve good results. Due to sparse sampling’s limitation of access to information, this paper mainly discusses how to further improve the learning ability of the model based on sparse sampling. We proposed a model based on divide-and-conquer, which use a threshold α to determine whether action data require sparse sampling or dense local sampling for learning. Finally, our approach obtains the state-the-of-art performance on the datasets of HMDB51 (72.4%) and UCF101 (95.3%).

Guanghua Tan, Rui Miao, Yi Xiao

An Adaptive Feature Channel Weighting Scheme for Correlation Tracking

In most of Discriminative Correlation Filter (DCF) based trackers, they used a fixed weight for each feature channel for all incoming frames. However, in the experiment, we find that different features have pros and cons under different scenarios. In this paper, we propose to couple the response of a DCF based tracker with the weights of different feature channels to strengthen their positive effects while weaken their negative effects simultaneously. This coupling is achieved by an adaptive feature channel weighting scheme. The tracking is formulated as a two-stage optimization problem: the tracker is learned using the alternative direction method of multipliers (ADMM) and the weights of feature channels are adaptively adjusted by a least-square estimation. We integrate the adaptive feature channel weighting scheme into two state-of-the-art handcrafted DCF based trackers, and evaluate them on two benchmarks: OTB2013 and VOT2016, respectively. The experimental results demonstrate its accuracy and efficiency when compared with some state-of-the-art handcrafted DCF based trackers.

Zhen Zhang, Chao Wang, Xiqun Lu

In-Silico Staining from Bright-Field and Fluorescent Images Using Deep Learning

Fluorescent markers are commonly used to characterize single cells and to uncover molecular properties. Unfortunately, fluorescent staining is laborious and costly, it damages tissue and suffers from inconsistencies. Recently deep learning approaches have been successfully applied to predict fluorescent markers from bright-field images [1–3]. These approaches can save costs and time and speed up the classification of tissue properties. However, it is currently not clear how different image channels can be meaningfully combined to improve prediction accuracy. Thus, we investigated the benefits of multi channel input for predicting a specific transcription factor antibody staining.Our image dataset consists of three channels: bright-field, fluorescent GFP reporter and transcription factor antibody staining. Fluorescent GFP is constantly expressed in the genetically modified cells from a particular differentiation step onwards. The cells are additionally stained with a specific transcription factor antibody that marks a subtype of GFP positive cells. For data acquisition we used a Leica SP8 and a Zeiss LSM780 microscope with 20x objectives.We trained a deep neural network, a modified U-Net [4], to predict the transcription factor antibody staining from bright-field and GFP channels. To this end, we trained on 2432 three-dimensional images containing roughly 7600 single cells and compared the accuracy for prediction of the transcription factor antibody staining using bright-field only, GFP only, and both channels together on a test-set of 576 images with approximately 1800 single cells. The same training- and test-set was used for all experiments (Fig. 1).The prediction error, measured as the mean relative pixel-wise error over the test-set, was calculated to 61% for prediction from bright-field, 55% for prediction from GFP and 51% for prediction both bright-field and GFP images. The median pixel-wise Pearson correlation coefficient, increases from 0.12 for prediction from bright-field channels to 0.17 for prediction from GFP channels, to 0.31 for prediction from bright-field and GFP channels (Fig. 2).Our work demonstrates that prediction performance can be increased by combining multiple channels for in-silico prediction of stainings. We anticipate this research to be a starting point for further investigations on which stainings could be predicted from other stainings using deep learning. These approaches bear a huge potential in saving laborious and costly work for researchers and clinical technicians and could reveal biological relationships between fluorescent markers.

Dominik Jens Elias Waibel, Ulf Tiemann, Valerio Lupperger, Henrik Semb, Carsten Marr

Image Segmentation

Frontmatter

A Lightweight Neural Network for Hard Exudate Segmentation of Fundus Image

Fundus image is an important indicator for diagnosing diabetic retinopathy (DR), which is a leading cause of blindness in adults. Moreover, hard exudate, a special kind of lesion in the fundus image, is treated as the basis to evaluate the severity level of DR. Therefore, it is crucial to segment hard exudate exactly. However, the segmentation results of existing deep learning-based segmentation methods are rather coarse due to successive pooling operations. In this paper, we propose a lightweight segmentation network with only one down-sampling operation and 1.9M parameters. Further, we propose a two-stage training algorithm to train our network from scratch. We conduct experiments on the IDRiD and e-ophtha EX datasets. Experimental results show that our network achieves superior performance with the fewest parameters and the fastest speed compared with baseline methods on the IDRiD dataset. Specially, with 1/20 parameters and 1/3 inference time, our method is over 10% higher than DeepLab v3+ in terms of F1-score on the IDRiD dataset. The source code of LWENet is available at https://github.com/guomugong/LWENet .

Song Guo, Tao Li, Kai Wang, Chan Zhang, Hong Kang

Attentional Residual Dense Factorized Network for Real-Time Semantic Segmentation

Semantic segmentation is a pixel-level image dense labeling task and plays a core role in autonomous driving. In this regard, how to balance between precision and speed is a frequently-studied issue. In this paper, we propose an alternative attentional residual dense factorized network (AttRDFNet) to address this issue. Specifically, we design a residual dense factorized convolution block (RDFB), which reaps the benefits of low-level and high-level layer-wise features through dense connection to boost segmentation precision whilst enjoying efficient computation by factorizing large convolution kernel into the product of two smaller kernels. This reduces computational burdens and makes real time become possible. To further leverage layer-wise features, we explore the graininess-aware channel and spatial attention modules to model different levels of salient features of interest. As a result, AttRDFNet can run with the inputs of the resolution 512 $$ \times $$ 1024 at the speed of 55.6 frames per second on a single Titan X GPU with solid 68.5% Mean IOU on the test set of Cityscapes. Experiments on the Cityscapes dataset show that AttRDFNet has real-time inference whilst achieving competitive precision against well-behaved counterparts.

Lulu Yang, Long Lan, Xiang Zhang, Xuhui Huang, Zhigang Luo

Random Drop Loss for Tiny Object Segmentation: Application to Lesion Segmentation in Fundus Images

Convolutional neural network (CNN), has achieved state-of-the-art performance in computer vision tasks. The segmentation of dense objects has been fully studies, but the research is insufficient on tiny objects segmentation which is very common in medical images. For instance, the proportion of lesions or tumors can be as low as 0.1%, which can easily lead to misclassification. In this paper, we propose a random drop loss function to improve the segmentation performance of tiny lesions on medical image analysis task by dropping negative samples randomly according to their classification difficulty. In addition, we designed three drop functions to map the classification difficulty to drop probability with the principle that easy negative samples are dropped with high probabilities and hard samples are retained with high probabilities. In this manner, not only can the sorting process existing in Top-k BCE loss be avoided, but CNN can also learn better discriminative features, thereby reducing misclassification. We evaluated our method on the task of segmentation of microaneurysms and hemorrhages in color fundus images. Experimental results show that our method outperforms other methods in terms of segmentation performance and computational cost. The source code of our method is available at https://github.com/guomugong/randomdrop .

Song Guo, Tao Li, Chan Zhang, Ning Li, Hong Kang, Kai Wang

Flow2Seg: Motion-Aided Semantic Segmentation

Motion is an important clue for segmentation. In this paper, we leverage motion information densely represented by optical flow to assist the semantic segmentation task. Specifically, our framework takes both image and optical flow as input, where image goes through a state-of-the-art deep network and optical flow goes through a relatively shallow network, and results from both paths are fused together in a residual manner. Unlike image, optical flow is weakly related to semantics but can separate different objects according motion consistency, which motivates us to use relatively shallow network to process optical flow to avoid overfitting and keep spatial information. In our experiment on Cityscapes, we find that optical flow improves image-based segmentation on object boundaries especially on small thin objects. Aided by motion, we achieve comparable results with state-of-the-art methods.

Xiangtai Li, Jiangang Bai, Kuiyuan Yang, Yunhai Tong

COCO_TS Dataset: Pixel–Level Annotations Based on Weak Supervision for Scene Text Segmentation

The absence of large scale datasets with pixel–level supervisions is a significant obstacle for the training of deep convolutional networks for scene text segmentation. For this reason, synthetic data generation is normally employed to enlarge the training dataset. Nonetheless, synthetic data cannot reproduce the complexity and variability of natural images. In this paper, a weakly supervised learning approach is used to reduce the shift between training on real and synthetic data. Pixel–level supervisions for a text detection dataset (i.e. where only bounding–box annotations are available) are generated. In particular, the COCO–Text–Segmentation (COCO_TS) dataset, which provides pixel–level supervisions for the COCO–Text dataset, is created and released. The generated annotations are used to train a deep convolutional neural network for semantic segmentation. Experiments show that the proposed dataset can be used instead of synthetic data, allowing us to use only a fraction of the training samples and significantly improving the performances.

Simone Bonechi, Paolo Andreini, Monica Bianchini, Franco Scarselli

Occluded Object Recognition

Frontmatter

Learning Deep Structured Multi-scale Features for Crisp and Object Occlusion Edge Detection

The key challenge for edge detection is that predicted edges are thick and need Non-Maximum Suppression to post-process to obtain crisp edges. In addition, object occlusion edge detection is an important research problem in computer vision. To increase the crispness and accuracy of occlusion relationships effectively, we propose a novel method of edge detection called MSDF (Multi Scale Decode and Fusion) based on deep structured multi-scale features to generate crisp salient edges in this paper. The decoder layer of MSDF can fuse the adjacent-scale features and increase the affinity between the features. We also propose a novel loss function to solve the class imbalance in object occlusion edge detection and a two streams learning framework to predict edge and occlusion orientation. Extensive experiments on BSDS500 dataset and the larger NYUD dataset show that the effectiveness of the proposed model and of the overall hierarchical framework. We also surpass the state of the art on the BSDS ownership dataset in occlusion edge detection.

Zihao Dong, Ruixun Zhang, Xiuli Shao

Graph-Boosted Attentive Network for Semantic Body Parsing

Human body parsing remains a challenging problem in natural scenes due to multi-instance and inter-part semantic confusions as well as occlusions. This paper proposes a novel approach to decomposing multiple human bodies into semantic part regions in unconstrained environments. Specifically we propose a convolutional neural network (CNN) architecture which comprises of novel semantic and contour attention mechanisms across feature hierarchy to resolve the semantic ambiguities and boundary localization issues related to semantic body parsing. We further propose to encode estimated pose as higher-level contextual information which is combined with local semantic cues in a novel graphical model in a principled manner. In this proposed model, the lower-level semantic cues can be recursively updated by propagating higher-level contextual information from estimated pose and vice versa across the graph, so as to alleviate erroneous pose information and pixel level predictions. We further propose an optimization technique to efficiently derive the solutions. Our proposed method achieves the state-of-art results on the challenging Pascal Person-Part dataset.

Tinghuai Wang, Huiling Wang

A Global-Local Architecture Constrained by Multiple Attributes for Person Re-identification

Person re-identification (person re-ID) is often considered as a sub-problem of image retrieval, which aims to match pedestrians under non-overlapping cameras. In this work, we present a novel global and local network structure integrating pedestrian identities with multiple attributes to improve the performance of person re-ID. The proposed framework consists of three modules: shared one, global one and local one. The shared module based on pre-trained residual network extracts low-level and mid-level features. And the global module guided by identification loss learns high-level semantic feature representations. To achieve accurate localization of local attribute features, we propose a multi-attributes partitioning learning method and consider pedestrian attributes as supervised information of the local module. Meanwhile, we employ whole-to-part spatial transformer networks (STNs) to achieve coarse-to-fine meaningful feature locations. By applying a multi-task learning strategy, we design various objective functions including identification and multiple attributes classification losses for training our model. The experimental results on several challenging datasets show our method significantly improves person re-ID performance and surpasses most of the state-of-the-art methods. Specifically, our model achieves 87.49% of the attribute recognition accuracy on Market1501 dataset.

Chao Liu, Hongyang Quan

Recurrent Connections Aid Occluded Object Recognition by Discounting Occluders

Recurrent connections in the visual cortex are thought to aid object recognition when part of the stimulus is occluded. Here we investigate if and how recurrent connections in artificial neural networks similarly aid object recognition. We systematically test and compare architectures comprised of bottom-up (B), lateral (L) and top-down (T) connections. Performance is evaluated on a novel stereoscopic occluded object recognition dataset. The task consists of recognizing one target digit occluded by multiple occluder digits in a pseudo-3D environment. We find that recurrent models perform significantly better than their feedforward counterparts, which were matched in parametric complexity. Furthermore, we analyze how the network’s representation of the stimuli evolves over time due to recurrent connections. We show that the recurrent connections tend to move the network’s representation of an occluded digit towards its un-occluded version. Our results suggest that both the brain and artificial neural networks can exploit recurrent connectivity to aid occluded object recognition.

Markus Roland Ernst, Jochen Triesch, Thomas Burwick

Learning Relational-Structural Networks for Robust Face Alignment

Unconstrained face alignment usually undergoes extreme deformations and severe occlusions, which likely gives rise to biased shape prediction. Most existing methods simply exploit shape structure by directly concatenating all landmarks, which leads to losses of facial details in extreme deformation regions. In this paper, we propose a relational-structural networks (RSN) approach to learn both local and global feature representation for robust face alignment. To achieve this goal, we built a structural branch network to disentangle the local geometric relationship among neighboring facial sub-regions. Moreover, we develop a reinforcement learning approach to reason the robust iterative process. Our RSN generates three candidate shapes. Then a Q-net evaluates three candidate shapes by a reward function, which select the best shape to re-initialize network input to alleviate the local optimization problem of cascade regression methods. Authentic experimental results indicate that our approach consistently outperforms the most state-of-the-art methods on widely evaluated challenging datasets.

Congcong Zhu, Xing Wang, Suping Wu, Zhenhua Yu

Gesture Recognition

Frontmatter

An Efficient 3D-NAS Method for Video-Based Gesture Recognition

3D convolutional neural network (3DCNN) is a powerful and effective model utilizing spatial-temporal features, especially for gesture recognition. Unfortunately, so many parameters are modified in 3DCNN that lots of researchers choose 2DCNN or hybrid models, but these models are designed manually. In this paper, we propose a framework to automatically construct a model based on 3DCNN by network architecture search (NAS) [1]. In our method called 3DNAS, a 3D teacher network is trained from scratch as a pre-trained model to accelerate the convergence of the child networks. Then series of child networks with various architectures are generated randomly and each is trained under the direction of converted teacher model. Finally, the controller predicts a network architecture according to the rewards of all the child networks. We evaluate our method on a video-based gesture recognition dataset 20BN-Jester dataset v1 [2] and the result shows our approach is superiority against prior methods both in efficiency and accuracy.

Ziheng Guo, Yang Chen, Wei Huang, Junhao Zhang

Robustness of Deep LSTM Networks in Freehand Gesture Recognition

We present an analysis of the robustness of deep LSTM networks for freehand gesture recognition against temporal shifts of the performed gesture w.r.t. the “temporal receptive field”. Such shifts inevitably occur when not only the gesture type but also its onset needs to be determined from sensor data, and it is imperative that recognizers be as invariant as possible to this effect which we term gesture onset variability. Based on a real-world hand gesture classification task we find that LSTM networks are very sensitive to this type of variability, which we confirm by creating a synthetic sequence classification task of similar dimensionality. Lastly, we show that including gesture onset variability in the training data by a simple data augmentation strategy leads to a high robustness against all tested effects, so we conclude that LSTM networks can be considered good candidates for real-time and real-world gesture recognition.

Monika Schak, Alexander Gepperth

Saliency Detection

Frontmatter

Delving into the Impact of Saliency Detector: A GeminiNet for Accurate Saliency Detection

Although plenty of saliency detection methods based on CNNs have shown impressive performance, we observe that these methods adopt single-scale convolutional layers as saliency detectors after extracting features to predict saliency maps, which will cause serious missed detection especially those targets having small scales, irregular shapes and sporadic locations in complex scenario of multi-target graphs. In addition, the edges of salient objects predicted by these methods are often confused with their background, causing these partial regions to be very blurred. In order to deal with these issues, we delved into the impact of diverse unified detectors based on convolutional layers and nearest neighbor optimization on saliency detection. It was found that (1) the flattened design contributes to the improvement of accuracy, but due to the inherent characteristics of convolutional layers, it is not the effective way to solve the problems; (2) Nearest neighbor optimization is beneficial to remove background regions from salient objects and restore the missing sections while refining their boundaries, yielding a more reliable final prediction. With the progress of these studies, we built a GeminiNet for accurate saliency detection. Quantitative and qualitative experiments on six benchmark datasets demonstrate that our proposed GeminiNet performs favorably against the state-of-the-art methods under different evaluation metrics.

Tao Zheng, Bo Li, Delu Zeng, Zhiheng Zhou

FCN Salient Object Detection Using Region Cropping

An important issue in salient object detection is how to improve the result of saliency map for the reason that it is the basis of many subsequent operations in computer vision. In this paper, we propose a region-based salient object detection model using fully convolutional neural network (FCN) with traditional visual saliency method. We introduce the region cropping and jumping operation into FCN network for a more target-oriented feature extraction, which is a low-level cue based processing. It processes the training images into patches of various sizes and makes these patches jump to convolutional layers with corresponding depths as their input data in training. This operation can preserve the main structure of objects while decrease the background redundancy. In the meantime, it also takes into account topological property, which emphasizes the topological integrity of objects. Experimental results on four datasets show that the proposed model performs effectively on salient object detection compared with other ten approaches, including state-of-the-art ones.

Yikai Hua, Xiaodong Gu

Object-Level Salience Detection by Progressively Enhanced Network

Saliency detection plays an important role in computer vision area. However, most of the previous works focus on detecting the salient regions, rather than the objects, which is more reasonable in many practical applications. In this paper, a framework is proposed for detecting the salient objects in input images. This framework is composed of two main components: (1) progressively enhanced network (PEN) for amplifying the specified layers of the network and merging the global context simultaneously; (2) object-level boundary extraction module (OBEM) for extracting the complete boundary of the salient object. Experiments and comparisons show that the proposed framework achieves state-of-the-art results. Especially on many challenging datasets, our method performs much better than other methods.

Wang Yuan, Haichuan Song, Xin Tan, Chengwei Chen, Shouhong Ding, Lizhuang Ma

Perception

Frontmatter

Action Unit Assisted Facial Expression Recognition

Facial expression recognition is vital to many intelligent applications such as human-computer interaction and social networks. For machines, learning to classify six basic human expressions (anger, disgust, fear, happiness, sadness and surprise) is still a big challenge. This paper proposed a convolutional neural network based on AlexNet combining a Bayesian network. Besides traditional features, the relationships between facial action units (AU) and expressions are captured. Firstly, a convolutional neural network to extract features from images is constructed. Then, a Bayesian network is established to learn the dependencies of AUs and expressions from joint probabilities and conditional probabilities. Finally, ensemble learning is used to combine the features of expressions, AUs and dependencies between the two. Our experiments on popular datasets show that the proposed method performs well compared with latest approaches.

Fangjun Wang, Liping Shen

Discriminative Feature Learning Using Two-Stage Training Strategy for Facial Expression Recognition

Although deep convolutional neural networks (CNNs) have achieved the state-of-the-arts for facial expression recognition (FER), FER is still challenging due to two aspects: class imbalance and hard expression examples. However, most existing FER methods recognize facial expression images by training the CNN models with cross-entropy (CE) loss in a single stage, which have limited capability to deal with these problems because each expression example is assigned equal weight of loss. Inspired by the recently proposed focal loss which reduces the relative loss for those well-classified expression examples and pay more attention to those misclassified ones, we can mitigate these problems by introducing the focal loss into the existing FER system when facing imbalanced data or hard expression examples. Considering that the focal loss allows the network to further extract discriminative features based on the learned feature-separating capability, we present a two-stage training strategy utilizing CE loss in the first stage and focal loss in the second stage to boost the FER performance. Extensive experiments have been conducted on two well-known FER datasets called CK+ and Oulu-CASIA. We gain improvements compared with the common one-stage training strategy and achieve the state-of-the-art results on the datasets in terms of average classification accuracy, which demonstrate the effectiveness of our proposed two-stage training strategy.

Lei Gan, Yuexian Zou, Can Zhang

Action Units Classification Using ClusWiSARD

This paper presents the use of WiSARD and ClusWiSARD weightless neural networks models for the classification of the contraction and extension of Action Units, the facial muscles involved in emotive expressions. This is a complex problem due to the large number of very similar classes, and because it is a multi-label classification task, where the positive expression of one class can modify the response of the others. WiSARD and ClusWiSARD solutions are proposed and validated using the CK+ dataset, producing responses with accuracy of 89.66%. Some of the major works in the field are cited here, but a proper comparison is not possible due to a lack of appropriate information about such solutions, such as the subset of classes used and the time of training/testing. The contribution of this paper is in the pioneering use of weightless neural networks in an AUs classification task, in the unpublished application of the WiSARD and ClusWiSARD models in multi-label tasks and in the new unsupervised expansion of ClusWiSARD proposed here.

Leopoldo A. D. Lusquino Filho, Gabriel P. Guarisa, Luiz F. R. Oliveira, Aluizio Lima Filho, Felipe M. G. França, Priscila M. V. Lima

Automatic Estimation of Dog Age: The DogAge Dataset and Challenge

Automatic age estimation is a challenging problem attracting attention of the computer vision and pattern recognition communities due to its many practical applications. Artificial neural networks, such as CNNs are a popular tool for tackling this problem, and several datasets which can be used for training models are available.Despite the fact that dogs are the most well studied species in animal science, and that ageing processes in dogs are in many aspects similar to those of humans, the problem of age estimation for dogs has so far been overlooked. In this paper we present the DogAge dataset and an associated challenge, hoping to spark the interest of the scientific community in the yet unexplored problem of automatic dog age estimation.

Anna Zamansky, Aleksandr M. Sinitca, Dmitry I. Kaplun, Luisa M. L. Dutra, Robert J. Young

Motion Analysis

Frontmatter

Neural Network 3D Body Pose Tracking and Prediction for Motion-to-Photon Latency Compensation in Distributed Virtual Reality

Distributed Virtual Reality (DVR) systems enable geographically dispersed users to interact in a shared virtual environment. The realism of the interaction is crucial to increase the feeling of co-presence. Latency, produced either by hard- or software components of DVR applications, impedes reaching high realism levels of the DVR experience. For example, the time delay between the user’s motion and the corresponding display rendering of the DVR system might lead to adverse effects such as a reduced sense of presence or motion sickness. One way of minimizing the latency is to predict user’s motion and thus compensate for the inherent latency in the system. In order to address this problem, we propose a neural network 3D pose tracking and prediction system with latency guarantees for end-to-end avatar reconstruction. We evaluate and compare our system against multiple traditional methods and provide a thorough analysis on real-world human motion data.

Sebastian Pohl, Armin Becher, Thomas Grauschopf, Cristian Axenie

Variational Deep Embedding with Regularized Student-t Mixture Model

This paper proposes a new motion classifier using variational deep embedding with regularized student-t mixture model as prior, named VaDE-RT, to improve robustness to outliers while maintaining continuity in latent space. Normal VaDE uses Gaussian mixture model, which is sensitive to outliers, and furthermore, all the components of mixture model can freely move in the latent space, which would lose the continuity in the latent space. In contrast, VaDE-RT aims to exploit a heavy-tailed feature of student-t distribution for robustness, and regularize the mixture model to standard normal distribution, which is employed in a standard variational autoencoder as prior. To do so, three reasonable approximations for (i) reparameterization trick, (ii) Kullback-Leibler (KL) divergence between student-t distributions, and (iii) KL divergence of the mixture model, are introduced to make backpropagation in VaDE-RT possible. As a result, VaDE-RT outperforms the original VaDE and a simple deep-learning-based classifier in terms of classification accuracy. In addition, VaDE-RT yields both continuity and natural topology of clusters in the latent space, which make robot control adaptive smoothly.

Taisuke Kobayashi

A Mixture-of-Experts Model for Vehicle Prediction Using an Online Learning Approach

Predicting future motion of other vehicles or, more generally, the development of traffic situations, is an essential step towards secure, context-aware automated driving. On the one hand, human drivers are able to anticipate driving situations continuously based on the currently perceived behavior of other traffic participants while incorporating prior experience. On the other hand, the most successful data-driven prediction models are typically trained on large amounts of recorded data before deployment achieving remarkable results. In this paper, we present a mixture-of-experts online learning model encapsulating both ideas. Our system learns at run time to choose between several models, which have been previously trained offline, based on the current situational context. We show that our model is able to improve over the offline models already after a short ramp-up phase. We evaluate our system on real world driving data.

Florian Mirus, Terrence C. Stewart, Chris Eliasmith, Jörg Conradt

Analysis of Dogs’ Sleep Patterns Using Convolutional Neural Networks

Video-based analysis is one of the most important tools of animal behavior and animal welfare scientists. While automatic analysis systems exist for many species, this problem has not yet been adequately addressed for one of the most studied species in animal science—dogs. In this paper we describe a system developed for analyzing sleeping patterns of kenneled dogs, which may serve as indicator of their welfare. The system combines convolutional neural networks with classical data processing methods, and works with very low quality video from cameras installed in dogs shelters.

Anna Zamansky, Aleksandr M. Sinitca, Dmitry I. Kaplun, Michael Plazner, Ivana G. Schork, Robert J. Young, Cristiano S. de Azevedo

On the Inability of Markov Models to Capture Criticality in Human Mobility

We examine the non-Markovian nature of human mobility by exposing the inability of Markov models to capture criticality in human mobility. In particular, the assumed Markovian nature of mobility was used to establish an upper bound on the predictability of human mobility, based on the temporal entropy. Since its inception, this bound has been widely used for validating the performance of mobility prediction models. We show that the variants of recurrent neural network architectures can achieve significantly higher prediction accuracy surpassing this upper bound. The central objective of our work is to show that human-mobility dynamics exhibit criticality characteristics which contributes to this discrepancy. In order to explain this anomaly, we shed light on the underlying assumption that human mobility characteristics follow an exponential decay that has resulted in this bias. By evaluating the predictability on real-world datasets, we show that human mobility exhibits scale-invariant long-distance dependencies, bearing resemblance to power-law decay, contrasting with the initial Markovian assumption. We experimentally validate that this assumption inflates the estimated mobility entropy, consequently lowering the upper bound on predictability. We demonstrate that the existing approach of entropy computation tends to overlook the presence of long-distance dependencies and structural correlations in human mobility. We justify why recurrent-neural network architectures that are designed to handle long-distance dependencies surpass the previously computed upper bound on mobility predictability.

Vaibhav Kulkarni, Abhijit Mahalunkar, Benoit Garbinato, John D. Kelleher

LSTM with Uniqueness Attention for Human Activity Recognition

Deep neural network has promoted the development of human activity recognition research and becomes an indispensable tool for it. Deep neural networks, such as LSTM, can automatically learn important features from the data of human activities. But parts of these data are irrelevant and correspond to the Null activity [3], which can affect the recognition performance. Therefore, we propose a uniqueness attention mechanism to solve this problem. Every human activity consists of many atom motions. Some of these atom motions only occur in one single human activity. This kind of motion is more effective to discriminate human activities, and should therefore receive more attention. We design a model, named LSTM with Uniqueness Attention. When we identify the category of an unknown activity, our model first discover unknown activity’s atom motions which are unique to a known activity, and then use these motions to discriminate this unknown activity. In this way, irrelevant information can be filtered out. Moreover, by discovering an activity’s unique atom motion, we can get more insights to understand this human activity. We evaluate our approach on two public datasets and obtain state-of-the-art results. We also visualize this uniqueness attention, which has an excellent interpretability and goes pretty well with common sense.

Zengwei Zheng, Lifei Shi, Chi Wang, Lin Sun, Gang Pan

Comparative Research on SOM with Torus and Sphere Topologies for Peculiarity Classification of Flat Finishing Skill Training

The paper compares classification performances on Self-Organizing Maps (SOMs) by torus and spherical topologies in the case of peculiarities classification of flat finishing motion with an iron file measured by a 3D stylus. In case of manufacturing skill training, peculiarities of tool motion are useful information for learners. Classified peculiarities are also useful especially for trainers to grasp effectively the tendency of the learners’ peculiarities in their class. In the authors’ former studies, a torus SOM are considered to be powerful tools to classify and visualize peculiarities with its borderless topological feature map structure. In this paper, the authors compare the classification performance of two kind of borderless topological SOMs: torus SOM and spherical SOM by quality measurements.

Masaru Teranishi, Shimpei Matsumoto, Hidetoshi Takeno

Generating Images

Frontmatter

Generative Creativity: Adversarial Learning for Bionic Design

Generative creativity in the context of visual data refers to the generation process of new and creative images by composing features of existing ones. In this work, we aim to achieve generative creativity by learning to combine spatial features of images from different domains. We focus on bionic design as an ideal task for this study, in which a target object (e.g. a floor lamp) is designed to contain features of biological source objects (e.g. flowers), resulting in creative biologically-inspired design. Specifically, given an input image of a design target object, a generative model should learn to generate images that (1) maintain shape features of the input design target image, (2) contain shape features of images from the specified biological source domain, (3) are plausible and diverse. We propose DesignGAN, a novel unsupervised deep generative approach to realising shape-oriented bionic design. DesignGAN employs an adversarial learning architecture with designated losses to generate images that meet the three aforementioned requirements of bionic design modelling. We perform qualitative and quantitative experiments to evaluate our method, and demonstrate that our proposed framework successfully generates creative images of bionic design.

Simiao Yu, Hao Dong, Pan Wang, Chao Wu, Yike Guo

Self-attention StarGAN for Multi-domain Image-to-Image Translation

In this paper, we propose a Self-attention StarGAN by introducing the self-attention mechanism into StarGAN to deal with multi-domain image-to-image translation, aiming to generate images with high-quality details and obtain consistent backgrounds. The self-attention mechanism models the long-range dependencies among the feature maps at all positions, which is not limited to the local image regions. Simultaneously, we take the advantage of batch normalization to reduce reconstruction error and generate fine-grained texture details. We adopt spectral normalization in the network to stabilize the training of Self-attention StarGAN. Both quantitative and qualitative experiments on a public dataset have been conducted. The experimental results demonstrate that the proposed model achieves lower reconstruction error and generates images in higher quality compared to StarGAN. We exploit Amazon Mechanical Turk (AMT) for perceptual evaluation, and 68.1% of all 1,000 AMT Turkers agree that the backgrounds of the images generated by Self-attention StarGAN are more consistent with the original images.

Ziliang He, Zhenguo Yang, Xudong Mao, Jianming Lv, Qing Li, Wenyin Liu

Generative Adversarial Networks for Operational Scenario Planning of Renewable Energy Farms: A Study on Wind and Photovoltaic

For the integration of renewable energy sources, power grid operators need realistic information about the effects of energy production and consumption to assess grid stability. Recently, research in scenario planning benefits from utilizing generative adversarial networks (GANs) as generative models for operational scenario planning. In these scenarios, operators examine temporal as well as spatial influences of different energy sources on the grid. The analysis of how renewable energy resources affect the grid enables the operators to evaluate the stability and to identify potential weak points such as a limiting transformer. However, due to their novelty, there are limited studies on how well GANs model the underlying power distribution. This analysis is essential because, e.g., especially extreme situations with low or high power generation are required to evaluate grid stability. We conduct a comparative study of the Wasserstein distance, binary-cross-entropy loss, and a Gaussian copula as the baseline applied on two wind and two solar datasets with limited data compared to previous studies. Both GANs achieve good results considering the limited amount of data, but the Wasserstein GAN is superior in modeling temporal and spatial relations, and the power distribution. Besides evaluating the generated power distribution over all farms, it is essential to assess terrain specific distributions for wind scenarios. These terrain specific power distributions affect the grid by their differences in their generating power magnitude. Therefore, in a second study, we show that even when simultaneously learning distributions from wind parks with terrain specific patterns, GANs are capable of modeling these individualities also when faced with limited data. These results motivate a further usage of GANs as generative models in scenario planning as well as other areas of renewable energy.

Jens Schreiber, Maik Jessulat, Bernhard Sick

Constraint-Based Visual Generation

In the last few years the systematic adoption of deep learning to visual generation has produced impressive results that, amongst others, definitely benefit from the massive exploration of convolutional architectures. In this paper, we propose a general approach to visual generation that combines learning capabilities with logic descriptions of the target to be generated. The process of generation is regarded as a constrained satisfaction problem, where the constraints describe a set of properties that characterize the target. Interestingly, the constraints can also involve logic variables, while all of them are converted into real-valued functions by means of the t-norm theory. We use deep architectures to model the involved variables, and propose a computational scheme where the learning process carries out a satisfaction of the constraints. We propose some examples in which the theory can naturally be used, including the modeling of GAN and auto-encoders, and report promising results in image translation of human faces.

Giuseppe Marra, Francesco Giannini, Michelangelo Diligenti, Marco Gori

Text to Image Synthesis Based on Multiple Discrimination

We propose a novel and simple text-to-image synthesizer (MD-GAN) using multiple discrimination. Based on the Generative Adversarial Network (GAN), we introduce segmentation images to the discriminator to ensure the improvement of discrimination ability. The improvement of discrimination ability will enhance the generator’s generating ability, thus obtaining high-resolution results. Experiments well validate the outstanding performance of our algorithm. On CUB dataset, our inception score is 27.7% and 1.7% higher than GAN-CLS-INT and GAWWN, respectively. On the flower dataset, it further outplays GAN-CLS-INT and StackGAN by 21.8% and 1.25%, respectively. At the same time, our model is more concise in structure, and its training time is only half that of StackGAN.

Zhiqiang Zhang, Yunye Zhang, Wenxin Yu, Jingwei Lu, Li Nie, Gang He, Ning Jiang, Gang He, Yibo Fan, Zhuo Yang

Disentangling Latent Factors of Variational Auto-encoder with Whitening

After deep generative models were successfully applied to image generation tasks, learning disentangled latent variables of data has become a crucial part of deep generative model research. Many models have been proposed to learn an interpretable and factorized representation of latent variable by modifying their objective function or model architecture. To disentangle the latent variable, some models show lower quality of reconstructed images and others increase the model complexity which is hard to train. In this paper, we propose a simple disentangling method based on a traditional whitening process. The proposed method is applied to the latent variables of variational auto-encoder (VAE), although it can be applied to any generative models with latent variables. In experiment, we apply the proposed method to simple VAE models and experiment results confirm that our method finds more interpretable factors from the latent space while keeping the reconstruction error the same as the conventional VAE’s error.

Sangchul Hahn, Heeyoul Choi

Training Discriminative Models to Evaluate Generative Ones

Generative models are known to be difficult to assess. Recent works, especially on generative adversarial networks (GANs), produce good visual samples of varied categories of images. However, the validation of their quality is still difficult to define and there is no existing agreement on the best evaluation process. This paper aims at making a step toward an objective evaluation process for generative models. It presents a new method to assess a trained generative model by evaluating the test accuracy of a classifier trained with generated data. The test set is composed of real images. Therefore, The classifier accuracy is used as a proxy to evaluate if the generative model fit the true data distribution. By comparing results with different generated datasets we are able to classify and compare generative models. The motivation of this approach is also to evaluate if generative models can help discriminative neural networks to learn, i.e., measure if training on generated data is able to make a model successful at testing on real settings. Our experiments compare different generators from the Variational Auto-Encoders (VAE) and Generative Adversarial Network (GAN) frameworks on MNIST and fashion MNIST datasets. Our results show that none of the generative models is able to replace completely true data to train a discriminative model. But they also show that the initial GAN and WGAN are the best choices to generate on MNIST database (Modified National Institute of Standards and Technology database) and fashion MNIST database.

Timothée Lesort, Andrei Stoian, Jean-François Goudou, David Filliat

Scene Graph Generation via Convolutional Message Passing and Class-Aware Memory Embeddings

Detecting visual relationships between objects in an image still remains challenging, because the relationships are difficult to be modeled and the class imbalance problem tends to jeopardize the predictions. To alleviate these problems, we propose an end-to-end approach for scene graph generations. The proposed method employs the ResNet as the backbone network to extract the appearance features of the objects and relationships. An attention based graph convolutional network is exploited and modified to extract the contextual information. Language and geometric priors are also utilized and fused with the visual features to better describe the relationships. At last, a novel memory module is designed to alleviate the class imbalance problem. Experimental results demonstrate the validity of our model and our superiority compared to our baseline technique.

Yidong Zhang, Yunhong Wang, Yuanfang Guo

Attacks on Images

Frontmatter

Change Detection in Satellite Images Using Reconstruction Errors of Joint Autoencoders

With the growing number of open source satellite image time series, such as SPOT or Sentinel-2, the number of potential change detection applications is increasing every year. However, due to the image quality and resolution, the change detection process is a challenge nowadays. In this work, we propose an approach that uses the reconstruction losses of joint autoencoders to detect non-trivial changes (permanent changes and seasonal changes that do not follow common tendency) between two co-registered images in a satellite image time series. The autoencoder aims to learn a transformation model that reconstructs one co-registered image from another. Since trivial changes such as changes in luminosity or seasonal changes between two dates have a tendency to repeat in different areas of the image, their transformation model can be easily learned. However, non-trivial changes tend to be unique and can not be correctly translated from one date to another, hence an elevated reconstruction error where there is change. In this work, we compare two models in order to find the most performing one. The proposed approach is completely unsupervised and gives promising results for an open source time series when compared with other concurrent methods.

Ekaterina Kalinicheva, Jérémie Sublime, Maria Trocan

Physical Adversarial Attacks by Projecting Perturbations

Research on adversarial attacks analyses how to slightly manipulate patterns like images to make a classifier believe it recogises a pattern with a wrong label, although the correct label is obvious to humans. In traffic sign recognition, previous physical adversarial attacks were mainly based on stickers or graffity on the sign’s surface. In this paper, we propose and experimentally verify a new threat model that projects perturbations onto street signs via projectors or simulated laser pointers. No physical manipulation is required, which makes the attack difficult to detect. Attacks via projection imply new constraints like exclusively increasing colour intensities or manipulating certain colour channels. As exemplary experiments, we fool neural networks to classify stop signs as priority signs only by projecting optimised perturbations onto original traffic signs.

Nils Worzyk, Hendrik Kahlen, Oliver Kramer

Improved Forward-Backward Propagation to Generate Adversarial Examples

Deep neural networks (DNNs) have been widely applied in many areas. However, they are quite vulnerable to well-designed perturbations. Most recent methods of generating adversarial examples fail to limit the perturbations while keeping good transferability. In this work, we propose a new method to address these problems. We combine local attack and gradient descent optimization method to generate adversarial examples. Specifically, in forward propagation, we select sensitive pixels and add perturbations to them. In backward propagation, we propose a novel loss function to reduce the difference between adversarial examples and benign images. Extensive experiments demonstrate that our method achieves strong attack ability with lower distortion.

Yuying Hao, Tuanhui Li, Yang Bai, Li Li, Yong Jiang, Xuanye Cheng

Incremental Learning of GAN for Detecting Multiple Adversarial Attacks

Neural networks are vulnerable to adversarial attack. Carefully crafted small perturbations can cause misclassification of neural network classifiers. As adversarial attack is a serious potential problem in many neural network based applications and new attacks always come up, it’s urgent to explore the detection strategies that can adapt new attacks quickly. Moreover, the detector is hard to train with limited samples. To solve these problems, we propose a GAN based incremental learning framework with Jacobian-based data augmentation to detect adversarial samples. To prove the proposed framework works on multiple adversarial attacks, we implement FGSM, LocSearchAdv, PSO-based attack on MNIST and CIFAR-10 dataset. The experiments show that our detection framework performs well on these adversarial attacks.

Zibo Yi, Jie Yu, Shasha Li, Yusong Tan, Qingbo Wu

Evaluating Defensive Distillation for Defending Text Processing Neural Networks Against Adversarial Examples

Adversarial examples are artificially modified input samples which lead to misclassifications, while not being detectable by humans. These adversarial examples are a challenge for many tasks such as image and text classification, especially as research shows that many adversarial examples are transferable between different classifiers. In this work, we evaluate the performance of a popular defensive strategy for adversarial examples called defensive distillation, which can be successful in hardening neural networks against adversarial examples in the image domain. However, instead of applying defensive distillation to networks for image classification, we examine, for the first time, its performance on text classification tasks and also evaluate its effect on the transferability of adversarial text examples. Our results indicate that defensive distillation only has a minimal impact on text classifying neural networks and does neither help with increasing their robustness against adversarial examples nor prevent the transferability of adversarial examples between neural networks.

Marcus Soll, Tobias Hinz, Sven Magg, Stefan Wermter

DCT: Differential Combination Testing of Deep Learning Systems

Deep learning (DL) systems are increasingly used in security-related fields, where the accuracy and predictability of DL systems are critical. However the DL models are difficult to test and existing DL testing relies heavily on manually labeled data and often fails to expose erroneous behavior for corner inputs. In this paper, we propose Differential Combination Testing (DCT), an automated DL testing tool for systematically detecting the erroneous behavior of more corner cases without relying on manually labeled input data or manually checking the correctness of the output behavior. Our tool aims at automatically generating test cases, that is, applying image combination transformations to seed images to systematically generate synthetic images that can achieve high neuron coverage and trigger inconsistencies between multiple similar DL models. In addition, DCT utilizes multiple DL models with similar functions as cross-references, so that input data no longer must be manually marked and the correctness of output behavior can be automatically checked. The results show that DCT can find thousands of erroneous corner behaviors in the most commonly used DL models effectively and quickly, which can better detect the reliability and robustness of DL systems.

Chunyan Wang, Weimin Ge, Xiaohong Li, Zhiyong Feng

Restoration as a Defense Against Adversarial Perturbations for Spam Image Detection

Spam image detection is essential for protecting the security and privacy of Internet users and saving network resources. However, we observe a spam image detection system might be out of order due to adversarial perturbations, which can force a classification model to misclassify the input images. To defend against adversarial perturbations, previous researches disorganize the perturbations with fundamental image processing techniques, which shows limited success. Instead, we apply image restoration as a defense, which focuses on restoring the perturbed adversarial images to their original versions. The restoration is achieved by a lightweight preprocessing network, which takes the adversarial images as input and outputs their restored versions for classification. The further evaluation results demonstrate that our defense significantly improves the performance of classification models, requires little cost and outperforms other representative defenses.

Jianguo Jiang, Boquan Li, Min Yu, Chao Liu, Weiqing Huang, Lejun Fan, Jianfeng Xia

HLR: Generating Adversarial Examples by High-Level Representations

Neural networks can be fooled by adversarial examples. Recently, many methods have been proposed to generate adversarial examples, but these works mainly concentrate on the pixel-wise information, which limits the transferability of adversarial examples. Different from these methods, we introduce perceptual module to extract the high-level representations and change the manifold of the adversarial examples. Besides, we propose a novel network structure to replace the generative adversarial network (GAN). The improved structure ensures high similarity of adversarial examples and promotes the stability of training process. Extensive experiments demonstrate that our method has significant improvement on the transferability. Furthermore, the adversarial training defence method is invalid for our attack.

Yuying Hao, Tuanhui Li, Li Li, Yong Jiang, Xuanye Cheng

Backmatter

Titel: Artificial Neural Networks and Machine Learning – ICANN 2019: Image Processing
herausgegeben von: Igor V. Tetko
Dr. Věra Kůrková
Pavel Karpov
Prof. Fabian Theis
Verlag: Springer International Publishing
Electronic ISBN: 978-3-030-30508-6
Print ISBN: 978-3-030-30507-9
DOI: https://doi.org/10.1007/978-3-030-30508-6