Skip to main content

2021 | Buch

Artificial Neural Networks and Machine Learning – ICANN 2021

30th International Conference on Artificial Neural Networks, Bratislava, Slovakia, September 14–17, 2021, Proceedings, Part III

herausgegeben von: Prof. Igor Farkaš, Paolo Masulli, Dr. Sebastian Otte, Stefan Wermter

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

The proceedings set LNCS 12891, LNCS 12892, LNCS 12893, LNCS 12894 and LNCS 12895 constitute the proceedings of the 30th International Conference on Artificial Neural Networks, ICANN 2021, held in Bratislava, Slovakia, in September 2021.* The total of 265 full papers presented in these proceedings was carefully reviewed and selected from 496 submissions, and organized in 5 volumes.

In this volume, the papers focus on topics such as generative neural networks, graph neural networks, hierarchical and ensemble models, human pose estimation, image processing, image segmentation, knowledge distillation, and medical image processing.

*The conference was held online 2021 due to the COVID-19 pandemic.

Inhaltsverzeichnis

Frontmatter

Generative Neural Networks

Frontmatter
Binding and Perspective Taking as Inference in a Generative Neural Network Model

The ability to flexibly bind features into coherent wholes from different perspectives is a hallmark of cognition and intelligence. This binding problem is not only relevant for vision but also for general intelligence, sensorimotor integration, event processing, and language. Various artificial neural network models have tackled this problem. Here we focus on a generative encoder-decoder model, which adapts its perspective and binds features by means of retrospective inference. We first train the model to learn sufficiently accurate generative models of dynamic biological, or other harmonic, motion patterns. We then scramble the input and vary the perspective onto it. To properly route the input and adapt the internal perspective onto a known frame of reference, we propagate the prediction error (i) back onto a binding matrix, that is, hidden neural states that determine feature binding, and (ii) further back onto perspective taking neurons, which rotate and translate the input features. Evaluations show that the resulting gradient-based inference process solves the perspective taking and binding problems in the considered motion domains, essentially yielding a Gestalt perception mechanism. Ablation studies show that redundant motion features and population encodings are highly useful.

Mahdi Sadeghi, Fabian Schrodt, Sebastian Otte, Martin V. Butz
Advances in Password Recovery Using Generative Deep Learning Techniques

Password guessing approaches via deep learning have recently been investigated with significant breakthroughs in their ability to generate novel, realistic password candidates. In the present work we study a broad collection of deep learning and probabilistic based models in the light of password guessing: attention-based deep neural networks, autoencoding mechanisms and generative adversarial networks. We provide novel generative deep-learning models in terms of variational autoencoders exhibiting state-of-art sampling performance, yielding additional latent-space features such as interpolations and targeted sampling. Lastly, we perform a thorough empirical analysis in a unified controlled framework over well-known datasets (RockYou, LinkedIn, MySpace, Youku, Zomato, Pwnd). Our results not only identify the most promising schemes driven by deep neural networks, but also illustrate the strengths of each approach in terms of generation variability and sample uniqueness.

David Biesner, Kostadin Cvejoski, Bogdan Georgiev, Rafet Sifa, Erik Krupicka
Dilated Residual Aggregation Network for Text-Guided Image Manipulation

Text-guided image manipulation aims to modify the visual attributes of images according to textual descriptions. Existing works either mismatch between generated images and textual descriptions or may pollute the text-irrelevant image regions. In this paper, we propose a dilated residual aggregation network (denoted as DRA) for text-guided image manipulation, which exploits a long-distance residual with dilated convolutions (RD) to aggregate the encoded visual content and style features and the textual features of the guiding descriptions. In particular, the dilated convolutions increase the receptive field without sacrificing spatial resolutions of intermediate features, benefiting to reconstructing the texture details matching with the textual descriptions. Furthermore, we propose an attention-guided injection module (AIM) to inject textual semantics into feature maps of DRA without polluting the text-irrelevant image regions by combining triplet attention mechanism and central biasing instance normalization. Quantitative and qualitative experiments conducted on the CUB-200-2011 and Oxford-102 datasets demonstrate the superior performance of the proposed DRA.

Siwei Lu, Di Luo, Zhenguo Yang, Tianyong Hao, Qing Li, Wenyin Liu
Denoising AutoEncoder Based Delete and Generate Approach for Text Style Transfer

Text style transfer task is transferring sentences to other styles while preserving the semantics as much as possible. In this work, we study a two-step text style transfer method on non-parallel datasets. In the first step, the style-relevant words are detected and deleted from the sentences in the source style corpus. In the second step, the remaining style-devoid contents are fed into a Natural Language Generation model to produce sentences in the target style. The model consists of a style encoder and a pre-trained DenoisingAutoEncoder. The former extracts style features of each style corpus and the latter reconstructs source sentences during training and generates sentences in the target style during inference from given contents. We conduct experiments on two text sentiment transfer datasets and comprehensive comparisons with other relevant methods in terms of several evaluation aspects. Evaluation results show that our method outperforms others in terms of sentence fluency and achieves a decent tradeoff between content preservation and style transfer intensity. The superior performance on the Caption dataset illustrates our method’s potential advantage on occasions of limited data.

Ting Hu, Haojin Yang, Christoph Meinel
GUIS2Code: A Computer Vision Tool to Generate Code Automatically from Graphical User Interface Sketches

It is a typical task for front-end developers to repetitively transform the graphical user interface model provided by the designer into code. Automatically converting the design draft provided by the designer into code can simplify the task of the front-end engineer and avoid a lot of simple and repetitive work. In this paper, we propose GUIS2Code using deep neural network, which is trained on the datasets of the design drafts to detect the UI elements of the input sketches and generate corresponding codes through the UI parser. Our method can generate code for three different platforms (i.e., iOS, Android, and Web). Our experimental results illustrates that GUIS2Code achieves an average GUI-component classification accuracy of 95.04% and generates code that can restore the target sketches more accurately while exhibiting reasonable code structure.

Zhen Feng, Jiaqi Fang, Bo Cai, Yingtao Zhang
Generating Math Word Problems from Equations with Topic Consistency Maintaining and Commonsense Enforcement

Data-to-text generation task aims at generating text from structured data. In this work, we focus on a relatively new and challenging equation-to-text generation task – generating math word problems from equations and propose a novel equation-to-problem text generation model. Our model first utilizes a template-aware equation encoder and a Variational AutoEncoder (VAE) model to bridge the gap between abstract math tokens and text. We then introduce a topic selector and a topic controller to prevent topic drifting problems. To avoid the commonsense violation issues, we design a pre-training stage together with a commonsense enforcement mechanism. We construct a dataset to evaluate our model through both automatic metrics and human evaluation. Experiments show that our model significantly outperforms baseline models. Further analysis shows our model is effective in tackling topic drifting and commonsense violation problems.

Tianyang Cao, Shuang Zeng, Songge Zhao, Mairgup Mansur, Baobao Chang
Generative Properties of Universal Bidirectional Activation-Based Learning

UBAL is a novel bidirectional neural network model with bio-inspired learning. It enhances contrastive Hebbian learning rule with an internal echo mechanism enabling self-supervised learning. UBAL approaches any problem as a bidirectional heteroassociation, which gives rise to emergent properties, such as generation of patterns while trained for classification. We briefly discuss and illustrate these properties using the MNIST dataset and conclude that with a slight trade-off in accuracy we can achieve feasible image generation without explicitly setting up the objective to do so.

Kristína Malinovská, Igor Farkaš

Graph Neural Networks I

Frontmatter
Joint Graph Contextualized Network for Sequential Recommendation

Sequential recommendation aims to suggest items to users based on sequential dependencies. Graph neural networks (GNNs) are recently proposed to capture transitions of items by treating session sequences as graph-structured data. However, existing graph construction approaches mainly focus on the directional dependency of items and ignore benefits of feature aggregation from undirectional relationship. In this paper, we innovatively propose a joint graph contextualized network (JGCN) for sequential recommendation, which constructs both the directed graphs and undirected graphs to jointly capture current interests and global preferences. Specifically, we introduce gate graph neural networks and model the combined embedding of weighted position and node information from directed graphs for capturing current interests. Besides, to learn global preferences, we propose a graph collaborative attention network with correlation-based similarity of items from undirected graphs. Finally, a feed-forward layer with the residual connection is applied to synthetically obtain accurate transitions of items. Extensive experiments conducted on three datasets show that JGCN outperforms state-of-the-art methods.

Ruoran Huang, Chuanqi Han, Li Cui
Relevance-Aware Q-matrix Calibration for Knowledge Tracing

Knowledge tracing (KT) lies at the core of intelligent education, which aims to diagnose students’ changing knowledge level over time based on their historical performance. Most of the existing KT models either ignore the significance of Q-matrix associated exercises with knowledge concepts (KCs) or fail to eliminate the subjective tendency of experts within the Q-matrix, thus it is insufficient for capturing complex interaction between students and exercises. In this paper, we propose a novel Relevance-Aware Q-matrix Calibration method for knowledge tracing (RAQC), which incorporates the calibrated Q-matrix into Long Short-Term Memory (LSTM) network to model the complex students’ learning process, for getting both accurate and interpretable diagnosis results. Specifically, we first leverage the message passing mechanism in Graph Convolution Network (GCN) to fully exploit the high-order connectivity between exercises and KCs for obtaining a potential KC list. Then, we propose a Q-matrix calibration method by using relevance scores between exercises and KCs to mitigate the problem of subjective bias existed in human-labeled Q-matrix. After that, the embedding of each exercise aggregated the calibrated Q-matrix with the corresponding response log is fed into the LSTM to tracing students’ knowledge states (KS). Extensive experimental results on two real-world datasets show the effectiveness of the proposed method.

Wentao Wang, Huifang Ma, Yan Zhao, Zhixin Li, Xiangchun He
LGACN: A Light Graph Adaptive Convolution Network for Collaborative Filtering

To relieve information flood on the web, recommender system has been widely used to retrieve personalized information. In recommender system, Graph Convolutional Network (GCN) has become a new frontier technology of collaborative filtering. However, existing methods usually assume that neighbor nodes have only positive effects on the target node. A few methods analyze the design of traditional GCNs and eliminate some invalid operations. However, they have not considered the possible negative effects of neighbors to adapt collaborative filtering. Thus, we argue that it is crucial to take the positive and negative effects of neighbors into consideration for collaborative filtering.In this paper, we aim to alter the neighbor aggregation method and layer combination mechanism of GCN to make it more applicable for recommendation. Inspired by LightGCN, we propose a new model named LGACN (Light Graph Adaptive Convolution Network), including the most important component in GCN - neighborhood aggregation and layer combination - for collaborative filtering and alter them to fit recommendations. Specifically, LGACN learns user and item embeddings by propagating their positive and negative information on the user-item interaction graph by an adaptive attention-based method and uses the self-attention mechanism to combine the embeddings learned at each layer as the final embedding. Such a neat model is not only easy to implement but also interpretable, outperforming strong recommender baselines. Our model achieves about $$15\%$$ 15 % relative improvement on Amazon-book and $$5\%$$ 5 % relative improvement on Yelp2018 compared with LightGCN.

Weiguang Jiang, Su Wang, Jun Zheng, Wenxin Hu
HawkEye: Cross-Platform Malware Detection with Representation Learning on Graphs

Malicious software, widely known as malware, is one of the biggest threats to our interconnected society. Cybercriminals can utilize malware to carry out their nefarious tasks. To address this issue, analysts have developed systems that can prevent malware from successfully infecting a machine. Unfortunately, these systems come with two significant limitations. First, they frequently target one specific platform/architecture, and thus, they cannot be ubiquitous. Second, code obfuscation techniques used by malware authors can negatively influence their performance. In this paper, we design and implement HawkEye, a control-flow-graph-based cross-platform malware detection system, to tackle the problems mentioned above. In more detail, HawkEye utilizes a graph neural network to convert the control flow graphs of executable to vectors with the trainable instruction embedding and then uses a machine-learning-based classifier to create a malware detection system. We evaluate HawkEye by testing real samples on different platforms and operating systems, including Linux (x86, x64, and ARM-32), Windows (x86 and x64), and Android. The results outperform most of the existing works with an accuracy of 96.82% on Linux, 93.39% on Windows, and 99.6% on Android. To the best of our knowledge, HawkEye is the first approach to consider graph neural networks in the malware detection field, utilizing natural language processing.

Peng Xu, Youyi Zhang, Claudia Eckert, Apostolis Zarras
An Empirical Study of the Expressiveness of Graph Kernels and Graph Neural Networks

Graph neural networks and graph kernels have achieved great success in solving machine learning problems on graphs. Recently, there has been considerable interest in determining the expressive power mainly of graph neural networks and of graph kernels, to a lesser extent. Most studies have focused on the ability of these approaches to distinguish non-isomorphic graphs or to identify specific graph properties. However, there is often a need for algorithms whose produced graph representations can accurately capture similarity/distance of graphs. This paper studies the expressive power of graph neural networks and graph kernels from an empirical perspective. Specifically, we compare the graph representations and similarities produced by these algorithms against those generated by a well-accepted, but intractable graph similarity function. We also investigate the impact of node attributes on the performance of the different models and kernels. Our results reveal interesting findings. For instance, we find that theoretically more powerful models do not necessarily yield higher-quality representations, while graph kernels are shown to be very competitive with graph neural networks.

Giannis Nikolentzos, George Panagopoulos, Michalis Vazirgiannis
Multi-resolution Graph Neural Networks for PDE Approximation

Deep Learning algorithms have recently received a growing interest to learn from examples of existing solutions and some accurate approximations of the solution of complex physical problems, in particular relying on Graph Neural Networks applied on a mesh of the domain at hand. On the other hand, state-of-the-art deep approaches of image processing use different resolutions to better handle the different scales of the images, thanks to pooling and up-scaling operations. But no such operators can be easily defined for Graph Convolutional Neural Networks (GCNN). This paper defines such operators based on meshes of different granularities. Multi-resolution GCNNs can then be defined. We propose the MGMI approach, as well as an architecture based on the famed U-Net. These approaches are experimentally validated on a diffusion problem, compared with projected CNN approach and the experiments witness their efficiency, as well as their generalization capabilities.

Wenzhuo Liu, Mouadh Yagoubi, Marc Schoenauer
Link Prediction on Knowledge Graph by Rotation Embedding on the Hyperplane in the Complex Vector Space

The large-scale exploitation of knowledge graphs has promoted research efforts on graph construction and completion in many organizations such as Google, Apple. The problem of predicting the missing links in the knowledge graph often depends heavily on the method of embedding the vertices into a low-dimensional space, mostly considering the relations as a translation. Recently, there is an approach based on rotation embedding, which can improve efficiency remarkably. Therefore, in this paper, we propose an approach towards rotation embedding entities on a low-dimensional vector. Specifically, we start by projecting the entities onto the relation-specific hyperplanes before rotating them so that the head entities are as close as possible to the tail entities. Based on that, each relation is a rotation from the head entities to the tail entities on the hyperplane in complex vector space. Experiments on well-known datasets show the improvement of the proposed model compared to other models.

Thanh Le, Ngoc Huynh, Bac Le

Graph Neural Networks II

Frontmatter
Contextualise Entities and Relations: An Interaction Method for Knowledge Graph Completion

The incompleteness of Knowledge Graph (KG) stimulates substantial research on knowledge graph completion, however, current state-of-the-art embedding based methods represent entities and relations in a semantic-separated manner, overlooking the interacted semantics between them. In this paper, we introduce a novel entity-relation interaction mechanism, which learns contextualised entity and relation representations with each other. We feature entity interaction embeddings by adopting a translation distance based method which projects entities into a relation-interacted semantic space, and we augment relation embeddings using a bi-linear projection. Built upon our interaction mechanism, we experiment our idea using two decoders, namely a simple Feed-forward based Interaction Model (FIM) and a Convolutional network based Interaction Model (CIM). Through extensive experiments conducted on three benchmark datasets, we demonstrate the advantages of our interaction mechanism, both of them achieving state-of-the-art performance consistently.

Kai Chen, Ye Wang, Yitong Li, Aiping Li, Xiaojuan Zhao
Civil Unrest Event Forecasting Using Graphical and Sequential Neural Networks

Having the ability to forecast civil unrest events, such as violent protests, is crucial because they can lead to severe violent conflict and social instabilities. Civil unrests are comprehensive consequences of multiple factors, which could be related to political, economic, cultural, and other types of historical events. Therefore, people naturally organize such historical data into time-series data and feed it into an RNN-like model to perform the forecasting. However, how to encode discrete historical information into a unified vector space is very important. Different events may have extensive and complex relationships in time, space, and participants. Traditional methods, such as collecting indicators of various fields as features, miss the vital correlation information between events. In this work, we propose a Graph Neural Network based model to learn the representation of correlated historical event information. By using the dates, events, participants, and locations as nodes, we construct an event graph so that the relationship between events can be expressed unambiguously. We organize date-node’s representations into time-series data and use an LSTM to predict if there will be a violent protest or demonstration in the next few days. In the experiments, we use historical events from Hong Kong to evaluate our system’s forecasting ability in 1-day, 2-day, and 3-day lead-time. Our system achieves recall rates of 0.85, 0.86, 0.88, and precision rates of 0.75, 0.77, 0.75, respectively. We also discussed the impact of longer prediction lead times, and external events in China Mainland, the United States, and the United Kingdom on the Hong Kong civil unrest event prediction.

Zheng Chen, Yifan Wang
Parameterized Hypercomplex Graph Neural Networks for Graph Classification

Despite recent advances in representation learning in hypercomplex (HC) space, this subject is still vastly unexplored in the context of graphs. Motivated by the complex and quaternion algebras, which have been found in several contexts to enable effective representation learning that inherently incorporates a weight-sharing mechanism, we develop graph neural networks that leverage the properties of hypercomplex feature transformation. In particular, in our proposed class of models, the multiplication rule specifying the algebra itself is inferred from the data during training. Given a fixed model architecture, we present empirical evidence that our proposed model incorporates a regularization effect, alleviating the risk of overfitting. We also show that for fixed model capacity, our proposed method outperforms its corresponding real-formulated GNN, providing additional confirmation for the enhanced expressivity of HC embeddings. Finally, we test our proposed hypercomplex GNN on several open graph benchmark datasets and show that our models reach state-of-the-art performance while consuming a much lower memory footprint with 70 $$\%$$ % fewer parameters. Our implementations are available at https://github.com/bayer-science-for-a-better-life/phc-gnn .

Tuan Le, Marco Bertolini, Frank Noé, Djork-Arné Clevert
Feature Interaction Based Graph Convolutional Networks for Image-Text Retrieval

To solve the challenge of heterogeneous gap between visual and linguistic data in image-text retrieval task, many methods have been proposed and significant progress has been made. Recently, some works use more refined information of the relation between regions in images or the semantic connection between words in text to further improve the representation of text and image data, while the cross-modal relation between image region and text word is not well explored in the representation. The current methods lack feature interaction in the data representation. For this purpose, we propose a novel image-text retrieval method which introduces inter-modal feature interaction in the graph convolutional networks (GCN) of image and text fragments. By the feature interaction between fragments of different modalities and the information propagation of GCN, the proposed method can capture more inter-modal interaction information for image-text retrieval. The experimental results on MS COCO and Flickr30K datasets show that the proposed method outperforms the state-of-the-art methods.

Yongli Hu, Feili Gao, Yanfeng Sun, Junbin Gao, Baocai Yin
Generalizing Message Passing Neural Networks to Heterophily Using Position Information

Message Passing Neural Networks (MPNNs) is a promising architecture for machine learning on graphs, which iteratively propagates the information among nodes. The existing MPNNs methods are more suitable for homophily graphs in which the geometrically close nodes have similar features and class labels. However, in real-world applications, there exist graphs with heterophily and the performance of MPNNs may be limited when dealing with the heterophily graphs. We analyze the limitations of MPNNs when facing heterophily graphs and owe it to the indistinguishability of nodes during aggregating and combining. To this end, we propose a method under the MPNNs architecture called Position Enhanced Message Passing model (PEMP) that endows the node with position information to make the node distinguishable. Extensive experiments on nine real-world datasets show that our method achieves state-of-the-art performances in most heterophily graphs while preserving the performances of MPNNs on homophily graphs.

Wenzheng Zhang, Jie Liu, Liting Liu
Local and Non-local Context Graph Convolutional Networks for Skeleton-Based Action Recognition

Graph convolutional networks (GCNs) for skeleton-based action recognition have achieved considerable progress recently. However, there are still two unresolved shortages. One is that the input data lacks high-level motion information of discriminant features. The other is that the access to the long-range action features is limited by the local sampling scale. In this work, we propose a new model called local and non-local context graph convolutional networks (LnLC-GCN). The first innovation is a motion enhanced graph containing high-level motion information which is served as the multi-stream input. Secondly, to overcome the limitations of local receptive field, we present a local and non-local context module based on the global context mechanism. Moreover, we use two optimization strategies of front-end fusion and non-local context feedback to further improve the accuracy of LnLC-GCN. For validation of the performance, numerous experiments were deployed on three public datasets, NTU-RGB+D 60 & 120 and Kinetics-Skeleton, strongly demonstrating that our approach achieves state-of-the-art.

Zikai Gao, Yang Zhao, Zhe Han, Kang Wang, Yong Dou
STGATP: A Spatio-Temporal Graph Attention Network for Long-Term Traffic Prediction

Traffic prediction is essential to public transportation management in cities. However, long-term traffic prediction involves complex spatio-temporal correlations changing dynamically, which is highly challenging to capture in road networks. We focus on these dynamic correlations and propose a spatio-temporal graph modeling method to solve the long-term traffic prediction problem. Our proposed method builds a Spatio-Temporal Graph Attention network for Traffic Prediction (STGATP), exploring and capturing the complex spatial-temporal nature in traffic networks. We apply dilated causal convolution with a gated fusion in the temporal modeling block, and diffusion convolution with the attention mechanism in the spatial modeling block. This results in that STGATP can simultaneously capture spatial dependencies and temporal dependencies in road networks. Finally, we conduct the experiments on public traffic datasets METR-LA and PEMS-BAY, and our method reaches superior performance. In particular, STGATP surpasses state-of-the-art methods by up to 11% improvement of RMSE measure on the PEMS-BAY datasets.

Mengting Zhu, Xianqiang Zhu, Cheng Zhu

Hierarchical and Ensemble Models

Frontmatter
Integrating N-Gram Features into Pre-trained Model: A Novel Ensemble Model for Multi-target Stance Detection

Multi-target stance detection in tweets aims to detect the stance of given texts towards a specific target entity. Most existing models on stance detection consider word embedding as input, however, recent developments pointed out that it would be beneficial to incorporate feature-based information appropriately. Motivated by the strong performance of the pre-trained models in many Natural Language Processing field, and n-gram features that have been proved to be effective in prior competition, we present a novel combination module to obtain both advantages. This paper has proposed a pre-trained model integrated with n-gram features module (PMINFM) to better utilize multi-scale feature representation information and semantic features. Then connect it to a Bidirectional Long Short-Term Memory networks with target-specific attention mechanism. The experimental results show that our proposed model outperforms other baseline models in the SemEval-2016 stance detection dataset and achieves state-of-the-art performance.

Pengyuan Chen, Kai Ye, Xiaohui Cui
Hierarchical Ensemble for Multi-view Clustering

Multi-view clustering is a challenging task due to the distinct feature distributions among different views. To permit complementarity while exploiting consistency among views, some multi-layer models have been developed. These models usually enforce consistent representation on the top layer for clustering purpose while allowing the other layers to represent various attributes of the multi-view data. However, a single consistent layer is often insufficient especially for complicated real-world tasks. In addition, the existing models often represent different views using the same number of layers without taking the various levels of complexity of different views into account. Furthermore, different views are often considered to be equal in the clustering process, which does not necessarily hold in many applications. To address these issues, in this paper, we present a hierarchical ensemble framework for multi-view clustering (HEMVC). It is superior to the existing methods in three facets. Firstly, HEMVC allows for different views to share more than one consistent layers and implement ensemble clustering on all shared layers. Secondly, it facilitates an adaptive clustering scheme by automatically quantifying the contribution of each layer and each view in the ensemble learning process. Thirdly, it represents different views using different numbers of layers to compensate various complexities of different views. To realize HEMVC, a two-stage algorithm has been derived and implemented. The experimental results on five benchmark datasets illustrate its performance by comparing with the state-of-the-art methods.

Fei Gao, Liu Yang
Structure-Aware Multi-scale Hierarchical Graph Convolutional Network for Skeleton Action Recognition

In recent years, graph convolutional neural network (GCNN) has achieved the most advanced results in skeleton action recognition tasks. However, existing models mainly focus on extracting local information from joint-level and part-level, but ignore the global information of frame-level and the relevance between multiple levels, which lead to the loss of hierarchical information. Moreover, these models consider the non-physical connection relationship between nodes but neglect the dependence between body parts. The lose of topology information directly results in poor model performance. In this paper, we propose a structure-aware multi-scale hierarchical graph convolutional network (SAMS-HGCN) model, which includes two modules: a structure-aware hierarchical graph pooling block (SA-HGP Block) and a multi-scale fusion module (MSF module). Specifically, SA-HGP Block establishes a hierarchical network to capture the topological information of multiple levels by using the hierarchical graph pooling (HGP) operation and model the dependence among parts via the structure-aware learning (SA Learning) operation. MSF module fuses information of different scales in each level to obtain multi-scale global structural information. Experiments show that our method achieves comparable performances to state-of-the-art methods on NTU-RGB+D and Kinetics-Skeleton datasets.

Changxiang He, Shuting Liu, Ying Zhao, Xiaofei Qin, Jiayuan Zeng, Xuedian Zhang
Learning Hierarchical Reasoning for Text-Based Visual Question Answering

Text-based visual question answering (TextVQA) task needs to answer questions based on the objects and text information in image, which involves the joint reasoning over three modalities - question, visual objects, and text in image. Recent approaches on textVQA regard three modalities as joint input of transformers. However, these implicit reasoning methods do not make full use of multi-modal information, especially visual modality. To this end, we propose a novel model for textVQA based on reasoning explicitly in human-like mode. Firstly, the relevance between different objects and question is obtained. Then, the object modality is fused into the text modality weighted by obtained relevance. Finally, the amended text modality is used to predict the answer. In contrast to previous multi-modal free fusion strategy, our method can make the reasoning process more explicit and robust. Moreover, a prior-based loss is proposed to constrain object-question relevance. Extensive experimental results on several benchmark datasets well demonstrate the superior performance of our hierarchical reasoning framework over current state-of-the-art methods.

Caiyuan Li, Qinyi Du, Qingqing Wang, Yaohui Jin
Hierarchical Deep Gaussian Processes Latent Variable Model via Expectation Propagation

Gaussian Processes (GPs) and related unsupervised learning techniques such as Gaussian Process Latent Variable Models (GP-LVMs) have been very successful in the accurate modeling of high-dimensional data based on limited amounts of training data. Usually these techniques have the disadvantage of a high computational complexity. This makes it difficult to solve the associated learning problems for complex hierarchical models and large data sets, since the related computations, as opposed to neural networks, are not node-local. Combining sparse approximation techniques for GPs and Power Expectation Propagation, we present a framework for the computationally efficient implementation of hierarchical deep Gaussian process (latent variable) models. We provide implementations of this approach on the GPU as well as on the CPU, and we benchmark efficiency comparing different optimization algorithms. We present the first implementation of such deep hierarchical GP-LVMs and demonstrate the computational efficiency of our GPU implementation.

Nick Taubert, Martin A. Giese
Adaptive Consensus-Based Ensemble for Improved Deep Learning Inference Cost

Deep learning models are continuously improving the state-of-the-art in nearly every domain, achieving increased levels of accuracy. To sustain, however, this performance, these models have become larger and more computationally intensive at a staggering rate. Using an ensemble of deep learning models to improve the accuracy (in comparison to running a single model) is a well-known approach, but using it in real-world settings is challenging due to its exuberant inference computational cost. In this paper we present a novel method for reducing the cost associated with an ensemble of models by $$\sim $$ ∼ 50% on average while maintaining comparable accuracy. The method proposed is simple to implement, and is fully agnostic to the model and the problem domain. The experimental results presented demonstrate that our method can be used in a number of configurations, all of which provide a much better “performance per cost” than standard ensembles, whether using an ensemble of N instances of the same model architecture (trained from scratch each time), or an ensemble of completely different models.

Nelly David, Nathan S. Netanyahu

Human Pose Estimation

Frontmatter
Multi-Branch Network for Small Human Pose Estimation

The task of 2D human pose estimation aims to obtain the position of the body’s articulation points in picture, it is the basis for many other tasks, but the current human pose estimation network has shortcomings when dealing with small objects. Due to the ambiguity caused by inadequate expansion, a small human object contains insufficient semantic information. Therefore, the prediction of coordinate points becomes imprecise. In this paper, to address the problem of small human pose estimation, we present a novel network structure called Multi-Branch Network (MBN), consisting of three modules: Multi-Branch Expansion Module (MBEM), Multi-Branch Downsample Module (MBDM), and Refine Module (RM). MBEM reduces the input bias before the image enters the backbone network. MBDM adds an extra downsampling branch to obtain richer semantic information. RM locates hard joints by the refined operation. The experimental studies on COCO benchmark show that our approach gains noteworthy enhancements on the state-of-the-art single-stage models of ResNet and RSN.

Yuchen Ge, Zhongqiu Zhao, Yao Gao, Weidong Tian, Hai Min
PNO: Personalized Network Optimization for Human Pose and Shape Reconstruction

Most previous human pose and shape reconstruction methods focus on the generalization ability and learn a prior of the general pose and shape, however the personalized features are often ignored. We argue that the personalized features such as appearance and body shape are always consistent for the specific person and can further improve the accuracy. In this paper, we propose a Personalized Network Optimization (PNO) method to maintain both generalization and personality for human pose and shape reconstruction. The general trained network is adapted to the personalized network by optimizing with only a few unlabeled video frames of the target person. Moreover, we specially propose geometry-aware temporal constraints that help the network better exploit the geometry knowledge of the target person. In order to prove the effectiveness of PNO, we re-design the benchmark of pose and shape reconstruction to test on each person independently. Experiments show that our method achieve the state-of-the-art results in both 3DPW and MPI-INF-3DHP datasets.

Zhijie Cao, Min Wang, Shanyan Guan, Wentao Liu, Chen Qian, Lizhuang Ma
JointPose: Jointly Optimizing Evolutionary Data Augmentation and Prediction Neural Network for 3D Human Pose Estimation

3D human pose estimation plays important roles in various human-machine interactive applications, but lacking diversity in existing labeled 3D human posture dataset restricts the generalization ability of deep learning based models. Data augmentation is therefore an important method to solve this problem. However, data augmentation and pose estimation network training are usually treated as two isolated processes, limiting the performance of pose estimation network. In this paper, we developed an improved data augmentation method which jointly performs pose network estimation and data augmentation by designing a reward/penalty strategy for effective joint training, making model training and data augmentation improve each other. In particular, an improved evolutionary data augmentation method is proposed to generate the distribution of nodes in crossover and rotation angles in mutation through the process of the evolution. Extensive experiments show that our approach not only significantly improves state-of-the-art models without additional data efforts but also is extremely competitive with other advanced methods.

Zhiwei Yuan, Songlin Du
DeepRehab: Real Time Pose Estimation on the Edge for Knee Injury Rehabilitation

Human pose estimation is a crucial step towards understanding and characterizing people’s behavior in images and videos. Current state of the art results on human pose estimation were achieved by large Deep Learning models that are restricted to cloud computing for real time applications. However, with the development of edge computing, Deep Learning is moving more from the cloud to the edge. In this work we present DeepRehab, a Deep Learning based 2D pose estimator optimized for Edge TPU processing. We first improve an existing Edge TPU compatible model named PoseNet by refining its predictions with filtering methods. Subsequently, as the performance of the filters is limited by the model’s inaccuracies, specifically on the lower body parts, we developed DeepRehab, trained on 23 keypoints from the COCO-WholeBody dataset. We achieve 0.65 AP with DeepRehab and quantize it, such that, losing only 3% of AP it runs at a speed of 15 FPS on the Coral USB Accelerator that suits real time evaluations.

Bruno Carlos Dos Santos Melício, Gábor Baranyi, Zsófia Gaál, Sohil Zidan, András Lőrincz

Image Processing

Frontmatter
Subspace Constraint for Single Image Super-Resolution

Recently, single image super-resolution (SISR) algorithms based on convolutional neural networks (CNN) have proliferated and achieved significant success. However, most of them use the same constraint to both low-frequency and high-frequency features in the loss function. They do not discriminate between high-frequency details and low-frequency information, which limits the representation capacity of high-frequency information. This paper presents a subspace constraint approach for SISR to discriminate between high-frequency information and low-frequency information and enhance the reconstruction of high-frequency features. In our approach, the constraint is introduced in wavelet domain. Meanwhile, our approach adopts the multi-level residual learning to improve the training efficiency. Extensive experimental results on five benchmark datasets show that our model is superior to those state-of-the-art methods for both accuracy and visual comparisons.

Yanlin Zhang, Ding Qin, Xiaodong Gu
Towards Fine-Grained Control over Latent Space for Unpaired Image-to-Image Translation

We address the open problem of unpaired image-to-image (I2I) translation using a generative model with fine-grained control over the latent space. The goal is to learn the conditional distribution of translated images given images from a source domain without access to the joint distribution. Previous works, such as MUNIT and DRIT, which simply keep content latent codes and exchange the style latent codes, generate images of inferior quality. In this paper, we propose a new framework for unpaired I2I translation. Our framework first assumes that the latent space can be decomposed into content and style sub-spaces. Instead of naively exchanging style codes when translating, our framework uses an interpolator that guides the transformation and is able to produce intermediate results under different strengths of translation. Domain specific information, which might still exist in content codes, is excluded in our framework. Extensive experiments show that the translated images using our framework are superior than or comparable to state-of-the-art baselines. Code is available upon publication.

Lei Luo, William Hsu, Shangxian Wang
FMSNet: Underwater Image Restoration by Learning from a Synthesized Dataset

Underwater images suffer from various degradation, which can significantly lower the visual quality and the accuracy of subsequent applications. Moreover, the artificial light source tends to invalidate many image restoration algorithms. In this paper, an underwater image restoration (UIR) method using a novel Convolutional Neural Network (CNN) architecture and a synthesized underwater dataset is proposed. We discuss the reason for the over enhancement that exists in current UIR methods and revise the underwater image formation model (IFM) to alleviate the problem. With the revised IFM, we proposed an underwater image synthesizing method that can create a realistic underwater dataset. In order to effectively conduct end-to-end supervised learning, we design a network based on the characteristics of image restoration tasks, namely FMSNet. Different from existing networks, the decomposition and fusion operation in FMSNet can process the feature maps more efficiently and improve the contrast more prominently. The UIR method built by FMSNet can directly recover the degraded underwater images without the need of any pre-processing and post-processing. The experimental results indicate that FMSNet performs favorably against the widely used network architectures and our UIR method can outperform the state-of-the-art methods on both qualitative and quantitative evaluations. Comparing with the original underwater images, the experiments carried out by subsequent mission shows that 285% more feature points can be detected in the restored images by using our method.

Xiangyu Yin, Xiaohong Liu, Huan Liu
Towards Measuring Bias in Image Classification

Convolutional Neural Networks (CNN) have become de facto state-of-the-art for the main computer vision tasks. However, due to the complex underlying structure their decisions are hard to understand which limits their use in some context of the industrial world. A common and hard to detect challenge in machine learning (ML) tasks is data bias. In this work, we present a systematic approach to uncover data bias by means of attribution maps. For this purpose, first an artificial dataset with a known bias is created and used to train intentionally biased CNNs. The networks’ decisions are then inspected using attribution maps. Finally, meaningful metrics are used to measure the attribution maps’ representativeness with respect to the known bias. The proposed study shows that some attribution map techniques highlight the presence of bias in the data better than others and metrics can support the identification of bias.

Nina Schaaf, Omar de Mitri, Hang Beom Kim, Alexander Windberger, Marco F. Huber
Towards Image Retrieval with Noisy Labels via Non-deterministic Features

For large image retrieval applications, collecting images from the Internet is a relatively low-cost way to obtain labeled training data. However, images from the Internet are often falsely annotated, which may lead to unsatisfying performances of trained models. To alleviate the impact of label noise, our work adopts non-deterministic features for training deep convolutional neural networks. Suppose that non-deterministic features of images obey multi-dimensional Gaussian distribution, and a feature is represented by its mean and variance. During the training, when images are mapped to non-deterministic features, the model will tend to assign large variance to mislabeled samples, suggesting low confidence levels, instead of causing negative optimization by directly changing estimation of mean value. During the test, mean value of non-deterministic features are used for image retrieval. This method increases redundancy by adding variance term of features, so the model has stronger tolerance to label noise. Experimental results and analysis show that on datasets with label noise, using non-deterministic features leads to better image retrieval results than using deterministic features.

Hengwei Liu, Jinyu Ma, Xiaodong Gu

Image Segmentation

Frontmatter
Improving Visual Question Answering by Semantic Segmentation

Most recent visual question answering (VQA) methods extract object regions (bounding-boxes) by Faster R-CNN and use these region features in the visual encoder. Because extracted bounding-boxes are often located around things (countable objects), information on stuff (amorphous background regions such as grass and sky) is not reflected well in the visual encoder. Because stuff is amorphous and uncountable, it is common to use semantic segmentation to extract its features. In this work, we extend conventional thing-centric regions-of-interest (ROIs) by adding ROIs distributed around stuff regions and use semantic segmentation labels to encode stuff features in the visual encoder. The results of our experiments revealed that our method improved on existing VQA models and produced state-of-the-art results on VQA-v2 val, even though this dataset was not designed specifically for evaluating stuff, and most of its questions are thing-centric.

Viet-Quoc Pham, Nao Mishima, Toshiaki Nakasu
Weakly Supervised Semantic Segmentation with Patch-Based Metric Learning Enhancement

Weakly supervised semantic segmentation (WSSS) methods are more flexible and less costly than supervised ones since no pixel-level annotation is required. Class activation maps (CAMs) are commonly used in existing WSSS methods with image-level annotations to identify seed localization cues. However, as CAMs are obtained from a classification network that mainly focuses on the most discriminative parts of an object, less discriminative parts may be ignored and not identified. This study aims to improve the local visual understanding on objects of the classification network by considering an additional metric learning task on patches sampled from each CAM-based object proposal. As the patches contain different object parts and surrounding backgrounds, not only the most discriminative object parts but the entire objects are learned through leveraging the patch similarity. After the joint training process with the proposed patch-based metric learning and classification tasks, we expect more discriminative local features can be learned by the backbone network. As a result, more complete class-specific regions of an object can be identified. Extensive experiments on the PASCAL VOC 2012 dataset validate the superiority of our method. Our proposed model achieves improvement compared with the state-of-the-art methods.

Patrick P. K. Chan, Keke Chen, Linyi Xu, Xiaoman Hu, Daniel S. Yeung
ComBiNet: Compact Convolutional Bayesian Neural Network for Image Segmentation

Fully convolutional U-shaped neural networks have largely been the dominant approach for pixel-wise image segmentation. In this work, we tackle two defects that hinder their deployment in real-world applications: 1) Predictions lack uncertainty quantification that may be crucial to many decision-making systems; 2) Large memory storage and computational consumption demanding extensive hardware resources. To address these issues and improve their practicality we demonstrate a few-parameter compact Bayesian convolutional architecture, that achieves a marginal improvement in accuracy in comparison to related work using significantly fewer parameters and compute operations. The architecture combines parameter-efficient operations such as separable convolutions, bilinear interpolation, multi-scale feature propagation and Bayesian inference for per-pixel uncertainty quantification through Monte Carlo Dropout. The best performing configurations required fewer than 2.5 million parameters on diverse challenging datasets with few observations.

Martin Ferianc, Divyansh Manocha, Hongxiang Fan, Miguel Rodrigues
Depth Mapping Hybrid Deep Learning Method for Optic Disc and Cup Segmentation on Stereoscopic Ocular Fundus

Optic disc and cup segmentation on ocular fundus images is an important prerequisite for diagnosing glaucoma. For the segmentation of optic disc (OD) and optic cup (OC), many previously proposed deep learning methods typically utilize monoscopic view images that lack spatial depth information, limiting their diagnostic ability and overall performance. According to ophthalmologists’ clinical insights, stereoscopic view of ocular fundus contains great potential to improve optic cup segmentation. We propose a depth mapping hybrid (DeMaH) deep learning method that effectively adopts depth mappings to segment OD and OC (ODC) on ocular fundus images. Experimental results demonstrate that our method achieves significant improvement on ODC segmentation, especially OC segmentation, validating the effectiveness of our method to incorporate clinical prior knowledge.

Gang Yang, Yunfeng Du, Yanni Wang, Donghong Li, Dayong Ding, Jingyuan Yang, Gangwei Cheng
RATS: Robust Automated Tracking and Segmentation of Similar Instances

Continuous identification of objects with identical appearance is crucial to analyze the behavior of laboratory animals. Most existing methods attempt to avoid this problem by excluding direct social interactions or facilitating it by implants or artificial markers. Unfortunately, these techniques may distort the results as they can affect the behavior of the observed animals. In this paper, we present a simple, deep learning-based approach that can overcome these problems by providing reliable segmentation and tracking of similar instances. Recognition of frames where the system could not reliably locate the objects and mark them suggests human supervision is central to the system since there should be no mistake in instance tracking. Manual annotation of these data improves tracking and decreases annotation needs quickly. The proposed method achieves higher segmentation accuracy and more stable tracking than previous methods despite requiring only a small set of manually annotated data.

László Kopácsi, Árpád Dobolyi, Áron Fóthi, Dávid Keller, Viktor Varga, András Lőrincz

Knowledge Distillation

Frontmatter
Data Diversification Revisited: Why Does It Work?

Data Diversification is a recently proposed method of data augmentation for Neural Machine Translation (NMT). While it attracts broad attention due to its effectiveness, the reason for its success is unclear. In this paper, we first establish a connection between data diversification and knowledge distillation, and prove that data diversification reduces the modality complexity. We also find knowledge distillation has a lower complexity of data modality than data diversification, but challenging to boost performance. Our analysis reveals that knowledge distillation has a negative impact on the word frequency distribution where increasing rare words with unreliable representations. Furthermore, data diversification trains multiple models to further decrease the modality complexity, suffering from unbearable computational expenses. To reduce the computational cost, we propose adjustable sampling, which samples a model multiple times instead of training multiple models. Different from other sampling methods, our method introduces entropy to adjust the quality and diversity of the generated sentences, achieving the goal of reducing modality complexity and noise introduction. Extensive experimental results show our method dramatically reduces the computational cost of data diversification without loss of accuracy, and achieves improvements over other strong sampling methods.

Yuheng Song, Tianyi Liu, Weijia Jia
A Generalized Meta-loss Function for Distillation Based Learning Using Privileged Information for Classification and Regression

Learning using privileged information (LUPI) is a powerful heterogeneous feature space machine learning framework that allows models to learn from highly informative (privileged) features which are available during training only. These models then generate test predictions using input space features which are available during both training and testing. LUPI can significantly improve prediction performance in a variety of machine learning problems. However, existing large margin and neural network implementations of learning using privileged information are mostly designed for classification tasks. In this work, we propose a simple yet effective formulation that allows general application of LUPI to classification, regression, and other related problems. We have verified the correctness, applicability, and effectiveness of our method on regression and classification problems over different synthetic and real-world problems. To test the usefulness of the proposed model in real-world problems, we have further evaluated our method on the problem of protein binding affinity prediction where the proposed scheme has shown to outperform the current state-of-the-art predictor.

Amina Asif, Muhammad Dawood, Fayyaz ul Amir Afsar Minhas
Empirical Study of Data-Free Iterative Knowledge Distillation

Iterative Knowledge Distillation (IKD) [20] is an iterative variant of Hinton’s knowledge distillation framework for deep neural network compression. IKD has shown promising model compression results for image classification tasks where a large amount of training data is available for training the teacher and student models. In this paper, we consider problems where training data is not available, making it impractical to use the usual IKD approach. We propose a variant of the IKD framework, called Data-Free IKD (or DF-IKD), that adopts recent results from data-free learning of deep models [2]. This exploits generative adversarial networks (GANs), in which a readily available pre-trained teacher model is regarded as a fixed discriminator, and a generator (a deep network) is used to generate training samples. The goal of the generator is to generate samples that can obtain a maximum predictive response from the discriminator. In DF-IKD, the student model at every IKD iteration is a compressed version of the original discriminator (‘teacher’). Our experiments suggest: (a) DF-IKD results in a student model that is significantly smaller in size than the original parent model; (b) the predictive performance of the compressed student model is comparable to that of the parent model.

Het Shah, Ashwin Vaswani, Tirtharaj Dash, Ramya Hebbalaguppe, Ashwin Srinivasan
Adversarial Variational Knowledge Distillation

Knowledge Distillation (KD) is one of the most popular and effective techniques for model compression and knowledge transfer. However, most existing KD approaches are heavily relying on the labeled training data, which is usually unavailable due to privacy concerns. Thus, data-free KD focus on restoring the training data with Generative Adversarial Networks (GANs) by either catering the pre-trained teacher or fooling the student. In this paper we introduce Adversarial Variational Knowledge Distillation (AVKD), a framework that formulates the restoring process as Variational Autoencoders (VAEs). Different from vanilla VAEs, AVKD is specified by a pre-trained teacher model $$p(y|x)$$ p ( y | x ) of the visible labels $$y$$ y given the latent $$x$$ x , a prior $$p(x)$$ p ( x ) over the latent variables and an approximate generative model $$q(x|y)$$ q ( x | y ) . In practice, we refer the prior $$p(x)$$ p ( x ) as an alternate unlabeled data distribution from other related domains. Similar to Adversarial Variational Bayes (AVB), we estimate the KL-divergence term between $$p(x)$$ p ( x ) and $$q(x|y)$$ q ( x | y ) by introducing a discriminator network. Although the original training data are unavailable, we argue that the prior data drawn from other related domains can be easily obtained to learn the knowledge distillation efficiently. Extensive experiments testify that our method outperforms the state-of-the-art algorithms in the absence of the original training data, with performance approaching the case where the original training data are provided.

Xuan Tang, Tong Lin
Extract then Distill: Efficient and Effective Task-Agnostic BERT Distillation

Task-agnostic knowledge distillation, a teacher-student fra- mework, has been proved effective for BERT compression. Although achieving promising results on NLP tasks, it requires enormous computational resources. In this paper, we propose Extract Then Distill (ETD), a generic and flexible strategy to reuse the teacher’s parameters for efficient and effective task-agnostic distillation, which can be applied to students of any size. Specifically, we introduce two variants of ETD, ETD-Rand and ETD-Impt, which extract the teacher’s parameters in a random manner and by following an importance metric, respectively. In this way, the student has already acquired some knowledge at the beginning of the distillation process, which makes the distillation process converge faster. We demonstrate the effectiveness of ETD on the GLUE benchmark and SQuAD. The experimental results show that: (1) compared with the baseline without an ETD strategy, ETD can save 70% of computation cost. Moreover, it achieves better results than the baseline when using the same computing resource. (2) ETD is generic and has been proven effective for different distillation methods (e.g., TinyBERT and MiniLM) and students of different sizes. Code is available at https://github.com/huawei-noah/Pretrained-Language-Model .

Cheng Chen, Yichun Yin, Lifeng Shang, Zhi Wang, Xin Jiang, Xiao Chen, Qun Liu

Medical Image Processing

Frontmatter
Semi-supervised Learning Based Right Ventricle Segmentation Using Deep Convolutional Boltzmann Machine Shape Model

Automated Right Ventricle (RV) segmentation is a challenge due to the RV’s variable shape and the lack of labelled data. This paper proposes a semi-supervised learning method based on a convolutional deep Boltzmann machine (CDBM). A CDBM is constructed to learn the complex shape of RV using the short-run MCMC. Next, a semi-supervised learning network composing of a CDBM and two CNNs is proposed. The CNNs and the CDBM have trained alternatively; labelled data are used to train the CNNs, and CDBM reconstructs the predicted results of unlabelled data using the CNNs to guide the training of CNNs further. During this procedure, the CDBM is trained at the same time. Our approach’s main idea is to extract the shape information of RV and use the shape information to improve CNN’s performance. Our approach takes advantage of avoiding overfitting and requiring less labelled data. Besides, our approach does not increase any extra computational cost and parameters during inference. The experiment results show that our approach can improve segmentation accuracy when the labelled training data is small.

Kaimin Liao, Ziyu Gan, Xuan Yang
Improved U-Net for Plaque Segmentation of Intracoronary Optical Coherence Tomography Images

Optical coherence tomography (OCT) has been widely used in the assessment of coronary atherosclerotic plaques. Traditional machine learning methods are mainly based on the image texture features for the plaque segmentation. However, the texture features only represent the information of the local area, which may lead to unsatisfactory results. U-Net and its improved versions use continuous convolution and pooling to extract more advanced features, resulting in the loss of image spatial information and low plaque segmentation accuracy. This paper introduces a spatial pyramid pooling module and a multi-scale dilated convolution module into the U-Net to capture more advanced features while retaining sufficient spatial information. Based on our method, the F1 Score of the segmentation results of the four types of plaques including fibrosis, calcification, lipid and background are 0.85, 0.81, 0.80, 0.99, and the mIOU is 0.7663. Compared to other state-of-the-art methods, our method achieves better plaque segmentation accuracy.

Xinyu Cao, Jiawei Zheng, Zhe Liu, Peilin Jiang, Dengfeng Gao, Rong Ma
Approximated Masked Global Context Network for Skin Lesion Segmentation

The number of skin cancer cases worldwide is increasing by millions every year. A large number of patients bring great pressure to the diagnosis and treatment of skin cancer, it is urgent to apply automatic segmentation techniques to skin lesions to help the diagnosis of skin lesions and the evaluation of recovery. At present, there are still challenges in automatic skin lesion segmentation, including blurring irregular lesion boundaries, low contrast between the lesion and surrounding skin, and all kinds of interference with bubbles, lights, and hairs. We found that modeling the context relationship by using the strongest consistent masked global context can focus only on the lesion region with a high degree. Based on the observation, we propose an approximated masked global context network (AMGC-Net), which firstly approximates the masked global context by constructing the approximated masked global context, and calculates the similarity between each pixel and the approximated masked global information at the spatial level to form a global context requirements gating coefficient matrix, and then captures the dependencies between channels at the channel level to improve segmentation performance. The AMGC-Net is assessed on three public skin challenge datasets: PH2, ISBI2016, and ISIC2018. It achieves state-of-the-art results when compared to some new methods in terms of sensitivity.

Chunguang Jiang, Yueling Zhang, Jiangtao Wang, Weiting Chen
DSNet: Dynamic Selection Network for Biomedical Image Segmentation

This paper focuses on uterine segmentation, an important clue for understanding MRI images and medical analysis of expectant mothers, which has long been underestimated. Related works have proven that the receptive field is crucial in computer vision. However, current methods usually use pooling operations to continuously enlarge the receptive field, which leads to some inevitable information loss. In this paper, we design the Dynamic Selection Module (DSM) to effectively capture the spatial perception of medical images. Specifically, DSM adopts dynamic convolution kernel to adaptively adjust the receptive field in the horizontal and vertical directions. We then combine DSM and residual block to construct a Dynamic Residual Unit (DRU) which further learns feature representation. Then DRU is embedded in the standard U-Net termed Dynamic Selection Network (DSNet). We evaluate our method on the Uterus dataset we acquired. To validate the generalization of this method, we also do the same experiment on the Gland Segmentation dataset and Lung dataset. The results demonstrate that DSNet can significantly boost the performance of medical image segmentation than other related encoder-decoder architectures.

Xiaofei Qin, Yan Liu, Liang Tang, Shuhui Zhao, Xingchen Zhou, Xuedian Zhang, Dengbin Wang
Computational Approach to Identifying Contrast-Driven Retinal Ganglion Cells

The retina acts as the primary stage for the encoding of visual stimuli in the central nervous system. It is comprised of numerous functionally distinct cells tuned to particular types of visual stimuli. This work presents an analytical approach to identifying contrast-driven retinal cells. Machine learning approaches as well as traditional regression models are used to represent the input-output behaviour of retinal ganglion cells. The findings of this work demonstrate that it is possible to separate the cells based on how they respond to changes in mean contrast upon presentation of single images. The separation allows us to identify retinal ganglion cells that are likely to have good model performance in a computationally inexpensive way.

Richard Gault, Philip Vance, T. Martin McGinnity, Sonya Coleman, Dermot Kerr
Radiological Identification of Hip Joint Centers from X-ray Images Using Fast Deep Stacked Network and Dynamic Registration Graph

Locating the hip joint center (HJC) from X-ray images is frequently required for the evaluation of hip dysplasia. Existing state-of-the-art methods focus on developing functional methods or regression equations with some radiographic landmarks. Such developments employ shallow networks or single equations to locate the HJC, and little attention has been given to deep stacked networks. In addition, existing methods ignore the connections between static and dynamic landmarks, and their prediction capacity is limited. This paper proposes an innovative hybrid framework for HJC identification. The proposed method is based on fast deep stacked network (FDSN) and dynamic registration graph with four improvements: (1) an anatomical landmark extraction module obtains comprehensive prominent bony landmarks from multipose X-ray images; (2) an attribute optimization module based on grey relational analysis (GRA) guides the network to focus on useful external anatomical landmarks; (3) a multiverse optimizer (MVO) module appended to the framework automatically and efficiently determines the optimal model parameters; and (4) the dynamic fitting and two-step registration approach are integrated into the model to further improve the accuracy of HJC localization. By integrating the above improvements in series, the models’ performances are gradually enhanced. Experimental results show that our model achieves superior results to existing HJC prediction approaches.

Fuchang Han, Shenghui Liao, Renzhong Wu, Shu Liu, Yuqian Zhao, Xiantao Shen
A Two-Branch Neural Network for Non-Small-Cell Lung Cancer Classification and Segmentation

Immunotherapy has great potential in the treatment of Non-Small-Cell Lung Cancer (NSCLC). The treatment decision for patients is based on the pathologist’s analysis of NSCLC biopsy images. Using deep learning (DL) methods to automatically segment and classify tissues enable quantitative analysis of biopsy images. However, distinguishing between positive tumor tissues and immune tissues remains challenging due to the similarity between these two types of tissues. In this paper, we present a two-branch convolutional neural network (TBNet) combining segmentation network and patch-based classification network. The segmentation branch feeds additional information to the classification branch to improve the performance of patch classification. Then the classification results are fed back to the segmentation branch to obtain segmented tissue regions classified as positive tumor or immune. The experimental results show that the proposed method improves the classification accuracy by an average of 4.3% over a single classification model and achieves 0.864 and 0.907 dice coefficient of positive tumor tissue region and immune tissue region in segmentation task.

Borui Gao, Guangtai Ding, Kun Fang, Peilin Chen
Uncertainty Quantification and Estimation in Medical Image Classification

Deep Neural Networks (DNNs) have shown tremendous success in numerous AI-related fields. However, despite DNNs exhibiting remarkable performance, they still can make mistakes. Therefore, estimation and quantification of uncertainty have become an essential parameter in Deep Learning practical applications, especially in medical imaging. Measuring uncertainty can help with better decision making, early diagnosis, and a variety of tasks. In this paper, we explore uncertainty quantification (UQ) approaches and propose an uncertainty quantification (UQ) system for general medical imaging classification tasks. For its practical use, we adapt our UQ system for the problem of medical pre-screening, where patients are referred to a medical specialist if the DNN model classification or diagnosis is too uncertain. In experiments, we apply the UQ system to two medical imaging databases, a SARS-CoV2 CT dataset and the BreaKHis dataset. We show how to capture the most uncertain samples and predict the most uncertain category. For the application of medical pre-screening, we demonstrate that we can obtain more accurate results than initial modeling results by removing a percentage of the most uncertain input data.

Sidi Yang, Thomas Fevens
Labeling Chest X-Ray Reports Using Deep Learning

One of the primary challenges in the development of Chest X-Ray (CXR) interpretation models has been the lack of large datasets with multilabel image annotations extracted from radiology reports. This paper proposes a CXR labeler that can simultaneously extracts fourteen observations from free-text radiology reports as positive or negative, abbreviated as CXRlabeler. It fine-tunes a pre-trained language model, AWD-LSTM, to the corpus of CXR radiology impressions and then uses it as the base of the multilabel classifier. Experimentation demonstrates that a language model fine-tuning increases the classifier F1 score by 12.53%. Overall, CXRlabeler achieves a 96.17% F1 score on the MIMIC-CXR dataset. To further test the generalization of the CXRlabeler model, it is tested on the PadChest dataset. This testing shows that the CXRlabeler approach is helpful in a different language environment, and the model (available at https://github.com/MaramMonshi/CXRlabeler ) can assist researchers in labeling CXR datasets with fourteen observations.

Maram Mahmoud A. Monshi, Josiah Poon, Vera Chung, Fahad Mahmoud Monshi
Backmatter
Metadaten
Titel
Artificial Neural Networks and Machine Learning – ICANN 2021
herausgegeben von
Prof. Igor Farkaš
Paolo Masulli
Dr. Sebastian Otte
Stefan Wermter
Copyright-Jahr
2021
Electronic ISBN
978-3-030-86365-4
Print ISBN
978-3-030-86364-7
DOI
https://doi.org/10.1007/978-3-030-86365-4