Skip to main content

2021 | Buch

Pattern Recognition. ICPR International Workshops and Challenges

Virtual Event, January 10–15, 2021, Proceedings, Part I

herausgegeben von: Prof. Alberto Del Bimbo, Prof. Rita Cucchiara, Prof. Stan Sclaroff, Dr. Giovanni Maria Farinella, Tao Mei, Prof. Dr. Marco Bertini, Hugo Jair Escalante, Dr. Roberto Vezzani

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This 8-volumes set constitutes the refereed of the 25th International Conference on Pattern Recognition Workshops, ICPR 2020, held virtually in Milan, Italy and rescheduled to January 10 - 11, 2021 due to Covid-19 pandemic. The 416 full papers presented in these 8 volumes were carefully reviewed and selected from about 700 submissions. The 46 workshops cover a wide range of areas including machine learning, pattern analysis, healthcare, human behavior, environment, surveillance, forensics and biometrics, robotics and egovision, cultural heritage and document analysis, retrieval, and women at ICPR2020.

Inhaltsverzeichnis

Frontmatter

3DHU 2020 - 3D Human Understanding

Frontmatter
Subject Identification Across Large Expression Variations Using 3D Facial Landmarks

In this work, we propose to use 3D facial landmarks for the task of subject identification, over a range of expressed emotion. Landmarks are detected, using a Temporal Deformable Shape Model and used to train a Support Vector Machine (SVM), Random Forest (RF), and Long Short-term Memory (LSTM) neural network for subject identification. As we are interested in subject identification with large variations in expression, we conducted experiments on 3 emotion-based databases, namely the BU-4DFE, BP4D, and BP4D+ 3D/4D face databases. We show that our proposed method outperforms current state of the art methods for subject identification on BU-4DFE and BP4D. To the best of our knowledge, this is the first work to investigate subject identification on the BP4D+, resulting in a baseline for the community.

SK Rahatul Jannat, Diego Fabiano, Shaun Canavan, Tempestt Neal
3D Human Pose Estimation Based on Multi-Input Multi-Output Convolutional Neural Network and Event Cameras: A Proof of Concept on the DHP19 Dataset

Nowadays Human Pose Estimation (HPE) represents one of the main research themes in the field of computer vision. Despite innovative methods and solutions introduced for frame processing algorithms, the use of standard frame-based cameras still has several drawbacks such as data redundancy and fixed frame-rate. The use of event-based cameras guarantees higher temporal resolution with lower memory and computational cost while preserving the significant information to be processed and thus it represents a new solution for real-time applications. In this paper, the DHP19 dataset was employed, the first and, to date, the only one with HPE data recorded from Dynamic Vision Sensor (DVS) event-based cameras. Starting from the baseline single-input single-output (SISO) Convolutional Neural Network (CNN) model proposed in the literature, a novel multi-input multi-output (MIMO) CNN-based architecture was proposed in order to model simultaneously two different single camera views. Experimental results show that the proposed MIMO approach outperforms the standard SISO model in terms of accuracy and training time.

Alessandro Manilii, Leonardo Lucarelli, Riccardo Rosati, Luca Romeo, Adriano Mancini, Emanuele Frontoni
Image-Based Out-of-Distribution-Detector Principles on Graph-Based Input Data in Human Action Recognition

Living in a complex world like ours makes it unacceptable that a practical implementation of a machine learning system assumes a closed world. Therefore, it is necessary for such a learning-based system in a real world environment, to be aware of its own capabilities and limits and to be able to distinguish between confident and unconfident results of the inference, especially if the sample cannot be explained by the underlying distribution. This knowledge is particularly essential in safety-critical environments and tasks e.g. self-driving cars or medical applications. Towards this end, we transfer image-based Out-of-Distribution (OoD)-methods to graph-based data and show the applicability in action recognition.The contribution of this work is (i) the examination of the portability of recent image-based OoD-detectors for graph-based input data, (ii) a Metric Learning-based approach to detect OoD-samples, and (iii) the introduction of a novel semi-synthetic action recognition dataset.The evaluation shows that image-based OoD-methods can be applied to graph-based data. Additionally, there is a gap between the performance on intraclass and intradataset results. First methods as the examined baseline or ODIN provide reasonable results. More sophisticated network architectures – in contrast to their image-based application – were surpassed in the intradataset comparison and even lead to less classification accuracy.

Jens Bayer, David Münch, Michael Arens
A Novel Joint Points and Silhouette-Based Method to Estimate 3D Human Pose and Shape

This paper presents a novel method for 3D human pose and shape estimation from images with sparse views, using joint points and silhouettes, based on a parametric model. Firstly, the parametric model is fitted to the joint points estimated by deep learning-based human pose estimation. Then, we extract the correspondence between the parametric model of pose fitting and silhouettes in 2D and 3D space. A novel energy function based on the correspondence is built and minimized to fit a parametric model to the silhouettes. Our approach uses comprehensive shape information because the energy function of silhouettes is built from both 2D and 3D space. This also means that our method only needs images from sparse views, which balances data used and the required prior information. Results on synthetic data and real data demonstrate the competitive performance of our approach on pose and shape estimation of the human body.

Zhongguo Li, Anders Heyden, Magnus Oskarsson
Pose Based Trajectory Forecast of Vulnerable Road Users Using Recurrent Neural Networks

In this work, we use Recurrent Neural Networks (RNNs) in form of Gated Recurrent Unit (GRU) networks to forecast trajectories of vulnerable road users (VRUs), such as pedestrians and cyclists, in road traffic utilizing the past trajectory and 3D poses as input. The 3D poses represent the postures and movements of limbs and torso and contain early indicators for the transition between motion types, e.g. wait, start, move, and stop. VRUs often only become visible from the perspective of an approaching vehicle shortly before dangerous situations occur. Therefore, a network architecture is required which is able to forecast trajectories after short time periods and is able to improve the forecasts in case of longer observations. This motivates us to use GRU networks, which are able to use time series of varying duration as inputs, and to investigate the effects of different observation periods on the forecasting results. Our approach is able to make reasonable forecasts even for short observation periods. The use of poses improves the forecasting accuracy, especially for short observation periods compared to a solely head trajectory based approach. Different motion types benefit to different extent from the use of poses and longer observation periods.

Viktor Kress, Stefan Zernetsch, Konrad Doll, Bernhard Sick
Towards Generalization of 3D Human Pose Estimation in the Wild

In this paper, we propose 3DBodyTex.Pose, a dataset that addresses the task of 3D human pose estimation in-the-wild. Generalization to in-the-wild images remains limited due to the lack of adequate datasets. Existent ones are usually collected in indoor controlled environments where motion capture systems are used to obtain the 3D ground-truth annotations of humans. 3DBodyTex.Pose offers high quality and rich data containing 405 different real subjects in various clothing and poses, and 81k image samples with ground-truth 2D and 3D pose annotations. These images are generated from 200 viewpoints among which 70 challenging extreme viewpoints. This data was created starting from high resolution textured 3D body scans and by incorporating various realistic backgrounds. Retraining a state-of-the-art 3D pose estimation approach using data augmented with 3DBodyTex.Pose showed promising improvement in the overall performance, and a sensible decrease in the per joint position error when testing on challenging viewpoints. The 3DBodyTex.Pose is expected to offer the research community with new possibilities for generalizing 3D pose estimation from monocular in-the-wild images.

Renato Baptista, Alexandre Saint, Kassem Al Ismaeil, Djamila Aouada
Space-Time Triplet Loss Network for Dynamic 3D Face Verification

In this paper, we propose a new approach for 3D dynamic face verification exploiting 3D facial deformations. First, 3D faces are encoded into low-dimensional representations describing the local deformations of the faces with respect to a mean face. Second, the encoded versions of the 3D faces along a sequence are stacked into 2D arrays for temporal modeling. The resulting 2D arrays are then fed to a triplet loss network for dynamic sequence embedding. Finally, the outputs of the triplet loss network are compared using cosine similarity measure for face verification. By projecting the feature maps of the triplet loss network into attention maps on the 3D face sequences, we are able to detect the space-time patterns that contribute most to the pairwise similarity between different 3D facial expressions of the same person. The evaluation is conducted on the publicly available BU4D dataset which contains dynamic 3D face sequences. Obtained results are promising with respect to baseline methods.

Anis Kacem, Hamza Ben Abdesslam, Kseniya Cherenkova, Djamila Aouada

AIDP - Artificial Intelligence for Digital Pathology

Frontmatter
Noise Robust Training of Segmentation Model Using Knowledge Distillation

Deep Neural Networks are susceptible to label noise, which can lead to poor generalization. Degradation of labels in a Histopathology segmentation dataset can be especially caused due to the large inter-observer variability between expert annotators. Thus, obtaining a clean dataset may not be feasible. We address this by using Knowledge Distillation as a learned Label Smoothening Regularizer which has a denoising effect when training on a noisy dataset. To show the effectiveness of our approach, an evaluation is performed on the Gleason Challenge dataset which has high discordance between expert pathologists. Based on the reported experiments, we show that the distilled model achieves significant performance gain when training on the noisy dataset.

Geetank Raipuria, Saikiran Bonthu, Nitin Singhal
Semi-supervised Learning with a Teacher-Student Paradigm for Histopathology Classification: A Resource to Face Data Heterogeneity and Lack of Local Annotations

Training classification models in the medical domain is often difficult due to data heterogeneity (related to acquisition procedures) and due to the difficulty of getting sufficient amounts of annotations from specialized experts. It is particularly true in digital pathology, where models do not generalize easily. This paper presents a novel approach for the generalization of models in conditions where heterogeneity is high and annotations are few. The approach relies on a semi-supervised teacher/student paradigm to different datasets and annotations. The paradigm combines a small amount of strongly-annotated data, with a large amount of unlabeled data, for training two Convolutional Neural Networks (CNN): the teacher and the student model. The teacher model is trained with strong labels and used to generate pseudo-labeled samples from the unlabeled data. The student model is trained combining the pseudo-labeled samples and a small amount of strongly-annotated data. The paradigm is evaluated on the student model performance of Gleason pattern and Gleason score classification in prostate cancer images and compared with a fully-supervised learning approach for training the student model. In order to evaluate the capability of the approach to generalize, the datasets used are highly heterogeneous in visual characteristics and are collected from two different medical institutions. The models, trained with the teacher/student paradigm, show an improvement in performance above the fully-supervised training. The models generalize better on both the datasets, despite the inter-datasets heterogeneity, alleviating the overfitting. The classification performance shows an improvement both in the classification of Gleason pattern at patch level ( $$\kappa \,=\,0.6129\, \pm \, 0.0127$$ κ = 0.6129 ± 0.0127 from $$\kappa \,=\,0.5608\pm 0.0308$$ κ = 0.5608 ± 0.0308 ) and at in Gleason score classification, evaluated at WSI-level $$\kappa \,=\,0.4477\pm 0.0460$$ κ = 0.4477 ± 0.0460 from $$\kappa \,=\,0.2814\pm 0.1312$$ κ = 0.2814 ± 0.1312 ).

Niccolò Marini, Sebastian Otálora, Henning Müller, Manfredo Atzori
Self-attentive Adversarial Stain Normalization

Hematoxylin and Eosin (H&E) stained Whole Slide Images (WSIs) are utilized for biopsy visualization-based diagnostic and prognostic assessment of diseases. Variation in the H&E staining process across different lab sites can lead to important variations in biopsy image appearance . These variations introduce an undesirable bias when the slides are examined by pathologists or used for training deep learning models. Traditionally proposed stain normalization and color augmentation strategies can handle the human level bias. But deep learning models can easily disentangle the linear transformation used in these approaches, resulting in undesirable bias and lack of generalization. To handle these limitations, we propose a Self-Attentive Adversarial Stain Normalization (SAASN) approach for the normalization of multiple stain appearances to a common domain. This unsupervised generative adversarial approach includes self-attention mechanism for synthesizing images with finer detail while preserving the structural consistency of the biopsy features during translation. SAASN demonstrates consistent and superior performance compared to other popular stain normalization techniques on H&E stained duodenal biopsy image data.

Aman Shrivastava, William Adorno, Yash Sharma, Lubaina Ehsan, S. Asad Ali, Sean R. Moore, Beatrice Amadi, Paul Kelly, Sana Syed, Donald E. Brown
Certainty Pooling for Multiple Instance Learning

Multiple Instance Learning is a form of weakly supervised learning in which the data is arranged in sets of instances called bags with one label assigned per bag. The bag level class prediction is derived from the multiple instances through application of a permutation invariant pooling operator on instance predictions or embeddings. We present a novel pooling operator called Certainty Pooling which incorporates the model certainty into bag predictions resulting in a more robust and explainable model. We compare our proposed method with other pooling operators in controlled experiments with low evidence ratio bags based on MNIST, as well as on a real life histopathology dataset - Camelyon16. Our method outperforms other methods in both bag level and instance level prediction, especially when only small training sets are available. We discuss the rationale behind our approach and the reasons for its superiority for these types of datasets.

Jacob Gildenblat, Ido Ben-Shaul, Zvi Lapp, Eldad Klaiman
Classification of Noisy Free-Text Prostate Cancer Pathology Reports Using Natural Language Processing

Free-text reporting has been the main approach in clinical pathology practice for decades. Pathology reports are an essential information source to guide the treatment of cancer patients and for cancer registries, which process high volumes of free-text reports annually. Information coding and extraction are usually performed manually and it is an expensive and time-consuming process, since reports vary widely between institutions, usually contain noise and do not have a standard structure. This paper presents strategies based on natural language processing (NLP) models to classify noisy free-text pathology reports of high and low-grade prostate cancer from the open-source repository TCGA (The Cancer Genome Atlas). We used paragraph vectors to encode the reports and compared them with n-grams and TF-IDF representations. The best representation based on distributed bag of words of paragraph vectors obtained an $$f_{1}$$ f 1 -score of 0.858 and an AUC of 0.854 using a logistic regression classifier. We investigate the classifier’s more relevant words in each case using the LIME interpretability tool, confirming the classifiers’ usefulness to select relevant diagnostic words. Our results show the feasibility of using paragraph embeddings to represent and classify pathology reports.

Anjani Dhrangadhariya, Sebastian Otálora, Manfredo Atzori, Henning Müller
AI Slipping on Tiles: Data Leakage in Digital Pathology

Reproducibility of AI models on biomedical data still stays as a major concern for their acceptance into the clinical practice. Initiatives for reproducibility in the development of predictive biomarkers as the MAQC Consortium already underlined the importance of appropriate Data Analysis Plans (DAPs) to control for different types of bias, including data leakage from the training to the test set. In the context of digital pathology, the leakage typically lurks in weakly designed experiments not accounting for the subjects in their data partitioning schemes. This issue is then exacerbated when fractions or subregions of slides (i.e. “tiles”) are considered. Despite this aspect is largely recognized by the community, we argue that it is often overlooked. In this study, we assess the impact of data leakage on the performance of machine learning models trained and validated on multiple histology data collection. We prove that, even with a properly designed DAP ( $$10 \times 5 $$ 10 × 5 repeated cross-validation), predictive scores can be inflated up to $$41\%$$ 41 % when tiles from the same subject are used both in training and validation sets by deep learning models. We replicate the experiments for 4 classification tasks on 3 histopathological datasets, for a total of 374 subjects, 556 slides and more than 27, 000 tiles. Also, we discuss the effects of data leakage on transfer learning strategies with models pre-trained on general-purpose datasets or off-task digital pathology collections. Finally, we propose a solution that automates the creation of leakage-free deep learning pipelines for digital pathology based on histolab, a novel Python package for histology data preprocessing. We validate the solution on two public datasets (TCGA and GTEx).

Nicole Bussola, Alessia Marcolini, Valerio Maggio, Giuseppe Jurman, Cesare Furlanello

AIHA 2020 - International Workshop on Artificial Intelligence for Healthcare Applications

Frontmatter
Predictive Medicine Using Interpretable Recurrent Neural Networks

Deep learning has been revolutionizing multiple aspects of our daily lives, thanks to its state-of-the-art results. However, the complexity of its models and its associated difficulty to interpret its results has prevented it from being widely adopted in healthcare systems. This represents a missed opportunity, specially considering the growing volumes of Electronic Health Record (EHR) data, as hospitals and clinics increasingly collect information in digital databases. While there are studies addressing artificial neural networks applied to this type of data, the interpretability component tends to be approached lightly or even disregarded. Here we demonstrate the superior capability of recurrent neural network based models, outperforming multiple baselines with an average of 0.94 test AUC, when predicting the use of non-invasive ventilation by Amyotrophic Lateral Sclerosis (ALS) patients, while also presenting a comprehensive explainability solution. In order to interpret these complex, recurrent algorithms, the robust SHAP package was adapted, as well as a new instance importance score was defined, to highlight the effect of feature values and time series samples in the output, respectively. These concepts were then combined in a dashboard, which serves as a proof of concept in terms of a AI-enhanced detailed analysis tool for medical staff.

André Ferreira, Sara C. Madeira, Marta Gromicho, Mamede de Carvalho, Susana Vinga, Alexandra M. Carvalho
Automated Detection of Adverse Drug Events from Older Patients’ Electronic Medical Records Using Text Mining

The Swiss Monitoring of Adverse Drug Events (SwissMADE) project is part of the SNSF-funded Smarter Health Care initiative, which aims at improving health services for the public. Its goal is to use text mining on electronic patient reports to automatically detect adverse drug events automatically in hospitalised elderly patients who received anti-thrombotic drugs. The project is the first of its kind in Switzerland: the data is provided by four hospitals from both the German- and French-speaking part of Switzerland, all of which had not previously released electronic patient records for research, making extraction and anonymisation of records one of the major challenges of the project.In this paper, we describe the part of the project concerned with the de-identification and annotation of German data obtained from one of the hospitals in the form of patient reports.All of these reports are automatically de-identified using a dictionary-based approach augmented with manually created rules, and then automatically annotated. For this, we employ our entity recognition pipeline called OGER (OntoGene Entity Recognizer), also a dictionary-based approach, augmented by an adapted transformer model to obtain state of the art performance, to detect drug, disease and symptom mentions in these reports. Furthermore, a subset of reports are manually annotated for drugs and diagnoses by a medical expert, serving as a validation set for the automatic annotations.

Nicola Colic, Patrick Beeler, Chantal Csajka, Vasiliki Foufi, Frederic Gaspar, Marie-Annick Le Pogam, Angela Lisibach, Christian Lovis, Monika Lutters, Fabio Rinaldi
Length of Stay Prediction for Northern Italy COVID-19 Patients Based on Lab Tests and X-Ray Data

The recent spread of COVID-19 put a strain on hospitals all over the world. In this paper we address the problem of hospital overloads and present a tool based on machine learning to predict the length of stay of hospitalised patients affected by COVID-19. This tool was developed using Random Forests and Extra Trees regression algorithms and was trained and tested on the data from more than 1000 hospitalised patients from Northern Italy. These data contain demographics, several laboratory test results and a score that evaluates the severity of the pulmonary conditions. The experimental results show good performance for the length of stay prediction and, in particular, for identifying which patients will stay in hospital for a long period of time.

Mattia Chiari, Alfonso E. Gerevini, Roberto Maroldi, Matteo Olivato, Luca Putelli, Ivan Serina
Advanced Non-linear Generative Model with a Deep Classifier for Immunotherapy Outcome Prediction: A Bladder Cancer Case Study

Immunotherapy is one of the most interesting and promising cancer treatments. Encouraging results have confirmed the effectiveness of immunotherapy drugs for treating tumors in terms of long-term survival and a significant reduction in toxicity compared to more traditional chemotherapy approaches. However, the percentage of patients eligible for immunotherapy is rather small, and this is likely related to the limited knowledge of physiological mechanisms by which certain subjects respond to the treatment while others have no benefit. To address this issue, the authors propose an innovative approach based on the use of a non-linear cellular architecture with a deep downstream classifier for selecting and properly augmenting 2D features from chest-abdomen CT images toward improving outcome prediction. The proposed pipeline has been designed to make it usable over an innovative embedded Point of Care system. The authors report a case study of the proposed solution applied to a specific type of aggressive tumor, namely Metastatic Urothelial Carcinoma (mUC). The performance evaluation (overall accuracy close to 93%) confirms the proposed approach effectiveness.

Francesco Rundo, Giuseppe Luigi Banna, Francesca Trenta, Concetto Spampinato, Luc Bidaut, Xujiong Ye, Stefanos Kollias, Sebastiano Battiato
Multi-model Ensemble to Classify Acute Lymphoblastic Leukemia in Blood Smear Images

Acute Lymphoblastic Leukemia (ALL) is one of the most commonly occurring type of leukemia which poses a serious threat to life. It severely affects White Blood Cells (WBCs) of the human body that fight against any kind of infection or disease. Since, there are no evident morphological changes and the signs are pretty similar to other disorders, it becomes difficult to detect leukemia. Manual diagnosis of leukemia is time-consuming and is even susceptible to errors. Thus, in this paper, computer assisted diagnosis method has been implemented to detect leukemia using deep learning models. Three models namely, VGG11, ResNet18 and ShufflenetV2 have been trained and fine tuned on ISBI 2019 C-NMC dataset. Finally an ensemble using weighted averaging technique is formed and evaluated as per the criteria of binary classification. The proposed method gave an overall accuracy of 87.52% and F1-score of 87.40%. Thus, it outperforms most of the existing techniques for the same dataset.

Sabrina Dhalla, Ajay Mittal, Savita Gupta, Harleen Singh
MIINet: An Image Quality Improvement Framework for Supporting Medical Diagnosis

Medical images have been indispensable and useful tools for supporting medical experts in making diagnostic decisions. However, taken medical images especially throat and endoscopy images are normally hazy, lack of focus, or uneven illumination. Thus, these could difficult the diagnosis process for doctors. In this paper, we propose MIINet, a novel image-to-image translation network for improving quality of medical images by unsupervised translating low-quality images to the high-quality clean version. Our MIINet is not only capable of generating high-resolution clean images, but also preserving the attributes of original images, making the diagnostic more favorable for doctors. Experiments on dehazing 100 practical throat images show that our MIINet largely improves the mean doctor opinion score (MDOS), which assesses the quality and the reproducibility of the images from the baseline of 2.36 to 4.11, while dehazed images by CycleGAN got lower score of 3.83. The MIINet is confirmed by three physicians to be satisfying in supporting throat disease diagnostic from original low-quality images.

Quan Huu Cap, Hitoshi Iyatomi, Atsushi Fukuda
Medical Image Tampering Detection: A New Dataset and Baseline

The recent advances in algorithmic photo-editing and the vulnerability of hospitals to cyberattacks raises the concern about the tampering of medical images. This paper introduces a new large scale dataset of tampered Computed Tomography (CT) scans generated by different methods, LuNoTim-CT dataset, which can serve as the most comprehensive testbed for comparative studies of data security in healthcare. We further propose a deep learning-based framework, ConnectionNet, to automatically detect if a medical image is tampered. The proposed ConnectionNet is able to handle small tampered regions and achieves promising results and can be used as the baseline for studies of medical image tampering detection.

Benjamin Reichman, Longlong Jing, Oguz Akin, Yingli Tian
Deep Learning for Human Embryo Classification at the Cleavage Stage (Day 3)

To date, deep learning has assisted in classifying embryos as early as day 5 after insemination. We investigated whether deep neural networks could successfully predict the destiny of each embryo (discard or transfer) at an even earlier stage, namely at day 3. We first assessed whether the destiny of each embryo could be derived from technician scores, using a simple regression model. We then explored whether a deep neural network could make accurate predictions using images alone. We found that a simple 8-layer network was able to achieve 75.24% accuracy of destiny prediction, outperforming deeper, state-of-the-art models that reached 68.48% when applied to our middle slice images. Increasing focal points from a single (middle slice) to three slices per image did not improve accuracy. Instead, accounting for the “batch effect”, that is, predicting an embryo’s destiny in relation to other embryos from the same batch, greatly improved accuracy, to a level of 84.69% for unseen cases. Importantly, when analyzing cases of transferred embryos, we found that our lean, deep neural network predictions were correlated (0.65) with clinical outcomes.

Astrid Zeman, Anne-Sofie Maerten, Annemie Mengels, Lie Fong Sharon, Carl Spiessens, Hans Op de Beeck
Double Encoder-Decoder Networks for Gastrointestinal Polyp Segmentation

Polyps represent an early sign of the development of Colorectal Cancer. The standard procedure for their detection consists of colonoscopic examination of the gastrointestinal tract. However, the wide range of polyp shapes and visual appearances, as well as the reduced quality of this image modality, turn their automatic identification and segmentation with computational tools into a challenging computer vision task. In this work, we present a new strategy for the delineation of gastrointestinal polyps from endoscopic images based on a direct extension of common encoder-decoder networks for semantic segmentation. In our approach, two pretrained encoder-decoder networks are sequentially stacked: the second network takes as input the concatenation of the original frame and the initial prediction generated by the first network, which acts as an attention mechanism enabling the second network to focus on interesting areas within the image, thereby improving the quality of its predictions. Quantitative evaluation carried out on several polyp segmentation databases shows that double encoder-decoder networks clearly outperform their single encoder-decoder counterparts in all cases. In addition, our best double encoder-decoder combination attains excellent segmentation accuracy and reaches state-of-the-art performance results in all the considered datasets, with a remarkable boost of accuracy on images extracted from datasets not used for training.

Adrian Galdran, Gustavo Carneiro, Miguel A. González Ballester
A Superpixel-Wise Fully Convolutional Neural Network Approach for Diabetic Foot Ulcer Tissue Classification

Accurate assessment of diabetic foot ulcers (DFU) is primordial to provide an efficient treatment and to prevent amputation. Traditional DFU assessment methods used by clinicians are based on visual examination of the ulcer by estimating the surface and analyzing tissue conditions. These manual methods are subjective and make direct contact with the wound, resulting in high variability and risk of infection. In this research work, we propose a novel smartphone-based skin telemonitoring system to support medical diagnoses and decisions during DFU tissues examination. The database contains 219 images, for effective tissue identification and annotation of the ground truth, a graphical interface based on superpixel segmentation method has been used. Our method performs DFU assessment in an end-to-end style comprising automatic ulcer segmentation and tissue classification. The classification task is performed at a patch-level, superpixels extracted with SLIC are used as input for the training of the deep neural network. State-of-the-art deep learning models for semantic segmentation have been used to perform tissue differentiation within the ulcer area into three classes (Necrosis, Granulation and Slough) and have been compared to the proposed method. The proposed superpixel-based method outperforms classic fully convolutional network models while improving significantly the performance on all the metrics. Accuracy and DICE index are improved from 84.55% to 92.68% and from 54.31% to 75.74% respectively for FCN-32. The results reveal robust tissue classification effectiveness and the potential of our system to monitor DFU healing over time.

Rania Niri, Hassan Douzi, Yves Lucas, Sylvie Treuillet
Fully vs. Weakly Supervised Caries Localization in Smartphone Images with CNNs

While in developed countries routine dental consultations are often covered by insurance, access to prophylactic dental examinations is often expensive in developing countries. Therefore, sufficient oral health prevention, particularly early caries detection, is not accessible to many people in these countries, yet. This observation is, however, contrary to the accessibility of smartphone technology, as smartphones have become available and affordable in most countries. Their technology can be utilized for low-cost initial caries inspection to determine the necessity for a subsequent dental examination. In this paper we address the specific problem of caries detection in smartphone images. Fully supervised methods usually require tedious location annotations, whereas weakly supervised approaches manage to address the detection task with less complex labels. To this end, we propose a weakly supervised caries detection strategy with local constraints and investigate its caries localization capabilities compared to a superior fully supervised Faster R-CNN approach as upper baseline. Our proposed strategy shows promising initial results on our in-house smartphone caries data set.

Duc Duy Pham, Jonas Müller, Piush Aggarwal, Amit Khatri, Mayank Sharma, Torsten Zesch, Josef Pauli
Organ Segmentation with Recursive Data Augmentation for Deep Models

The precise segmentation of organs from computed tomography is a fundamental and pivotal task for correct diagnosis and proper treatment of diseases. Neural network models are widely explored for their promising performance in the segmentation of medical images. However, the small dimension of available datasets is affecting the biomedical imaging domain significantly and has a huge impact in training of deep learning models. In this paper we try to address this issue by iteratively augmenting the dataset with auxiliary task-based information. This is obtained by introducing a recursive training approach, where a new set of segmented images is generated at each iteration and then concatenated with the original input data as organ attention maps. In the experimental evaluation two different datasets were tested and the results produced from the proposed approach have shown significant improvements in organ segmentation as compared to a standard non-recursive approach.

Muhammad Usman Akbar, Muhammad Abubakar Yamin, Vittorio Murino, Diego Sona
Pollen Grain Microscopic Image Classification Using an Ensemble of Fine-Tuned Deep Convolutional Neural Networks

Pollen grain micrograph classification has multiple applications in medicine and biology. Automatic pollen grain image classification can alleviate the problems of manual categorisation such as subjectivity and time constraints. While a number of computer-based methods have been introduced in the literature to perform this task, classification performance needs to be improved for these methods to be useful in practice.In this paper, we present an ensemble approach for pollen grain microscopic image classification into four categories: Corylus Avellana well-developed pollen grain, Corylus Avellana anomalous pollen grain, Alnus well-developed pollen grain, and non-pollen (debris) instances. In our approach, we develop a classification strategy that is based on fusion of four state-of-the-art fine-tuned convolutional neural networks, namely EfficientNetB0, EfficientNetB1, EfficientNetB2 and SeResNeXt-50 deep models. These models are trained with images of three fixed sizes ( $$224 \times 224$$ 224 × 224 , $$240 \times 240$$ 240 × 240 , and $$260 \times 260$$ 260 × 260 pixels) and their prediction probability vectors are then fused in an ensemble method to form a final classification vector for a given pollen grain image.Our proposed method is shown to yield excellent classification performance, obtaining an accuracy of 94.48% and a weighted F1-score of 94.54% on the ICPR 2020 Pollen Grain Classification Challenge training dataset based on five-fold cross-validation. Evaluated on the test set of the challenge, our approach achieves a very competitive performance in comparison to the top ranked approaches with an accuracy and weighted F1-score of 96.28% and 96.30%, respectively.

Amirreza Mahbod, Gerald Schaefer, Rupert Ecker, Isabella Ellinger
Active Surface for Fully 3D Automatic Segmentation

For tumor delineation in Positron Emission Tomography (PET) images, it is of utmost importance to devise efficient and operator-independent segmentation methods capable of reconstructing the 3D tumor shape.In this paper, we present a fully 3D automatic system for the brain tumor delineation in PET images. In previous work, we proposed a 2D segmentation system based on a two-steps approach. The first step automatically identified the slice enclosing the maximum tracer uptake in the whole tumor volume and generated a rough contour surrounding the tumor itself. Such contour was then used to initialize the second step, where the 3D shape of the tumor was obtained by separately segmenting 2D slices. In this paper, we migrate our system into fully 3D. In particular, the segmentation in the second step is performed by evolving an active surface directly in the 3D space. The key points of such advancement are that it performs the shape reconstruction on the whole stack of slices simultaneously, leveraging useful cross-slice information. Additionally, it does not require any specific stopping condition, as the active surface naturally reaches a stable topology once convergence is achieved.Performance of this approach is evaluated on the same dataset discussed in our previous work to assess if any benefit is achieved migrating the system from 2D to 3D. Results confirm an improvement in performance in term of dice similarity coefficient (89.89%), and Hausdorff distance (1.11 voxel).

Albert Comelli, Alessandro Stefano
Penalizing Small Errors Using an Adaptive Logarithmic Loss

Loss functions are error metrics that quantify the difference between a prediction and its corresponding ground truth. Fundamentally, they define a functional landscape for traversal by gradient descent. Although numerous loss functions have been proposed to date in order to handle various machine learning problems, little attention has been given to enhancing these functions to better traverse the loss landscape. In this paper, we simultaneously and significantly mitigate two prominent problems in medical image segmentation namely: i) class imbalance between foreground and background pixels and ii) poor loss function convergence. To this end, we propose an Adaptive Logarithmic Loss (ALL) function. We compare this loss function with the existing state-of-the-art on the ISIC 2018 dataset, the nuclei segmentation dataset as well as the DRIVE retinal vessel segmentation dataset. We measure the performance of our methodology on benchmark metrics and demonstrate state-of-the-art performance. More generally, we show that our system can be used as a framework for better training of deep neural networks.

Chaitanya Kaul, Nick Pears, Hang Dai, Roderick Murray-Smith, Suresh Manandhar
Exploiting Saliency in Attention Based Convolutional Neural Network for Classification of Vertical Root Fractures

Cone-beam computed tomography (CBCT) is widely used in clinical diagnosis of vertical root fractures (VRFs) which presents as crack on the teeth. However, manually checking the VRFs from a larger number of CBCT images is time-consuming and error-prone. Although the Convolutional Neural Networks (CNN) have achieved unprecedented progress in natural image recognition, end-to-end CNN is unsuitable to identify VRFs due to crack appears to be multi-scales and their complex relationships with surroundings tissues. We proposed a novel Feature Pyramids Attention Convolutional Neural Network (FPA-CNN), which incorporates saliency mask and multi-scale feature to boost the classification performance. Saliency map is viewed as spatial probability map where a person might look first to make a discriminative conclusion. Therefore it plays a role of high-level hint to guide the network focusing on the discriminative region. Experimental results demonstrate that our proposed FPA-CNN overcomes the challenge arised from multi-scale crack and complex contextual relationships.

Zhenxing Xu, Peng Wan, Gulibire Aihemaiti, Daoqiang Zhang
UIP-Net: A Decoder-Encoder CNN for the Detection and Quantification of Usual Interstitial Pneumoniae Pattern in Lung CT Scan Images

A key step of the diagnosis of Idiopathic Pulmonary Fibrosis (IPF) is the examination of high-resolution computed tomography images (HRCT). IPF exhibits a typical radiological pattern, named Usual Interstitial Pneumoniae (UIP) pattern, which can be detected in non-invasive HRCT investigations, thus avoiding surgical lung biopsy. Unfortunately, the visual recognition and quantification of UIP pattern can be challenging even for experienced radiologists due to the poor inter and intra-reader agreement.This study aimed to develop a tool for the semantic segmentation and the quantification of UIP pattern in patients with IPF using a deep-learning method based on a Convolutional Neural Network (CNN), called UIP-net. The proposed CNN, based on an encoder-decoder architecture, takes as input a thoracic HRCT image and outputs a binary mask for the automatic discrimination between UIP pattern and healthy lung parenchyma. To train and evaluate the CNN, a dataset of 5000 images, derived by 20 CT scans of different patients, was used. The network performance yielded 96.7% BF-score and 85.9% sensitivity. Once trained and tested, the UIP-net was used to obtain the segmentations of other 60 CT scans of different patients to estimate the volume of lungs affected by the UIP pattern. The measurements were compared with those obtained using the reference software for the automatic detection of UIP pattern, named Computer Aided Lungs Informatics for Pathology Evaluation and Rating (CALIPER), through the Bland-Altman plot. The network performance assessed in terms of both BF-score and sensitivity on the test-set and resulting from the comparison with CALIPER demonstrated that CNNs have the potential to reliably detect and quantify pulmonary disease in order to evaluate its progression and become a supportive tool for radiologists.

Rossana Buongiorno, Danila Germanese, Chiara Romei, Laura Tavanti, Annalisa De Liperi, Sara Colantonio
Don’t Tear Your Hair Out: Analysis of the Impact of Skin Hair on the Diagnosis of Microscopic Skin Lesions

Recent work on the classification of microscopic skin lesions does not consider how the presence of skin hair may affect diagnosis. In this work, we investigate how deep-learning models can handle a varying amount of skin hair during their predictions. We present an automated processing pipeline that tests the performance of the classification model. We conclude that, under realistic conditions, modern day classification models are robust to the presence of skin hair and we investigate three architectural choices (Resnet50, InceptionV3, Densenet121) that make them so.

Alessio Gallucci, Dmitry Znamenskiy, Nicola Pezzotti, Milan Petkovic
Deep Learning Based Segmentation of Breast Lesions in DCE-MRI

Dynamic Contrast Enhanced Magnetic Resonance Imaging (DCE-MRI) is a popular tool for the diagnosis of breast lesions due to its effectiveness, especially in a high risk population. Accurate lesion segmentation is an important step for subsequent analysis, especially for computer aided diagnosis systems. However, manual breast lesion segmentation of (4D) MRI is time consuming, requires experience, and it is prone to interobserver and intraobserver variability. This work proposes a deep learning (DL) framework for segmenting breast lesions in DCE-MRI using a 3D patch based U-Net architecture. We perform different experiments to analyse the effects of class imbalance, different patch sizes, optimizers and loss functions in a cross-validation fashion using 46 images from a subset of a challenging and publicly available dataset not reported to date, that is the TCGA-BRCA. We also compare the proposed U-Net framework with another state-of-the-art approach used for breast lesion segmentation in DCE-MRI, and report better segmentation accuracy with the proposed framework. The results presented in this work have the potential to become a publicly available benchmark for this task.

Roa’a Khaled, Joel Vidal, Robert Martí
Fall Detection and Recognition from Egocentric Visual Data: A Case Study

Falling is among the most damaging events for elderly people, which sometimes may end with significant injuries. Due to fear of falling, many elderly people choose to stay more at home in order to feel safer. In this work, we propose a new fall detection and recognition approach, which analyses egocentric videos collected by wearable cameras through a computer vision/machine learning pipeline. More specifically, we conduct a case study with one volunteer who collected video data from two cameras; one attached to the chest and the other one attached to the waist. A total of 776 videos were collected describing four types of falls and nine kinds of non-falls. Our method works as follows: extracts several uniformly distributed frames from the videos, uses a pre-trained ConvNet model to describe each frame by a feature vector, followed by feature fusion and a classification model. Our proposed model demonstrates its suitability for the detection and recognition of falls from the data captured by the two cameras together. For this case study, we detect all falls with only one false positive, and reach a balanced accuracy of 93% in the recognition of the 13 types of activities. Similar results are obtained for videos of the two cameras when considered separately. Moreover, we observe better performance of videos collected in indoor scenes.

Xueyi Wang, Estefanía Talavera, Dimka Karastoyanova, George Azzopardi
Deep Attention Based Semi-supervised 2D-Pose Estimation for Surgical Instruments

For many practical problems and applications, it is not feasible to create a vast and accurately labeled dataset, which restricts the application of deep learning in many areas. Semi-supervised learning algorithms intend to improve performance by also leveraging unlabeled data. This is very valuable for 2D-pose estimation task where data labeling requires substantial time and is subject to noise. This work aims to investigate if semi-supervised learning techniques can achieve acceptable performance level that makes using these algorithms during training justifiable. To this end, a lightweight network architecture is introduced and mean teacher, virtual adversarial training and pseudo-labeling algorithms are evaluated on 2D-pose estimation for surgical instruments. For the applicability of pseudo-labelling algorithm, we propose a novel confidence measure, total variation. Experimental results show that utilization of semi-supervised learning improves the performance on unseen geometries drastically while maintaining high accuracy for seen geometries. For RMIT benchmark, our lightweight architecture outperforms state-of-the-art with supervised learning. For Endovis benchmark, pseudo-labelling algorithm improves the supervised baseline achieving the new state-of-the-art performance.

Mert Kayhan, Okan Köpüklü, Mhd Hasan Sarhan, Mehmet Yigitsoy, Abouzar Eslami, Gerhard Rigoll
Development of an Augmented Reality System Based on Marker Tracking for Robotic Assisted Minimally Invasive Spine Surgery

Spine surgery is nowadays performed for a great number of spine pathologies; it is estimated that 4.83 million surgeries are carried out globally each year. This prevalence led to an evolution of spine surgery into an extremely specialized field, so that traditional open interventions to the spine were integrated and often replaced by minimally invasive approaches. Despite the several benefits associated to robotic minimally invasive surgeries (RMIS), loss of depth perception, reduced field of view and consequent difficulty in intraoperative identification of relevant anatomical structures are still unsolved issues. For these reasons, Augmented Reality (AR) was introduced to support the surgeon in surgical applications. However, even though the irruption of AR has promised breakthrough changes in surgery, its adoption was slower than expected as there are still usability hurdles. The objective of this work is to introduce a client software with marker-based optical tracking capabilities, included into a client-server architecture that uses protocols to enable real-time streaming over the network, providing desktop rendering power to the head mounted display (HMD). Results relative to the tracking are promising (Specificity = 0.98 ± 0.03; Precision = 0.94 ± 0.04; Dice = 0.80 ± 0.07) as well as real-time communication, which was successfully set.

Francesca Pia Villani, Mariachiara Di Cosmo, Álvaro Bertelsen Simonetti, Emanuele Frontoni, Sara Moccia
Towards Stroke Patients’ Upper-Limb Automatic Motor Assessment Using Smartwatches

Assessing the physical condition in rehabilitation scenarios is a challenging problem, since it involves Human Activity Recognition (HAR) and kinematic analysis methods. In addition, the difficulties increase in unconstrained rehabilitation scenarios, which are much closer to the real use cases. In particular, our aim is to design an upper-limb assessment pipeline for stroke patients using smartwatches. We focus on the HAR task, as it is the first part of the assessing pipeline. Our main target is to automatically detect and recognize four key movements inspired by the Fugl-Meyer assessment scale, which are performed in both constrained and unconstrained scenarios. In addition to the application protocol and dataset, we propose two detection and classification baseline methods. We believe that the proposed framework, dataset and baseline results will serve to foster this research field.

Asma Bensalah, Jialuo Chen, Alicia Fornés, Cristina Carmona-Duarte, Josep Lladós, Miguel Ángel Ferrer
Multimodal Detection of Tonic–Clonic Seizures Based on 3D Acceleration and Heart Rate Data from an In-Ear Sensor

Patients with epilepsy suffer from recurrently occurring seizures. To improve diagnosis and treatment as well as to increase patients’ safety and quality of life, it is of great interest to develop reliable methods for automated seizure detection. In this work, we evaluate a first trial of a multimodal approach combining 3D acceleration and heart rate data acquired with a mobile In-Ear sensor as part of the project EPItect. For the detection of tonic–clonic seizures (TCS), we train different classification models (Naïve Bayes, K-Nearest-Neighbor, linear Support Vector Machine and Adaboost.M1) and evaluate cost-sensitive learning as a measure to address the problem of highly imbalanced data. To assess the performance of our multimodal approach, we compare it to a unimodal approach, which only uses the acceleration data. Experiments show that our method leads to a higher sensitivity, lower detection latency and lower false alarm rate compared to the unimodal method.

Jasmin Henze, Salima Houta, Rainer Surges, Johannes Kreuzer, Pinar Bisgin
Deep Learning Detection of Cardiac Akinesis in Echocardiograms

Heart diseases are still among the main causes of death in the world population. The use of tools able to discriminate early this type of problem, even by non-specialized medical personnel on an outpatient basis, would put a decrease in health pressure on hospital centers and a better patient prognosis. This paper focuses on the problem of cardiac akinesis, a condition attributable to a very large number of pathologies, and a possible serious complication for SARS-Covid19 patients. In particular, we considered echocardiographic images of both akinetic and healthy patients. The dataset, containing echocardiograms of around 700 patients, has been supplied by Sacco hospital of Milan (Italy). We implemented a modified ResNet34 architecture and we tested the model under various combinations of parameters. The final best performing model was able to achieve a F1-score of 0.91 in the binary classification Akinetic vs. Normokinetic.

Alessandro Bitetto, Elena Bianchi, Piercarlo Dondi, Luca Bianchi, Janos Tolgyesi, Diego Ferri, Luca Lombardi, Paola Cerchiello, Azzurra Marceca, Alberto Barosi
Prediction of Minimally Conscious State Responder Patients to Non-invasive Brain Stimulation Using Machine Learning Algorithms

The right matching of patients to an intended treatment is routinely performed by doctor and physicians in healthcare. Improving doctor’s ability to choose the right treatment can greatly speed up patient’s recovery. In a clinical study on Disorders of Consciousness patients in Minimal Consciousness State (MCS) have gone through transcranial Electrical Stimulation (tES) therapy to increase consciousness level. We have carried out the study of MCS patient’s response to tES therapy using as input the EEG data collected before the intervention. Different Machine Learning approaches have been applied to the Relative Band Power features extracted from the EEG. We aimed to predict tES treatment outcome from this EEG data of 17 patients, where 4 of the patients sustainably showed further signs of consciousness after treatment. We have been able to correctly classify with 95% accuracy the response of patients to tES therapy. In this paper we present the methodology as well as a comparative evaluation of the different employed classification approaches. Hereby we demonstrate the feasibility of implementing a novel informed Decision Support System (DSS) based on this methodological approach for the correct prediction of patients’ response to tES therapy in MCS.

Andrés Rojas, Eleni Kroupi, Géraldine Martens, Aurore Thibaut, Alice Barra, Steven Laureys, Giulio Ruffini, Aureli Soria-Frisch
Sinc-Based Convolutional Neural Networks for EEG-BCI-Based Motor Imagery Classification

Brain-Computer Interfaces (BCI) based on motor imagery translate mental motor images recognized from the electroencephalogram (EEG) to control commands. EEG patterns of different imagination tasks, e.g. hand and foot movements, are effectively classified with machine learning techniques using band power features. Recently, also Convolutional Neural Networks (CNNs) that learn both effective features and classifiers simultaneously from raw EEG data have been applied. However, CNNs have two major drawbacks: (i) they have a very large number of parameters, which thus requires a very large number of training examples; and (ii) they are not designed to explicitly learn features in the frequency domain. To overcome these limitations, in this work we introduce Sinc-EEGNet, a lightweight CNN architecture that combines learnable band-pass and depthwise convolutional filters. Experimental results obtained on the publicly available BCI Competition IV Dataset 2a show that our approach outperforms reference methods in terms of classification accuracy.

Alessandro Bria, Claudio Marrocco, Francesco Tortorella
An Analysis of Tasks and Features for Neuro-Degenerative Disease Assessment by Handwriting

Neurodegenerative disease assessment with handwriting has been shown to be effective. In this exploratory analysis, several features are extracted and tested on different tasks of the novel HAND-UNIBA dataset. Results show what are the most important kinematic features and the most significant tasks for neurodegenerative disease assessment through handwriting.

Vincenzo Dentamaro, Donato Impedovo, Giuseppe Pirlo
A Comparative Study on Autism Spectrum Disorder Detection via 3D Convolutional Neural Networks

The prevalence of Autism Spectrum Disorder (ASD) in the United States has increased by 178% from 2000 to 2016. However, due to the lack of well-trained specialists and the time-consuming diagnostic process, many children are not able to be promptly diagnosed. Recently, several research have taken steps to explore automatic video-based ASD detection systems with the help of machine learning and deep learning models, such as support vector machine (SVM) and long short-term memory (LSTM) model. However, the models mentioned above could not extract effective features directly from raw videos. In this study, we aim to take advantages of 3D convolution-based deep learning models to aid video-based ASD detection. We explore three representative 3D convolutional neural networks (CNNs), including C3D, I3D and 3D ResNet. In addition, a new 3D convolutional model, called 3D ResNeSt, is also proposed based on ResNeSt. We evaluate these models on an ASD detection dataset. The experimental results show that, on average, all of the four 3D convolutional models can obtain competitive results when compared to the baseline using LSTM model. Our proposed 3D ResNeSt model achieves the best performance, which improves the average detection accuracy from 0.72 to 0.85.

Kaijie Zhang, Wei Wang, Yijun Guo, Caifeng Shan, Liang Wang
A Multi Classifier Approach for Supporting Alzheimer’s Diagnosis Based on Handwriting Analysis

Nowadays, the treatments of neurodegenerative diseases are increasingly sophisticated, mainly thanks to innovations in the medical field. As the effectiveness of care, strategies is enhanced by the early diagnosis, in recent years there has been an increasing interest in developing reliable, non-invasive, easy to administer, and cheap diagnostics tools to support clinicians in the diagnostic processes. Among others, Alzheimer’s disease (AD) has received special attention in that it is a severe and progressive neurodegenerative disease that heavily influence the patient’s quality of life, as well as the social costs for proper care. In this context, a large variety of methods have been proposed that exploit handwriting and drawing tasks to discriminate between healthy subjects and AD patients. Most, if not all, of these methods adopt a single machine learning technique to achieve the final classification. We propose to tackle the problem by adopting a multi-classifier approach envisaging as many classifiers as the number of tasks, each of whom produces a binary output. The outputs of the classifiers are eventually combined by a majority vote to achieve the final decision. Experiments on a dataset involving 175 subjects executing 25 different handwriting and drawing tasks and 6 different machine learning techniques selected among the most used ones in the literature show that the best results are achieved by selecting the subset of tasks on which each classifier perform best and then combining the outputs of the classifier on each task, achieving an overall accuracy of 91% with a sensitivity of 83% and a specificity of 100%. Moreover, this strategy reduces the meantime to complete the test from 25 minutes to less than 10.

Giuseppe De Gregorio, Domenico Desiato, Angelo Marcelli, Giuseppe Polese
A Lightweight Spatial Attention Module with Adaptive Receptive Fields in 3D Convolutional Neural Network for Alzheimer’s Disease Classification

The development of deep learning provides powerful support for disease classification of neuroimaging data. However, in the classification of neuroimaging data based on deep learning methods, the spatial information cannot be fully utilized. In this paper, we propose a lightweight 3D spatial attention module with adaptive receptive fields, which allows neurons to adaptively adjust the receptive field size according to multiple scales of input information. The attention module can fuse spatial information of different scales on multiple branches, so that 3D spatial information of neuroimaging data can be fully utilized. A 3D-ResNet18 based on our proposed attention module is trained to diagnose Alzheimer’s disease (AD). Experiments are conducted on 521 subjects (254 of patients with AD and 267 of normal controls) from Alzheimer’s Disease National Initiative (ADNI) dataset of 3D structural MRI brain scans. Experimental results show the effectiveness and efficiency of our proposed approach for AD classification.

Fei Yu, Baoqi Zhao, Qingqing Ge, Zhijie Zhang, Junmei Sun, Xiumei Li
Handwriting-Based Classifier Combination for Cognitive Impairment Prediction

Cognitive impairments affect areas such as memory, learning, concentration, or decision making and range from mild to severe. Impairments of this kind can be indicators of neurodegenerative diseases such as Alzheimer’s, that affect millions of people worldwide and whose incidence is expected to increase in the near future. Handwriting is one of the daily activities affected by this kind of impairment, and its anomalies are already used for the diagnosis of neurodegenerative diseases, such as, for example, micrographia in Parkinson’s patients. Classifier combination methods have proved to be an effective tool for increasing the performance in pattern recognition applications. The rationale of this approach follows from the observation that appropriately diverse classifiers, especially when trained on different types of data, tend to make uncorrelated errors. In this paper, we present a study in which the responses of different classifiers, trained on data from graphic tasks, have been combined to predict cognitive impairments. The proposed system has been trained and tested on a dataset containing handwritten traits extracted from some simple graphic tasks, e.g. joining two points or drawing circles. The results confirmed that a simple combination rule, such as the majority vote rule, performs better than single classifiers.

Nicole Dalia Cilia, Claudio De Stefano, Francesco Fontanella, Alessandra Scotto di Freca

CADL2020 - Workshop on Computational Aspects of Deep Learning

Frontmatter
WaveTF: A Fast 2D Wavelet Transform for Machine Learning in Keras

The wavelet transform is a powerful tool for performing multiscale analysis and it is a key subroutine in countless applications, from image processing to astronomy. Recently, it has extended its range of users to include the ever growing machine learning community. For a wavelet library to be efficiently adopted in this context, it needs to provide transformations which can be integrated seamlessly in already existing machine learning workflows and neural networks, being able to leverage the same libraries and run on the same hardware (e.g., CPU vs GPU) as the rest of the machine learning pipeline, without impacting training and evaluation performance. In this paper we present WaveTF, a wavelet library available as a Keras layer, which leverages TensorFlow to exploit GPU parallelism and can be used to enrich already existing machine learning workflows. To demonstrate its efficiency we compare its raw performance against other alternative libraries and finally measure the overhead it causes to the learning process when it is integrated in an already existing Convolutional Neural Network.

Francesco Versaci
Convergence Dynamics of Generative Adversarial Networks: The Dual Metric Flows

Fitting neural networks often resorts to stochastic (or similar) gradient descent which is a noise-tolerant (and efficient) resolution of a gradient descent dynamics. It outputs a sequence of networks parameters, which sequence evolves during the training steps. The gradient descent is the limit, when the learning rate is small and the batch size is infinite, of this set of increasingly optimal network parameters obtained during training. In this contribution, we investigate instead the convergence in the Generative Adversarial Networks used in machine learning. We study the limit of small learning rate, and show that, similar to single network training, the GAN learning dynamics tend, for vanishing learning rate to some limit dynamics. This leads us to consider evolution equations in metric spaces (which is the natural framework for evolving probability laws) that we call dual flows. We give formal definitions of solutions and prove the convergence. The theory is then applied to specific instances of GANs and we discuss how this insight helps understand and mitigate the mode collapse.

Gabriel Turinici
Biomedical Named Entity Recognition at Scale

Named entity recognition (NER) is a widely applicable natural language processing task and building block of question answering, topic modeling, information retrieval, etc. In the medical domain, NER plays a crucial role by extracting meaningful chunks from clinical notes and reports, which are then fed to downstream tasks like assertion status detection, entity resolution, relation extraction, and de-identification. Reimplementing a Bi-LSTM-CNN-Char deep learning architecture on top of Apache Spark, we present a single trainable NER model that obtains new state-of-the-art results on seven public biomedical benchmarks without using heavy contextual embeddings like BERT. This includes improving BC4CHEMD to 93.72% (4.1% gain), Species800 to 80.91% (4.6% gain), and JNLPBA to 81.29% (5.2% gain). In addition, this model is freely available within a production-grade code base as part of the open-source Spark NLP library; can scale up for training and inference in any Spark cluster; has GPU support and libraries for popular programming languages such as Python, R, Scala and Java; and can be extended to support other human languages with no code changes.

Veysel Kocaman, David Talby
PyraD-DCNN: A Fully Convolutional Neural Network to Replace BLSTM in Offline Text Recognition Systems

We present in this paper a fast and efficient multi-task fully convolutional neural network (FCNN). The proposed architecture uses a multi-resolution Pyramid of Densely connected Dilated Convolution (PyraD-DCNN). Our design also implements optimized convolutional building blocks that enable large dimensional representation with a low computational cost. Besides its ability to perform semantic image segmentation by itself as an auto-encoder, it may also be coupled with a signal encoder to build an end-to-end signal-to-sequence system without the help of recurrent layers (RNN). In the current work, we present the PyraD-DCNN through an application on Optical Character Recognition task and how it holds the comparison with Bidirectional Long Short-Term Memory (BLSTM) RNN. The pyramid-like structure using dilated kernels provides short and long term context management without recurrence. Thus we managed to improve inference time on CPU up to three times faster on our own datasets compared to a classical CNN-LSTM, with slight accuracy improvements in addition to faster training cycles (up to 24 times faster). Furthermore, the lightness of this structure makes it naturally adapted to mobile applications without any accuracy loss.

Jonathan Jouanne, Quentin Dauchy, Ahmad Montaser Awal
Learning Sparse Filters in Deep Convolutional Neural Networks with a Pseudo-Norm

While deep neural networks (DNNs) have proven to be efficient for numerous tasks, they come at a high memory and computation cost, thus making them impractical on resource-limited devices. However, these networks are known to contain a large number of parameters. Recent research has shown that their structure can be more compact without compromising their performance.In this paper, we present a sparsity-inducing regularization term based on the ratio $$l_1/l_2$$ l 1 / l 2 pseudo-norm defined on the filter coefficients. By defining this pseudo-norm appropriately for the different filter kernels, and removing irrelevant filters, the number of kernels in each layer can be drastically reduced leading to very compact Deep Convolutional Neural Networks (DCNN) structures. Unlike numerous existing methods, our approach does not require an iterative retraining process and, using this regularization term, directly produces a sparse model during the training process. Furthermore, our approach is also much easier and simpler to implement than existing methods. Experimental results on MNIST and CIFAR-10 show that our approach significantly reduces the number of filters of classical models such as LeNet and VGG while reaching the same or even better accuracy than the baseline models. Moreover, the trade-off between the sparsity and the accuracy is compared to other loss regularization terms based on the l1 or l2 norm as well as the SSL [1], NISP [2] and GAL [3] methods and shows that our approach is outperforming them.

Anthony Berthelier, Yongzhe Yan, Thierry Chateau, Christophe Blanc, Stefan Duffner, Christophe Garcia
Multi-node Training for StyleGAN2

StyleGAN2 is a Tensorflow-based Generative Adversarial Network (GAN) framework that represents the state-of-the-art in generative image modelling. The current release of StyleGAN2 implements multi-GPU training via Tensorflow’s device contexts which limits data parallelism to a single node. In this work, a data-parallel multi-node training capability is implemented in StyleGAN2 via Horovod which enables harnessing the compute capability of larger cluster architectures. We demonstrate that the new Horovod-based communication outperforms the previous context approach on a single node. Furthermore, we demonstrate that the multi-node training does not compromise the accuracy of StyleGAN2 for a constant effective batch size. Finally, we report strong and weak scaling of the new implementation up to 64 NVIDIA Tesla A100 GPUs distributed across eight NVIDIA DGX A100 nodes, demonstrating the utility of the approach at scale.

Niki A. Loppi, Tuomas Kynkäänniemi
Flow R-CNN: Flow-Enhanced Object Detection

This work addresses the problem of multi-task object detection in an efficient, generic but at the same time simple way, following the recent and highly promising studies in the computer vision field, and more specifically the Region-based Convolutional Neural Network (R-CNN) approach. A flow-enhanced methodology for object detection is proposed, by adding a new branch to predict an object-level flow field. Following a scheme grounded on neuroscience, a pseudo-temporal motion stream is integrated in parallel to the classification, bounding box regression and segmentation mask prediction branches of Mask R-CNN. Extensive experiments and thorough comparative evaluation provide a detailed analysis of the problem at hand and demonstrate the added value of the involved object-level flow branch. The overall proposed approach achieves improved performance in the six currently broadest and most challenging publicly available semantic urban scene understanding datasets, surpassing the region-based baseline method.

Athanasios Psaltis, Anastasios Dimou, Federico Alvarez, Petros Daras
Compressed Video Action Recognition Using Motion Vector Representation

Action recognition is an important task for video understanding. Due to expensive time consumption, the conventional approaches employing the optical flow are difficult to be used for real-time purpose. Recently, the Motion Vector (MV), which can be directly extracted from the compressed video, has been introduced for action recognition. In this paper, we propose a novel approach by utilizing motion vector representation for action recognition. On the one hand, we use the motion vector information to select key information sequences for recognition. On the other hand, we further use the motion vector to formulate the representation of the selected sequences. We evaluate the proposed approach on UCF101 and HMDB51 datasets. The experimental results demonstrate that the proposed approach is able to achieve competitive recognition performance, and is able to maintain a 461.5 fps end-to-end processing rate at the same time.

Chenghui Zhou, Xiaolei Chen, Pei Sun, Guanwen Zhang, Wei Zhou
Introducing Region Pooling Learning

In recent years, the advancement of convolutional neural network (CNN) topologies have been constantly evolving at an increasingly fast pace, with current novel proposals like Inception, ResNet, MobileNet, etc., pushing the performance results on available benchmarks, and optimizing towards smaller models with lower number of trainable parameters capable of achieving comparable results to the known state of the art. To this day, most of these novel approaches either rely on the use of pooling layers or the use of strided convolutions as one of the main building blocks of a CNN, i.e., the traditional max pooling and average pooling layers, as a way to reduce spatial dimensionality of features progressively through the network and focusing on obtain a value that not necessarily contains valuable information of the position of the object of interest. The selection between these two layers is typically based on the experience of the Neural Network architect, were several training procedures have to be evaluated to guarantee the best accuracy. In this work, we introduce the concept of the region pooling learning, in which an optimal pooling behavior is learned through training. Additionally, the knowledge of the region pooling layers can be leveraged by deeper layers. The region pooling layer could learn to behave as a max-pooling or an average-pooling, or it might learn to pool the most convenient value based on training. The experimental results presented in this work over two available image datasets suggest that the use of the region pooling layer improves the performance of a Resnet18 CNN network, outperforming in some cases a typical ResNet110 CNN.

Jesus Adan Cruz Vargas, Julio Zamora Esquivel, Omesh Tickoo
Second Order Bifurcating Methodology for Neural Network Training and Topology Optimization

This work proposes a second-order methodology that minimizes the global loss error over a neural network (NN) training process of fully connected layers, based on the usage of vertical and horizontal tangent parabolas to the error derivative. This methodology expands the search area of zero-crossings in the error derivative function without restrictions, therefore quantifying the need for a larger or a smaller number of neurons in a given fully connected layer in order to optimally converge to the solution of a given training database. During training, the number of neurons in a layer converges to the number of roots of the derivative of the error function, e.g. when two neurons converge to the same root, both will merge into a single neuron; additionally, every neuron improves its position to better cover the training data distribution, or otherwise, it will be split into two neurons of needed, depending on its derivative function on each training iteration. The proposed routine removes neurons that are not in the error’s minimum value neighborhood, reducing computational costs and therefore optimizing the model architecture, since it adjusts the NN topology embedded in the same training process, without the cost of having to train multiple topologies over an exhaustive trial-and-error approach.

Julio Zamora Esquivel, Jesus Adan Cruz Vargas, Omesh Tickoo
Backmatter
Metadaten
Titel
Pattern Recognition. ICPR International Workshops and Challenges
herausgegeben von
Prof. Alberto Del Bimbo
Prof. Rita Cucchiara
Prof. Stan Sclaroff
Dr. Giovanni Maria Farinella
Tao Mei
Prof. Dr. Marco Bertini
Hugo Jair Escalante
Dr. Roberto Vezzani
Copyright-Jahr
2021
Electronic ISBN
978-3-030-68763-2
Print ISBN
978-3-030-68762-5
DOI
https://doi.org/10.1007/978-3-030-68763-2