Skip to main content
Top

2022 | Book

Image Analysis and Processing – ICIAP 2022

21st International Conference, Lecce, Italy, May 23–27, 2022, Proceedings, Part I

Editors: Prof. Stan Sclaroff, Cosimo Distante, Marco Leo, Dr. Giovanni M. Farinella, Prof. Dr. Federico Tombari

Publisher: Springer International Publishing

Book Series : Lecture Notes in Computer Science

insite
SEARCH

About this book

The proceedings set LNCS 13231, 13232, and 13233 constitutes the refereed proceedings of the 21st International Conference on Image Analysis and Processing, ICIAP 2022, which was held during May 23-27, 2022, in Lecce, Italy,

The 168 papers included in the proceedings were carefully reviewed and selected from 307 submissions. They deal with video analysis and understanding; pattern recognition and machine learning; deep learning; multi-view geometry and 3D computer vision; image analysis, detection and recognition; multimedia; biomedical and assistive technology; digital forensics and biometrics; image processing for cultural heritage; robot vision; etc.

Table of Contents

Frontmatter
Correction to: Improving Colon Carcinoma Grading by Advanced CNN Models

In the originally published version of chapter 20, the name of the author Pierluigi Carcagnì contained a spelling mistake. This has been corrected.

Marco Leo, Pierluigi Carcagnì, Luca Signore, Giulio Benincasa, Mikko O. Laukkanen, Cosimo Distante
Correction to: Improving Autoencoder Training Performance for Hyperspectral Unmixing with Network Reinitialisation

In the originally published version of chapter 33, table 2 included an error. This has been corrected.

Kamil Książek, Przemysław Głomb, Michał Romaszewski, Michał Cholewa, Bartosz Grabowski, Krisztián Búza

Brave New Ideas

Frontmatter
A Lightweight Model for Satellite Pose Estimation

In this work, a study on computer vision techniques for automating rendezvous manoeuvres in space has been carried out. A lightweight algorithm pipeline for achieving the 6 degrees of freedom (DOF) object pose estimation, i.e. relative position and attitude, of a spacecraft in a non-cooperative context using a monocular camera has been studied. In particular, the considered lite architecture has been never exploited for space operations and it allows to be compliant with operational constraints, in terms of payload and power, of small satellite platforms. Experiments were performed on a benchmark Satellite Pose Estimation Dataset of synthetic and real spacecraft imageries specifically introduced for the challenging task of the 6DOF object pose estimation in space. Extensive comparisons with existing approaches are provided both in terms of reliability/accuracy and in terms of model size that ineluctably affect resource requirements for deployment on space vehicles.

Pierluigi Carcagnì, Marco Leo, Paolo Spagnolo, Pier Luigi Mazzeo, Cosimo Distante
Imitation Learning for Autonomous Vehicle Driving: How Does the Representation Matter?

Autonomous vehicle driving is gaining ground, by receiving increasing attention from the academic and industrial communities. Despite this considerable effort, there is a lack of a systematic and fair analysis of the input representations by means of a careful experimental evaluation on the same framework. To this aim, this work proposes the first comprehensive, comparative analysis of the most common inputs that can be processed by a conditional imitation learning (CIL) approach. With more details, we considered the combinations of raw and processed data—namely RGB images, depth (D) images and semantic segmentation (S)—to be assessed as inputs of the well-established Conditional Imitation Learning with ResNet and Speed prediction (CILRS) architecture. We performed a benchmark analysis, endorsed by statistical tests, on the CARLA simulator to compare the considered configurations. The achieved results showed that RGB outperformed the other monomodal inputs, in terms of success rate on the most popular benchmark NoCrash. However, RGB did not generalize well when tested on different weather conditions; overall, the best multimodal configuration was a combination of the RGB image and semantic segmentation inputs (i.e., RGBS) compared to the others, especially in regular and dense traffic scenarios. This confirms that an appropriate fusion of multimodal sensors is an effective approach in autonomous vehicle driving.

Antonio Greco, Leonardo Rundo, Alessia Saggese, Mario Vento, Antonio Vicinanza
LessonAble: Leveraging Deep Fakes in MOOC Content Creation

This paper introduces LessonAble, a pipelined methodology leveraging the concept of Deep Fakes for generating MOOC (Massive Online Open Course) visual contents directly from a lesson narrative. To achieve this, the proposed pipeline consists of three main modules: audio generation, video generation and lip-syncing. In this work, we use the NVIDIA Tacotron2 Text-to-Speech model to generate custom speech from text, adapt the famous First Order Motion Model to generate the video sequence from different driving sequences and target images, and modify the Wav2Lip model to deal with lip-syncing. Moreover, we introduce some novel strategies to support the use of markdown-like formatting to guide the pipeline in the generation of expression aware (i.e. curious, happy, etc.) contents. Despite the use and adaptation of third parties modules, developing such a pipeline presented interesting challenges, all analysed and reported in this work. The result is an extremely intuitive tool to support MOOC content generation.

Ciro Sannino, Michela Gravina, Stefano Marrone, Giuseppe Fiameni, Carlo Sansone
An Intelligent Scanning Vehicle for Waste Collection Monitoring

While many industries have adopted digital solutions to improve ecological footprints and optimize services, new technologies have not yet found broad acceptance in waste management. In addition, past efforts to motivate households to improve waste separation have shown limited success. To reduce greenhouse gas emissions as part of a greater plan for fighting climate change, institutions like the European Union (EU) undertake strong efforts. In this context, developing intelligent digital technologies for waste management helps to increase the recycling rate and as a consequence reduces greenhouse gas emissions. Within this work, we propose an innovative computer vision system that is able to assess the residential waste in real-time and deliver individual feedback to the households and waste management companies with the aim of increasing recycling rates and thus reducing emissions. It consists of two core components: A compact scanning hardware designed specifically for rugged environments like the innards of a garbage truck and an intelligent software that applies a convolutional neural network (CNN) to automatically identify the composition of the waste which was dumped into the truck and subsequently delivers the results to a web portal for further analysis and communication. We show that our system can impact household separation behavior and result in higher recycling rates leading to noticeable reduction of CO2 emissions in the long term.

Georg Waltner, Malte Jaschik, Alfred Rinnhofer, Horst Possegger, Horst Bischof
Morphological Galaxies Classification According to Hubble-de Vaucouleurs Diagram Using CNNs

Galaxies morphology classification is a crucial task for studying their physical properties, formation and evolutionary histories. The large-scale surveys on universe has boosted the need to develop techniques for automated galaxies morphological classification. This paper proposes a system able to classify automatically galaxies according to the Hubble De Vaucouleurs diagram. We introduce a novel CNN architectures that for the first time was trained to automatically classify galaxies according to 26-classes Hubble-De Vaucouleurs scheme. We use Galaxy Zoo dataset, using the decision tree, to extract a labeled examples containing an even amount of images of each 26-classes. We also compared different CNN Backbones in order to assess obtained galaxies classification results. We obtain a balanced multi-class accuracy (BCA) of more than 80% in classifying all 26 Hubble-De Vaucouleurs galaxy categories.

Pier Luigi Mazzeo, Antonio Rizzo, Cosimo Distante

Biomedical and Assistive Technology

Frontmatter
Pulmonary-Restricted COVID-19 Informative Visual Screening Using Chest X-ray Images from Portable Devices

In the recent COVID-19 outbreak, chest X-rays were the main tool for diagnosing and monitoring the pathology. To prevent further spread of this disease, special circuits had to be implemented in the healthcare services. For this reason, these chest X-rays were captured with portable X-ray devices that compensate its lower quality and limitations with more deployment flexibility. However, most of the proposed computer-aided diagnosis methodologies were designed to work with traditional fixed X-ray machines and their performance is diminished when faced with these portable images. Additionally, given that the equipment needed to properly treat the disease (such as for life support and monitoring of vital signs) most of these systems learnt to identify these artifacts in the images instead of real clinically-significant variables. In this work, we present the first methodology forced to extract features exclusively from the pulmonary region of interest that is specially designed to work with these difficult portable images. Additionally, we generate a class activation map so the methodology also provides explainability to the results returned to the clinician. To ensure the robustness of our proposal, we tested the methodology with chest radiographs from patients diagnosed with COVID-19, pathologies similar to COVID-19 (such as other types of viral pneumonias) and healthy patients in different combinations with three convolutional networks from the state of the art (for a total of 9 studied scenarios). The experimentation confirms that our proposal is able to separate COVID-19 cases, reaching a 94.7% ± 1.34% of accuracy.

Plácido L. Vidal, Joaquim de Moura, Jorge Novo, Marcos Ortega
Comparison of Different Supervised and Self-supervised Learning Techniques in Skin Disease Classification

For years now, The International Skin Imaging Collaboration has been providing datasets of dermoscopic images. Several studies show that dermoscopy provides improved diagnostic accuracy, in comparison to standard photography. Excellent results have been obtained that even exceed human performance. In this paper we broke the state of the art for the dataset provided for the ISIC 2019 challenge. In this work were compared the performance of various convolutional networks, various data augmentations, and various cost functions and optimizers. Results obtained using transfer learning from IMAGENET were compared with the performance obtained using BYOL (bootstrap your own latent), a self-supervised technique. Moreover, it has been demonstrated that self-supervised learning techniques can be used in this field improving the performance of the network compared to training from scratch. We were obtained a balanced multiclass accuracy (BCA) of 87% in the test and validation dataset and with a top-2 accuracy of 97%.

Loris Cino, Pier Luigi Mazzeo, Cosimo Distante
Unsupervised Deformable Image Registration in a Landmark Scarcity Scenario: Choroid OCTA

Recent advances in OCTA allow the imaging of blood flow deeper than the retinal layers at the level of the choriocapillaris (CC), where a pattern of small dark areas represents the absence of flow, called flow voids. The distribution of flow voids can be used as a biomarker to diagnose and monitor the progression of relevant pathologies or the efficacy of applied treatments. A pixel-to-pixel comparison can help to carry out this monitoring effectively, although in order to carry out this comparison, the used images must be perfectly aligned. CC images are characterized by their granularity, presenting numerous and complex local deformations, so a deformable registration is necessary to carry out a reliable comparison. However, CC OCTA images also present a characteristic absence of visually significant anatomical structures. This landmark scarcity hardens drastically the identification of points of interest to achieve an accurate registration. Based on this context, we designed a methodology to accurately perform this deformable registration in this challenging scenario. Hence, we propose a convolutional neural network model trained by unsupervised learning to register images in a real clinical scenario, being obtained at different time instants from patients with central serous chorioretinopathy (CSC) treated with photodynamic therapy. Our methodology produces superior alignment to those achieved with other proven methods, helping to improve the monitoring of the efficacy of photodynamic therapy applied to patients with CSC. Our robust and adaptable methodology can also be exploited in other similar scenarios of complex registrations with anatomical landmark scarcity.

Emilio López-Varela, Jorge Novo, José Ignacio Fernández-Vigo, Francisco Javier Moreno-Morillo, Marcos Ortega
Leveraging CycleGAN in Lung CT Sinogram-free Kernel Conversion

Cancer screening guidelines recommend annual screening with low-dose Computed Tomography (CT) for high-risk groups to reduce lung cancer mortality. Unfortunately, lung CT effectiveness can be strongly impacted by the considered reconstruction kernel. This selection is (almost) final, implying that it is no longer possible to change the used reconstruction kernel once applied, unless a sinogram for the conversion is available. The aim of this paper was to introduce a new sinogram-free kernel conversion in the contest of lung CT imaging. In particular, we wanted to define a procedure able to deal with different acquisition protocols, able to be used in an unpaired images scenario. To this aim, we leveraged a CycleGAN, considering the CT kernel conversion task as a style transfer problem. Results show that the CT kernel conversion can be effectively addressed as a style transfer problem.

Michela Gravina, Stefano Marrone, Ludovico Docimo, Mario Santini, Alfonso Fiorelli, Domenico Parmeggiani, Carlo Sansone
Investigating One-Class Classifiers to Diagnose Alzheimer’s Disease from Handwriting

The analysis of handwriting and drawing has been adopted since the early studies to help diagnose neurodegenerative diseases, such as Alzheimer’s and Parkinson’s. Departing from the current state-of-the-art methods that approach the problem of discriminating between healthy subjects and patients by using two- or multi-class classifiers, we propose to adopt one-class classifier models, as they require only data by healthy subjects to build the classifier, thus avoiding to collect patient data, as requested by competing techniques. In this framework, we evaluated the performance of three models of one-class classifiers, namely the Negative Selection Algorithm, the Isolation Forest and the One-Class Support Vector Machine, on the DARWIN dataset, which includes 174 subjects performing 25 handwriting/drawing tasks. The comparison with the state-of-the-art shows that the methods achieve state-of-the-art performance, and therefore may represent a viable alternative to the dominant approach.

Antonio Parziale, Antonio Della Cioppa, Angelo Marcelli
Learning Unrolling-Based Neural Network for Magnetic Resonance Imaging Reconstruction

Accelerated magnetic resonance imaging (MRI) based on neural networks is an effective solution for fast MRI reconstruction, producing competitive performance in restoring the image domain from its undersampled measurements. However, most existing works rely on convolutional neural networks (CNNs), which are limited by the inherent locality in capturing the long-distance dependency. In this work, we propose a UNet-like Transformer network (UTrans) that is capable of mapping the measurements back to image domain, resulting in an efficient MRI reconstruction. To better capture the non-local features, window-based self-attention operators are adopted to replace the convolutional layers in both encoder and decoder branches of UTrans. Inspired by unrolled optimization approaches, we apply a recurrent block to integrate the forward measurement operator and UTrans to unroll the iterative reconstruction. In the unrolling framework, UTrans served as a regularizer for image reconstruction with limited data. Finally, we replace feed forward network (FFN) module of the window-based self-attention operators with layer-fixed FFN (LF-FFN) whose parameters in the first hidden layer are obtained by random initialization and are fixed, with those in the second layer being updated in the usual fashion. Experiments on fastMRI indicate that the proposed method can attain improved reconstruction results with high performance on limited measurements with fewer network parameters.

Qiunv Yan, Li Liu, Lanyin Mei
Machine Learning to Predict Cognitive Decline of Patients with Alzheimer’s Disease Using EEG Markers: A Preliminary Study

Alzheimer’s disease causes most of dementia cases. Although currently there is no cure for this disease, predicting the cognitive decline of people at the first stage of the disease allows clinicians to alleviate its burden. Clinicians evaluate individuals’ cognitive decline by using neuropsychological tests consisting of different sections, each devoted to testing a specific set of cognitive skills. In this paper, we present the results of a preliminary study aimed at assessing the ability of machine learning based tools to predict the cognitive decline of Alzheimer’s patients using features extracted from EEG records at resting state. We tested seven classification schemes in predicting nine scores, provided by different sections of four neuropsychological tests. The experimental results demonstrated that at least three of these scores allows EEG-based features to be effective in predicting the cognitive decline of Alzheimer’s patients by using machine learning tools.

Francesco Fontanella, Sonia Pinelli, Claudio Babiloni, Roberta Lizio, Claudio Del Percio, Susanna Lopez, Giuseppe Noce, Franco Giubilei, Fabrizio Stocchi, Giovanni B. Frisoni, Flavio Nobili, Raffaele Ferri, Tiziana D’Alessandro, Nicole Dalia Cilia, Claudio De Stefano
Improving AMD Diagnosis by the Simultaneous Identification of Associated Retinal Lesions

Age-related Macular Degeneration (AMD) is the predominant cause of blindness in developed countries, specially in elderly people. Moreover, its prevalence is increasing due to the global population ageing. In this scenario, early detection is crucial to avert later vision impairment. Nonetheless, implementing large-scale screening programmes is usually not viable, since the population at-risk is large and the analysis must be performed by expert clinicians. Also, the diagnosis of AMD is considered to be particularly difficult, as it is characterized by many different lesions that, in many cases, resemble those of other macular diseases. To overcome these issues, several works have proposed automatic methods for the detection of AMD in retinography images, the most widely used modality for the screening of the disease. Nowadays, most of these works use Convolutional Neural Networks (CNNs) for the binary classification of images into AMD and non-AMD classes. In this work, we propose a novel approach based on CNNs that simultaneously performs AMD diagnosis and the classification of its potential lesions. This latter secondary task has not yet been addressed in this domain, and provides complementary useful information that improves the diagnosis performance and helps understanding the decision. A CNN model is trained using retinography images with image-level labels for both AMD and lesion presence, which are relatively easy to obtain. The experiments conducted in several public datasets show that the proposed approach improves the detection of AMD, while achieving satisfactory results in the identification of most lesions.

José Morano, Álvaro S. Hervella, José Rouco, Jorge Novo, José Ignacio Fernández-Vigo, Marcos Ortega
Eye Diseases Classification Using Deep Learning

Eye disease recognition is a challenging task, which usually requires years of medical experience. In this work, we conducted research that can be a start for the most versatile solution. We tried to solve the problem of the classification of different eye diseases using neural networks. The first step of this work consists of gathering all publicly available eye disease datasets and preprocessing them to make the experiments as generalized as possible. This led to the creation of a dataset composed of over 30,000 images. The aim was to teach the model the actual symptoms of the diseases instead of adjusting the results to a given part of the dataset. Several deep convolutional neural networks were used as feature extractors and they were combined with the Synergic Deep Learning model. We conducted experiments on the data and were able to achieve promising results.

Patrycja Haraburda, Łukasz Dabała
A Two-Step Radiologist-Like Approach for Covid-19 Computer-Aided Diagnosis from Chest X-Ray Images

Thanks to the rapid increase in computational capability during the latest years, traditional and more explainable methods have been gradually replaced by more complex deep-learning-based approaches, which have in fact reached new state-of-the-art results for a variety of tasks. However, for certain kinds of applications performance alone is not enough. A prime example is represented by the medical field, in which building trust between the physicians and the AI models is fundamental. Providing an explainable or trustful model, however, is not a trivial task, considering the black-box nature of deep-learning based methods. While some existing methods, such as gradient or saliency maps, try to provide insights about the functioning of deep neural networks, they often provide limited information with regards to clinical needs.We propose a two-step diagnostic approach for the detection of Covid-19 infection from Chest X-Ray images. Our approach is designed to mimic the diagnosis process of human radiologists: it detects objective radiological findings in the lungs, which are then employed for making a final Covid-19 diagnosis. We believe that this kind of structural explainability can be preferable in this context. The proposed approach achieves promising performance in Covid-19 detection, compatible with expert human radiologists. Moreover, despite this work being focused Covid-19, we believe that this approach could be employed for many different CXR-based diagnosis.

Carlo Alberto Barbano, Enzo Tartaglione, Claudio Berzovini, Marco Calandri, Marco Grangetto
UniToChest: A Lung Image Dataset for Segmentation of Cancerous Nodules on CT Scans

Lung cancer has emerged as a major causes of death and early detection of lung nodules is the key towards early cancer diagnosis and treatment effectiveness assessment. Deep neural networks achieve outstanding results in tasks such as lung nodules detection, segmentation and classification, however their performance depends on the quality of the training images and on the training procedure. This paper introduces UniToChest, a dataset consisting Computed Tomography (CT) scans of 623 patients. Then, we propose a lung nodules segmentation scheme relying on a convolutional neural architecture that we also re-purpose for a nodule detection task. The experimental results show accurate segmentation of lung nodules across a wide diameter range and better detection accuracy over a traditional detection approach. The datasets and the code used in this paper are publicly made available as a baseline reference.

Hafiza Ayesha Hoor Chaudhry, Riccardo Renzulli, Daniele Perlo, Francesca Santinelli, Stefano Tibaldi, Carmen Cristiano, Marco Grosso, Giorgio Limerutti, Attilio Fiandrotti, Marco Grangetto, Paolo Fonio
Optimized Fusion of CNNs to Diagnose Pulmonary Diseases on Chest X-Rays

Since the beginning of the COVID-19 pandemic, more than 350 million cases and 5 million deaths have occurred. Since day one, multiple methods have been provided to diagnose patients who have been infected. Alongside the gold standard of laboratory analyses, deep learning algorithms on chest X-rays (CXR) have been developed to support the COVID-19 diagnosis. The literature reports that convolutional neural networks (CNNs) have obtained excellent results on image datasets when the tests are performed in cross-validation, but such models fail to generalize to unseen data. To overcome this limitation, we exploit the strength of multiple CNNs by building an ensemble of classifiers via an optimized late fusion approach. To demonstrate the system’s robustness, we present different experiments on open source CXR datasets to simulate a real-world scenario, where scans of patients affected by various lung pathologies and coming from external datasets are tested. Promising performances are obtained both in cross-validation and in external validation, obtaining an average accuracy of 93.02% and 91.02%, respectively.

Valerio Guarrasi, Paolo Soda
High/Low Quality Style Transfer for Mutual Conversion of OCT Images Using Contrastive Unpaired Translation Generative Adversarial Networks

Recent advances in artificial intelligence and deep learning models are contributing to the development of advanced computer-aided diagnosis (CAD) systems. In the context of medical imaging, Optical Coherence Tomography (OCT) is a valuable technique that is able to provide cross-sectional visualisations of the ocular tissue. However, OCT is constrained by a limitation between the quality of the visualisations that it can produce and the overall amount of tissue that can be analysed at once. This limitation leads to a scarcity of high quality data, a problem that is very prevalent when developing machine learning-based CAD systems intended for medical imaging. To mitigate this problem, we present a novel methodology for the unpaired conversion of OCT images acquired with a low quality extensive scanning preset into the visual style of those taken with a high quality intensive scan and vice versa. This is achieved by employing contrastive unpaired translation generative adversarial networks to convert between the visual styles of the different acquisition presets. The results we obtained in the validation experiments show that these synthetic generated images can mirror the visual features of the original ones while preserving the natural tissue texture, effectively increasing the total number of available samples that can be used to train robust machine learning-based CAD systems.

Mateo Gende, Joaquim de Moura, Jorge Novo, Marcos Ortega
Real-Time Respiration Monitoring of Neonates from Thermography Images Using Deep Learning

In this work, we present an approach for non-contact automatic extraction of respiration in infants using infrared thermography video sequences, which were recorded in a neonatal intensive care unit. The respiratory signal was extracted in real-time on low-cost embedded GPUs by analyzing breathing-related temperature fluctuations in the nasal region. The automatic detection of the patient’s nose was performed using the Deep Learning-based YOLOv4-Tiny object detector. Additionally, the head was detected for movement tracking. A leave-one-out cross validation showed a mean intersection over union of 79% and a mean average precision of 93% for the detection algorithm. Since no clinical reference was provided, the extracted respiratory activity was validated for video sequences without motion artifacts using Farnebäck’s Optical Flow algorithm. A mean MAE of 8.5 breaths per minute and a mean $$\mathrm{F}_{1}$$ F 1 -score of 80% for respiration detection were achieved. The model inference on NVIDIA Jetson modules showed a performance of 32 fps on the Xavier NX and 62 fps on the Xavier AGX. These outcomes showed promising results for the real-time extraction of respiratory activity from thermography recordings of neonates using Deep Learning-based techniques on embedded GPUs.

Simon Lyra, Ines Groß-Weege, Steffen Leonhardt, Markus Lüken
Improving Colon Carcinoma Grading by Advanced CNN Models

Cancer ranks as a leading cause of death and an important barrier to increasing life expectancy in every country of the world. For this reason, there is a great requirement for developing computer-aided approaches for accurate cancer diagnosis and grading that can overcome the problem of intra- and inter-observer inconsistency and thereby improve the accuracy and consistency in the cancer detection and treatment planning processes. In particular, the studies exploiting deep learning for automatic grading of colon carcinoma are still in infancy since the works in the literature did not exploit the most advanced models and methodologies of machine learning and systematic exploration of the most promising available convolutional networks is missing. To fill this gap, in this work, the most performing convolutional architectures in classification tasks have been exploited to improve colon carcinoma grading in histological images. The experimental proofs on the largest publicly available dataset demonstrated marked improvement with respect to the leading state of the art approaches.

Marco Leo, Pierluigi Carcagnì, Luca Signore, Giulio Benincasa, Mikko O. Laukkanen, Cosimo Distante

Multimedia

Frontmatter
Frame Adaptive Rate Control Scheme for Video Compressive Sensing

Measurement coding compresses the output of compressive image sensors to improve the image/video transmission efficiency. In these coding systems, rate control plays a vital role. The major purpose of rate control is to determine the quantization parameters (or quantization stepsizes) to control the bitrate under available bandwidth (bits limitation) while maximizing the image/video quality. However, most of the existing rate control algorithms apply iterations to find the best quantization parameters, so it suffers from a long processing time and can’t efficiently support video processing. This paper presents a frame adaptive rate control scheme for measurement coding. Firstly, the initialized quantization parameter (QP) of the first frame is determined by the triangle quantization method. Moreover, frame adaptive QP adjustment is proposed to refine the QP for each frame. As a result, this work improves the video quality up to 1.56 dB PSNR and reduces the processing time up to 53% compared to the state-of-the-art.

Fuma Kimishima, Jian Yang, Thuy T. T. Tran, Jinjia Zhou
Shot-Based Hybrid Fusion for Movie Genre Classification

Multi-modal fusion methods for movie genre classification have shown to be superior over their single modality counterparts. However, it is still challenging to design a fusion strategy for real-world scenarios where missing data and weak labeling are common. Considering the heterogeneity in different modalities, most existing works design late fusion strategies that process and train models per modality, and combine the results at the decision level. A major drawback in such strategies is the potential loss of across-modality dependencies, which is important for understanding audiovisual contents. In this paper, we introduce a Shot-based Hybrid Fusion Network (SHFN) for movie genre classification. It consists of single-modal feature fusion networks for video and audio, a multi-modal feature fusion network working on a shot-basis, and finally a late fusion part for video-level decisions. An ablation study indicates the major contribution from video and the performance gain from the additional modality, audio. The experimental results on the LMTD-9 dataset demonstrate the effectiveness of our proposed method in movie genre classification. Our best model outperforms the state-of-the-art method by 5.7% on AUPRC(micro).

Tianyu Bi, Dimitri Jarnikov, Johan Lukkien
Landmark-Guided Conditional GANs for Face Aging

Face aging, which alters a person’s facial photo to the appearance at a different age, is a popular topic in multimedia applications. Recently, conditional Generative Adversarial Networks (cGANs) have achieved visually impressive progress in this area. However, generating a convincing aging appearance while preserving the person’s identity is still a challenging task. In this paper, we propose a novel Landmark-Guided cGAN (LGcGAN), which not only synthesizes texture changes related to aging, but also alters facial structures accordingly. We adapt a built-in attention mechanism to emphasize the most discriminative regions relevant to aging and minimize changes that affect personal identity and background. Conditioned with age vectors, the primal cGAN in our symmetric network converts input faces to target ages, and the dual cGAN inverts the previous task, which feeds synthesized target faces back to the original input age scope for enhancing age consistency. Both qualitative and quantitative experiments show that our method can generate appealing results in terms of image quality, personal identity, and age accuracy.

Xin Huang, Minglun Gong
Introducing AV1 Codec-Level Video Steganography

Steganography is the ancient art of concealing messages into data. High research interest has grown over the last years, however, techniques in literature are only focused on standard and in some way legacy multimedia formats (e.g., H.264). Moreover, most video steganography techniques are based on concealing data into contents of each frame employing many strategies. In this paper, a codec-level video steganography technique is presented on the novel AV1: a royalty-free video compression format proposed by the Alliance for Open Media (AOM). The proposed method is based on the alteration of intra-prediction angles and, differently from other solutions, it works along the compression process, by allowing the encoder to reduce possible distortions caused by the messages to be hidden. The effectiveness of the technique was demonstrated by hiding up to 1024 characters into an highly compressed video of 40 s maintaining an average Peak Signal-to-Noise Ratio value of 37.53 dB.

Lorenzo Catania, Dario Allegra, Oliver Giudice, Filippo Stanco, Sebastiano Battiato

Deep Learning

Frontmatter
Efficient Transfer Learning for Visual Tasks via Continuous Optimization of Prompts

Traditional methods for adapting pre-trained vision models to downstream tasks involve fine-tuning some or all of the model’s parameters. There are a number of trade-offs with this approach. When too many parameters are fine-tuned, the model may lose the benefits associated with pre-training, such as the ability to generalize to out-of-distribution data. But, if instead too few parameters are fine-tuned, the model may be unable to adapt effectively for the tasks downstream. In this paper, we propose Visual Prompt Tuning (VPT) as an alternative to fine-tuning for Transformer-based vision models. Our method is closely related to, and inspired by, prefix-tuning of language models [22]. We find that, by adding additional parameters to a pre-trained model, VPT offers similar performance to fine-tuning the final layer. In addition, for low-data settings and for specialized tasks, such as traffic sign recognition, satellite photo recognition and handwriting classification, the performance of Transformer-based vision models is improved with the use of VPT.

Jonathan Conder, Josephine Jefferson, Nathan Pages, Khurram Jawed, Alireza Nejati, Mark Sagar
Continual Learning with Neuron Activation Importance

Continual learning is a concept of online learning with multiple sequential tasks. One of the critical barriers of continual learning is that a network should learn a new task keeping the knowledge of old tasks without access to any data of the old tasks. We propose a neuron activation importance-based regularization method for stable continual learning regardless of the order of tasks. We conduct comprehensive experiments on existing benchmark data sets to evaluate not just the stability and plasticity of our method with improved classification accuracy also the robustness of the performance along the changes of task order.

Sohee Kim, Seungkyu Lee
AD-CGAN: Contrastive Generative Adversarial Network for Anomaly Detection

Anomaly detection (AD), a fundamental challenge in machine learning, aims to find samples that do not belong to the distribution of the training data. Among unsupervised anomaly detection models, models based on generative adversarial networks show promising results. These models mainly rely on the rich representations learned from the normal training data to find anomalies. However, their performance is bounded by the limitations of GANs, known as mode collapse, in learning complex training distribution. This work presents a new GAN-based anomaly detection model with a combination of contrastive learning to mitigate the negative effect of mode collapse in more complex distributions. Our unsupervised Anomaly Detection model based on Contrastive Generative Adversarial Network, AD-CGAN, contrasts a sample with local feature maps of itself instead of only contrasting the given sample with other instances as in conventional contrastive learning approaches. Contrastive loss in AD-CGAN helps the model learn more discriminative representations of normal samples. Furthermore, we consider a new normality score to target anomalous samples. The normality score is defined on the encoded representations of samples obtained from the model. Extensive experiments showed AD-CGAN outperforms its counterparts on multiple benchmarks with a significant improvement in ROC-AUC over the previously proposed reconstruction-based approaches.

Laya Rafiee Sevyeri, Thomas Fevens
Analyzing EEG Data with Machine and Deep Learning: A Benchmark

Nowadays, machine and deep learning techniques are widely used in different areas, ranging from economics to biology. In general, these techniques can be used in two ways: trying to adapt well-known models and architectures to the available data, or designing custom architectures. In both cases, to speed up the research process, it is useful to know which type of models work best for a specific problem and/or data type. By focusing on EEG signal analysis, and for the first time in literature, in this paper a benchmark of machine and deep learning for EEG signal classification is proposed. For our experiments we used the four most widespread models, i.e., multilayer perceptron, convolutional neural network, long short-term memory, and gated recurrent unit, highlighting which one can be a good starting point for developing EEG classification models.

Danilo Avola, Marco Cascio, Luigi Cinque, Alessio Fagioli, Gian Luca Foresti, Marco Raoul Marini, Daniele Pannone
A Two-Stage U-Net to Estimate the Cultivated Area of Plantations

In order to reduce tax evasion in agribusiness, it is possible to estimate the production of crops through the monitoring and analysis of satellite images and compare with the values declared by the taxpayer. For this, deep learning techniques can be applied to satellite images to segment the cultivated area of plantations, and the segmented area can be used to estimate crop yields. As an initial step, this work aims to analyze the satellite images of plantations to estimate the cultivated area of plantations using semantic segmentation. For this, we created a dataset for planting areas data, and we proposed network architecture for image segmentation, a two-stage U-net. The proposed methodology returned average IoU results above 80% in both stages.

Walysson Carlos dos Santos Oliveira, Geraldo Braz Junior, Daniel Lima Gomes Junior, Anselmo Cardoso de Paiva, Joao Dallyson Sousa de Almeida
An Explainable Medical Imaging Framework for Modality Classifications Trained Using Small Datasets

With the huge expansion of artificial intelligence in medical imaging, many clinical warehouses, medical centres and research communities, have organized patients’ data in well-structured datasets. These datasets are one of the key elements to train AI-enabled solutions. Additionally, the value of such datasets depends on the quality of the underlying data. To maintain the desired high-quality standard, these datasets are actively cleaned and continuously expanded. This labelling process is time-consuming and requires clinical expertise even when a simple classification task must be performed. Therefore, in this work, we propose to tackle this problem by developing a new pipeline for the modality classification of medical images. Our pipeline has the purpose to provide an initial step in organizing a large collection of data and grouping them by modality, thus reducing the involvement of costly human raters. In our experiments, we consider 4 popular deep neural networks as the core engine of the proposed system. The results show that when limited datasets are available simpler pre-trained networks achieved better results than more complex and sophisticated architectures. We demonstrate this by comparing the considered networks on the ADNI dataset and by exploiting explainable AI techniques that help us to understand our hypothesis. Still today, many medical imaging studies make use of limited datasets, therefore we believe that our contribution is particularly relevant to drive future developments of new medical imaging technologies when limited data are available.

Francesca Trenta, Sebastiano Battiato, Daniele Ravì
Fusion of Periocular Deep Features in a Dual-Input CNN for Biometric Recognition

Periocular recognition has attracted attention in recent times. The advent of the COVID-19 pandemic and the consequent obligation to wear facial masks made face recognition problematic due to the important occlusion of the lower part of the face. In this work, a dual-input Neural Network architecture is proposed. The structure is a Siamese-like model, with two identical parallel streams (called base models) that process the two inputs separately. The input is represented by RGB images of the right eye and the left eye belonging to the same subject. The outputs of the two base models are merged through a fusion layer. The aim is to investigate how deep feature aggregation affects periocular recognition. The experimentation is performed on the Masked Face Recognition Database (M $$^2$$ 2 FRED) which includes videos of 46 participants with and without masks. Three different fusion layers are applied to understand which type of merging technique is most suitable for data aggregation. Experimental results show promising performance for almost all experimental configurations with a worst-case accuracy of 90% and a best-case accuracy of 97%.

Andrea Abate, Lucia Cimmino, Michele Nappi, Fabio Narducci
Improve Convolutional Neural Network Pruning by Maximizing Filter Variety

Neural network pruning is a widely used strategy for reducing model storage and computing requirements. It allows to lower the complexity of the network by introducing sparsity in the weights. Because taking advantage of sparse matrices is still challenging, pruning is often performed in a structured way, i.e. removing entire convolution filters in the case of ConvNets, according to a chosen pruning criteria. Common pruning criteria, such as $$l_1$$ l 1 -norm or movement, usually do not consider the individual utility of filters, which may lead to: (1) the removal of filters exhibiting rare, thus important and discriminative behaviour, and (2) the retaining of filters with redundant information. In this paper, we present a technique solving those two issues, and which can be appended to any pruning criteria. This technique ensures that the criteria of selection focuses on redundant filters, while retaining the rare ones, thus maximizing the variety of remaining filters. The experimental results, carried out on different datasets (CIFAR-10, CIFAR-100 and CALTECH-101) and using different architectures (VGG-16 and ResNet-18) demonstrate that it is possible to achieve similar sparsity levels while maintaining a higher performance when appending our filter selection technique to pruning criteria. Moreover, we assess the quality of the found sparse subnetworks by applying the Lottery Ticket Hypothesis and find that the addition of our method allows to discover better performing tickets in most cases.

Nathan Hubens, Matei Mancas, Bernard Gosselin, Marius Preda, Titus Zaharia
Improving Autoencoder Training Performance for Hyperspectral Unmixing with Network Reinitialisation

Neural networks, in particular autoencoders, are one of the most promising solutions for unmixing hyperspectral data, i.e. reconstructing the spectra of observed substances (endmembers) and their relative mixing fractions (abundances), which is needed for effective hyperspectral analysis and classification. However, as we show in this paper, the training of autoencoders for unmixing is highly dependent on weights initialisation; some sets of weights lead to degenerate or low-performance solutions, introducing negative bias in the expected performance. In this work, we experimentally investigate autoencoders stability as well as network reinitialisation methods based on coefficients of neurons’ dead activations. We demonstrate that the proposed techniques have a positive effect on autoencoder training in terms of reconstruction, abundances and endmembers errors.

Kamil Książek, Przemysław Głomb, Michał Romaszewski, Michał Cholewa, Bartosz Grabowski, Krisztián Búza
Cluster Centers Provide Good First Labels for Object Detection

Learning object detection models with a few labels, is possible due to ingenious few-shot techniques, and due to clever selection of images to be labeled. Few-shot techniques work with as few as 1 to 10 randomized labels per object class. We are curious if performance of randomized label selection can be improved by selecting 1 to 10 labels per object class in a non-random manner. Several active learning techniques have been proposed to select object labels, but all started with a minimum of several tens of labels. We explore an effective and simple label selection strategy, for the case of 1 to 10 labels per object class. First, the full unlabeled dataset is clustered into N clusters, where N is the desired number of labels. Clustering is based on k-means on embedding vectors from a state-of-the-art pretrained image classification model (SimCLR v2). The image closest to the center is selected to be labeled. It is effective: on Pascal VOC we validate that it improves over randomized selection over 25%, with large improvements especially when having 1 label per object class. We have several benefits to report on this simple strategy: it is easy to implement, it is effective, and it is relevant in practice where one often starts with a dataset without any labels.

Gertjan J. Burghouts, Maarten Kruithof, Wyke Huizinga, Klamer Schutte
Unsupervised Detection of Dynamic Hand Gestures from Leap Motion Data

The effective and reliable detection and classification of dynamic hand gestures is a key element for building Natural User Interfaces, systems that allow the users to interact using free movements of their body instead of traditional mechanical tools. However, methods that temporally segment and classify dynamic gestures usually rely on a great amount of labeled data, including annotations regarding the class and the temporal segmentation of each gesture. In this paper, we propose an unsupervised approach to train a Transformer-based architecture that learns to detect dynamic hand gestures in a continuous temporal sequence. The input data is represented by the 3D position of the hand joints, along with their speed and acceleration, collected through a Leap Motion device. Experimental results show a promising accuracy on both the detection and the classification task and that only limited computational power is required, confirming that the proposed method can be applied in real-world applications.

Andrea D’Eusanio, Stefano Pini, Guido Borghi, Alessandro Simoni, Roberto Vezzani
SCAF: Skip-Connections in Auto-encoder for Face Alignment with Few Annotated Data

Supervised face alignment methods need large amounts of training data to achieve good performance in terms of accuracy and generalization. However face alignment datasets rarely exceed a few thousand samples making these methods prone to overfitting on the specific training dataset. Semi-supervised methods like $$\mathrm {TS^3}$$ TS 3 or 3FabRec have emerged to alleviate this issue by using labeled and unlabeled data during the training. In this paper we propose Skip-Connections in Auto-encoder for Face alignment (SCAF), we build on 3FabRec by adding skip-connections between the encoder and the decoder. These skip-connections lead to better landmark predictions, especially on challenging examples. We also apply for the first time active learning to the face alignment task and introduce a new acquisition function, the Negative Neighborhood Magnitude, specially designed to assess the quality of heatmaps. These two proposals show their effectiveness on several face alignment datasets when training with limited data.

Martin Dornier, Philippe-Henri Gosselin, Christian Raymond, Yann Ricquebourg, Bertrand Coüasnon
Full Motion Focus: Convolutional Module for Improved Left Ventricle Segmentation Over 4D MRI

Magnetic Resonance Imaging (MRI) is a widely known medical imaging technique used to assess the heart function. Over Cardiac MRI (CMR) images, Deep Learning (DL) models perform several tasks with good efficacy, such as segmentation, estimation, and detection of diseases. Such models can produce even better results when their input is a Region of Interest (RoI), that is, a segment of the image with more analytical potential for diagnosis. Accordingly, we describe Full Motion Focus (FMF), an image processing technique sensitive to the heart motion in a 4D MRI sequence (video) whose principle is to combine static and dynamic image features with a Radial Basis Function (RBF) to highlight the RoI found in the motion field. We experimented FMF with the U-Net convolutional DL architecture over three CMR datasets in the task of Left Ventricle segmentation; we achieved a rate of detection (Recall score) of 99.7% concerning the RoIs, improved the U-Net segmentation (mean Dice score) by 1.7 ( $$p<.001$$ p < . 001 ), and improved the overall training speed by 2.5 times (+150%).

Daniel M. Lima, Catharine V. Graves, Marco A. Gutierrez, Bruno Brandoli, Jose F. Rodrigues Jr.
Super-Resolution of Solar Active Region Patches Using Generative Adversarial Networks

Monitoring solar active region patches from Helioseismic and Magnetic Imager (HMI) instruments is essential for space weather forecasting. However, recovering small bipolar details in HMI patches requires additional pre-processing steps to obtain better quality. This work uses a generative adversarial network, with transposed convolution and super-pixel convolution up-sampling layers, to generate the higher quality of HMI patches. It trains and validates the network based on binary cross-entropy, mean absolute error and multi-scale dice-coefficient functions. It illustrates the performance of the generative method in two image types (magnetogram and continuum intensity patches) from two instruments (SDO/HMI and SOT/NET). It also compares its performance with state-of-the-art methods. The results demonstrate that the generative method produces high-quality images by increasing polarity contrast and retrieving smaller structures.

Rasha Alshehhi
Avoiding Shortcuts in Unpaired Image-to-Image Translation

Image-to-image translation is a very popular task in deep learning. In particular, one of the most effective and popular approach to solve it, when a paired dataset of examples is not available, is to use a cycle consistency loss. This means forcing an inverse mapping in order to reverse the output of the network back to the source domain and reduce the space of all the possible mappings. Nevertheless, the network could learn to take shortcuts and softly apply the target domain in order to make the reverse translation easier therefore producing unsatisfactory results. For this reason, in this paper an additional constraint is introduced during the training phase of an unpaired image-to-image translation network; this forces the model to have the same attention both when applying the target domains and when reversing the translation. This approach has been tested on different datasets showing a consistent improvement over the generated results.

Tomaso Fontanini, Filippo Botti, Massimo Bertozzi, Andrea Prati
Towards Efficient and Data Agnostic Image Classification Training Pipeline for Embedded Systems

Nowadays deep learning-based methods have achieved a remarkable progress at the image classification task among a wide range of commonly used datasets (ImageNet, CIFAR, SVHN, Caltech 101, SUN397, etc.). SOTA performance on each of the mentioned datasets is obtained by careful tuning of the model architecture and training tricks according to the properties of the target data. Although this approach allows setting academic records, it is unrealistic that an average data scientist would have enough resources to build a sophisticated training pipeline for every image classification task he meets in practice. This work is focusing on reviewing the latest augmentation and regularization methods for the image classification and exploring ways to automatically choose some of the most important hyperparameters: total number of epochs, initial learning rate value and it’s schedule. Having a training procedure equipped with a lightweight modern CNN architecture (like MobileNetV3 or EfficientNet), sufficient level of regularization and adaptive to data learning rate schedule, we can achieve a reasonable performance on a variety of downstream image classification tasks without manual tuning of parameters to each particular task. Resulting models are computationally efficient and can be deployed to CPU using the OpenVINO™ toolkit. Source code is available as a part of the OpenVINO™ Training Extensions ( https://github.com/openvinotoolkit/training_extensions ).

Kirill Prokofiev, Vladislav Sovrasov
Medicinal Boxes Recognition on a Deep Transfer Learning Augmented Reality Mobile Application

Taking medicines is a fundamental aspect to cure illnesses. However, studies have shown that it can be hard for patients to remember the correct posology. More aggravating, a wrong dosage generally causes the disease to worsen. Although, all relevant instructions for a medicine are summarized in the corresponding patient information leaflet, the latter is generally difficult to navigate and understand. To address this problem and help patients with their medication, in this paper we introduce an augmented reality mobile application that can present to the user important details on the framed medicine. In particular, the app implements an inference engine based on a deep neural network, i.e., a densenet, fine-tuned to recognize a medicinal from its package. Subsequently, relevant information, such as posology or a simplified leaflet, is overlaid on the camera feed to help a patient when taking a medicine. Extensive experiments to select the best hyperparameters were performed on a dataset specifically collected to address this task; ultimately obtaining up to 91.30% accuracy as well as real-time capabilities.

Danilo Avola, Luigi Cinque, Alessio Fagioli, Gian Luca Foresti, Marco Raoul Marini, Alessio Mecca, Daniele Pannone
Consistency Regularization for Unsupervised Domain Adaptation in Semantic Segmentation

Unsupervised domain adaptation is a promising technique for computer vision tasks, especially when annotating large amounts of data is very costly and time-consuming, as in semantic segmentation. Here it is attractive to train neural networks on simulated data and fit them to real data on which the models are to be used. In this paper, we propose a consistency regularization method for domain adaptation in semantic segmentation that combines pseudo-labels and strong perturbations. We analyse the impact of two simple perturbations, dropout and image mixing, and show how they contribute enormously to the final performance. Experiments and ablation studies demonstrate that our simple approach achieves strong results on relevant synthetic-to-real domain adaptation benchmarks.

Sebastian Scherer, Stephan Brehm, Rainer Lienhart
Towards an Efficient Facial Image Compression with Neural Networks

Digital images are more and more part of everyday life. Efficient compression methods are needed to reduce the disk-space usage for their storage and the bandwidth for their transmission while keeping the resolution and the visual quality of the reconstructed images as close to the original images as possible. Not all images have the same importance. The facial images are being extensively used in many applications (e.g., law enforcement, social networks) and require high efficient facial image compression schemes in order to not compromise face recognition and identification (e.g., for surveillance and security scenarios). For this reason, we propose a promising approach that consists of a custom loss that combines the two tasks of image compression and face recognition. The results show that our method compresses efficiently face images guaranteeing high perceptive quality and face verification accuracy.

Maria Ausilia Napoli Spatafora, Alessandro Ortis, Sebastiano Battiato
Avalanche RL: A Continual Reinforcement Learning Library

Continual Reinforcement Learning (CRL) is a challenging setting where an agent learns to interact with an environment that is constantly changing over time (the stream of experiences). In this paper, we describe Avalanche RL, a library for Continual Reinforcement Learning which allows users to easily train agents on a continuous stream of tasks. Avalanche RL is based on PyTorch [23] and supports any OpenAI Gym [4] environment. Its design is based on Avalanche [16], one of the most popular continual learning libraries, which allow us to reuse a large number of continual learning strategies and improve the interaction between reinforcement learning and continual learning researchers. Additionally, we propose Continual Habitat-Lab, a novel benchmark and a high-level library which enables the usage of the photorealistic simulator Habitat-Sim [28] for CRL research. Overall, Avalanche RL attempts to unify under a common framework continual reinforcement learning applications, which we hope will foster the growth of the field.

Nicoló Lucchesi, Antonio Carta, Vincenzo Lomonaco, Davide Bacciu
CVGAN: Image Generation with Capsule Vector-VAE

In unsupervised learning, the extraction of a representational learning space is an open challenge in machine learning. Important contributions in this field are: the Variational Auto-Encoder (VAE), on a continuous latent representation, and the Vector Quantized - VAE (VQ-VAE), on a discrete latent representation. VQ-VAE is a discrete latent variable model that has been demonstrated to learn nontrivial features representations of images in unsupervised learning. It is a viable alternative to the continuous latent variable models, VAE. However, training deep discrete variable models is challenging, due to the inherent non-differentiability of the discretization operation. In this paper, we propose Capsule Vector - VAE(CV-VAE), a new model based on VQ-VAE architecture where the discrete bottleneck represented by the quantization code-book is replaced with a capsules layer. We demonstrate that the capsules can be successfully applied for the clusterization procedure reintroducing the differentiability of the bottleneck in the model. The capsule layer clusters the encoder outputs considering the agreement among capsules. The CV-VAE is trained within Generative Adversarial Paradigm (GAN), CVGAN in short. Our model is shown to perform on par with the original VQGAN, VAE in GAN. CVGAN obtains images with higher quality after few epochs of training. We present results on ImageNet, COCOStuff, and FFHQ datasets, and we compared the obtained images with results with VQGAN. The interpretability of the training process for the latent representation is significantly increased maintaining the structured bottleneck idea. This has practical benefits, for instance, in unsupervised representation learning, where a large number of capsules may lead to the disentanglement of latent representations.

Rita Pucci, Christian Micheloni, Gian Luca Foresti, Niki Martinel
Self-Adaptive Logit Balancing for Deep Learning Robustness in Computer Vision

With wide applications of machine learning algorithms, machine learning security has become a significant issue. The vulnerability to adversarial perturbations exists in most machine learning algorithms, including cutting-edge deep neural networks. The standard adversarial perturbation defence techniques with adversarial training need to generate adversarial examples during the training process, which require high computational costs. This paper proposed a novel defence method using self-adaptive logit balancing and Gaussian noise boost training. This method can improve the robustness of deep neural networks without high computational cost and achieve competitive results compared with the adversarial training methods. Meanwhile, this defence method enables deep learning systems to have proactive and reactive defence during the operation. A sub-classifier is trained to determine whether the system is under attack and detect attack algorithms via the patterns of the Log-Softmax values. It can achieve high accuracy for detecting clean inputs and adversarial examples created by seven attack methods.

Jiefei Wei, Qinggang Meng, Luyan Yao
Don’t Wait Until the Accident Happens: Few-Shot Classification Framework for Car Accident Inspection in a Real World

Car accident inspection is a binary classification task to recognize whether a given car image includes a damaged surface or not. While the prior studies utilized various computer vision algorithms under the fully supervised, high data availability regime, these studies bear several limits for application in the real world. First, acquiring a large amount of car accident images is challenging due to their scarcity. Second, the supervised classifier would fail to recognize a sample not seen a priori. To improve the aforementioned drawbacks, we propose a few-shot classification framework for the accident inspection task and illustrate several takeaways to the practitioners. First, we designed a few-shot classification framework and validated our approach precisely identifies the accident, although the practitioner has a few accident images. Second, we analyzed the fine-grained discriminative characteristics between normal and accident images; thus, fine-grained feature extractor architecture is adequate for our accident inspection task. Third, we scrutinized optimal image resizing strategy varies along with the feature extractor architecture; therefore, we recommend that practitioners be cautious in handling real world car images. Lastly, we analyzed a larger number of acquired accident images that are advantageous in a few-shot classification. Based on these contributions, we highly expect further studies can realize the benefits of automated car part recognition in the real world shortly.

Kyung Ho Park, Hyunhee Chung
Robust Object Detection with Multi-input Multi-output Faster R-CNN

Recent years have seen impressive progress in visual recognition on many benchmarks, however, generalization to the out-of-distribution setting remains a significant challenge. A state-of-the-art method for robust visual recognition is model ensembling. However, recently it was shown that similarly competitive results could be achieved with a much smaller cost, by using multi-input multi-output architecture (MIMO).In this work, a generalization of the MIMO approach is applied to the task of object detection using the general-purpose Faster R-CNN model. It was shown that using the MIMO framework allows building strong feature representation and obtains very competitive accuracy when using just two input/output pairs. Furthermore, it adds just 0.5% additional model parameters and increases the inference time by 15.9% when compared to the standard Faster R-CNN. It also works comparably to or outperforms the Deep Ensemble approach in terms of model accuracy, robustness to out-of-distribution setting, and uncertainty calibration when the same number of predictions is used. This work opens up avenues for applying the MIMO approach in other high-level tasks such as semantic segmentation and depth estimation.

Sebastian Cygert, Andrzej Czyżewski
Multiple Input Branches Shift Graph Convolutional Network with DropEdge for Skeleton-Based Action Recognition

Graph Convolutional Networks (GCNs) achieve remarkable success in the skeleton-based action recognition tasks. However, the recent state-of-the-art (SOTA) methods for this task usually have a large model size and too heavy computational complexity. In this work, we propose an early fused model, Multiple Input Branches Shift Graph Convolutional Network with DropEdge (MIBSD-GCN). First, to reduce the complexity of the multi-stream model, we introduce a lightweight Shift Graph Convolutional Network (Shift-GCN) block. It is embedded into an early fused architecture, Multiple Input Branches (MIB), which can enrich input features and suppresses the model redundancy. Then, a novel spherical coordinate representation is added as one of the input branches to enhance the recognition effect. Finally, we design the Shift Graph Convolutional Network with DropEdge (SD-GCN) to prevent over-fitting and over-smoothing, while maintain the model accuracy. Extensive experiments on two large-scale datasets, NTU RGB+D 60 and NTU RGB+D 120, show that the proposed model outperforms previous SOTA methods. We achieve 96.6% accuracy on the Cross-view benchmark of the NTU RGB+D 60, while being 3.4–16.5 times fewer FLOPs than other SOTA models.

Yan Liu, Yuelin Deng, Jinping Su, Ruonan Wang, Chi Li
Contrastive Supervised Distillation for Continual Representation Learning

In this paper, we propose a novel training procedure for the continual representation learning problem in which a neural network model is sequentially learned to alleviate catastrophic forgetting in visual search tasks. Our method, called Contrastive Supervised Distillation (CSD), reduces feature forgetting while learning discriminative features. This is achieved by leveraging labels information in a distillation setting in which the student model is contrastively learned from the teacher model. Extensive experiments show that CSD performs favorably in mitigating catastrophic forgetting by outperforming current state-of-the-art methods. Our results also provide further evidence that feature forgetting evaluated in visual retrieval tasks is not as catastrophic as in classification tasks. Code at: https://github.com/NiccoBiondi/ContrastiveSupervisedDistillation .

Tommaso Barletti, Niccoló Biondi, Federico Pernici, Matteo Bruni, Alberto Del Bimbo
A Comparison of Deep Learning Methods for Inebriation Recognition in Humans

Excessive alcohol consumption leads to inebriation. Driving under the influence of alcohol is a criminal offence in many countries involving operating a motor vehicle while inebriated to a level that renders safely operating a motor vehicle extremely difficult. Studies show that traffic accidents will become the fifth most significant cause of death if inebriated driving is not mitigated. Inversely, 70% of the world population can be protected by mitigating inebriated driving. Short term effects of inebriation include lack of balance, inhibition and fine motor coordination, dilated pupils and slow heart rate. An ideal inebriation recognition method that operates in real-time is less intrusive, more convenient, and efficient. Deep learning has been used to solve object detection, object recognition, object tracking and image segmentation problems. In this paper, we compare deep learning inebriation recognition methods. We implemented Faster R-CNN and YOLO methods for our experiment. We created our dataset of sober and inebriated individuals made available to the public. Six thousand four hundred forty-three (6443) face images were used, and our best performing pipeline was YOLO with a 99.6% accuracy rate.

Zibusiso Bhango, Dustin van der Haar
Enhanced Data-Recalibration: Utilizing Validation Data to Mitigate Instance-Dependent Noise in Classification

This paper proposes a practical approach to deal with instance-dependent noise in classification. Supervised learning with noisy labels is one of the major research topics in the deep learning community. While old works typically assume class conditional and instance-independent noise, recent works provide theoretical and empirical proof to show that the noise in real-world cases is instance-dependent. Current state-of-the-art methods for dealing with instance-dependent noise focus on data-recalibrating strategies to iteratively correct labels while training the network. While some methods provide theoretical analysis to prove that each iteration results in a cleaner dataset and a better-performing network, the limiting assumptions and dependency on knowledge about noise for hyperparameter tuning often contrast their claims. The proposed method in this paper is a two-stage data-recalibration algorithm that utilizes validation data to correct noisy labels and refine the model iteratively. The algorithm works by training the network on the latest cleansed training Set to obtain better performance on a small, clean validation set while using the best performing model to cleanse the training set for the next iteration. The intuition behind the method is that a network with decent performance on the clean validation set can be utilized as an oracle network to generate less noisy labels for the training set. While there is no theoretical guarantee attached, the method’s effectiveness is demonstrated with extensive experiments on synthetic and real-world benchmark datasets. The empirical evaluation suggests that the proposed method has a better performance compared to the current state-of-the-art works. The implementation is available at https://github.com/Sbakhshigermi/EDR .

Saeed Bakhshi Germi, Esa Rahtu
DMSANet: Dual Multi Scale Attention Network

Attention mechanism of late has been quite popular in the computer vision community. A lot of work has been done to improve the performance of the network, although almost always it results in increased computational complexity. In this paper, we propose a new attention module that not only achieves the best performance but also has lesser parameters compared to most existing models. Our attention module can easily be integrated with other convolutional neural networks because of its lightweight nature. The proposed network named Dual Multi Scale Attention Network (DMSANet) is comprised of two parts: the first part is used to extract features at various scales and aggregate them, the second part uses spatial and channel attention modules in parallel to adaptively integrate local features with their global dependencies. We benchmark our network performance for Image Classification on ImageNet dataset, Object Detection and Instance Segmentation both on MS COCO dataset.

Abhinav Sagar
Towards Latent Space Optimization of GANs Using Meta-Learning

The necessity to use very large datasets in order to train Generative Adversarial Networks (GANs) has limited their use in cases where the data at disposal are scarce or poorly labelled (e.g., in real life applications). Recently, meta-learning proved that it can help solving effectively few-shot classification problems, but its use in noise-to-image generation was only partially explored. In this paper, we took the first step into applying a meta-learning algorithm (Reptile), to the discriminator of a GAN and to a mapping network in order to optimize the random noise z to guide the generator network into producing images belonging to specific classes. By doing so, we prove that the latent space distribution is crucial for the generation of sharp samples when few training data are at disposal and also managed to generate samples of previously unseen classes just by optimizing the latent space without changing any parameter in the generator network. Finally, we show several experiments with two widely used datasets: MNIST and Omniglot.

Tomaso Fontanini, Claudio Praticò, Andrea Prati
Pruning in the Face of Adversaries

The vulnerability of deep neural networks against adversarial examples – inputs with small imperceptible perturbations – has gained a lot of attention in the research community recently. Simultaneously, the number of parameters of state-of-the-art deep learning models has been growing massively, with implications on the memory and computational resources required to train and deploy such models. One approach to control the size of neural networks is retrospectively reducing the number of parameters, so-called neural network pruning.Available research on the impact of neural network pruning on the adversarial robustness is fragmentary and often does not adhere to established principles of robustness evaluation. We close this gap by evaluating the robustness of pruned models against $$\ell ^0$$ ℓ 0 , $$\ell ^2$$ ℓ 2 , and $$\ell ^\infty $$ ℓ ∞ -attacks for a wide range of attack strengths, several architectures, data sets, pruning methods, and compression rates.Our results confirm that neural network pruning and adversarial robustness are not mutually exclusive. Instead, sweet spots can be found that are favorable in terms of model size and adversarial robustness. Furthermore, we extend our analysis to situations that incorporate additional assumptions on the adversarial scenario and show that depending on the situation, different strategies are optimal.

Florian Merkle, Maximilian Samsinger, Pascal Schöttle
GradVAE: An Explainable Variational Autoencoder Model Based on Online Attentions Preserving Curvatures of Representations

Unsupervised learning (UL) is a class of machine learning (ML) that learns data, reduces dimensionality, and visualizes decisions without labels. Among UL models, a variational autoencoder (VAE) is considered a UL model that is regulated by variational inference to approximate the posterior distribution of large datasets. In this paper, we propose a novel explainable artificial intelligence (XAI) method to visually explain the VAE behavior based on the second-order derivative of the latent space concerning the encoding layers, which reflects the amount of acceleration required from encoding to decoding space. Our model is termed as Grad $$_{2}$$ 2 VAE and it is able to capture the local curvatures of the representations to build online attention that visually explains the model’s behavior. Besides the VAE explanation, we employ our method for anomaly detection, where our model outperforms the recent UL deep models when generalizing it for large-scale anomaly data.

Mohanad Abukmeil, Stefano Ferrari, Angelo Genovese, Vincenzo Piuri, Fabio Scotti

Image Processing for Cultural Heritage

Frontmatter
The AIRES-CH Project: Artificial Intelligence for Digital REStoration of Cultural Heritages Using Nuclear Imaging and Multidimensional Adversarial Neural Networks

Artificial Intelligence for digital REStoration of Cultural Heritage (AIRES-CH) aims at building a web-based app for the digital restoration of pictorial artworks through Computer Vision technologies applied to physical imaging raw data. Physical imaging techniques, such as XRF, PIXE, PIGE, and FTIR, are capable of exploring a wide range of wavelengths providing spectra that are used to infer the chemical composition of the pigments. A multidimensional neural network, specifically designed to automatically restore damaged or hidden pictorial work, will be deployed on the INFN-CHNet Cloud as a web service, freely available to authenticated researchers. In this contribution, we report the status of the project, its current results, the development plans as well as future prospects.

Alessandro Bombini, Lucio Anderlini, Luca dell’Agnello, Francesco Giaocmini, Chiara Ruberto, Francesco Taccetti
Automatic Classification of Fresco Fragments: A Machine and Deep Learning Study

The reconstruction of destroyed frescoes is a complex task: very small fragments, irregular shapes, color alterations and missing pieces are only some of the possible problems that we have to deal with. Surely, an important preliminary step involves the separation of mixed fragments. In fact, in a real scenario, such as a church destroyed by an earthquake, it is likely that pieces of different frescoes, which were close on the same wall, end up mixed together, making their reconstruction more complex. Their separation may be especially difficult if there are many of them and if there are no (or very old) reference images of the original frescoes. A possible way to separate the fragments is to treat this problem as a stylistic classification task, in which we have only parts of an artwork instead of a complete one. In this work, we tested various machine and deep learning solutions on the DAFNE dataset (to date the largest open access collection of artificially fragmented fresco images). The experiments showed promising results, with good performances in both binary and multi-class classification.

Lucia Cascone, Piercarlo Dondi, Luca Lombardi, Fabio Narducci
Unsupervised Multi-camera Domain Adaptation for Object Detection in Cultural Sites

Domain adaptation approaches can be used to efficiently train object detectors by leveraging labeled synthetic images, inexpensively generated from 3D models, and unlabeled real images, which are cheaper to obtain than labeled ones. Most of the state-of-the-art techniques consider only one source and one target domain for the adaptation task. However, real world scenarios, such as applications in cultural sites, naturally involve many target domains which arise from the use of different cameras at inference time (e.g. different wearable devices and different smartphones on which the algorithm will be deployed). In this work, we investigate whether the availability of multiple unlabeled target domains can improve domain adaptive object detection algorithms. To study the problem, we propose a new dataset comprising images of 16 different objects rendered from a 3D model as well as images collected in the real environment using two different cameras. We experimentally assess that current domain adaptive object detectors can improve their performance by leveraging the multiple targets. As evidence of the usefulness of explicitly considering multiple target domains, we propose a new unsupervised multi-camera domain adaptation approach for object detection which outperforms current methods. Code and dataset are available at https://iplab.dmi.unict.it/OBJ-MDA/ .

Giovanni Pasqualino, Antonino Furnari, Giovanni Maria Farinella

Robot Vision

Frontmatter
Leveraging Road Area Semantic Segmentation with Auxiliary Steering Task

Robustness of different pattern recognition methods is one of the key challenges in autonomous driving, especially when driving in the high variety of road environments and weather conditions, such as gravel roads and snowfall. Although one can collect data from these adverse conditions using cars equipped with sensors, it is quite tedious to annotate the data for training. In this work, we address this limitation and propose a CNN-based method that can leverage the steering wheel angle information to improve the road area semantic segmentation. As the steering wheel angle data can be easily acquired with the associated images, one could improve the accuracy of road area semantic segmentation by collecting data in new road environments without manual data annotation. We demonstrate the effectiveness of the proposed approach on two challenging data sets for autonomous driving and show that when the steering task is used in our segmentation model training, it leads to a 0.1–2.9% gain in the road area mIoU (mean Intersection over Union) compared to the corresponding reference transfer learning model.

Jyri Maanpää, Iaroslav Melekhov, Josef Taher, Petri Manninen, Juha Hyyppä
Embodied Navigation at the Art Gallery

Embodied agents, trained to explore and navigate indoor photorealistic environments, have achieved impressive results on standard datasets and benchmarks. So far, experiments and evaluations have involved domestic and working scenes like offices, flats, and houses. In this paper, we build and release a new 3D space with unique characteristics: the one of a complete art museum. We name this environment ArtGallery3D (AG3D). Compared with existing 3D scenes, the collected space is ampler, richer in visual features, and provides very sparse occupancy information. This feature is challenging for occupancy-based agents which are usually trained in crowded domestic environments with plenty of occupancy information. Additionally, we annotate the coordinates of the main points of interest inside the museum, such as paintings, statues, and other items. Thanks to this manual process, we deliver a new benchmark for PointGoal navigation inside this new space. Trajectories in this dataset are far more complex and lengthy than existing ground-truth paths for navigation in Gibson and Matterport3D. We carry on extensive experimental evaluation using our new space for evaluation and prove that existing methods hardly adapt to this scenario. As such, we believe that the availability of this 3D model will foster future research and help improve existing solutions.

Roberto Bigazzi, Federico Landi, Silvia Cascianelli, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara
Relaxing the Forget Constraints in Open World Recognition

In the last few years deep neural networks has significantly improved the state-of-the-art of robotic vision. However, they are mainly trained to recognize only the categories provided in the training set (closed world assumption), being ill equipped to operate in the real world, where new unknown objects may appear over time. In this work, we investigate the open world recognition (OWR) problem that presents two challenges: (i) learn new concepts over time (incremental learning) and (ii) discern between known and unknown categories (open set recognition). Current state-of-the-art OWR methods address incremental learning by employing a knowledge distillation loss. It forces the model to keep the same predictions across training steps, in order to maintain the acquired knowledge. This behaviour may induce the model in mimicking uncertain predictions, preventing it from reaching an optimal representation on the new classes. To overcome this limitation, we propose the Poly loss that penalizes less the changes in the predictions for uncertain samples, while forcing the same output on confident ones. Moreover, we introduce a forget constraint relaxation strategy that allows the model to obtain a better representation of new classes by randomly zeroing the contribution of some old classes from the distillation loss. Finally, while current methods rely on metric learning to detect unknown samples, we propose a new rejection strategy that sidesteps it and directly uses the model classifier to estimate if a sample is known or not. Experiments on three datasets demonstrate that our method outperforms the state of the art.

Dario Fontanel, Fabio Cermelli, Antonino Geraci, Mauro Musarra, Matteo Tarantino, Barbara Caputo
Memory Guided Road Segmentation

In self-driving car applications, there is a requirement to predict the location of the road given an input RGB front-facing image. We propose a framework that utilizes an interleaving strategy of large and small feature extractors assisted via a propagating shared feature space allowing us to realize gains of over 2.5X in speed with a negligible loss in the accuracy of predictions. By utilizing the gist of previously observed frames, we train the network to predict the current road with greater accuracy and lesser deviation from previous frames.

Praveen Venkatesh, Rwik Rana, Varun Jain
Learning Visual Landmarks for Localization with Minimal Supervision

Camera localization is one of the fundamental requirements for vision-based mobile robots, self-driving cars, and augmented reality applications. In this context, learning spatial representations relative to unique regions in a scene with Slow Feature Analysis (SFA) has demonstrated large-scale localization. However, it relies on hand-labeled data to train a CNN for recognizing unique regions. We propose a new approach that uses pre-trained CNN-detectable objects as anchors to label and learn new landmark objects or regions in a scene using minimal supervision. The method bootstraps the landmark learning process and removes the need to manually label large amounts of data. The anchor objects are only required to learn the new landmarks and become obsolete for the unsupervised mapping and localization phases. We present localization results with the learned landmarks in simulated and real-world outdoor environments and compare the results to SFA on complete images and PoseNet. The landmark-based localization shows similar or better accuracy than the baseline methods in challenging scenarios. Our results further suggest that the approach scales well and achieves even higher localization accuracy by increasing the number of learned landmarks without increasing the number of anchors.

Muhammad Haris, Mathias Franzius, Ute Bauer-Wersing
Backmatter
Metadata
Title
Image Analysis and Processing – ICIAP 2022
Editors
Prof. Stan Sclaroff
Cosimo Distante
Marco Leo
Dr. Giovanni M. Farinella
Prof. Dr. Federico Tombari
Copyright Year
2022
Electronic ISBN
978-3-031-06427-2
Print ISBN
978-3-031-06426-5
DOI
https://doi.org/10.1007/978-3-031-06427-2

Premium Partner