Skip to main content
Top

2021 | Book

Pattern Recognition. ICPR International Workshops and Challenges

Virtual Event, January 10–15, 2021, Proceedings, Part VI

Editors: Prof. Alberto Del Bimbo, Prof. Rita Cucchiara, Prof. Stan Sclaroff, Dr. Giovanni Maria Farinella, Tao Mei, Prof. Dr. Marco Bertini, Hugo Jair Escalante, Dr. Roberto Vezzani

Publisher: Springer International Publishing

Book Series : Lecture Notes in Computer Science

insite
SEARCH

About this book

This 8-volumes set constitutes the refereed of the 25th International Conference on Pattern Recognition Workshops, ICPR 2020, held virtually in Milan, Italy and rescheduled to January 10 - 11, 2021 due to Covid-19 pandemic. The 416 full papers presented in these 8 volumes were carefully reviewed and selected from about 700 submissions. The 46 workshops cover a wide range of areas including machine learning, pattern analysis, healthcare, human behavior, environment, surveillance, forensics and biometrics, robotics and egovision, cultural heritage and document analysis, retrieval, and women at ICPR2020.

Table of Contents

Frontmatter

MAES - Machine Learning Advances Environmental Science

Frontmatter
Finding Relevant Flood Images on Twitter Using Content-Based Filters

The analysis of natural disasters such as floods in a timely manner often suffers from limited data due to coarsely distributed sensors or sensor failures. At the same time, a plethora of information is buried in an abundance of images of the event posted on social media platforms such as Twitter. These images could be used to document and rapidly assess the situation and derive proxy-data not available from sensors, e.g., the degree of water pollution. However, not all images posted online are suitable or informative enough for this purpose.Therefore, we propose an automatic filtering approach using machine learning techniques for finding Twitter images that are relevant for one of the following information objectives: assessing the flooded area, the inundation depth, and the degree of water pollution. Instead of relying on textual information present in the tweet, the filter analyzes the image contents directly. We evaluate the performance of two different approaches and various features on a case-study of two major flooding events. Our image-based filter is able to enhance the quality of the results substantially compared with a keyword-based filter, improving the mean average precision from 23% to 53% on average.

Björn Barz, Kai Schröter, Ann-Christin Kra, Joachim Denzler
Natural Disaster Classification Using Aerial Photography Explainable for Typhoon Damaged Feature

Recent years, typhoon damages has become social problem owing to climate change. In 9 September 2019, Typhoon Faxai passed on the Chiba in Japan, whose damages included with electric provision stop because of strong wind recorded on the maximum 45 m/s. A large amount of tree fell down, and the neighbour electric poles also fell down at the same time. These disaster features have caused that it took 18 days for recovery longer than past ones. Immediate responses are important for faster recovery. As long as we can, aerial survey for global screening of devastated region would be required for decision support to respond where to recover ahead. This paper proposes a practical method to visualize the damaged areas focused on the typhoon disaster features using aerial photography. This method can classify eight classes which contains land covers without damages and areas with disaster. Using target feature class probabilities, we can visualize disaster feature map to scale a colour range. Furthermore, we can realize explainable map on each unit grid images to compute the convolutional activation map using Grad-CAM. We demonstrate case studies applied to aerial photographs recorded at the Chiba region after typhoon.

Takato Yasuno, Masazumi Amakata, Masahiro Okano
Environmental Time Series Prediction with Missing Data by Machine Learning and Dynamics Recostruction

Environmental time series are often affected by missing data, namely data unavailability at certain time points. In this paper, it is presented an Iterated Prediction and Imputation algorithm, that makes possible time series prediction in presence of missing data. The algorithm uses Dynamics Reconstruction and Machine Learning methods for estimating the model order and the skeleton of time series, respectively. Experimental validation of the algorithm on an environmental time series with missing data, expressing the concentration of Ozone in a European site, shows an average percentage prediction error of $$0.45\%$$ 0.45 % on the test set.

Francesco Camastra, Vincenzo Capone, Angelo Ciaramella, Tony Christian Landi, Angelo Riccio, Antonino Staiano
Semi-Supervised Learning for Grain Size Distribution Interpolation

High-resolution grain size distribution maps for geographical regions are used to model soil-hydrological processes that can be used in climate models. However, measurements are expensive or impossible, which is why interpolation methods are used to fill the gaps between known samples. Common interpolation methods can handle such tasks with few data points since they make strong modeling assumptions regarding soil properties and environmental factors. Neural networks potentially achieve better results as they do not rely on these assumptions and approximate non-linear relationships from data. However, their performance is often severely limited for tasks like grain size distribution interpolation due to their requirement for many training examples. Semi-supervised learning may improve their performance on this task by taking widely available unlabeled auxiliary data (e.g. altitude) into account.We propose a novel semi-supervised training strategy for spatial interpolation tasks that pre-trains a neural network on weak labels obtained by methods with stronger assumptions and then fine-tunes the network on the small labeled dataset. In our research area, our proposed strategy improves the performance of a supervised neural network and outperforms other commonly used interpolation methods.

Konstantin Kobs, Christian Schäfer, Michael Steininger, Anna Krause, Roland Baumhauer, Heiko Paeth, Andreas Hotho
Location-Specific vs Location-Agnostic Machine Learning Metamodels for Predicting Pasture Nitrogen Response Rate

In this work we compare the performance of a location-specific and a location-agnostic machine learning metamodel for crop nitrogen response rate prediction. We conduct a case study for grass-only pasture in several locations in New Zealand. We generate a large dataset of APSIM simulation outputs and train machine learning models based on that data. Initially, we examine how the models perform at the location where the location-specific model was trained. We then perform the Mann–Whitney U test to see if the difference in the predictions of the two models (i.e. location-specific and location-agnostic) is significant. We expand this procedure to other locations to investigate the generalization capability of the models. We find that there is no statistically significant difference in the predictions of the two models. This is both interesting and useful because the location-agnostic model generalizes better than the location-specific model which means that it can be applied to virgin sites with similar confidence to experienced sites.

Christos Pylianidis, Val Snow, Dean Holzworth, Jeremy Bryant, Ioannis N. Athanasiadis
Pattern Classification from Multi-beam Acoustic Data Acquired in Kongsfjorden

Climate change is causing a structural change in Arctic ecosystems, decreasing the effectiveness that the polar regions have in cooling water masses, with inevitable repercussions on the climate and with an impact on marine biodiversity. The Svalbard islands under study are an area greatly influenced by Atlantic waters. This area is undergoing changes that are modifying the composition and distribution of the species present. The aim of this work is to provide a method for the classification of acoustic patterns acquired in the Kongsfjorden, Svalbard, Arctic Circle using multibeam technology. Therefore the general objective is the implementation of a methodology useful for identifying the acoustically reflective 3D patterns in the water column near the Kronebreen glacier. For each pattern identified, characteristic morphological and energetic quantities were extracted. All the information that describes each of the patterns has been divided into more or less homogeneous groupings by means of a K-means partitioning algorithm. The results obtained from clustering suggest that the most correct interpretation is that which divides the data set into 3 distinct clusters, relating to schools of fish. The presence of 3 different schools of fish does not allow us to state that they are 3 different species. The method developed and implemented in this work is a good method for discriminating the patterns present in the water column, obtained from multibeam data, in restricted contexts similar to those of the study area.

Giovanni Giacalone, Giosué Lo Bosco, Marco Barra, Angelo Bonanno, Giuseppa Buscaino, Riko Noormets, Christopher Nuth, Monica Calabrò, Gualtiero Basilone, Simona Genovese, Ignazio Fontana, Salvatore Mazzola, Riccardo Rizzo, Salvatore Aronica
Unsupervised Classification of Acoustic Echoes from Two Krill Species in the Southern Ocean (Ross Sea)

This work presents a computational methodology able to automatically classify the echoes of two krill species recorded in the Ross sea employing scientific echo-sounder at three different frequencies (38, 120 and 200 kHz). The goal of classifying the gregarious species represents a time-consuming task and is accomplished by using differences and/or thresholds estimated on the energy features of the insonified targets. Conversely, our methodology takes into account energy, morphological and depth features of echo data, acquired at different frequencies. Internal validation indices of clustering were used to verify the ability of the clustering in recognizing the correct number of species. The proposed approach leads to the characterization of the two krill species (Euphausia superba and Euphausia crystallorophias), providing reliable indications about the species spatial distribution and relative abundance.

Ignazio Fontana, Giovanni Giacalone, Riccardo Rizzo, Marco Barra, Olga Mangoni, Angelo Bonanno, Gualtiero Basilone, Simona Genovese, Salvatore Mazzola, Giosuè Lo Bosco, Salvatore Aronica
Multi-Input ConvLSTM for Flood Extent Prediction

Flooding is among the most destructive natural disasters in the world. The destruction that floods cause has led to an urgency in developing accurate prediction models. One aspect of flood prediction which has yet to benefit from machine learning techniques is in the prediction of flood extent. However, due to the many factors that can cause flooding, developing predictive models that can generalise to other potential flooding locations has proven to be a difficult task. This paper shows that a Multi-Input ConvLSTM can exploit several flood conditioning factors to effectively model flood extent while generalising well to other flood locations under certain conditions. Furthermore, this study compares the sub-components of the system to demonstrate their efficacy when applied to various flood types.

Leo Muckley, James Garforth
Developing a Segmentation Model for Microscopic Images of Microplastics Isolated from Clams

Microplastics (MP) have become a major concern, given the threat they pose to marine-derived food and human health. One way to investigate this threat is to quantify MP found in marine organisms, for instance making use of image analysis to identify ingested MP in fluorescent microscopic images. In this study, we propose a deep learning-based segmentation model to generate binarized images (masks) that make it possible to clearly separate MP from other background elements in the aforementioned type of images. Specifically, we created three variants of the U-Net model with a ResNet-101 encoder, training these variants with 99 high-resolution fluorescent images containing MP, each having a mask that was generated by experts using manual color threshold adjustments in ImageJ. To that end, we leveraged a sliding window and random selection to extract patches from the high-resolution images, making it possible to adhere to input constraints and to increase the number of labeled examples. When measuring effectiveness in terms of accuracy, recall, and F $$_{2}$$ 2 -score, all segmentation models exhibited low scores. However, compared to two ImageJ baseline methods, the effectiveness of our segmentation models was better in terms of precision, F $$_{0.5}$$ 0.5 -score, F $$_{1}$$ 1 -score, and mIoU: U-Net (1) obtained the highest mIoU of 0.559, U-Net (2) achieved the highest F $$_{1}$$ 1 -score of 0.682, and U-Net (3) had the highest precision and F $$_{0.5}$$ 0.5 -score of 0.594 and 0.626, respectively, with our segmentation models, in general, detecting less false positives in the predicted masks. In addition, U-Net (1), which used binary cross-entropy loss and stochastic gradient descent, and U-Net (2), which used dice loss and Adam, were most effective in discriminating MP from other background elements. Overall, our experimental results suggest that U-Net (1) and U-Net (2) allow for more effective MP identification and measurement than the macros currently available in ImageJ.

Ji Yeon Baek, Maria Krishna de Guzman, Ho-min Park, Sanghyeon Park, Boyeon Shin, Tanja Cirkovic Velickovic, Arnout Van Messem, Wesley De Neve
A Machine Learning Approach to Chlorophyll a Time Series Analysis in the Mediterranean Sea

Understanding the dynamics of natural system is a crucial task in ecology especially when climate change is taken into account. In this context, assessing the evolution of marine ecosystems is pivotal since they cover a large portion of the biosphere.For these reasons, we decided to develop an approach aimed at evaluating temporal and spatial dynamics of remotely-sensed chlorophyll a concentration. The concentrations of this pigment are linked with phytoplankton biomass and production, which in turn play a central role in marine environment.Machine learning techniques proved to be valuable tools in dealing with satellite data since they need neither assumptions on data distribution nor explicit mathematical formulations. Accordingly, we exploited the Self Organizing Map (SOM) algorithm firstly to reconstruct missing data from satellite time series of chlorophyll a and secondly to classify them. The missing data reconstruction task was performed using a large SOM and allowed to enhance the available information filling the gaps caused by cloud coverage. The second part of the procedure involved a much smaller SOM used as a classification tool. This dimensionality reduction enabled the analysis and visualization of over 37 000 chlorophyll a time series. The proposed approach provided insights into both temporal and spatial chlorophyll a dynamics in the Mediterranean Basin.

F. Mattei, M. Scardi
Plankton Recognition in Images with Varying Size

Monitoring plankton is important as they are an essential part of the aquatic food web as well as producers of oxygen. Modern imaging devices produce a massive amount of plankton image data which calls for automatic solutions. These images are characterized by a very large variation in both the size and the aspect ratio. Convolutional neural network (CNN) based classification methods, on the other hand, typically require a fixed size input. Simple scaling of the images into a common size contains several drawbacks. First, the information about the size of the plankton is lost. For human experts, the size information is one of the most important cues for identifying the species. Second, downscaling the images leads to the loss of fine details such as flagella essential for species recognition. Third, upscaling the images increases the size of the network. In this work, extensive experiments on various approaches to address the varying image dimensions are carried out on a challenging phytoplankton image dataset. A novel combination of methods is proposed, showing improvement over the baseline CNN.

Jaroslav Bureš, Tuomas Eerola, Lasse Lensu, Heikki Kälviäinen, Pavel Zemčík
Environment Object Detection for Marine ARGO Drone by Deep Learning

Aim of this work is to implement an environment object detection system for a marine drone. A Deep Learning based model for object detection is embedded on ARGO drone equipped with geophysical sensors and several on-board cameras. The marine drone, developed at iMTG laboratory in partnership with NEPTUN-IA laboratory, was designed to obtain high-resolution mapping of nearshore-to-foreshore sectors and equipped with a system able to detect and identify Ground Control Point (GCP) in real time. A Deep Neural Network is embedded on a Raspberry PI platform and it is adopted for developing the object detection module. Real experiments and comparisons are conducted for identifying GCP among the roughness and vegetation present in the seabed.

Angelo Ciaramella, Francesco Perrotta, Gerardo Pappone, Pietro Aucelli, Francesco Peluso, Gaia Mattei
Unsupervised Learning of High Dimensional Environmental Data Using Local Fractality Concept

The research deals with an exploration of high dimensional environmental data using unsupervised learning algorithms and the concept of local fractality. The proposed methodology is applied to geospatial data used for the wind speed prediction in a complex mountainous region. It is shown, that the approach provides important additional information on data manifold useful in data analysis, data visualisation and predictive modelling.

Mikhail Kanevski, Mohamed Laib
Spatiotemporal Air Quality Inference of Low-Cost Sensor Data; Application on a Cycling Monitoring Network

Air quality monitoring in heterogeneous cities is challenging as a high resolution in both space and time is required to accurately assess population exposure. As regulatory monitoring networks are sparse due to high investment and maintenance costs, recent advances in sensor and IoT technologies have resulted in innovative sensing approaches like mobile sensing to increase the spatial monitoring resolution. An example of such an opportunistic mobile monitoring network is “Snuffelfiets”, a project where air quality data is collected from mobile sensors attached to bicycles in Utrecht (NL). The collected data results in a sparse spatiotemporal matrix of measurements which can be completed using data-driven techniques. This work reports on the potential of two machine learning approaches to infer the collected air quality measurements in both space and time; a deep learning model based on Variational Graph Autoencoders (AVGAE) and a Geographical Random Forest model (GRF). A temporal validation exercise is performed at two regulatory monitoring stations following the FAIRMODE modelling quality objectives protocol. This work demonstrates the potential of data-driven techniques for spatiotemporal air quality inference of sensor data as the considered models performed well in terms of accuracy and correlation. The model observed performance metrics approach current state-of-the-art physical models in terms of performance while needing much lower resources, computational power, infrastructure and processing time.

Jelle Hofman, Tien Huu Do, Xuening Qin, Esther Rodrigo, Martha E. Nikolaou, Wilfried Philips, Nikos Deligiannis, Valerio Panzica La Manna
How Do Deep Convolutional SDM Trained on Satellite Images Unravel Vegetation Ecology?

Species distribution models (SDM) assess and predict how species spatial distributions depend on the environment, due to species ecological preferences. These models are used in many different scenarios such as conservation plans or monitoring of invasive species. The choice of a model and of environmental data have strong impact on the model’s ability to capture important ecological information. Specifically, state-of-the-art models generally rely on local, punctual environmental information, and do not take into account environmental variation in surrounding landscape. Here we use a convolutional neural network model to analyze and predict species distributions depending on high resolution data including remote sensing images, land cover and altitude. We show that the model unravel the functional response of vegetation to both local and large-scale environmental variation. To demonstrate the ecological significance of the results, we propose an original statistical analysis of t-SNE nonlinear dimension reduction. We illustrate and test the traits-species-environment relationships learned by the model and expressed in t-SNE dimensions.

Benjamin Deneu, Alexis Joly, Pierre Bonnet, Maximilien Servajean, François Munoz

ManifLearn - Manifold Learning in Machine Learning, from Euclid to Riemann

Frontmatter
Latent Space Geometric Statistics

Deep generative models, e.g., variational autoencoders and generative adversarial networks, result in latent representation of observed data. The low dimensionality of the latent space provides an ideal setting for analysing high-dimensional data that would otherwise often be infeasible to handle statistically. The linear Euclidean geometry of the high-dimensional data space pulls back to a nonlinear Riemannian geometry on latent space where classical linear statistical techniques are no longer applicable. We show how analysis of data in their latent space representation can be performed using techniques from the field of geometric statistics. Geometric statistics provide generalisations of Euclidean statistical notions including means, principal component analysis, and maximum likelihood estimation of parametric distributions. Introduction to estimation procedures on latent space are considered, and the computational complexity of using geometric algorithms with high-dimensional data addressed by training a separate neural network to approximate the Riemannian metric and cometric tensor capturing the shape of the learned data manifold.

Line Kühnel, Tom Fletcher, Sarang Joshi, Stefan Sommer
Improving Neural Network Robustness Through Neighborhood Preserving Layers

One major source of vulnerability of neural nets in classification tasks is from overparameterized fully connected layers near the end of the network. In this paper, we propose a new neighborhood preserving layer which can replace these fully connected layers to improve the network robustness. Networks including these neighborhood preserving layers can be trained efficiently. We theoretically prove that our proposed layers are more robust against distortion because they effectively control the magnitude of gradients. Finally, we empirically show that networks with our proposed layers are more robust against state-of-the-art gradient descent based attacks, such as a PGD attack on the benchmark image classification datasets MNIST and CIFAR10.

Bingyuan Liu, Christopher Malon, Lingzhou Xue, Erik Kruus
Metric Learning on the Manifold of Oriented Ellipses: Application to Facial Expression Recognition

In this paper we propose a new family of metrics on the manifold of oriented ellipses centered at the origin in Euclidean n-space, the double cover of the manifold of positive semi-definite matrices of rank two, in order to measure similarities between landmark representations. The metrics, whose distance functions are remarkably simple, are parametrized by the choice of a n-by-n positive semi-definite matrix P. This allows us to learn the parameter P from the training data and increase the efficiency of the metric. We evaluate the proposed metric on facial expression recognition from 2D facial landmarks. The conducted experiments demonstrate the effectiveness of the learned metric to classify facial shapes under different expressions.

Mohamed Daoudi, Naima Otberdout, Juan-Carlos Álvarez Paiva

MANPU - The 4th International Workshop on coMics ANalysis, Processing and Understanding

Frontmatter
An OCR Pipeline and Semantic Text Analysis for Comics

Optical character recognition has remained a challenge for comics, given the high variability of placement of text on the page, the wide variety of frequently handwritten fonts, and the limited availability and small size of datasets. This paper reports on currently on-going work on an OCR pipeline that includes text spotting with the help of a U-Net based fully convolutional neural network and OCR training with the open-source software Calamari, which was performed on the “Graphic Narrative Corpus” of book-length graphic novels written in English. Based on the results of the OCR training, we then present an analysis of the textual properties of 129 graphic novels correlated with page length, historical development, and genre affiliation.

Rita Hartel, Alexander Dunst
Manga Vocabulometer, A New Support System for Extensive Reading with Japanese Manga Translated into English

Extensive Reading, called “Tadoku” in Japan, is a method of learning a second language to improve reading speed and fluency. Japanese comics translated into English is used as one of the materials for extensive reading, where Japanese comics are called manga. Using manga to learn English is considered to be a good way to learn English because the content can be inferred from the pictures. However, some learners cannot memorize and learn all the words when they read many books. Therefore, if there is a function to automatically save unknown words in the books they read or to create flashcards, they can learn English more efficiently.In this paper, we introduce Manga Vocabulometer, the support system for extensive reading. It is a web-based system that allows students to choose their favorite manga to read. It is also able to check for unknown words, so the system can present flashcards to learners. To confirm the advantage of the proposed system, we compare two memorization methods: one is the memorization method using Manga Vocabulometer and the other is the traditional simple memorization method.

Jin Kato, Motoi Iwata, Koichi Kise
Automatic Landmark-Guided Face Image Generation for Anime Characters Using CGAN

Recently, comics with animations, called motion comics, has appeared. However, due to the time and effort required to create the animations, only a few famous comics have been converted to motion comics. In this study, we propose a method for generating landmark-based face images of anime characters using C $$^2$$ 2 GAN with the aim of automatically generating animation of facial expressions, where C $$^2$$ 2 GAN proposed by Hao et al. is a framework for generating keypoint-guided image. In addition, this paper explains how to create datasets almost automatically from anime videos. In the experiment, we firstly trained C $$^2$$ 2 GAN with dataset created from anime videos, and then we tried to improve the performance by changing the representation of facial landmarks.

Junki Oshiba, Motoi Iwata, Koichi Kise
Text Block Segmentation in Comic Speech Bubbles

Comics and manga text recognition are attracting an increasing research and industrial interest. Also, the state of the art text detection and OCR performances is starting to be mature enough to provide automatic text recognition for a variety of comics and manga writing styles. However, comics text layout sometimes prevents usual text line detection to be applied successfully, even within speech bubbles. In this paper, we propose a domain specific text block detection method able to detect single and multiple text block regions inside speech bubbles, in order to enhance OCR transcription and further post-processing. This approach presents very satisfactory results on all tested bubble styles from Latin and non-Latin scripts.

Christophe Rigaud, Nhu-Van Nguyen, Jean-Christophe Burie

MMDLCA - Multi-modal Deep Learning: Challenges and Applications

Frontmatter
Hierarchical Consistency and Refinement for Semi-supervised Medical Segmentation

Semi-supervised learning exploits unlabeled data to improve generalization ability with insufficient annotations. In recent years, Mean Teacher method (MT) obtained impressive performance using prediction consistency as regularization. However, severe ambiguity in medical images makes the targets in the teacher model highly unreliable in obscure regions, thereby limits the model capability. To address this problem, we propose a novel multi-task learning semi-supervised framework to gain hierarchical consistency through training process. Specifically, we introduce region and shape predictions as subtasks to obtain the coarse-grained location and fine-grained boundary information. Then we predict pixel-level segmentation by fusing the hierarchical feature. Since calculating consistency loss in more loose regions typically alleviates the degradation caused by learning from unreliable targets, our teacher model generate guidance from each of the subtasks. Moreover, we focus on the geometrical correlations in different tasks and proposed the constraint method to refine the segmentation for accurate guidance. Experiments on the left atrium segmentation dataset show our algorithm achieves state-of-the-art performance comparing with other semi-supervised methods.

Zixiao Wang, Hai Xu, Youliang Tian, Hongtao Xie
BVTNet: Multi-label Multi-class Fusion of Visible and Thermal Camera for Free Space and Pedestrian Segmentation

Deep learning-based visible camera semantic segmentation report state-of-the-art segmentation accuracy. However, this approach is limited by the visible camera’s susceptibility to varying illumination and environmental conditions. One approach to address this limitation is visible and thermal camera-based sensor fusion. Existing literature utilizes this sensor fusion approach for object segmentation, but the approach’s application to free space segmentation has not been reported. Here, a multi-label multi-class visible-thermal camera learning framework, termed as the BVTNet, is proposed for the semantic segmentation of pedestrians and the free space. The BVTNet estimates the pedestrians and free space in an individual multi-class output branch. Additionally, the network also separately estimates the free space and pedestrian boundaries in another multi-class output branch. The boundary semantic segmentation is integrated within the full semantic segmentation framework in a post-processing step. The proposed framework is validated on the public MFNet dataset. A comparative analysis with baseline algorithms and ablation studies with BVTNet variants show that the proposed framework report state-of-the-art segmentation accuracy in real-time in challenging environmental conditions.

Vijay John, Ali Boyali, Simon Thompson, Seiichi Mita
Multimodal Emotion Recognition Based on Speech and Physiological Signals Using Deep Neural Networks

A suitable combination of data in a multimodal emotion recognition model allows conveying and combining each channel’s information to achieve a better recognition of the encoded emotion than would be possible using only a single modality and channel. In this paper, we focus on combining speech and physiological signals to predict the arousal and valence levels of the emotional states of a person. We designed a neural network that can use the information from raw audio signals, electrocardiograms, heart rate variability, electro-dermal activity, and skin conductance levels, to predict emotional states. The proposed deep neural network architecture works as an end-to-end process, which means, neither any pre-processing of the input data nor post-processing of the prediction of the network was applied. Using the data of the modalities available in the publicly accessible part of the RECOLA database, we achieved results comparable to other state-of-the-art approaches.

Ali Bakhshi, Stephan Chalup
Cross-modal Deep Learning Applications: Audio-Visual Retrieval

Recently, deep neural networks have exhibited as a powerful architecture to well capture the nonlinear distribution of high-dimensional multimedia data such as image, video, text and audio, so naturally does for multi-modal data. How to make full use of multimedia data? This leads to an important research direction: cross-modal learning. In this paper, we introduce a method based on the content of audio and video data modalities implemented with a novel two-branch neural network is to learn the joint embeddings from a shared subspace for computing the similarity between the two modalities. In particular, the contribution of proposed method is mainly manifested in the three aspects: i) Using feature selection model for choosing top-k audio and visual feature representation; ii) A novel combination of training loss function concerning inter-modal similarity and intra-modal invariance is used; iii) Due to the lack of video-music paired dataset, we construct dataset of video-music pairs from YouTube 8M and MER31K datasets. The experiments have proved that our proposed model has a better performance compared with other methods.

Cong Jin, Tian Zhang, Shouxun Liu, Yun Tie, Xin Lv, Jianguang Li, Wencai Yan, Ming Yan, Qian Xu, Yicong Guan, Zhenggougou Yang
Exploiting Word Embeddings for Recognition of Previously Unseen Objects

A notable characteristic of human cognition is its ability to derive reliable hypotheses in situations characterized by extreme uncertainty. Even in the absence of relevant knowledge to make a correct inference, humans are able to draw upon related knowledge to make an approximate inference that is semantically close to the correct inference. In the context of object recognition, this ability amounts to being able to hypothesize the identity of an object in an image without previously having ever seen any visual training examples of that object. The paradigm of low-shot (i.e., zero-shot and few-shot) classification has been traditionally used to address these situations. However, traditional zero-shot and few-shot approaches entail the training of classifiers in situations where a majority of classes are previously seen or visually observed whereas a minority of classes are previously unseen, in which case the classifiers for the unseen classes are learned by expressing them in terms of the classifiers for the seen classes. In this paper, we address the related but different problem of object recognition in situations where only a few object classes are visually observed whereas a majority of the object classes are previously unseen. Specifically, we pose the following questions: (a) Is it possible to hypothesize the identity of an object in an image without previously having seen any visual training examples for that object? and (b) Could the visual training examples of a few seen object classes provide reliable priors for hypothesizing the identities of objects in an image that belong to the majority unseen object classes? We propose a model for recognition of objects in an image in situations where visual classifiers are available for only a limited number of object classes. To this end, we leverage word embeddings trained on publicly available text corpora and use them as natural language priors for hypothesizing the identities of objects that belong to the unseen classes. Experimental results on the Microsoft Common Objects in Context (MS-COCO) data set show that it is possible to come up with reliable hypotheses with regard to object identities by exploiting word embeddings trained on the Wikipedia text corpus even in the absence of explicit visual classifiers for those object classes. To bolster our hypothesis, we conduct additional experiments on larger dataset of concepts (themes) that we created from the Conceptual Captions dataset. Even on this extremely challenging dataset, our results, though not entirely impressive, serve to provide an important proof-of-concept for the proposed model.

Karan Sharma, Hemanth Dandu, Arun C. S. Kumar, Vinay Kumar, Suchendra M. Bhandarkar
Automated Segmentation of Lateral Ventricle in MR Images Using Multi-scale Feature Fusion Convolutional Neural Network

Studies have shown that the expansion of the lateral ventricle is closely related to many neurodegenerative diseases, so the segmentation of the lateral ventricle plays an important role in the diagnosis of related diseases. However, traditional segmentation methods are subjective, laborious, and time-consuming. Furthermore, due to the uneven magnetic field, irregular, small, and discontinuous shape of every single slice, the segmentation of the lateral ventricle is still a great challenge. In this paper, we propose an efficient and automatic lateral ventricle segmentation method in magnetic resonance (MR) images using a multi-scale feature fusion convolutional neural network (MFF-Net). First, we create a multi-center clinical dataset with a total of 117 patient MR scans. This dataset comes from two different hospitals and the images have different sampling intervals, different ages, and distinct image dimensions. Second, we present a new multi-scale feature fusion module (MSM) to capture different levels of feature information of lateral ventricles through various receptive fields. In particular, MSM can also extract the multi-scale lateral ventricle region feature information to solve the problem of insufficient feature extraction of small object regions with the deepening of network structure. Finally, extensive experiments have been conducted to evaluate the performance of the proposed MFF-Net. In addition, to verify the performance of the proposed method, we compare MFF-Net with seven state-of-the-art segmentation models. Both quantitative results and visual effects show that our MFF-Net outperforms other models and can achieve more accurate segmentation performance. The results also indicate that our model can be applied in clinical practice and is a feasible method for lateral ventricle segmentation.

Fei Ye, Zhiqiang Wang, Kai Hu, Sheng Zhu, Xieping Gao
Visual Word Embedding for Text Classification

The question we answer with this paper is: ‘can we convert a text document into an image to take advantage of image neural models to classify text documents?’ To answer this question we present a novel text classification method that converts a document into an encoded image, using word embedding. The proposed approach computes the Word2Vec word embedding of a text document, quantizes the embedding, and arranges it into a 2D visual representation, as an RGB image. Finally, visual embedding is categorized with state-of-the-art image classification models. We achieved competitive performance on well-known benchmark text classification datasets. In addition, we evaluated our proposed approach in a multimodal setting that allows text and image information in the same feature space.

Ignazio Gallo, Shah Nawaz, Nicola Landro, Riccardo La Grassainst
CC-LSTM: Cross and Conditional Long-Short Time Memory for Video Captioning

Automatically generating natural language descriptions for in-the-wild videos is a challenging task. Most recent progress in this field has been made through the combination of Convolutional Neural Networks (CNNs) and Encoder-Decoder Recurrent Neural Networks (RNNs). However, existing Encoder-Decoder RNNs framework has difficulty in capturing a large number of long-range dependencies along with the increasing of the number of LSTM units. It brings a vast information loss and leads to poor performance for our task. To explore this problem, in this paper, we propose a novel framework, namely Cross and Conditional Long Short-Term Memory (CC-LSTM). It is composed of a novel Cross Long Short-Term Memory (Cr-LSTM) for the encoding module and Conditional Long Short-Term Memory (Co-LSTM) for the decoding module. In the encoding module, the Cr-LSTM encodes the visual input into a richly informative representation by a cross-input method. In the decoding module, the Co-LSTM feeds the visual features, which is based on generated sentence and contains the global information of the visual content, into the LSTM unit as an extra visual feature. For the work of video capturing, extensive experiments are conducted on two public datasets, i.e., MSVD and MSR-VTT. Along with visualizing the results and how our model works, these experiments quantitatively demonstrate the effectiveness of the proposed CC-LSTM on translating videos to sentences with rich semantics.

Jiangbo Ai, Yang Yang, Xing Xu, Jie Zhou, Heng Tao Shen
An Overview of Image-to-Image Translation Using Generative Adversarial Networks

Image-to-image translation is an important and challenging problem in computer vision. It aims to learn the mapping between two different domains, with applications ranging from data augmentation, style transfer, to super-resolution, etc. With the success of deep learning methods in visual generative tasks, researchers have applied deep generative models, especially generative adversarial networks (GANs), to image-to-image translation since the year of 2016 and gained fruitful progress. In this survey, we have conducted a comprehensive review of the literature in this field, covering supervised and unsupervised methods, among which unsupervised approaches include one-to-one, one-to-many, many-to-many categories and some latest theories. We highlight the innovation aspect of these methods and analyze different models employed and their components. Besides, we summarized some commonly used normalization techniques and evaluation metrics, and finally, present several challenges and future research directions in this area.

Xin Chen, Caiyan Jia
Fusion Models for Improved Image Captioning

Visual captioning aims to generate textual descriptions given images or videos. Traditionally, image captioning models are trained on human annotated datasets such as Flickr30k and MS-COCO, which are limited in size and diversity. This limitation hinders the generalization capabilities of these models while also rendering them liable to making mistakes. Language models can, however, be trained on vast amounts of freely available unlabelled data and have recently emerged as successful language encoders [10] and coherent text generators [4]. Meanwhile, several unimodal and multimodal fusion techniques have been proven to work well for natural language generation [11] and automatic speech recognition [30]. Building on these recent developments, and with the aim of improving the quality of generated captions, the contribution of our work in this paper is two-fold: First, we propose a generic multimodal model fusion framework for caption generation as well as emendation where we utilize different fusion strategies to integrate a pretrained Auxiliary Language Model (AuxLM) within the traditional encoder-decoder visual captioning frameworks. Next, we employ the same fusion strategies to integrate a pretrained Masked Language Model (MLM), namely BERT, with a visual captioning model, viz. Show, Attend, and Tell, for emending both syntactic and semantic errors in captions. Our caption emendation experiments on three benchmark image captioning datasets, viz. Flickr8k, Flickr30k, and MSCOCO, show improvements over the baseline, indicating the usefulness of our proposed multimodal fusion strategies. Further, we perform a preliminary qualitative analysis on the emended captions and identify error categories based on the type of corrections.

Marimuthu Kalimuthu, Aditya Mogadala, Marius Mosbach, Dietrich Klakow
From Bottom to Top: A Coordinated Feature Representation Method for Speech Recognition

This article introduces a novel coordinated representation method, termed MFCC aided sparse representation (MSR), for speech recognition. The proposed MSR combines a top level sparse representation feature with the conventional MFCC, i.e., a bottom level feature of speech, so that complex information of various hidden attributes in the speech can be contained. A neural network architecture with attention mechanism has also been designed to validate the effective of the proposed MSR for speech recognition. Experiments on the TIMIT database show that significant performance improvements, in terms of recognition accuracy, can be obtained by the proposed MSR compared with the scenarios that adopt the MFCC or the sparse representation solely.

Lixia Zhou, Jun Zhang

MMForWild2020 - MultiMedia FORensics in the WILD 2020

Frontmatter
Increased-Confidence Adversarial Examples for Deep Learning Counter-Forensics

Transferability of adversarial examples is a key issue to apply this kind of attacks against multimedia forensics (MMF) techniques based on Deep Learning (DL) in a real-life setting. Adversarial example transferability, in fact, would open the way to the deployment of successful counter forensics attacks also in cases where the attacker does not have a full knowledge of the to-be-attacked system. Some preliminary works have shown that adversarial examples against CNN-based image forensics detectors are in general non-transferrable, at least when the basic versions of the attacks implemented in the most popular libraries are adopted. In this paper, we introduce a general strategy to increase the strength of the attacks and evaluate their transferability when such a strength varies. We experimentally show that, in this way, attack transferability can be largely increased, at the expense of a larger distortion. Our research confirms the security threats posed by the existence of adversarial examples even in multimedia forensics scenarios, thus calling for new defense strategies to improve the security of DL-based MMF techniques.

Wenjie Li, Benedetta Tondi, Rongrong Ni, Mauro Barni
Defending Neural ODE Image Classifiers from Adversarial Attacks with Tolerance Randomization

Deep learned models are now largely adopted in different fields, and they generally provide superior performances with respect to classical signal-based approaches. Notwithstanding this, their actual reliability when working in an unprotected environment is far enough to be proven. In this work, we consider a novel deep neural network architecture, named Neural Ordinary Differential Equations (N-ODE), that is getting particular attention due to an attractive property—a test-time tunable trade-off between accuracy and efficiency. This paper analyzes the robustness of N-ODE image classifiers when faced against a strong adversarial attack and how its effectiveness changes when varying such a tunable trade-off. We show that adversarial robustness is increased when the networks operate in different tolerance regimes during test time and training time. On this basis, we propose a novel adversarial detection strategy for N-ODE nets based on the randomization of the adaptive ODE solver tolerance. Our evaluation performed on standard image classification benchmarks shows that our detection technique provides high rejection of adversarial examples while maintaining most of the original samples under white-box attacks and zero-knowledge adversaries.

Fabio Carrara, Roberto Caldelli, Fabrizio Falchi, Giuseppe Amato
Analysis of the Scalability of a Deep-Learning Network for Steganography “Into the Wild”

Since the emergence of deep learning and its adoption in steganalysis fields, most of the reference articles kept using small to medium size CNN, and learn them on relatively small databases. Therefore, benchmarks and comparisons between different deep learning-based steganalysis algorithms, more precisely CNNs, are thus made on small to medium databases. This is performed without knowing: 1. if the ranking, with a criterion such as accuracy, is always the same when the database is larger, 2. if the efficiency of CNNs will collapse or not if the training database is a multiple of magnitude larger, 3. the minimum size required for a database or a CNN, in order to obtain a better result than a random guesser. In this paper, after a solid discussion related to the observed behaviour of CNNs as a function of their sizes and the database size, we confirm that the error’s power law also stands in steganalysis, and this in a border case, i.e. with a medium-size network, on a big, constrained and very diverse database.

Hugo Ruiz, Marc Chaumont, Mehdi Yedroudj, Ahmed Oulad Amara, Frédéric Comby, Gérard Subsol
Forensics Through Stega Glasses: The Case of Adversarial Images

This paper explores the connection between forensics, counter-forensics, steganography and adversarial images. On the one hand, forensics-based and steganalysis-based detectors help in detecting adversarial perturbations. On the other hand, steganography can be used as a counter-forensics strategy and helps in forging adversarial perturbations that are not only invisible to the human eye but also less statistically detectable. This work explains how to use these information hiding tools for attacking or defending computer vision image classification. We play this cat and mouse game using both recent deep-learning content-based classifiers, forensics detectors derived from steganalysis, and steganographic distortions dedicated to color quantized images. It turns out that crafting adversarial perturbations relying on steganographic perturbations is an effective counter-forensics strategy.

Benoît Bonnet, Teddy Furon, Patrick Bas
LSSD: A Controlled Large JPEG Image Database for Deep-Learning-Based Steganalysis “Into the Wild”

For many years, the image databases used in steganalysis have been relatively small, i.e. about ten thousand images. This limits the diversity of images and thus prevents large-scale analysis of steganalysis algorithms. In this paper, we describe a large JPEG database composed of 2 million colour and grey-scale images. This database, named LSSD for Large Scale Steganalysis Database, was obtained thanks to the intensive use of “controlled” development procedures. LSSD has been made publicly available, and we aspire it could be used by the steganalysis community for large-scale experiments. We introduce the pipeline used for building various image database versions. We detail the general methodology that can be used to redevelop the entire database and increase even more the diversity. We also discuss computational cost and storage cost in order to develop images.

Hugo Ruiz, Mehdi Yedroudj, Marc Chaumont, Frédéric Comby, Gérard Subsol
Neural Network for Denoising and Reading Degraded License Plates

The denoising and the interpretation of severely-degraded license plates is one of the main problems that law enforcement agencies face worldwide and everyday. In this paper, we present a system made by coupling two convolutional neural networks. The first one produces a denoised version of the input image; the second one takes the denoised and original images to estimate a prediction of each character in the plate. Considering the complexity of gathering training data for this task, we propose a way of creating and augmenting an artificial dataset, which also allows tailoring the training to the specific license plate format of a given country at little cost. The system is designed as a tool to aid law enforcement investigations when dealing with low resolution corrupted license plates. Compared to existing methods, our system provides both a denoised license plate and a prediction of the characters to enable a visual inspection and an accurate validation of the final result. We validated the system on a dataset of real license plates, yielding a sensible perceptual improvement and an average character classification accuracy of 93%.

Gianmaria Rossi, Marco Fontani, Simone Milani
The Forchheim Image Database for Camera Identification in the Wild

Image provenance can represent crucial knowledge in criminal investigation and journalistic fact checking. In the last two decades, numerous algorithms have been proposed for obtaining information on the source camera and distribution history of an image. For a fair ranking of these techniques, it is important to rigorously assess their performance on practically relevant test cases. To this end, a number of datasets have been proposed. However, we argue that there is a gap in existing databases: to our knowledge, there is currently no dataset that simultaneously satisfies two goals, namely a) to cleanly separate scene content and forensic traces, and b) to support realistic post-processing like social media recompression.In this work, we propose the Forchheim Image Database (FODB) to close this gap. It consists of more than 23,000 images of 143 scenes by 27 smartphone cameras, and it allows to cleanly separate image content from forensic artifacts. Each image is provided in 6 different qualities: the original camera-native version, and five copies from social networks. We demonstrate the usefulness of FODB in an evaluation of methods for camera identification. We report three findings. First, the recently proposed general-purpose EfficientNet remarkably outperforms several dedicated forensic CNNs both on clean and compressed images. Second, classifiers obtain a performance boost even on unknown post-processing after augmentation by artificial degradations. Third, FODB’s clean separation of scene content and forensic traces imposes important, rigorous boundary conditions for algorithm benchmarking.

Benjamin Hadwiger, Christian Riess
Nested Attention U-Net: A Splicing Detection Method for Satellite Images

Satellite imagery is becoming increasingly available due to a large number of commercial satellite companies. Many fields use satellite images, including meteorology, forestry, natural disaster analysis, and agriculture. These images can be changed or tampered with image manipulation tools causing issues in applications using these images. Manipulation detection techniques designed for images captured by “consumer cameras” tend to fail when used on satellite images. In this paper we propose a supervised method, known as Nested Attention U-Net, to detect spliced areas in the satellite images. We introduce three datasets of manipulated satellite images that contain objects generated by a generative adversarial network (GAN). We test our approach and compare it to existing supervised splicing detection and segmentation techniques and show that our proposed approach performs well in detection and localization.

János Horváth, Daniel Mas Montserrat, Edward J. Delp, János Horváth
Fingerprint Adversarial Presentation Attack in the Physical Domain

With the advent of the deep learning era, Fingerprint-based Authentication Systems (FAS) equipped with Fingerprint Presentation Attack Detection (FPAD) modules managed to avoid attacks on the sensor through artificial replicas of fingerprints. Previous works highlighted the vulnerability of FPADs to digital adversarial attacks. However, in a realistic scenario, the attackers may not have the possibility to directly feed a digitally perturbed image to the deep learning based FPAD, since the channel between the sensor and the FPAD is usually protected. In this paper we thus investigate the threat level associated with adversarial attacks against FPADs in the physical domain. By materially realising fakes from the adversarial images we were able to insert them into the system directly from the “exposed” part, the sensor. To the best of our knowledge, this represents the first proof-of-concept of a fingerprint adversarial presentation attack. We evaluated how much liveness score changed by feeding the system with the attacks using digital and printed adversarial images. To measure what portion of this increase is due to the printing itself, we also re-printed the original spoof images, without injecting any perturbation. Experiments conducted on the LivDet 2015 dataset demonstrate that the printed adversarial images achieve $$\sim $$ ∼ 100% attack success rate against an FPAD if the attacker has the ability to make multiple attacks on the sensor (10) and a fairly good result ( $$\sim $$ ∼ 28%) in a one-shot scenario. Despite this work must be considered as a proof-of-concept, it constitutes a promising pioneering attempt confirming that an adversarial presentation attack is feasible and dangerous.

Stefano Marrone, Roberto Casula, Giulia Orrù, Gian Luca Marcialis, Carlo Sansone
Learning to Decipher License Plates in Severely Degraded Images

License plate recognition is instrumental in many forensic investigations involving organized crime and gang crime, burglaries and trafficking of illicit goods or persons. After an incident, recordings are collected by police officers from cameras in-the-wild at gas stations or public facilities. In such an uncontrolled environment, a generally low image quality and strong compression oftentimes make it impossible to read license plates. Recent works showed that characters from US license plates can be reconstructed from noisy, low resolution pictures using convolutional neural networks (CNN). However, these studies do not involve compression, which is arguably the most prevalent image degradation in real investigations.In this paper, we present work toward closing this gap and investigate the impact of JPEG compression on license plate recognition from strongly degraded images. We show the efficacy of the CNN on a real-world dataset of Czech license plates.Using only synthetic data for training, we show that license plates with a width larger than 30 pixels, an SNR above –3 dB, and a JPEG quality factor down to 15 can at least partially be reconstructed. Additional analyses investigate the influence of the position of the character in the license plate and the similarity of characters.

Paula Kaiser, Franziska Schirrmacher, Benedikt Lorch, Christian Riess
Differential Morphed Face Detection Using Deep Siamese Networks

Although biometric facial recognition systems are fast becoming part of security applications, these systems are still vulnerable to morphing attacks, in which a facial reference image can be verified as two or more separate identities. In border control scenarios, a successful morphing attack allows two or more people to use the same passport to cross borders. In this paper, we propose a novel differential morph attack detection framework using a deep Siamese network. To the best of our knowledge, this is the first research work that makes use of a Siamese network architecture for morph attack detection. We compare our model with other classical and deep learning models using two distinct morph datasets, VISAPP17 and MorGAN. We explore the embedding space generated by the contrastive loss using three decision making frameworks using Euclidean distance, feature difference and a support vector machine classifier, and feature concatenation and a support vector machine classifier.

Sobhan Soleymani, Baaria Chaudhary, Ali Dabouei, Jeremy Dawson, Nasser M. Nasrabadi
In-Depth DCT Coefficient Distribution Analysis for First Quantization Estimation

The exploitation of traces in JPEG double compressed images is of utter importance for investigations. Properly exploiting such insights, First Quantization Estimation (FQE) could be performed in order to obtain source camera model identification (CMI) and therefore reconstruct the history of a digital image. In this paper, a method able to estimate the first quantization factors for JPEG double compressed images is presented, employing a mixed statistical and Machine Learning approach. The presented solution is demonstrated to work without any a-priori assumptions about the quantization matrices. Experimental results and comparisons with the state-of-the-art show the goodness of the proposed technique.

Sebastiano Battiato, Oliver Giudice, Francesco Guarnera, Giovanni Puglisi

MOI2QDN - Workshop on Metrification and Optimization of Input Image Quality in Deep Networks

Frontmatter
On the Impact of Rain over Semantic Segmentation of Street Scenes

We investigate the negative effects of rain streaks over the performance of a neural network for real time semantic segmentation of street scenes. This is done by synthetically augmenting the CityScapes dataset with artificial rain. We then define and train a generative adversarial network for rain removal, and quantify the benefits of its application as a pre-processing step to both rainy and “clean” images. Finally, we show that by retraining the semantic segmentation network on images processed for rain removal, it is possible to gain even more accuracy, with a model that produces stable results in all analyzed atmospheric conditions. For our experiments, we present a per-class analysis in order to provide deeper insights over the impact of rain on semantic segmentation.

Simone Zini, Marco Buzzelli
The Impact of Linear Motion Blur on the Object Recognition Efficiency of Deep Convolutional Neural Networks

Noise which can appear in images affects the classification performance of Convolutional Neural Networks (CNNs). In particular, the impact of linear motion blur, which is one of the possible noises, in the classification performance of CNNs is assessed in this work. A realistic vision sensor model has been proposed to produce a linear motion blur effect in input images. This methodology allows analyzing how the performance of several considered state of the art CNNs is affected. Experiments that have been carried out indicate that the accuracy is heavily degraded by a high length of the displacement, while the angle of displacement deteriorates the performance to a lesser extent.

José A. Rodríguez-Rodríguez, Miguel A. Molina-Cabello, Rafaela Benítez-Rochel, Ezequiel López-Rubio
Performance of Deep Learning and Traditional Techniques in Single Image Super-Resolution of Noisy Images

The improvement of the spatial resolution of natural images is an important task for many practical applications of current image processing. The endeavor of applying single image super-resolution, which only uses one low-resolution input image, is frequently hampered by the presence of noise or artifacts in the input image, so denoising and super-resolution algorithms are usually applied to obtain a noiseless high-resolution image. In this work, several traditional and deep learning methods for denoising and super-resolution are hybridized to ascertain which combinations of them yield the best possible quality. The experimental design includes the introduction of Gaussian noise, Poisson noise, salt-and-pepper noise, and uniform noise into the low-resolution inputs. It is found that denoising must be carried out before super-resolution for the best results. Moreover, the obtained results indicate that deep learning techniques clearly outperform the traditional ones.

Karl Thurnhofer-Hemsi, Guillermo Ruiz-Álvarez, Rafael Marcos Luque-Baena, Miguel A. Molina-Cabello, Ezequiel López-Rubio
The Effect of Noise and Brightness on Convolutional Deep Neural Networks

The classification performance of Convolutional Neural Networks (CNNs) can be hampered by several factors. Sensor noise is one of these nuisances. In this work, a study of the effect of noise on these networks is presented. The methodological framework includes two realistic models of noise for present day CMOS vision sensors. The models allow to include separately Poisson, Gaussian, salt & pepper, speckle and uniform noise as sources of defects in image acquisition sensors. Then, synthetic noise may be added to images by using that methodology in order to simulate common sources of image distortion. Additionally, the impact of the brightness in conjunction with each selected kind of noise is also addressed. This way, the proposed methodology incorporates a brightness scale factor to emulate images with low illumination conditions. Based on these models, experiments are carried out for a selection of state of the art CNNs. The results of the study demonstrate that Poisson noise has a small impact on the performance of CNNs, while speckle and salt & pepper noise together the global illumination level can substantially degrade the classification accuracy. Also, Gaussian and uniform noise have a moderate effect on the CNNs.

José A. Rodríguez-Rodríguez, Miguel A. Molina-Cabello, Rafaela Benítez-Rochel, Ezequiel López-Rubio
Exploring the Contributions of Low-Light Image Enhancement to Network-Based Object Detection

Low-light is a challenging environment for both human and computer vision to perform tasks such as object classification and detection. Recent works have shown potential in employing enhancements algorithms to support and improve such tasks in low-light, however there has not been any focused analysis to understand the direct effects that low-light enhancement have on an object detector. This work aims to quantify and visualize such effects on the multi-level abstractions involved in network-based object detection. First, low-light image enhancement algorithms are employed to enhance real low-light images, and then followed by deploying an object detection network on the low-light as well as the enhanced counterparts. A comparison of the activations in different layers, representing the detection features, are used to generate statistics in order to quantify the enhancements’ contribution to detection. Finally, this framework was used to analyze several low-light image enhancement algorithms and identify their impact on the detection model and task. This framework can also be easily generalized to any convolutional neural network-based models for the analysis of different enhancements algorithms and tasks.

Yuen Peng Loh
Multi-level Fusion Based Deep Convolutional Network for Image Quality Assessment

Image quality assessment aims to design effective models to automatically predict the perceptual quality score of a given image that is consistent with human cognition. In this paper, we propose a novel end-to-end multi-level fusion based deep convolutional neural network for full-reference image quality assessment (FR-IQA), codenamed MF-IQA. In MF-IQA, we first extract features with the help of edge feature fusion for both distorted images and the corresponding reference images. Afterwards, we apply multi-level feature fusion to evaluate a number of local quality indices, and then they would be pooled into a global quality score. With the proposed multi-level fusion and edge feature fusion strategy, the input images and the corresponding feature maps can be better learned and thus help produce more accurate and meaningful visual perceptual predictions. The experimental results and statistical comparisons on three IQA datasets demonstrate that our framework achieves the state-of-the-art prediction accuracy in contrast to most existing algorithms.

Qianyu Guo, Jing Wen
CNN Based Predictor of Face Image Quality

We propose a novel method for training Convolution Neural Network, named CNN-FQ, which takes a face image and outputs a scalar summary of the image quality. The CNN-FQ is trained from triplets of faces that are automatically labeled based on responses of a pre-trained face matcher. The quality scores extracted by the CNN-FQ are directly linked to the probability that the face matcher incorrectly ranks a randomly selected triplet of faces. We applied the proposed CNN-FQ, trained on CASIA database, for selection of the best quality image from a collection of face images capturing the same identity. The quality of the single face representation was evaluated on 1:1 Verification and 1:N Identification tasks defined by the challenging IJB-B protocol. We show that the recognition performance obtained when using faces selected based on the CNN-FQ scores is significantly higher than what can be achieved by competing state-of-the-art image quality extractors.

Andrii Yermakov, Vojtech Franc

MPRSS - 6th IAPR Workshop on Multimodal Pattern Recognition for Social Signal Processing in Human Computer Interaction

Frontmatter
Explainable Model Selection of a Convolutional Neural Network for Driver’s Facial Emotion Identification

Road accidents have a significant impact on increasing death rates. In addition to weather, roads, and vehicles, human error constitutes these accidents’ main reason. So, driver-safety technology is one of the common research areas, whether in academia or industry. The driver’s behavior is influenced by his feelings such as anger or sadness, as well as the physical distraction factors such as using mobile or drinking. Recognition of the driver’s emotions is crucial in expecting the driver’s behavior and dealing with it. In this work, the Convolutional Neural Network (CNN) model is employed to implement a Facial Expression Recognition (FER) approach to identify the driver’s emotions. The proposed CNN model has achieved considerable performance in prediction and classification tasks. However, it is similar to other deep learning approaches that have a lack of transparency and interpretability. We use Explainable Artificial Intelligence (XAI) techniques that generate interpretations for decisions and provide human-explainable representations to address this shortage. We utilize two visualization methods of XAI approaches to support our decision of choosing the architecture of the proposed FER model. Our model achieves accuracies of 92.85%, 99.28%, 88.88%, and 100% for the JAFFE, CK+, KDEF, and KMU-FED datasets, respectively.

Amany A. Kandeel, Hazem M. Abbas, Hossam S. Hassanein
Artificial Kindness The Italian Case of Google Mini Recognition Patterns

Kindness is a pro-social virtue, which evokes mixed feelings in the modern world. In Western culture, for example, kindness is always positive, but elitist. The Digital Revolution and the birth of artificial intelligence, such as the Google Mini, allowed the transition from elitist experience to mass experience. The study has an exploratory function and starts from the Man-AI interaction, in a provocative and oppositional conversational context created by the Human Being. In the research, it is hypothesized, therefore, that the synthesis of the artificial voice does not allow to characterize all the facets of the tone and the emotionality of the kindness. As a result, Artificial Intelligence is programmed to always respond in a “gentle” way, putting in place different facets of kindness, which are well detected in the emotionality and prosodic analysis, above all in the recognition of “pitch” speech pattern. While emotional tone analysis confirms the “understanding” of reading the communicative context of Artificial Intelligence. Among future perspectives it is highlighted how the study of these vocal patterns of artificial kindness can be a springboard for research on bullying, using the Google Mini Kindness tool.

Concetta Papapicco
Fingerspelling Recognition with Two-Steps Cascade Process of Spotting and Classification

In this paper, we propose a framework for fingerspelling recognition, based on the two-step cascade process of spotting and classification. This two-steps process is motivated by the human cognitive function in fingerspelling recognition. In the spotting process, an image sequence corresponding to certain fingerspelling is extracted from an input video by classifying the partial sequence into two fingerspelling categories and others. At this stage, how to deal with temporary dynamic information is a key point. The extracted fingerspelling is classified in the classification process. Here, the temporal dynamic information is not necessarily required. Rather, how to classify its static hand shape using the multi-view images is more important. In our framework, we employ temporal regularized canonical correlation analysis (TRCCA) for the spotting, considering it can effectively handle an image sequence’s temporal information. For the classification, we employ the orthogonal mutual subspace method (OMSM), since it can consider the information effectively from multi-view images to classify the hand shape fast and accurately. We demonstrate the effectiveness of our framework based on a complementary combination of TRCCA and OMSM compared to conventional methods on a private Japanese fingerspelling dataset.

Masanori Muroi, Naoya Sogi, Nobuko Kato, Kazuhiro Fukui
CNN Depression Severity Level Estimation from Upper Body vs. Face-Only Images

Upper body gestures have proven to provide more information about a person’s depressive state when added to facial expressions. While several studies on automatic depression analysis have looked into this impact, little is known in regard to how a convolutional neural network (CNN) uses such information for predicting depression severity levels. This study investigates the performance in various CNN models when looking at facial images alone versus including the upper body when estimating depression severity levels on a regressive scale. To assess generalisability of CNN model performance, two vastly different datasets were used, one collected by the Black Dog Institute and the other being the 2013 Audio/Visual Emotion Challenge (AVEC). Results show that the differences in model performance between face versus upper body are slight, as model performance across multiple architectures is very similar but varies when different datasets are introduced.

Dua’a Ahmad, Roland Goecke, James Ireland
Range-Doppler Hand Gesture Recognition Using Deep Residual-3DCNN with Transformer Network

Recently hand gesture recognition via millimeter-wave radar has attracted a lot of research attention for human-computer interaction. Encouraged by the ability of deep learning models in successfully tackling hand gesture recognition tasks, we propose a deep neural network (DNN) model namely, Res3DTENet that aims to classify dynamic hand gestures using the radio frequency (RF) signals. We propose a scheme that improves the convolutional process of 3DCNNs with residual skip connection (Res3D) to emphasize local-global information and enriches the intra-frame spatio-temporal feature representation. A multi-head attention transformer encoder (TE) network has been trained over the spatio-temporal features to refine the inter-frame temporal dependencies of range-Doppler sequences. The experiments are carried out on the publicly available Soli hand gesture data set. Based on our extensive experiments, we show that the proposed network achieves improved gesture recognition accuracy than the state-of-the-art hand gesture recognition methods.

Gaurav Jaswal, Seshan Srirangarajan, Sumantra Dutta Roy
Introducing Bidirectional Ordinal Classifier Cascades Based on a Pain Intensity Recognition Scenario

Ordinal classifier cascades (OCCs) are popular machine learning tools in the area of ordinal classification. OCCs constitute specific classification ensemble schemes that work in sequential manner. Each of the ensemble’s members either provides the architecture’s final prediction, or moves the current input to the next ensemble member. In the current study, we first confirm the fact that the direction of OCCs can have a high impact on the distribution of its predictions. Subsequently, we introduce and analyse our proposed bidirectional combination of OCCs. More precisely, based on a person-independent pain intensity scenario, we provide an ablation study, including the evaluation of different OCCs, as well as different popular error correcting output codes (ECOC) models. The provided outcomes show that our proposed straightforward approach significantly outperforms common OCCs, with respect to the accuracy and mean absolute error performance measures. Moreover, our results indicate that, while our proposed bidirectional OCCs are less complex in general, they are able to compete with and even outperform most of the analysed ECOC models.

Peter Bellmann, Ludwig Lausser, Hans A. Kestler, Friedhelm Schwenker
Personalized k-fold Cross-Validation Analysis with Transfer from Phasic to Tonic Pain Recognition on X-ITE Pain Database

Automatic pain recognition is currently one of the most interesting challenges in affective computing as it has great potential to improve pain management for patients in a clinical environment. In this work, we analyse automatic pain recognition using binary classification with personalized k-fold cross-validation analysis which is an approach that focuses on training on the labels of specific subjects and validating on the labels of other subjects in the Experimentally Induced Thermal and Electrical (X-ITE) Pain Database using both a random forest and a dense neural network model. The effectiveness of each approach is inspected on each of the phasic electro, phasic heat, tonic electro, and tonic heat datasets separately. Therefore, the analysis of subsets of the X-ITE dataset (phasic electro, tonic electro, phasic heat, and tonic heat) was made individually. Afterward, phasic to tonic transfer was made by training models on the phasic electro dataset and testing them on the tonic electro dataset. Our outcomes and evaluations indicate that electro datasets always perform better than heat datasets. Also, personalized scores had better performance than normal scores. Moreover, dense neural networks performed better than randoms forests in transfer from phasic electro to tonic electro and showed promising performance in the personalized transfer.

Youssef Wally, Yara Samaha, Ziad Yasser, Steffen Walter, Friedhelm Schwenker
ODANet: Online Deep Appearance Network for Identity-Consistent Multi-person Tracking

The analysis of affective states through time in multi-person scenarii is very challenging, because it requires to consistently track all persons over time. This requires a robust visual appearance model capable of re-identifying people already tracked in the past, as well as spotting newcomers. In real-world applications, the appearance of the persons to be tracked is unknown in advance, and therefore one must devise methods that are both discriminative and flexible. Previous work in the literature proposed different tracking methods with fixed appearance models. These models allowed, up to a certain extend, to discriminate between appearance samples of two different people. We propose online deep appearance network (ODANet), a method able to track people, and simultaneously update the appearance model with the newly gathered annotation-less images. Since this task is specially relevant for autonomous systems, we also describe a platform-independent robotic implementation of ODANet. Our experiments show the superiority of the proposed method with respect to the state of the art, and demonstrate the ability of ODANet to adapt to sudden changes in appearance, to integrate new appearances in the tracking system and to provide more identity-consistent tracks.

Guillaume Delorme, Yutong Ban, Guillaume Sarrazin, Xavier Alameda-Pineda
Backmatter
Metadata
Title
Pattern Recognition. ICPR International Workshops and Challenges
Editors
Prof. Alberto Del Bimbo
Prof. Rita Cucchiara
Prof. Stan Sclaroff
Dr. Giovanni Maria Farinella
Tao Mei
Prof. Dr. Marco Bertini
Hugo Jair Escalante
Dr. Roberto Vezzani
Copyright Year
2021
Electronic ISBN
978-3-030-68780-9
Print ISBN
978-3-030-68779-3
DOI
https://doi.org/10.1007/978-3-030-68780-9

Premium Partner