main-content

## Über dieses Buch

This 8-volumes set constitutes the refereed of the 25th International Conference on Pattern Recognition Workshops, ICPR 2020, held virtually in Milan, Italy and rescheduled to January 10 - 11, 2021 due to Covid-19 pandemic. The 416 full papers presented in these 8 volumes were carefully reviewed and selected from about 700 submissions. The 46 workshops cover a wide range of areas including machine learning, pattern analysis, healthcare, human behavior, environment, surveillance, forensics and biometrics, robotics and egovision, cultural heritage and document analysis, retrieval, and women at ICPR2020.

## Inhaltsverzeichnis

### A Multi-layered Approach for Tailored Black-Box Explanations

Explanations for algorithmic decision systems can take different forms, they can target different types of users with different goals. One of the main challenges in this area is therefore to devise explanation methods that can accommodate this variety of situations. A first step to address this challenge is to allow explainees to express their needs in the most convenient way, depending on their level of expertise and motivation. In this paper, we present a solution to this problem based on a multi-layered approach allowing users to express their requests for explanations at different levels of abstraction. We illustrate the approach with the application of a proof-of-concept system called IBEX to two case studies.

Clément Henin, Daniel Le Métayer

### Post-hoc Explanation Options for XAI in Deep Learning: The Insight Centre for Data Analytics Perspective

This paper profiles the recent research work on eXplainable AI (XAI), at the Insight Centre for Data Analytics. This work concentrates on post-hoc explanation-by-example solutions to XAI as one approach to explaining black box deep-learning systems. Three different methods of post-hoc explanation are outlined for image and time-series datasets: that is, factual, counterfactual, and semi-factual methods). The future landscape for XAI solutions is discussed.

Eoin M. Kenny, Eoin D. Delaney, Derek Greene, Mark T. Keane

### Expert Level Evaluations for Explainable AI (XAI) Methods in the Medical Domain

The recently emerged field of explainable artificial intelligence (XAI) attempts to shed lights on ‘black box’ Machine Learning (ML) models in understandable terms for human. As several explanation methods are developed alongside different applications for a black box model, the need for expert-level evaluation in inspecting their effectiveness becomes inevitable. This is significantly important for sensitive domains such as medical applications where evaluation of experts is essential to better understand how accurate the results of complex ML are and debug the models if necessary. The aim of this study is to experimentally show how the expert-level evaluation of XAI methods in a medical application can be utilized and aligned with the actual explanations generated by the clinician. To this end, we collect annotations from expert subjects equipped with an eye-tracker while they classify medical images and devise an approach for comparing the results with those obtained from XAI methods. We demonstrate the effectiveness of our approach in several experiments.

Satya M. Muddamsetty, Mohammad N. S. Jahromi, Thomas B. Moeslund

### Samples Classification Analysis Across DNN Layers with Fractal Curves

Deep Neural Networks are becoming the prominent solution when using machine learning models. However, they suffer from a black-box effect that renders complicated their inner workings interpretation and thus the understanding of their successes and failures. Information visualization is one way among others to help in their interpretability and hypothesis deduction. This paper presents a novel way to visualize a trained DNN to depict at the same time its architecture and its way of treating the classes of a test dataset at the layer level. In this way, it is possible to visually detect where the DNN starts to be able to discriminate the classes or where it could decrease its separation ability (and thus detect an oversized network). We have implemented the approach and validated it using several well-known datasets and networks. Results show the approach is promising and deserves further studies.

Adrien Halnaut, Romain Giot, Romain Bourqui, David Auber

### Random Forest Model and Sample Explainer for Non-experts in Machine Learning – Two Case Studies

Machine Learning (ML) is becoming an increasingly critical technology in many areas such as health, business but also in everyday applications of significant societal importance. However, the lack of explainability or ability of ML systems to offer explanation on how they work, which refers to the model (related to the whole data) and sample explainability (related to specific samples) poses significant challenges in their adoption, verification, and in ensuring the trust among users and general public. We present novel integrated Random Forest Model and Sample Explainer – RFEX. RFEX is specifically designed for important class of users who are non-ML experts but are often the domain experts and key decision makers. RFEX provides easy to analyze one-page Model and Sample explainability summaries in tabular format with wealth of explainability information including classification confidence, tradeoff between accuracy and features used, as well as ability to identify potential outlier samples and features. We demonstrate RFEX on two case studies: mortality prediction for COVID-19 patients from the data obtained from Huazhong University of Science and Technology, Wuhan, China, and classification of cell type clusters for human nervous system based on the data from J. Craig Venter Institute. We show that RFEX offers simple yet powerful means of explaining RF classification at model, sample and feature levels, as well as providing guidance for testing and developing explainable and cost-effective operational prediction models.

D. Petkovic, A. Alavi, D. Cai, M. Wong

### Jointly Optimize Positive and Negative Saliencies for Black Box Classifiers

Neural networks are increasingly applied to high-stakes tasks, such as autonomous driving and medical applications. For these tasks, it is important to explain the contributions of data components to a model prediction. In recent times, mask-based methods have been proposed for visual explanations. They optimize masks that maximally affect the model output. In this work, we propose a novel mask-based saliency method for given black box classifiers. We jointly optimize positive and negative masks to achieve a faithful feature importance map. To optimize them effectively, we define a distance between them in terms of the selected activation maps; the corresponding kernels express the important features of an input image. Then, we impose the distance to the objective function of each mask as a regularizer. By forcing both masks to be dissimilar in terms of the influential features, they can focus on essential parts of an object while alleviating noises.

Hyungsik Jung, Youngrock Oh, Jeonghyung Park, Min Soo Kim

### Low Dimensional Visual Attributes: An Interpretable Image Encoding

Deep convolutional networks (DCNs) as black-boxes make many computer vision models hard to interpret. In this paper, we present an interpretable encoding for images that represents the objects as a composition of parts and the parts themselves as a mixture of learned prototypes. We found that this representation is well suited for low-label image recognition problems such as few-shot learning (FSL), zero-shot learning (ZSL) and domain adaptation (DA). Our image encoding model with simple task predictors performs favorably against state of the art approaches in each of these tasks. Via crowdsourced results, we also show that this image encoding using parts and prototypes is interpretable to humans and agrees with their visual perception.

Pengkai Zhu, Ruizhao Zhu, Samarth Mishra, Venkatesh Saligrama

### Explainable 3D-CNN for Multiple Sclerosis Patients Stratification

The growing availability of novel interpretation techniques opened the way to the application of deep learning models in the clinical field, including neuroimaging, where their use is still largely underexploited. In this framework, we focus the stratification of Multiple Sclerosis (MS) patients in the Primary Progressive versus the Relapsing-Remitting state of the disease using a 3D Convolutional Neural Network trained on structural MRI data. Within this task, the application of Layer-wise Relevance Propagation visualization allowed detecting the voxels of the input data mostly involved in the classification decision, potentially bringing to light brain regions which might reveal disease state.

Federica Cruciani, Lorenza Brusini, Mauro Zucchelli, Gustavo Retuci Pinheiro, Francesco Setti, Ilaria Boscolo Galazzo, Rachid Deriche, Leticia Rittner, Massimiliano Calabrese, Gloria Menegaz

### Visualizing the Effect of Semantic Classes in the Attribution of Scene Recognition Models

Alejandro López-Cifuentes, Marcos Escudero-Viñolo, Andrija Gajić, Jesús Bescós

### The Impact of Activation Sparsity on Overfitting in Convolutional Neural Networks

Overfitting is one of the fundamental challenges when training convolutional neural networks and is usually identified by a diverging training and test loss. The underlying dynamics of how the flow of activations induce overfitting is however poorly understood. In this study we introduce a perplexity-based sparsity definition to derive and visualise layer-wise activation measures. These novel explainable AI strategies reveal a surprising relationship between activation sparsity and overfitting, namely an increase in sparsity in the feature extraction layers shortly before the test loss starts rising. This tendency is preserved across network architectures and reguralisation strategies so that our measures can be used as a reliable indicator for overfitting while decoupling the network’s generalisation capabilities from its loss-based definition. Moreover, our differentiable sparsity formulation can be used to explicitly penalise the emergence of sparsity during training so that the impact of reduced sparsity on overfitting can be studied in real-time. Applying this penalty and analysing activation sparsity for well known regularisers and in common network architectures supports the hypothesis that reduced activation sparsity can effectively improve the generalisation and classification performance. In line with other recent work on this topic, our methods reveal novel insights into the contradicting concepts of activation sparsity and network capacity by demonstrating that dense activations can enable discriminative feature learning while efficiently exploiting the capacity of deep models without suffering from overfitting, even when trained excessively.

Karim Huesmann, Luis Garcia Rodriguez, Lars Linsen, Benjamin Risse

### Remove to Improve?

The workhorses of CNNs are its filters, located at different layers and tuned to different features. Their responses are combined using weights obtained via network training. Training is aimed at optimal results for the entire training data, e.g., highest average classification accuracy. In this paper, we are interested in extending the current understanding of the roles played by the filters, their mutual interactions, and their relationship to classification accuracy. This is motivated by observations that the classification accuracy for some classes increases, instead of decreasing when some filters are pruned from a CNN. We are interested in experimentally addressing the following question: Under what conditions does filter pruning increase classification accuracy? We show that improvement of classification accuracy occurs for certain classes. These classes are placed during learning into a space (spanned by filter usage) populated with semantically related neighbors. The neighborhood structure of such classes is however sparse enough so that during pruning, the resulting compression bringing all classes together brings sample data closer together and thus increases the accuracy of classification.

Kamila Abdiyeva, Martin Lukac, Narendra Ahuja

### Explaining How Deep Neural Networks Forget by Deep Visualization

Explaining the behaviors of deep neural networks, usually considered as black boxes, is critical especially when they are now being adopted over diverse aspects of human life. Taking the advantages of interpretable machine learning (interpretable ML), this paper proposes a novel tool called Catastrophic Forgetting Dissector (or CFD) to explain catastrophic forgetting in continual learning settings. We also introduce a new method called Critical Freezing based on the observations of our tool. Experiments on ResNet-50 articulate how catastrophic forgetting happens, particularly showing which components of this famous network are forgetting. Our new continual learning algorithm defeats various recent techniques by a significant margin, proving the capability of the investigation. Critical freezing not only attacks catastrophic forgetting but also exposes explainability.

Giang Nguyen, Shuan Chen, Tae Joon Jun, Daeyoung Kim

### Deep Learning for Astrophysics, Understanding the Impact of Attention on Variability Induced by Parameter Initialization

In the astrophysics domain, the detection and description of gamma rays is a research direction for our understanding of the universe. Gamma-ray reconstruction from Cherenkov telescope data is multi-task by nature. The image recorded in the Cherenkov camera pixels relates to the type, energy, incoming direction and distance of a particle from a telescope observation. We propose $$\gamma$$ γ -PhysNet, a physically inspired multi-task deep neural network for gamma/proton particle classification, and gamma energy and direction reconstruction. As ground truth does not exist for real data, $$\gamma$$ γ -PhysNet is trained and evaluated on large-scale Monte Carlo simulations. Robustness is then crucial for the transfer of the performance to real data. Relying on a visual explanation method, we evaluate the influence of attention on the variability due to weight initialization, and how it helps improve the robustness of the model. All the experiments are conducted in the context of single telescope analysis for the Cherenkov Telescope Array simulated data analysis.

Mikaël Jacquemont, Thomas Vuillaume, Alexandre Benoit, Gilles Maurin, Patrick Lambert

### A General Approach to Compute the Relevance of Middle-Level Input Features

This work proposes a novel general framework, in the context of eXplainable Artificial Intelligence (XAI), to construct explanations for the behaviour of Machine Learning (ML) models in terms of middle-level features which represent perceptually salient input parts. One can isolate two different ways to provide explanations in the context of XAI: low and middle-level explanations. Middle-level explanations have been introduced for alleviating some deficiencies of low-level explanations such as, in the context of image classification, the fact that human users are left with a significant interpretive burden: starting from low-level explanations, one has to identify properties of the overall input that are perceptually salient for the human visual system. However, a general approach to correctly evaluate the elements of middle-level explanations with respect ML model responses has never been proposed in the literature.We experimentally evaluate the proposed approach to explain the decisions made by an Imagenet pre-trained VGG16 model on STL-10 images and by a customised model trained on the JAFFE dataset, using two different computational definitions of middle-level features and compare it with two different XAI middle-level methods. The results show that our approach can be used successfully in different computational definitions of middle-level explanations.

Andrea Apicella, Salvatore Giugliano, Francesco Isgrò, Roberto Prevete

### Evaluation of Interpretable Association Rule Mining Methods on Time-Series in the Maritime Domain

In decision critical domains, the results generated by black box models such as state of the art deep learning based classifiers raise questions regarding their explainability. In order to ensure the trust of operators in these systems, an explanation of the reasons behind the predictions is crucial. As rule-based approaches rely on simple if-then statements which can easily be understood by a human operator they are considered as an interpretable prediction model. Therefore, association rule mining methods are applied for explaining time-series classifier in the maritime domain. Three rule mining algorithms are evaluated on the classification of vessel types trained on a real world dataset. Each one is a surrogate model which mimics the behavior of the underlying neural network. In the experiments the GiniReg method performs the best, resulting in a less complex model which is easier to interpret. The SBRL method works well in terms of classification performance but due to an increase in complexity, it is more challenging to explain. Furthermore, during the evaluation the impact of hyper-parameters on the performance of the model along with the execution time of all three approaches is analyzed.

Manjunatha Veerappa, Mathias Anneken, Nadia Burkart

### Anchors vs Attention: Comparing XAI on a Real-Life Use Case

Recent advances in eXplainable Artificial Intelligence (XAI) led to many different methods in order to improve explainability of deep learning algorithms. With many options at hand, and maybe the need to adapt existing ones to new problems, one may find in a struggle to choose the right method to generate explanations. This paper presents an objective approach to compare two different existing XAI methods. These methods are applied to a use case from literature and to a real use case of a French administration.

Gaëlle Jouis, Harold Mouchère, Fabien Picarougne, Alexandre Hardouin

### Explanation-Driven Characterization of Android Ransomware

Machine learning is currently successfully used for addressing several cybersecurity detection and classification tasks. Typically, such detectors are modeled through complex learning algorithms employing a wide variety of features. Although these settings allow achieving considerable performances, gaining insights on the learned knowledge turns out to be a hard task. To address this issue, research efforts on the interpretability of machine learning approaches to cybersecurity tasks is currently rising. In particular, relying on explanations could improve prevention and detection capabilities since they could help human experts to find out the distinctive features that truly characterize malware attacks. In this perspective, Android ransomware represents a serious threat. Leveraging state-of-the-art explanation techniques, we present a first approach that enables the identification of the most influential discriminative features for ransomware characterization. We propose strategies to adopt explanation techniques appropriately and describe ransomware families and their evolution over time. Reported results suggest that our proposal can help cyber threat intelligence teams in the early detection of new ransomware families, and could be applicable to other malware detection systems through the identification of their distinctive features.

Michele Scalas, Konrad Rieck, Giorgio Giacinto

### Reliability of eXplainable Artificial Intelligence in Adversarial Perturbation Scenarios

Nowadays, Deep Neural Networks (DNNs) are widely adopted in several fields, including critical systems, medicine, self-guided vehicles etc. Among the reasons sustaining this spread there are the higher generalisation ability and performance levels that DNNs usually obtain when compared to classical machine learning models. Nonetheless, their black-box nature raises ethical and judicial concerns that lead to a lack of confidence in their use in sensitive applications. For this reason, recently there has been a growing interest in eXplainable Artificial Intelligence (XAI), a field providing tools, techniques and algorithms designed to generate interpretable explanations, comprehensible to humans, for the decisions made by a machine learning model. However, it has been demonstrated that DNNs are susceptible to Adversarial Perturbations (APs), namely procedures intended to mislead a target model by means of an almost imperceptible noise. The relation existing between XAI and AP is extremely of interest since it can help improve trustworthiness in AI-based systems. To this aim, it is important to increase awareness of the risks associated with the use of XAI in critical contexts, in a world where APs are present and easy to perform. On this line, we quantitatively analyse the impact that APs have on XAI in terms of differences in the explainability maps. Since this work wants to be just an intuitive proof-of-concept, the aforementioned experiments are run in a fashion easy to understand and to quantify, by using publicly available dataset and algorithms. Results show that AP can strongly affect the XAI outcomes, even in the case of a failed attack, highlighting the need for further research in this field.

Antonio Galli, Stefano Marrone, Vincenzo Moscato, Carlo Sansone

### AI Explainability. A Bridge Between Machine Vision and Natural Language Processing

This paper attempts to present an appraisal review of explainable Artificial Intelligence research, with a focus on building a bridge between image processing community and natural language processing (NLP) community. The paper highlights the implicit link between the two disciplines as exemplified from the emergence of automatic image annotation systems, visual question-answer systems. Text-To-Image generation and multimedia analytics. Next, the paper identified a set of natural language processing fields where the visual-based explainability can boost the local NLP task. This includes, sentiment analysis, automatic text summarization, system argumentation, topical analysis, among others, which are highly expected to fuel prominent future research in the field.

### Recursive Division of Image for Explanation of Shallow CNN Models

In this paper, we propose the research of the recursive division approach to get an explanation for a particular decision of the shallow black box model. The core of the proposed method is the division of the image being classified into separate rectangular parts, followed by the analysis of their influence on the classification result. Such divisions are repeated recursively until the explanation of the classification result is found, or the size of parts is too small. As a result, the pair of images with complement hidden parts is discovered, the first one of which preserves both the most valuable parts and the classification result of the initial image. The second image represents the result of hiding the most valuable parts of an initial image that leads to the different classification for the binary classification problem. Experimental research (applied for Food-5K and concrete crack images datasets) proved that the quality of the proposed method might be close to LIME or even better, while the performance of recursive division is better.

Oleksii Gorokhovatskyi, Olena Peredrii

### Camera Ego-Positioning Using Sensor Fusion and Complementary Method

Visual simultaneous localization and mapping (SLAM) is a common solution for camera ego-positioning. However, SLAM sometimes loses tracking, for instance due to fast camera motion or featureless or repetitive environments. To account for the limitations of visual SLAM, we use sensor fusion method to fuse the visual positioning results with inertial measurement unit (IMU) data based on filter-based, loosely-coupled sensor fusion methods, and further combines feature-based SLAM with direct SLAM via proposed complementary fusion to retain the advantages of both methods; i.e., we not only keep the accurate positioning of feature-based SLAM but also account for its difficulty with featureless scenes by direct SLAM. Experimental results show that the proposed complementary method improves the positioning accuracy of conventional vision-only SLAM and leads to more robust positioning results.

Peng-Yuan Kao, Kuan-Wei Tseng, Tian-Yi Shen, Yan-Bin Song, Kuan-Wen Chen, Shih-Wei Hu, Sheng-Wen Shih, Yi-Ping Hung

### ATSal: An Attention Based Architecture for Saliency Prediction in 360 Videos

The spherical domain representation of 360 $$^\circ$$ ∘ video/image presents many challenges related to the storage, processing, transmission and rendering of omnidirectional videos (ODV). Models of human visual attention can be used so that only a single viewport is rendered at a time, which is important when developing systems that allow users to explore ODV with head mounted displays (HMD). Accordingly, researchers have proposed various saliency models for 360 $$^\circ$$ ∘ video/images. This paper proposes ATSal, a novel attention based (head-eye) saliency model for 360 $$^\circ$$ ∘ videos. The attention mechanism explicitly encodes global static visual attention allowing expert models to focus on learning the saliency on local patches throughout consecutive frames. We compare the proposed approach to other state-of-the-art saliency models on two datasets: Salient360! and VR-EyeTracking. Experimental results on over 80 ODV videos (75K+ frames) show that the proposed method outperforms the existing state-of-the-art.

Yasser Dahou, Marouane Tliba, Kevin McGuinness, Noel O’Connor

### Rescue Dog Action Recognition by Integrating Ego-Centric Video, Sound and Sensor Information

A dog which assists rescue activity in the scene of disasters such as earthquakes and landslides is called a “disaster rescue dog” or just a “rescue dog”. In Japan where earthquakes happen frequently, a research project on “Cyber-Rescue” is being organized for more efficient rescue activities. In the project, to analyze the activities of rescue dogs in the scene of disasters, “Cyber Dog Suits” equipped with sensors, a camera and a GPS were developed. In this work, we recognize dog activities in the ego-centric dog videos taken by the camera mounted on the cyber-dog suits. To do that, we propose an image/sound/sensor-based four-stream CNN for dog activity recognition which integrates sound and sensor signals as well as motion and appearance. We conducted some experiments for multi-class activity categorization using the proposed method. As a result, the proposed method which integrates appearance, motion, sound and sensor information achieved the highest accuracy, 48.05%. This result is relatively high as a recognition result of ego-centric videos.

Yuta Ide, Tsuyohito Araki, Ryunosuke Hamada, Kazunori Ohno, Keiji Yanai

### Understanding Event Boundaries for Egocentric Activity Recognition from Photo-Streams

The recognition of human activities captured by a wearable photo-camera is especially suited for understanding the behavior of a person. However, it has received comparatively little attention with respect to activity recognition from fixed cameras. In this work, we propose to use segmented events from photo-streams as temporal boundaries to improve the performance of activity recognition. Furthermore, we robustly measure its effectiveness when images of the evaluated person have been seen during training, and when the person is completely unknown during testing. Experimental results show that leveraging temporal boundary information on pictures of seen people improves all classification metrics, particularly it improves the classification accuracy up to 85.73%.

Alejandro Cartas, Estefania Talavera, Petia Radeva, Mariella Dimiccoli

### Egomap: Hierarchical First-Person Semantic Mapping

We consider unsupervised learning of semantic, user-specific maps from first-person video. The task we address can be thought of as a semantic, non-geometric form of simultaneous localisation and mapping, differing in significant ways from formulations typical in robotics. Locations, termed stations, typically correspond to rooms or areas in which a user spends time, places to which they might refer in spoken conversation. Our maps are modeled as a hierarchy of probabilistic station graphs and view graphs. View graphs capture an aspect of user behaviour within stations. Visits are temporally segmented based on qualitative visual motion and used to update the map, either by updating an existing map station or adding a new map station. We contribute a labelled dataset suitable for evaluation of this novel SLAM task. Experiments compare mapping performance with and without the use of view graphs and demonstrate better online mapping than when using offline clustering.

Tamas Suveges, Stephen McKenna

### Ultrasound for Gaze Estimation

Most eye tracking methods are light-based. As such they can suffer from ambient light changes when used outdoors. It has been suggested that ultrasound could provide a low power, fast, light-insensitive alternative to camera based sensors for eye tracking. We designed a bench top experimental setup to investigate the utility of ultrasound for eye tracking, and collected time of flight and amplitude data for a range of gaze angles of a model eye. We used this data as input for a machine learning model and demonstrate that we can effectively estimate gaze (gaze RMSE error of 1.021 ± 0.189 $$^{\circ }$$ ∘ with an adjusted $$R^{2}$$ R 2 score of 89.92 ± 4.9).

Andre Golard, Sachin S. Talathi

### Synthetic Gaze Data Augmentation for Improved User Calibration

In this paper, we focus on the calibration possibilitiesó of a deep learning based gaze estimation process applying transfer learning, comparing its performance when using a general dataset versus when using a gaze specific dataset in the pretrained model. Subject calibration has demonstrated to improve gaze accuracy in high performance eye trackers. Hence, we wonder about the potential of a deep learning gaze estimation model for subject calibration employing fine-tuning procedures. A pretrained Resnet-18 network, which has great performance in many computer vision tasks, is fine-tuned using user’s specific data in a few shot adaptive gaze estimation approach. We study the impact of pretraining a model with a synthetic dataset, U2Eyes, before addressing the gaze estimation calibration in a real dataset, I2Head. The results of the work show that the success of the individual calibration largely depends on the balance between fine-tuning and the standard supervised learning procedures and that using a gaze specific dataset to pretrain the model improves the accuracy when few images are available for calibration. This paper shows that calibration is feasible in low resolution scenarios providing outstanding accuracies below 1.5 $$^\circ$$ ∘ of error.

Gonzalo Garde, Andoni Larumbe-Bergera, Sonia Porta, Rafael Cabeza, Arantxa Villanueva

### Eye Movement Classification with Temporal Convolutional Networks

Recently, deep learning approaches have been proposed to detect eye movements such as fixations, saccades, and smooth pursuits from eye tracking data. These are end-to-end methods that have shown to surpass traditional ones, requiring no ad hoc parameters. In this work we propose the use of temporal convolutional networks (TCNs) for automated eye movement classification and investigate the influence of feature space, scale, and context window sizes on the classification results. We evaluated the performance of TCNs against a state-of-the-art 1D-CNN-BLSTM model using GazeCom, a public available dataset. Our results show that TCNs can outperform the 1D-CNN-BLSTM, achieving an F-score of 94.2% for fixations, 89.9% for saccades, and 73.7% for smooth pursuits on sample level, and 89.6%, 94.3%, and 60.2% on event level. We also state the advantages of TCNs over sequential networks for this problem, and how these scores can be further improved by feature space extension.

Carlos Elmadjian, Candy Gonzales, Carlos H. Morimoto

### A Web-Based Eye Tracking Data Visualization Tool

Visualizing eye tracking data can provide insights in many research fields. However, visualizing such data efficiently and cost-effectively is challenging without well-designed tools. Easily accessible web-based approaches equipped with intuitive and interactive visualizations offer to be a promising solution. Many of such tools already exist, however, they mostly use one specific visualization technique. In this paper, we describe a web application which uses a combination of different visualization methods for eye tracking data. The visualization techniques are interactively linked to provide several perspectives on the eye tracking data. We conclude the paper by discussing challenges, limitations, and future work.

Hristo Bakardzhiev, Marloes van der Burgt, Eduardo Martins, Bart van den Dool, Chyara Jansen, David van Scheppingen, Günter Wallner, Michael Burch

### Influence of Peripheral Vibration Stimulus on Viewing and Response Actions

Changes in perceptional performance and attention levels in response to vibration motion stimulus in the peripheral field of vision were observed experimentally. Viewers were asked to respond to the dual tasks of detecting a single peripheral vibration while viewing a consequence task in the central field of vision. A hierarchical Bayesian model was employed to extract the features of viewing behaviour from observed response data. The estimated parameters showed the correct answer rate tendency, vibration frequency dependence, and time series for covert attention. Also, the estimated frequency of microsaccades was an indicator of the temporal change in latent attention and the suppression of eye movement.

Takahiro Ueno, Minoru Nakayama

### Judging Qualification, Gender, and Age of the Observer Based on Gaze Patterns When Looking at Faces

The research aimed to compare eye movement patterns of people looking at faces with different but subtle teeth imperfections. Both non-specialists and dental experts took part in the experiment. The research outcome includes the analysis of eye movement patterns depending on the specialization, gender, age, face gender, and level of teeth deformation. The study was performed using a novel, not widely explored features of eye movements, derived from recurrence plots and Gaze Self Similarity Plots. It occurred that most features are significantly different for laypeople and specialists. Significant differences were also found for gender and age among the observers. There were no differences found when comparing the gender of the face being observed and levels of imperfection. Interestingly, it was possible to define which features are sensitive to gender and which to qualification.

Pawel Kasprowski, Katarzyna Harezlak, Piotr Fudalej, Pawel Fudalej

### Gaze Stability During Ocular Proton Therapy: Quantitative Evaluation Based on Eye Surface Surveillance Videos

Ocular proton therapy (OPT) is acknowledged as a therapeutic option for the treatment of ocular melanomas. OPT clinical workflow is deeply based on x-ray image guidance procedures, both for treatment planning and patient setup verification purposes. An optimized eye orientation relative to the proton beam axis is determined during treatment planning and it is reproduced during treatment by focusing the patient gaze on a fixation light conveniently positioned in space. Treatment geometry verification is routinely performed through stereoscopic radiographic images while real time patient gaze reproducibility is qualitatively monitored by visual control of eye surface images acquired by dedicated optical cameras. We described an approach to quantitatively evaluate the stability of patients’ gaze direction over an OPT treatment course at the National Centre of Oncological Hadrontherapy (Centro Nazionale di Adroterapia Oncologica, CNAO, Pavia, Italy).Pupil automatic segmentation procedure was implemented on eye surveillance videos of five patients recorded during OPT. Automatic pupil detection performance was benchmarked against manual pupil contours of four different clinical operators. Stability of patients’ gaze direction was quantified. 2D distances were expressed as percentage of the reference pupil radius.Valuable approximation between circular fitting and manual contours was observed. Inter-operator manual contours 2D distances were in median (interquartile range) 3.3% (3.6%) of the of the reference pupil radius. The median (interquartile range) of 2D distances between the automatic segmentations and the manual contours was 5.0% (5.3) of the of the reference pupil radius. Stability of gaze direction varied across patients with median values ranging between 6.6% and 16.5% of reference pupil radius.The measured pupil displacement on the camera field of view were clinically acceptable. Further developments are necessary to reach a real-time clip-less quantification of eye during OPT.

Rosalinda Ricotti, Andrea Pella, Giovanni Elisei, Barbara Tagaste, Federico Bello, Giulia Fontana, Maria Rosaria Fiore, Mario Ciocca, Edoardo Mastella, Ester Orlandi, Guido Baroni

### Predicting Reading Speed from Eye-Movement Measures

Ádám Nárai, Kathleen Kay Amora, Zoltán Vidnyánszky, Béla Weiss

### Investigating the Effect of Inter-letter Spacing Modulation on Data-Driven Detection of Developmental Dyslexia Based on Eye-Movement Correlates of Reading: A Machine Learning Approach

János Szalma, Kathleen Kay Amora, Zoltán Vidnyánszky, Béla Weiss

### A Brief Overview of Deep Learning Approaches to Pattern Extraction and Recognition in Paintings and Drawings

This paper provides a brief overview of some of the most relevant deep learning approaches to visual art pattern extraction and recognition, particularly painting and drawing. Indeed, recent advances in deep learning and computer vision, coupled with the growing availability of large digitized visual art collections, have opened new opportunities for computer science researchers to assist the art community with automatic tools to analyze and further understand visual arts. Among other benefits, a deeper understanding of visual arts has the potential to make them more accessible to a wider population, both in terms of fruition and creation, thus supporting the spread of culture.

Giovanna Castellano, Gennaro Vessio

### Iconographic Image Captioning for Artworks

Image captioning implies automatically generating textual descriptions of images based only on the visual input. Although this has been an extensively addressed research topic in recent years, not many contributions have been made in the domain of art historical data. In this particular context, the task of image captioning is confronted with various challenges such as the lack of large-scale datasets of image-text pairs, the complexity of meaning associated with describing artworks and the need for expert-level annotations. This work aims to address some of those challenges by utilizing a novel large-scale dataset of artwork images annotated with concepts from the Iconclass classification system designed for art and iconography. The annotations are processed into clean textual description to create a dataset suitable for training a deep neural network model on the image captioning task. Motivated by the state-of-the-art results achieved in generating captions for natural images, a transformer-based vision-language pre-trained model is fine-tuned using the artwork image dataset. Quantitative evaluation of the results is performed using standard image captioning metrics. The quality of the generated captions and the model’s capacity to generalize to new data is explored by employing the model on a new collection of paintings and performing an analysis of the relation between commonly generated captions and the artistic genre. The overall results suggest that the model can generate meaningful captions that exhibit a stronger relevance to the art historical context, particularly in comparison to captions obtained from models trained only on natural image datasets.

Eva Cetinic

### Semantic Analysis of Cultural Heritage Data: Aligning Paintings and Descriptions in Art-Historic Collections

Art-historic documents often contain multimodal data in terms of images of artworks and metadata, descriptions, or interpretations thereof. Most research efforts have focused either on image analysis or text analysis independently since the associations between the two modes are usually lost during digitization. In this work, we focus on the task of alignment of images and textual descriptions in art-historic digital collections. To this end, we reproduce an existing approach that learns alignments in a semi-supervised fashion. We identify several challenges while automatically aligning images and texts, specifically for the cultural heritage domain, which limit the scalability of previous works. To improve the performance of alignment, we introduce various enhancements to extend the existing approach that show promising results.

Nitisha Jain, Christian Bartz, Tobias Bredow, Emanuel Metzenthin, Jona Otholt, Ralf Krestel

### Insights from a Large-Scale Database of Material Depictions in Paintings

Deep learning has paved the way for strong recognition systems which are often both trained on and applied to natural images. In this paper, we examine the give-and-take relationship between such visual recognition systems and the rich information available in the fine arts. First, we find that visual recognition systems designed for natural images can work surprisingly well on paintings. In particular, we find that interactive segmentation tools can be used to cleanly annotate polygonal segments within paintings, a task which is time consuming to undertake by hand. We also find that FasterRCNN, a model which has been designed for object recognition in natural scenes, can be quickly repurposed for detection of materials in paintings. Second, we show that learning from paintings can be beneficial for neural networks that are intended to be used on natural images. We find that training on paintings instead of natural images can improve the quality of learned features and we further find that a large number of paintings can be a valuable source of test data for evaluating domain adaptation algorithms. Our experiments are based on a novel large-scale annotated database of material depictions in paintings which we detail in a separate manuscript.

Hubert Lin, Mitchell Van Zuijlen, Maarten W. A. Wijntjes, Sylvia C. Pont, Kavita Bala

### An Analysis of the Transfer Learning of Convolutional Neural Networks for Artistic Images

Transfer learning from huge natural image datasets, fine-tuning of deep neural networks and the use of the corresponding pre-trained networks have become de facto the core of art analysis applications. Nevertheless, the effects of transfer learning are still poorly understood. In this paper, we first use techniques for visualizing the network internal representations in order to provide clues to the understanding of what the network has learned on artistic images. Then, we provide a quantitative analysis of the changes introduced by the learning process thanks to metrics in both the feature and parameter spaces, as well as metrics computed on the set of maximal activation images. These analyses are performed on several variations of the transfer learning procedure. In particular, we observed that the network could specialize some pre-trained filters to the new image modality and also that higher layers tend to concentrate classes. Finally, we have shown that a double fine-tuning involving a medium-size artistic dataset can improve the classification on smaller datasets, even when the task changes.

Nicolas Gonthier, Yann Gousseau, Saïd Ladjal

### Handwriting Classification for the Analysis of Art-Historical Documents

Digitized archives contain and preserve the knowledge of generations of scholars in millions of documents. The size of these archives calls for automatic analysis since a manual analysis by specialists is often too expensive. In this paper, we focus on the analysis of handwriting in scanned documents from the art-historic archive of the Wildenstein Plattner Institute. Since the archive consists of documents written in several languages and lacks annotated training data for the creation of recognition models, we propose the task of handwriting classification as a new step for a handwriting OCR pipeline. We propose a handwriting classification model that labels extracted text fragments, e.g., numbers, dates, or words, based on their visual structure. Such a classification supports historians by highlighting documents that contain a specific class of text without the need to read the entire content. To this end, we develop and compare several deep learning-based models for text classification. In extensive experiments, we show the advantages and disadvantages of our proposed approach and discuss possible usage scenarios on a real-world dataset.

Christian Bartz, Hendrik Rätz, Christoph Meinel

### Color Space Exploration of Paintings Using a Novel Probabilistic Divergence

It is strange to think of a world without color. Color adds profound richness and spectacular variety to the phenomenon of visual experience. This paper explores the latent color space used by artists through various well-known paintings. We compare the paintings’ color spaces by introducing a novel probabilistic divergence to evaluate the separation (divergences) between two paintings in their color spaces. Our results show that there is a significant divergence in color spaces of the works created at different periods of history.

Shounak Roychowdhury

### Identifying Centres of Interest in Paintings Using Alignment and Edge Detection

Case Studies on Works by Luc Tuymans

What is the creative process through which an artist goes from an original image to a painting? Can we examine this process using techniques from computer vision and pattern recognition? Here we set the first preliminary steps to algorithmically deconstruct some of the transformations that an artist applies to an original image in order to establish centres of interest, which are focal areas of a painting that carry meaning. We introduce a comparative methodology that first cuts out the minimal segment from the original image on which the painting is based, then aligns the painting with this source, investigates micro-differences to identify centres of interest and attempts to understand their role. In this paper we focus exclusively on micro-differences with respect to edges. We believe that research into where and how artists create centres of interest in paintings is valuable for curators, art historians, viewers, and art educators, and might even help artists to understand and refine their own artistic method.

Sinem Aslan, Luc Steels

### Attention-Based Multi-modal Emotion Recognition from Art

Emotions are very important in dealing with human decisions, interactions, and cognitive processes. Art is an imaginative human creation that should be appreciated, thought-provoking, and elicits an emotional response. The automatic recognition of emotions triggered by art is of considerable importance. It can be used to categorize artworks according to the emotions they evoke, recommend paintings that accentuate or balance a particular mood, and search for paintings of a particular style or genre that represent custom content in a custom state of impact. In this paper, we propose an attention-based multi-modal approach to emotion recognition that aims to use information from both the painting and title channels to achieve more accurate emotion recognition. Experimental results on the WikiArt emotion dataset showed the efficiency of the model we proposed and the usefulness of image and text modalities in emotion recognition.

Tsegaye Misikir Tashu, Tomáš Horváth

### Machines Learning for Mixed Reality

The Milan Cathedral from Survey to Holograms

In recent years, a complete 3D mapping of the Cultural Heritage (CH) has become fundamental before every other action could follow. Different survey techniques outputs could be combined in a 3D point cloud, completely describing the geometry of even the most complex object. These data very rich in metric quality can be used to extract 2D technical elaborations and advanced 3D representations to support conservation interventions and maintenance planning.The case of Milan Cathedral is outstanding. In the last 12 years, a multi-technique omni-comprehensive survey has been carried out to extract the technical representations that are used by the Veneranda Fabbrica (VF) del Duomo di Milano to plan its maintenance and conservation activities.Nevertheless, point cloud data lack structured information such as semantics and hierarchy among parts, fundamentals for 3D model interaction and database (DB) retrieval. In this context, the introduction of point cloud classification methods could improve data usage, model definition and analysis.In this paper, a Multi-level Multi-resolution (MLMR) classification approach is presented and tested on the large dataset of Milan Cathedral. The 3D point model, so structured, for the first time, is used directly in a Mixed Reality (MR) environment to develop an application that could benefit professional works, allowing to use 3D survey data on-site, supporting VF activities.

Simone Teruggi, Francesco Fassi

### From Fully Supervised to Blind Digital Anastylosis on DAFNE Dataset

Anastylosis is an archaeological term consisting in a reconstruction technique whereby an artefact is restored using the original architectural elements. Experts can sometimes imply months or years to carry out this task counting on their expertise. Software procedures can represent a valid support but several challenges arise when dealing with practical scenarios. This paper starts from the achievements on DAFNE challenge, with a traditional template matching approach which won the third place at the competition, to arrive to discuss the critical issues that make the unsupervised version, the blind digital anastylosis, a hard problem to solve. A preliminary solution supported by experimental results is presented.

Paola Barra, Silvio Barra, Fabio Narducci

### Restoration and Enhancement of Historical Stereo Photos Through Optical Flow

Restoration of digital visual media acquired from repositories of historical photographic and cinematographic material is of key importance for the preservation, study and transmission of the legacy of past cultures to the coming generations. In this paper, a fully automatic approach to the digital restoration of historical stereo photographs is proposed. The approach exploits the content redundancy in stereo pairs for detecting and fixing scratches, dust, dirt spots and many other defects in the original images, as well as improving contrast and illumination. This is done by estimating the optical flow between the images, and using it to register one view onto the other both geometrically and photometrically. Restoration is then accomplished by data fusion according to the stacked median, followed by gradient adjustment and iterative visual consistency checking. The obtained output is fully consistent with the original content, thus improving over the methods based on image hallucination. Comparative results on three different datasets of historical stereograms show the effectiveness of the proposed approach, and its superiority over single-image denoising and super-resolution methods.

Marco Fanfani, Carlo Colombo, Fabio Bellavia

### Automatic Chain Line Segmentation in Historical Prints

The analysis of chain line patterns in historical prints can provide valuable information about the origin of the paper. For this task, we propose a method to automatically detect chain lines in transmitted light images of prints from the 16th century. As motifs and writing on the paper partially occlude the paper structure, we utilize a convolutional neural network in combination with further postprocessing steps to segment and parametrize the chain lines. We compare the number of parametrized lines, as well as the distances between them, with reference lines and values. Our proposed method is an effective method showing a low error of less than 1 mm in comparison to the manually measured chain line distances.

Meike Biendl, Aline Sindel, Thomas Klinke, Andreas Maier, Vincent Christlein

### Documenting the State of Preservation of Historical Stone Sculptures in Three Dimensions with Digital Tools

Protection of stone heritage requires detailed records of the state-of-preservation to ensure accurate decision-making for conservation interventions. This short paper explores the topic of using digital tools to better visualize and map in three-dimensional (3D) representations the deterioration state of stone statues. Technical photography, geomatics techniques, and 3D visualization approaches are combined to propose reproducible and adaptable solutions that can support the investigation of historical materials’ degradation. The short paper reports on the application of these multi-technique approaches regarding a bust sculpture from the Accademia Carrara in Bergamo (Italy).

### Motion Attention Deep Transfer Network for Cross-database Micro-expression Recognition

Cross-database micro-expression recognition is a great challenging problem due to the short duration and low intensity of micro-expressions from different collection conditions. In this paper, we present a Motion Attention Deep Transfer Network (MADTN) that can focus on the most discriminative movement regions of the face and reduce the database bias. Specifically, we firstly combine the motion information and facial appearance information to obtain the discriminative representation by merging the optical flow fields between three key-frames (the onset frame, the middle frame, the offset frame) and the facial appearance of the middle frame. Then, the deep network architecture extracts cross-domain feature with the superiority of the maximum mean discrepancy(MMD) loss so that the source and target domains have a similar distribution. Results on benchmark cross-database micro-expression experiments demonstrate that the MADTN achieves remarkable performance in many micro-expression transfer tasks and exceed the state-of-the-art results, which show the robustness and superiority of our approach.

Wanchuang Xia, Wenming Zheng, Yuan Zong, Xingxun Jiang

### Spatial Temporal Transformer Network for Skeleton-Based Action Recognition

Skeleton-based human action recognition has achieved a great interest in recent years, as skeleton data has been demonstrated to be robust to illumination changes, body scales, dynamic camera views, and complex background. Nevertheless, an effective encoding of the latent information underlying the 3D skeleton is still an open problem. In this work, we propose a novel Spatial-Temporal Transformer network (ST-TR) which models dependencies between joints using the Transformer self-attention operator. In our ST-TR model, a Spatial Self-Attention module (SSA) is used to understand intra-frame interactions between different body parts, and a Temporal Self-Attention module (TSA) to model inter-frame correlations. The two are combined in a two-stream network which outperforms state-of-the-art models using the same input data on both NTU-RGB+D 60 and NTU-RGB+D 120.

Chiara Plizzari, Marco Cannici, Matteo Matteucci

### Slow Feature Subspace for Action Recognition

This paper proposes a framework for human action recognition using a combination of subspace-based methods and slow feature analysis (SFA). Subspace-based methods can compactly model the distribution of multiple images from a video by a low dimensional subspace even when few data is available. However, the temporal information of the video is lost after generating the subspace using principal component analysis (PCA). In contrast, PCA-SFA, which is a variant of SFA, can produce a valid video descriptor as a basis of a slow feature space from a given image sequence. In the proposed framework, we extract a valid video descriptor from an input video by conducting PCA-SFA, and then transform the descriptor into a subspace by using PCA. This new representation of slow feature subspace includes temporal dynamic information. Thus, we can compare two sequences and perform classification by simply calculating the similarity between their slow feature subspaces. The effectiveness of our framework is demonstrated through extensive experiments with two publicly available datasets, KTH action and the Chinese sign language dataset (isolated SLR500).

Suzana R. A. Beleza, Kazuhiro Fukui

### Classification Mechanism of Convolutional Neural Network for Facial Expression Recognition

With the development of deep learning, the structures of convolutional neural networks (CNNs) are becoming more complex and the performance of expression recognition is getting better. However, the classification mechanism of CNN is still a black box. The main problem is that CNNs have a great number of parameters, which makes it difficult to analyze them clearly. In this paper, we explain the essence of deep learning from the perspective of manifold geometry. The main purpose of deep learning especially CNN is to learn the probability distributions on manifolds. And we design a neural network based on the facial expression recognition to explore the classification mechanism of CNN. By using the deconvolution visualization method, we qualitatively verify that the trained CNN forms a detector for specific facial action unit (FAU) and each neuron of CNN is a specific manifold feature extractor for facial images. Moreover, we design a distance function to measure the differences of activation value distributions on the same feature map of FAU. The greater the distance, the more sensitive the feature map is to the FAU. The results show that the mapping relationship between FAUs and feature maps of CNN is determined, the trained CNN has generated an internal detector for each FAU to extract the facial manifold feature.

Yongpei Zhu, Hongwei Fan, Kehong Yuan

### Applying Delaunay Triangulation Augmentation for Deep Learning Facial Expression Generation and Recognition

Generating and recognizing facial expressions has numerous applications, however, those are limited by the scarcity of datasets containing labeled nuanced expressions. In this paper, we describe the use of Delaunay triangulation combined with simple morphing techniques to blend images of faces, which allows us to create and automatically label facial expressions portraying controllable intensities of emotion. We have applied this approach on the RafD dataset consisting of 67 participants and 8 categorical emotions and evaluated the augmentation in a facial expression generation and recognition tasks using deep learning models. For the generation task, we used a deconvolution neural network which learns to encode the input images in a high-dimensional feature space and generate realistic expressions at varying intensities. The augmentation significantly improves the quality of images compared to previous comparable experiments and it allows to create images with a higher resolution. For the recognition task, we evaluated pre-trained Densenet121 and Resnet50 networks with either the original or augmented dataset. Our results indicate that the augmentation alone has a similar or better performance compared to the original. Implications of this method and its role in improving existing facial expression generation and recognition approaches are discussed.

Hristo Valev, Alessio Gallucci, Tim Leufkens, Joyce Westerink, Corina Sas

### Deformable Convolutional LSTM for Human Body Emotion Recognition

People represent their emotions in a myriad of ways. Among the most important ones is whole body expressions which have many applications in different fields such as human-computer interaction (HCI). One of the most important challenges in human emotion recognition is that people express the same feeling in various ways using their face and their body. Recently many methods have tried to overcome these challenges using Deep Neural Networks (DNNs). However, most of these methods were based on images or on facial expressions only and did not consider deformation that may happen in the images such as scaling and rotation which can adversely affect the recognition accuracy. In this work, motivated by recent researches on deformable convolutions, we incorporate the deformable behavior into the core of convolutional long short-term memory (ConvLSTM) to improve robustness to these deformations in the image and, consequently, improve its accuracy on the emotion recognition task from videos of arbitrary length. We did experiments on the GEMEP dataset and achieved state-of-the-art accuracy of 98.8 $$\%$$ % on the task of whole human body emotion recognition on the validation set.

Peyman Tahghighi, Abbas Koochari, Masoume Jalali

### Nonlinear Temporal Correlation Based Network for Action Recognition

Action recognition, a trending topic in current research, is important for human behavior analysis, virtual reality, and human computer interaction. Recently, Some of the latest works have achieved impressive results in action recognition by decomposing 3D convolutions into temporal and spatial convolutions, respsctively. Modelling the temporal features is important for action recognition. In this paper, we reconsider the decomposing of convolution operations. In the previous temporal convolution operations, the temporal features are extracted by simple linear transformation, and the temporal relations among adjacent frames are not fully considered. Therefore, we propose a novel temporal structure, namely, Nonlinear Temporal Extractors, to replace the existing 1D temporal convolutions. On the one hand, this operation can extract temporal features by considering the relation along the time dimension. On the other hand, this enhances network’s representation ability by increasing the nonlinearity of the network. Finally, we perform experiments on the common action classification datasets, including UCF-101, HMDB-51, and mini-Kinetics-200. Experimental results show the effectiveness of our proposed structure.

Hongsheng Li, WeiWei Zhang, Guangming Zhu, Liang Zhang, Peiyi Shen, Juan Song

### Backmatter

Weitere Informationen