Skip to main content
Top

2024 | Book

Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications

26th Iberoamerican Congress, CIARP 2023, Coimbra, Portugal, November 27–30, 2023, Proceedings, Part I

insite
SEARCH

About this book

This 2-volume set, LNCS 14469 and 14470, constitutes the proceedings of the 26th Iberoamerican Congress on Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, CIARP 2023, which took place in Coimbra, Portugal, in November 2023.
The 61 papers presented were carefully reviewed and selected from 106 submissions. And present research in the fields of pattern recognition, artificial intelligence, and related areas.

Table of Contents

Frontmatter
Deblur Capsule Networks

Blur is often caused by physical limitations of the image acquisition sensor or by unsuitable environmental conditions. Blind image deblurring recovers the underlying sharp image from its blurry counterpart without further knowledge regarding the blur kernel or the sharp image itself. Traditional deconvolution filters are highly dependent on specific kernels or prior knowledge to guide the deblurring process. This work proposes an end-to-end deep learning approach to address blind image deconvolution in three stages: (i) it first predicts the blur type, (ii) then it deconvolves the blurry image by the identified and reconstructed blur kernel, and (iii) it deep regularizes the output image. Our proposed approach, called Deblur Capsule Networks, explores the capsule structure in the context of image deblurring. Such a versatile structure showed promising results for synthetic uniform camera motion and multi-domain blind deblur of general-purpose and remote sensing image datasets compared to some state-of-the-art techniques.

Daniel Felipe S. Santos, Rafael G. Pires, João P. Papa
Graph Embedding of Almost Constant Large Graphs

In some machine learning applications, graphs tend to be composed of a large number of tiny almost constant sub-structures. The current embedding methods are not prepared for this type of graphs and thus, their representational power tends to be very low. Our aim is to define a new graph embedding, called GraphFingerprint, that considers this specific type of graphs. The three-dimensional characterisation of a chemical metal-oxide nanocompound easily fits in these types of graphs, which nodes are atoms and edges are their bonds. Our graph embedding method has been used to predict the toxicity of these nanocompounds, achieving a high accuracy compared to other embedding methods.

Francesc Serratosa
Feature Importance for Clustering

The literature on cluster analysis methods evaluating the contribution of features to the emergence of the cluster structure for a given clustering partition is sparse. Despite advances in explainable supervised methods, explaining the outcomes of unsupervised algorithms is a less explored area. This paper proposes two post-hoc algorithms to determine feature importance for prototype-based clustering methods. The first approach assumes that the variation in the distance among cluster prototypes after marginalizing a feature can be used as a proxy for feature importance. The second approach, inspired by cooperative game theory, determines the contribution of each feature to the cluster structure by analyzing all possible feature coalitions. Multiple experiments using real-world datasets confirm the effectiveness of the proposed methods for both hard and fuzzy clustering settings.

Gonzalo Nápoles, Niels Griffioen, Samaneh Khoshrou, Çiçek Güven
Uncovering Manipulated Files Using Mathematical Natural Laws

The data exchange between different sectors of society has led to the development of electronic documents supported by different reading formats, namely portable PDF format. These documents have characteristics similar to those used in programming languages, allowing the incorporation of potentially malicious code, which makes them a vector for cyberattacks. Thus, detecting anomalies in digital documents, such as PDF files, has become crucial in several domains, such as finance, digital forensic analysis and law enforcement. Currently, detection methods are mostly based on machine learning and are characterised by being complex, slow and mainly inefficient in detecting zero-day attacks. This paper aims to propose a Benford Law (BL) based model to uncover manipulated PDF documents by analysing potential anomalies in the first digit extracted from the PDF document’s characteristics.The proposed model was evaluated using the CIC Evasive PDFMAL2022 dataset, consisting of 1191 documents (278 benign and 918 malicious). To classify the PDF documents, based on BL, into malicious or benign documents, three statistical models were used in conjunction with the mean absolute deviation: the parametric Pearson and the non-parametric Spearman and Cramer-Von Mises models. The results show a maximum F1 score of $$87.63\%$$ 87.63 % in detecting malicious documents using Pearson’s model, demonstrating the suitability and effectiveness of applying Benford’s Law in detecting anomalies in digital documents to maintain the accuracy and integrity of information and promoting trust in systems and institutions.

Pedro Fernandes, Séamus Ó Ciardhuáin, Mário Antunes
History Based Incremental Singular Value Decomposition for Background Initialization and Foreground Segmentation

Background initialization is an essential step for both hand-crafted and deep learning foreground segmentation approaches. In this paper, we propose a low-rank approximation algorithm that effectively handles the challenge caused by Stationary Foreground Objects (SFOs) on both offline and online bases. The proposed algorithm employs different incremental decomposition mechanisms that control the contribution of earlier and current frames in the overall covariance of the processed video. The proposed algorithm is able to identify the type of the detected SFO, whether it is an abandoned or removed object. Moreover, a background-updating mechanism is introduced to feed the proper background to learning models that are pretrained for foreground segmentation. The experimental results demonstrate the effectiveness of both proposed mechanisms: the SFO identification and the background initialization.

Ibrahim Kajo, Yassine Ruichek, Nidal Kamel
Vehicle Re-Identification Based on Unsupervised Domain Adaptation by Incremental Generation of Pseudo-Labels

The main goal of vehicle re-identification (ReID) is to associate the same vehicle identity in different cameras. This is a challenging task due to variations in light, viewpoints or occlusions; in particular, vehicles present a large intra-class variability and a small inter-class variability. In ReID, the samples in the test sets belong to identities that have not been seen during training. To reduce the domain gap between train and test sets, this work explores unsupervised domain adaptation (UDA) generating automatically pseudo-labels from the testing data, which are used to fine-tune the ReID models. Specifically, the pseudo-labels are obtained by clustering using different hyperparameters and incrementally due to retraining the model a number of times per hyperparameter with the generated pseudo-labels. The ReID system is evaluated in CityFlow-ReID-v2 dataset.

Paula Moral, Álvaro García-Martín, José M. Martínez
How to Turn Your Camera into a Perfect Pinhole Model

Camera calibration is a first and fundamental step in various computer vision applications. Despite being an active field of research, Zhang’s method remains widely used for camera calibration due to its implementation in popular toolboxes like MATLAB and OpenCV. However, this method initially assumes a pinhole model with oversimplified distortion models. In this work, we propose a novel approach that involves a pre-processing step to remove distortions from images by means of Gaussian processes. Our method does not need to assume any distortion model and can be applied to severely warped images, even in the case of multiple distortion sources, e.g., a fisheye image of a curved mirror reflection. The Gaussian processes capture all distortions and camera imperfections, resulting in virtual images as though taken by an ideal pinhole camera with square pixels. Furthermore, this ideal GP-camera only needs one image of a square grid calibration pattern. This model allows for a serious upgrade of many algorithms and applications that are designed in a pure projective geometry setting but with a performance that is very sensitive to non-linear lens distortions. We demonstrate the effectiveness of our method by simplifying Zhang’s calibration method, reducing the number of parameters and getting rid of the distortion parameters and iterative optimization. We validate by means of synthetic data and real world images. The contributions of this work include the construction of a virtual ideal pinhole camera using Gaussian processes, a simplified calibration method and lens distortion removal.

Ivan De Boi, Stuti Pathak, Marina Oliveira, Rudi Penne
Single Image HDR Synthesis with Histogram Learning

High dynamic range imaging aims for a more accurate representation of the scene. It provides a large luminance coverage to yield the human perception range. In this paper, we present a technique to synthesize an HDR image from the LDR input. The proposed two-stage approach expands the dynamic range and predict its histogram with cumulative histogram learning. Histogram matching is then carried out to reallocate the pixel intensity. In the second stage, HDR images are constructed using reinforcement learning with pixel-wise rewards for local consistency adjustment. Experiments are conducted on HDR-Real and HDR-EYE datasets. The quantitative evaluation on HDR-VDP-2, PSNR, and SSIM have demonstrated the effectiveness compared to the state-of-the-art techniques.

Yi-Rung Lin, Huei-Yung Lin, Wen-Chieh Lin
But That’s Not Why: Inference Adjustment by Interactive Prototype Revision

Prototypical part networks predict not only the class of an image but also explain why it was chosen. In some cases, however, the detected features do not relate to the depicted objects. This is especially relevant in prototypical part networks as prototypes are meant to code for high-level concepts such as semantic parts of objects. This raises the question how the inference of the networks can be improved. Here we suggest to enable the user to give hints and interactively correct the model’s reasoning. It shows that even correct classifications can rely on unreasonable or spurious prototypes that result from confounding variables in a dataset. Hence, we propose simple yet effective interaction schemes for inference adjustment that enable the user to interactively revise the prototypes chosen by the model. Spurious prototypes can be removed or altered to become sensitive to object-features by the suggested mode of training. Interactive prototype revision allows machine learning naïve users to adjust the logic of reasoning and change the way prototypical part networks make a decision.

Michael Gerstenberger, Thomas Wiegand, Peter Eisert, Sebastian Bosse
Teaching Practices Analysis Through Audio Signal Processing

Remote teaching has been used successfully with the evolution of videoconference solutions and broadband internet availability. Even several years before the global COVID 19 pandemic, Ceibal used this approach for different educational programs in Uruguay. As in face-to-face lessons, teaching evaluation is a relevant task in this context, which requires many time and human resources for classroom observation. In this work we propose automatic tools for the analysis of teaching practices, taking advantage of the lessons recordings provided by the videoconference system. We show that it is possible to detect with a high level of accuracy, relevant lessons metrics for the analysis, such as the teacher talking time or the language usage in English lessons.

Braulio Ríos, Emilio Martínez, Diego Silvera, Pablo Cancela, Germán Capdehourat
Time Distributed Multiview Representation for Speech Emotion Recognition

In recent years, speech-emotion recognition (SER) techniques have gained importance, mainly in human-computer interaction studies and applications. This research area has different challenges, including developing new and efficient detection methods, efficient extraction of audio features, and time preprocessing strategies. This paper proposes a new multiview model to detect speech emotion in raw audio data. The proposed method uses mel-spectrogram features optimized from audio files and combines deep learning algorithms to improve the detection performance. This combination relied on the following algorithms: CNN (Convolutional Neural Network), VGG (Visual Geometry Group), ResNet (Residual neural network), and LSTM (Long Short-Term Memory). The role of the CNN algorithm is to extract the characteristics present in the images of the mel-spectrograms applied as input to the method. These characteristics are combined with the VGG and ResNet networks, which are pre-trained algorithms. Finally, the LSTM algorithm receives all this combined information to identify the predefined emotions. The proposed method was developed using the RAVDESS database and considering eight emotions. The results show an increase of up to 12% in accuracy compared to strategies in the literature that use raw data processing.

Flavia Letícia de Mattos, Marcelo E. Pellenz, Alceu de S. Britto
Detection of Covid-19 in Chest X-Ray Images Using Percolation Features and Hermite Polynomial Classification

Covid-19 is a serious disease caused by the Sars-CoV-2 virus that has been first reported in China at late 2019 and has rapidly spread around the world. As the virus affects mostly the lungs, chest X-rays are one of the safest and most accessible ways of diagnosing the infection. In this paper, we propose the use of an approach for detecting Covid-19 in chest X-ray images through the extraction and classification of local and global percolation-based features. The method was applied in two datasets: one containing 2,002 segmented samples split into two classes (Covid-19 and Healthy); and another containing 1,125 non-segmented samples split into three classes (Covid-19, Healthy and Pneumonia). The 48 obtained percolation features were given as input to six different classifiers and then AUC and accuracy values were evaluated. We employed the 10-fold cross-validation method and evaluated the lesion sub-types with binary and multiclass classification using the Hermite Polynomial classifier, which had never been employed in this context. This classifier provided the best overall results when compared to other five machine learning algorithms. These results based in the association of percolation features and Hermite polynomial can contribute to the detection of the lesions by supporting specialists in clinical practices.

Guilherme F. Roberto, Danilo C. Pereira, Alessandro S. Martins, Thaína A. A. Tosta, Carlos Soares, Alessandra Lumini, Guilherme B. Rozendo, Leandro A. Neves, Marcelo Z. Nascimento
Abandoned Object Detection Using Persistent Homology

The automatic detection of suspicious abandoned objects has become a priority in video surveillance in the last years. Terrorist attacks, improperly parked vehicles, abandoned drug packages and many other events, endorse the interest in automating this task. It is challenge to detect such objects due to many issues present in public spaces for video-sequence process, like occlusions, illumination changes, crowded environments, etc. On the other hand, using deep learning can be difficult due to the fact that it is more successful in perceptual tasks and generally what are called system 1 tasks. In this work we propose to use topological features to describe the scenery objects. These features have been used in objects with dynamic shape and maintain the stability under perturbations. The objects (foreground) are the result of to apply a background subtraction algorithm. We propose the concept the surveillance points: set of points uniformly distributed on scene. Then we keep track of the changes in a cubic region centered at each surveillance points. For that, we construct a simplicial complex (topological space) from the k foreground frames. We obtain the topological features (using persistent homology) in the sub-complexes for each cubical-regions, which represents the activity around the surveillance points. Finally for each surveillance points we keep track of the changes of its associated topological signature in time, in order to detect the abandoned objects. The accuracy of our method is tested on PETS2006 database with promising results.

Javier Lamar Leon, Raúl Alonso Baryolo, Edel Garcia Reyes, Rocio Gonzalez Diaz, Pedro Salgueiro
Interactive Segmentation with Incremental Watershed Cuts

In this article, we propose an incremental method for computing seeded watershed cuts for interactive image segmentation. We propose an algorithm based on the hierarchical image representation called the binary partition tree to compute a seeded watershed cut. We show that this algorithm fits perfectly in an interactive segmentation process by handling user interactions, seed addition or removal, in time linear with respect to the number of affected pixels. Run time comparisons with several state-of-the-art interactive and non-interactive watershed methods show that the proposed method can handle user interactions much faster than previous methods achieving significant speedup from 15 to 90, thus improving the user experience on large images.

Quentin Lebon, Josselin Lefèvre, Jean Cousty, Benjamin Perret
Supervised Learning of Hierarchical Image Segmentation

We study the problem of predicting hierarchical image segmentations using supervised deep learning. While deep learning methods are now widely used as contour detectors, the lack of image datasets with hierarchical annotations has prevented researchers from explicitly training models to predict hierarchical contours. Image segmentation has been widely studied, but it is limited by only proposing a segmentation at a single scale. Hierarchical image segmentation solves this problem by proposing segmentation at multiple scales, capturing objects and structures at different levels of detail. However, this area of research appears to be less explored and therefore no hierarchical image segmentation dataset exists. In this paper, we provide a hierarchical adaptation of the Pascal-Part dataset [2], and use it to train a neural network for hierarchical image segmentation prediction. We demonstrate the efficiency of the proposed method through three benchmarks: the precision-recall and F-score benchmarks for boundary location, the level recovery fraction for assessing hierarchy quality, and the false discovery fraction. We show that our method successfully learns hierarchical boundaries in the correct order, and achieves better performance than the state-of-the-art model trained on single-scale segmentations.

Raphael Lapertot, Giovanni Chierchia, Benjamin Perret
Unveiling the Influence of Image Super-Resolution on Aerial Scene Classification

Deep learning has made significant advances in recent years, and as a result, it is now in a stage where it can achieve outstanding results in tasks requiring visual understanding of scenes. However, its performance tends to decline when dealing with low-quality images. The advent of super-resolution (SR) techniques has started to have an impact on the field of remote sensing by enabling the restoration of fine details and enhancing image quality, which could help to increase performance in other vision tasks. However, in previous works, contradictory results for scene visual understanding were achieved when SR techniques were applied. In this paper, we present an experimental study on the impact of SR on enhancing aerial scene classification. Through the analysis of different state-of-the-art SR algorithms, including traditional methods and deep learning-based approaches, we unveil the transformative potential of SR in overcoming the limitations of low-resolution (LR) aerial imagery. By enhancing spatial resolution, more fine details are captured, opening the door for an improvement in scene understanding. We also discuss the effect of different image scales on the quality of SR and its effect on aerial scene classification. Our experimental work demonstrates the significant impact of SR on enhancing aerial scene classification compared to LR images, opening new avenues for improved remote sensing applications.

Mohamed Ramzy Ibrahim, Robert Benavente, Daniel Ponsa, Felipe Lumbreras
Weeds Classification with Deep Learning: An Investigation Using CNN, Vision Transformers, Pyramid Vision Transformers, and Ensemble Strategy

Weeds are a significant threat to agricultural production. Weed classification systems based on image analysis have offered innovative solutions to agricultural problems, with convolutional neural networks (CNNs) playing a pivotal role in this task. However, CNNs are limited in their ability to capture global relationships in images due to their localized convolutional operation. Vision Transformers (ViT) and Pyramid Vision Transformers (PVT) have emerged as viable solutions to overcome this limitation. Our study aims to determine the effectiveness of CNN, PVT, and ViT in classifying weeds in image datasets. We also examine if combining these methods in an ensemble can enhance classification performance. Our tests were conducted on significant agricultural datasets, including DeepWeeds and CottonWeedID15. The results indicate that a maximum of 3 methods in an ensemble, with only 15 epochs in training, can achieve high accuracy rates of up to 99.17%. This study demonstrates that high accuracies can be achieved with ease of implementation and only a few epochs.

Guilherme Botazzo Rozendo, Guilherme Freire Roberto, Marcelo Zanchetta do Nascimento, Leandro Alves Neves, Alessandra Lumini
Leveraging Question Answering for Domain-Agnostic Information Extraction

Transformers gave a considerable boost to Natural Language Processing, but their application to specific scenarios still poses some practical issues. We present an approach for extracting information from technical documents on different domains, with minimal effort. It leverages on generic models for Question Answering and on questions formulated with target properties in mind. These are made to specific sections where the answer, then used as the value for the property, should reside. We further describe how this approach was applied to documents of two very different domains: toxicology and finance. For both, results extracted from a sample of documents were assessed by domain experts, who also provided feedback on the benefits of this approach. F-Scores of 0.73 and 0.90, respectively in the toxicological and financial domain, confirm the potential and flexibility of the approach suggesting that, while it cannot yet be fully automated and replace human work, it can support expert decisions, thus reducing time and manual effort.

Bruno Carlos Luís Ferreira, Hugo Gonçalo Oliveira, Catarina Silva
Towards a Robust Solution for the Supermarket Shelf Audit Problem: Obsolete Price Tags in Shelves

Shelf auditing holds significant importance within the retail industry’s industrial sector. It encompasses various processes carried out by human operators. This article aims to address the issue of identifying outdated price tags on shelves, bridging the gap of an automated shelf audit. Our proposal introduces a minimum viable process that effectively detects, recognizes, and locates price tags using computer vision and deep learning techniques. The outcomes of this study demonstrate the robustness of our approach in generating a comprehensive list of price tags on shelves, which can be subsequently compared with a database to identify and flag obsolete ones.

Emmanuel F. Morán, Boris X. Vintimilla, Miguel A. Realpe
A Self-Organizing Map Clustering Approach to Support Territorial Zoning

This work aims to evaluate three strategies for analyzing clusters of ordinal categorical data (thematic maps) to support the territorial zoning of the Alto Taquari basin, MS/MT. We evaluated a model-based method, another based on the segmentation of the multi-way contingency table, and the last one based on the transformation of ordinal data into intervals and subsequent analysis of clusters from a proposed method of segmentation of the Self-Organizing Map after the neural network training process. The results showed the adequacy of the methods based on the Self-Organizen Map and the segmentation of the contingency table, as these techniques generated unimodal clusters with distinguishable groups.

Marcos A. S. da Silva, Pedro V. de A. Barreto, Leonardo N. Matos, Gastão F. Miranda Júnior, Márcia H. G. Dompieri, Fábio R. de Moura, Fabrícia K. S. Resende, Paulo Novais, Pedro Oliveira
Spatial-Temporal Graph Transformer for Surgical Skill Assessment in Simulation Sessions

Automatic surgical skill assessment has the capacity to bring a transformative shift in the assessment, development, and enhancement of surgical proficiency. It offers several advantages, including objectivity, precision, and real-time feedback. These benefits will greatly enhance the development of surgical skills for novice surgeons, enabling them to improve their abilities in a more effective and efficient manner. In this study, our primary objective was to explore the potential of hand skeleton dynamics as an effective means of evaluating surgical proficiency. Specifically, we aimed to discern between experienced surgeons and surgical residents by analyzing sequences of hand skeletons. To the best of our knowledge, this study represents a pioneering approach in using hand skeleton sequences for assessing surgical skills. To effectively capture the spatial-temporal correlations within sequences of hand skeletons for surgical skill assessment, we present STGFormer, a novel approach that combines the capabilities of Graph Convolutional Networks and Transformers. STGFormer is designed to learn advanced spatial-temporal representations and efficiently capture long-range dependencies. We evaluated our proposed approach on a dataset comprising experienced surgeons and surgical residents practicing surgical procedures in a simulated training environment. Our experimental results demonstrate that the proposed STGFormer outperforms all state-of-the-art models for the task of surgical skill assessment. More precisely, we achieve an accuracy of 83.29% and a weighted average F1-score of 81.41%. These results represent a significant improvement of 1.37% and 1.28% respectively when compared to the best state-of-the-art model.

Kevin Feghoul, Deise Santana Maia, Mehdi El Amrani, Mohamed Daoudi, Ali Amad
Deep Learning in the Identification of Psoriatic Skin Lesions

Psoriasis is a dermatological lesion that manifests in several regions of the body. Its late diagnosis can generate the aggravation of the disease itself, as well as of the comorbidities associated with it. The proposed work presents a computational system for image classification in smartphones, through deep convolutional neural networks, to assist the process of diagnosis of psoriasis.The dataset and the classification algorithms used revealed that the classification of psoriasis lesions was most accurate with unsegmented and unprocessed images, indicating that deep learning networks are able to do a good feature selection. Smaller models have a lower accuracy, although they are more adequate for environments with power and memory restrictions, such as smartphones.

Gabriel Silva Lima, Carolina Pires, Arlete Teresinha Beuren, Rui Pedro Lopes
WildFruiP: Estimating Fruit Physicochemical Parameters from Images Captured in the Wild

The progress in computer vision has allowed the development of a diversity of precision agriculture systems, improving the efficiency and yield of several processes of farming. Among the different processes, crop monitoring has been extensively studied to decrease the resources consumed and increase the yield, where a myriad of computer vision strategies has been proposed for fruit analysis (e.g., fruit counting) or plant health estimation. Nevertheless, the problem of fruit ripeness estimation has received little attention, particularly when the fruits are still on the tree. As such, this paper introduces a strategy to estimate the maturation stage of fruits based on images acquired from handheld devices while the fruit is still on the tree. Our approach relies on an image segmentation strategy to crop and align fruit images, which a CNN subsequently processes to extract a compact visual descriptor of the fruit. A non-linear regression model is then used for learning a mapping between descriptors to a set of physicochemical parameters, acting as a proxy of the fruit maturation stage. The proposed method is robust to the variations in position, lighting, and complex backgrounds, being ideal for working in the wild with minimal image acquisition constraints. Source code is available at https://github.com/Diogo365/WildFruiP .

Diogo J. Paulo, Cláudia M. B. Neves, Dulcineia Ferreira Wessel, João C. Neves
Depression Detection Using Deep Learning and Natural Language Processing Techniques: A Comparative Study

Depression is a frequently underestimated illness that significantly impacts a substantial number of individuals worldwide, making it a significant mental disorder. The world today lives fully connected, where more than half of the world’s population uses social networks in their daily lives. If we interpret and understand the feelings associated with a social media post, we can detect potential depression cases before they reach a major state associated with consequences for the patient. This paper proposes the use of natural language processing (NLP) techniques to classify the sentiment associated with a post made on the Twitter social network. This sentiment can be non-depressive, neutral, or depressive. The authors collected and validated the data, and performed pre-processing and feature generation using TF-IDF and Word2Vec techniques. Various DL and ML models were evaluated on these features. The Extra Trees classifier combined with the TF-IDF technique emerged as the most successful combination for classifying potential depression sentiment in tweets, achieving an accuracy of 84.83%.

Francisco Mesquita, José Maurício, Gonçalo Marques
Impact of Synthetic Images on Morphing Attack Detection Using a Siamese Network

This paper evaluated the impact of synthetic images on Morphing Attack Detection (MAD) using a Siamese network with a semi-hard-loss function. Intra and cross-dataset evaluations were performed to measure synthetic image generalisation capabilities using a cross-dataset for evaluation. Three different pre-trained networks were used as feature extractors from traditional MobileNetV2, MobileNetV3 and EfficientNetB0. Our results show that MAD trained on EfficientNetB0 from FERET, FRGCv2, and FRLL can reach a lower error rate in comparison with SOTA. Conversely, worse performances were reached when the system was trained only with synthetic images. A mixed approach (synthetic + digital) database may help to improve MAD and reduce the error rate. This fact shows that we still need to keep going with our effort to include synthetic images in the training process.

Juan Tapia, Christoph Busch
Face Image Quality Estimation on Presentation Attack Detection

Non-referential Face Image Quality Assessment (FIQA) methods have gained popularity as a pre-filtering step in Face Recognition (FR) systems. In most of them, the quality score is usually designed with face comparison in mind. However, a small amount of work has been done on measuring their impact and usefulness on Presentation Attack Detection (PAD). In this paper, we study the effect of quality assessment methods on filtering bona fide and attack samples, their impact on PAD systems, and how the performance of such systems is improved when training on a filtered (by quality) dataset. On a Vision Transformer PAD algorithm, a reduction of 20% of the training dataset by remoing lower-quality samples allowed us to improve the Bona fide Presentation Classification Error Rate (BPCER) by 3% in a cross-dataset test.

Carlos Aravena, Diego Pasmiño, Juan Tapia, Christoph Busch
Knowledge Distillation of Vision Transformers and Convolutional Networks to Predict Inflammatory Bowel Disease

Inflammatory bowel disease is a chronic disease of unknown cause that can affect the entire gastrointestinal tract, from the mouth to the anus. It is important for patients with this pathology that a good diagnosis is made as early as possible, so that the inflammation present in the mucosa intestinal is controlled and the most severe symptoms are reduced, thus offering the quality of life to people. Therefore, through this comparative study, we seek to find a way of automating the diagnosis of these patients during the endoscopic examination, reducing the subjectivity that is subject to the observation of a gastroenterologist, using six CNNs: AlexNet, ResNet50, VGG16, ResNet50-MobileNetV2 and Hybrid model. Also, five ViTs were used in this study: ViT-B/32, ViT-S/32, ViT-B/16, ViT-S/16 and R26+S/32. This comparison also consists in applying knowledge distillation to build simpler models, with fewer parameters, based on the learning of the pre-trained architectures on large volumes of data. It is concluded that in the ViTs framework, it is possible to reduce 25x the number of parameters by maintaining good performance and reducing the inference time by 5.32 s. For CNNs the results show that it is possible to reduce 107x the number of parameters, reducing consequently the inference time in 3.84 s.

José Maurício, Inês Domingues
Analysis and Impact of Training Set Size in Cross-Subject Human Activity Recognition

The ubiquity of consumer devices with sensing and computational capabilities, such as smartphones and smartwatches, has increased interest in their use in human activity recognition for healthcare monitoring applications, among others. When developing such a system, researchers rely on input data to train recognition models. In the absence of openly available datasets that meet the model requirements, researchers face a hard and time-consuming process to decide which sensing device to use or how much data needs to be collected. In this paper, we explore the effect of the amount of training data on the performance (i.e., classification accuracy and activity-wise F1-scores) of a CNN model by performing an incremental cross-subject evaluation using data collected from a consumer smartphone and smartwatch. Systematically studying the incremental inclusion of subject data from a set of 22 training subjects, the results show that the model’s performance initially improves significantly with each addition, yet this improvement slows down the larger the number of included subjects. We compare the performance of models based on smartphone and smartwatch data. The latter option is significantly better with smaller sizes of training data, while the former outperforms with larger amounts of training data. In addition, gait-related activities show significantly better results with smartphone-collected data, while non-gait-related activities, such as standing up or sitting down, were better recognized with smartwatch-collected data.

Miguel Matey-Sanz, Joaquín Torres-Sospedra, Alberto González-Pérez, Sven Casteleyn, Carlos Granell
Efficient Brazilian Sign Language Recognition: A Study on Mobile Devices

Automatic Sign Language Recognition (SLR) is a critical step in facilitating communication between deaf and hearing people. An interesting application of such a technology is a real-time mobile sign language translator since it could integrate both groups more easily. To this end,troduce a neBrazilian sign language (LIBRAS) recognition approach, the first for a mobile environment using an efficient 3D Convolutional Neural Network (CNN) to classify a sequence of frames extracted from a word being signaled in a video. Results show that our model is aproximately24 to 81 times faster than recent works in the field, and it is tested on a mobile device to understand the trade-off between performance and accuracy. Although slightly low accuracy, we have a significantly faster model at inference time and the beginning of something more relevant in the field, creating a discussion of future points of improvement to obtain an efficient real-time sign language system without greatly sacrificing accuracy in LIBRAS classification.

Vitor Lopes Fabris, Felype de Castro Bastos, Ana Claudia Akemi Matsuki de Faria, José Victor Nogueira Alves da Silva, Pedro Augusto Luiz, Rafael Custódio Silva, Renata De Paris, Claudio Filipi Gonçalves dos Santos
Presumably Correct Undersampling

This paper presents a data pre-processing algorithm to tackle class imbalance in classification problems by undersampling the majority class. It relies on a formalism termed Presumably Correct Decision Sets aimed at isolating easy (presumably correct) and difficult (presumably incorrect) instances in a classification problem. The former are instances with neighbors that largely share their class label, while the latter have neighbors that mostly belong to a different decision class. The proposed algorithm replaces the presumably correct instances belonging to the majority decision class with prototypes, and it operates under the assumption that removing these instances does not change the boundaries of the decision space. Note that this strategy opposes other methods that remove pairs of instances from different classes that are each other’s closest neighbors. We argue that the training and test data should have similar distribution and complexity and that making the decision classes more separable in the training data would only increase the risks of overfitting. The experiments show that our method improves the generalization capabilities of a baseline classifier, while outperforming other undersampling algorithms reported in the literature.

Gonzalo Nápoles, Isel Grau
Leveraging Longitudinal Data for Cardiomegaly and Change Detection in Chest Radiography

Chest radiography has been widely used for automatic analysis through deep learning (DL) techniques. However, in the manual analysis of these scans, comparison with images at previous time points is commonly done, in order to establish a longitudinal reference. The usage of longitudinal information in automatic analysis is not a common practice, but it might provide relevant information for desired output. In this work, the application of longitudinal information for the detection of cardiomegaly and change in pairs of CXR images was studied. Multiple experiments were performed, where the inclusion of longitudinal information was done at the features level and at the input level. The impact of the alignment of the image pairs (through a developed method) was also studied. The usage of aligned images was revealed to improve the final mcs for both the detection of pathology and change, in comparison to a standard multi-label classifier baseline. The model that uses concatenated image features outperformed the remaining, with an Area Under the Receiver Operating Characteristics Curve (AUC) of 0.858 for change detection, and presenting an AUC of 0.897 for the detection of pathology, showing that pathology features can be used to predict more efficiently the comparison between images. In order to further improve the developed methods, data augmentation techniques were studied. These proved that increasing the representation of minority classes leads to higher noise in the dataset. It also showed that neglecting the temporal order of the images can be an advantageous augmentation technique in longitudinal change studies.

Raquel Belo, Joana Rocha, João Pedrosa
Self-supervised Monocular Depth Estimation on Unseen Synthetic Cameras

Monocular depth estimation is a critical task in computer vision, and self-supervised deep learning methods have achieved remarkable results in recent years. However, these models often struggle on camera generalization, i.e. at sequences captured by unseen cameras. To address this challenge, we present a new public custom dataset created using the CARLA simulator [4], consisting of three video sequences recorded by five different cameras with varying focal distances. This dataset has been created due to the absence of public datasets containing identical sequences captured by different cameras. Additionally, it is proposed in this paper the use of adversarial training to improve the models’ robustness to intrinsic camera parameter changes, enabling accurate depth estimation regardless of the recording camera. The results of our proposed architecture are compared with a baseline model, hence being evaluated the effectiveness of adversarial training and demonstrating its potential benefits both on our synthetic dataset and on the KITTI benchmark [8] as the reference dataset to evaluate depth estimation.

Cecilia Diana-Albelda, Juan Ignacio Bravo Pérez-Villar, Javier Montalvo, Álvaro García-Martín, Jesús Bescós Cano
Novelty Detection in Human-Machine Interaction Through a Multimodal Approach

As the interest in robots continues to grow across various domains, including healthcare, construction and education, it becomes crucial to prioritize improving user experience and fostering seamless interaction. These human-machine interactions (HMI) are often impersonal. Our proposal, built upon previous work in the field, aims to use biometric data of individuals to detect whether a person has been encountered before. Since many models depend on a threshold set, an optimization method using a genetic algorithm was proposed. The novelty detection is made through a multimodal approach using both voice and facial images from the individuals, although the unimodal approaches of just each single cue were also tested. To assess the effectiveness of the proposed system, we conducted comprehensive experiments on three diverse datasets, namely VoxCeleb, Mobio and AveRobot, each possessing distinct characteristics and complexities. By examining the impact of data quality on model performance, we gained valuable insights into the effectiveness of the proposed solution. Our approach outperformed several conventional novelty detection methods, yielding superior and therefore promising results.

José Salas-Cáceres, Javier Lorenzo-Navarro, David Freire-Obregón, Modesto Castrillón-Santana
Filtering Safe Temporal Motifs in Dynamic Graphs for Dissemination Purposes

In this paper, we address the challenges posed by dynamic networks in various domains, such as bioinformatics, social network analysis, and computer vision, where relationships between entities are represented by temporal graphs that respect a temporal order. To understand the structure and functionality of such systems, we focus on small subgraph patterns, called motifs, which play a crucial role in understanding dissemination processes in dynamic networks that can be a spread of fake news, infectious diseases or computer viruses. To address this, we propose a novel approach called temporal motif filtering for classifying dissemination processes in labeled temporal graphs. Our approach identifies and examines key temporal subgraph patterns, contributing significantly to our understanding of dynamic networks. To further enhance classification performance, we combined directed line transformations with temporal motif removal. Additionally, we integrate filtering motifs, directed edge transformations, and transitive edge reduction. Experimental results demonstrate that our proposed approaches consistently improve classification accuracy across various datasets and tasks. These findings hold the potential to unlock deeper insights into diverse domains and enable the development of more accurate and efficient strategies to address challenges related to spreading process in dynamic environments. Our work significantly contributes to the field of temporal graph analysis and classification, opening up new avenues for advancing our understanding and utilization of dynamic networks.

Carolina Jerônimo, Simon Malinowski, Zenilton K. G. Patrocínio Jr., Guillaume Gravier, Silvio Jamil F. Guimarães
Graph-Based Feature Learning from Image Markers

Deep learning methods have achieved impressive results for object detection, but they usually require powerful GPUs and large annotated datasets. In contrast, there is a lack of explainable networks in the literature. For instance, Feature Learning from Image Markers (FLIM) is a feature extraction strategy for lightweight CNNs without backpropagation that requires only a few training images. In this work, we extend FLIM for general image graph modeling, allowing it for a non-strict kernel shape and taking advantage of the adjacency relation between nodes to extract feature vectors based on neighbors’ features. To produce saliency maps by combining learned features, we proposed a User-Guided Decoder (UGD) that does not require training and is suitable for any FLIM-based strategy. Our results indicate that the proposed Graph-based FLIM, named GFLIM, not only outperforms FLIM but also produces competitive detections with deep models, even having an architecture thousands of times smaller in the number of parameters. Our code is publicly available at https://github.com/IMScience-PPGINF-PucMinas/GFLIM .

Isabela Borlido Barcelos, Leonardo de Melo João, Zenilton K. G. Patrocínio Jr., Ewa Kijak, Alexandre X. Falcão, Silvio J. F. Guimarães
Seabream Freshness Classification Using Vision Transformers

Many different cultures and countries have fish as a central piece in their diet, particularly in coastal countries such as Portugal, with the fishery and aquaculture sectors playing an increasingly important role in the provision of food and nutrition. As a consequence, fish-freshness evaluation is very important, although so far it has relied on human judgement, which may not be the most reliable at times.This paper proposes an automated non-invasive system for fish-freshness classification, which takes fish images as input, as well as a seabream fish image dataset.The dataset will be made publicly available for academic and scientific purposes with the publication of this paper. The dataset includes metadata, such as manually generated segmentation masks corresponding to the fish eye and body regions, as well as the time since capture.For fish-freshness classification four freshness levels are considered: very-fresh, fresh, not-fresh and spoiled. The proposed system starts with an image segmentation stage, with the goal of automatically segmenting the fish eye region, followed by freshness classification based on the eye characteristics. The system employs transformers, for the first time in fish-freshness classification, both in the segmentation process with the Segformer and in feature extraction and freshness classification, using the Vision Transformer (ViT).Encouraging results have been obtained, with the automatic fish eye region segmentation reaching a detection rate of 98.77%, an accuracy of 96.28% and a value of the Intersection over Union (IoU) metric of 85.7%. The adopted ViT classification model, using a 5-fold cross-validation strategy, achieved a final classification accuracy of 80.8% and an F1 score of 81.0%, despite the relatively small dataset available for training purposes.

João Pedro Rodrigues, Osvaldo Rocha Pacheco, Paulo Lobato Correia
Explaining Semantic Text Similarity in Knowledge Graphs

In this paper we explore the application of text similarity for building text-rich knowledge graphs, where nodes describe concepts that relate semantically to each other. Semantic text similarity is a basic task in natural language processing (NLP) that aims at measuring the semantic relatedness of two texts. Transformer-based encoders like BERT combined with techniques like contrastive learning are currently the state-of-the-art methods in the literature. However, these methods act as black boxes where the similarity score between two texts cannot be directly explained from their components (e.g., words or sentences). In this work, we propose a method for similarity explainability for texts that are semantically connected to each other in a knowledge graph. To demonstrate the usefulness of this method, we use the Agenda 2030 which consists of a graph of sustainable development goals (SDGs), their subgoals and the indicators proposed for their achievement. Experiments carried out on this dataset show that the proposed explanations not only provide us with explanations about the computed similarity score but also they allow us to improve the accuracy of the predicted links between concepts.

Rafael Berlanga, Mario Soriano
Active Supervision: Human in the Loop

After the learning process, certain types of images may not be modeled correctly because they were not well represented in the training set. These failures can then be compensated for by collecting more images from the real-world and incorporating them into the learning process – an expensive process known as “active learning”. The proposed twist, called active supervision, uses the model itself to change the existing images in the direction where the boundary is less defined and requests feedback from the user on how the new image should be labeled. Experiments in the context of class imbalance show the technique is able to increase model performance in rare classes. Active human supervision helps provide crucial information to the model during training that the training set lacks.

Ricardo P. M. Cruz, A. S. M. Shihavuddin, Md. Hasan Maruf, Jaime S. Cardoso
Condition Invariance for Autonomous Driving by Adversarial Learning

Object detection is a crucial task in autonomous driving, where domain shift between the training and the test set is one of the main reasons behind the poor performance of a detector when deployed. Some erroneous priors may be learned from the training set, therefore a model must be invariant to conditions that might promote such priors. To tackle this problem, we propose an adversarial learning framework consisting of an encoder, an object-detector, and a condition-classifier. The encoder is trained to deceive the condition-classifier and aid the object-detector as much as possible throughout the learning stage, in order to obtain highly discriminative features. Experiments showed that this framework is not very competitive regarding the trade-off between precision and recall, but it does improve the ability of the model to detect smaller objects and some object classes.

Diana Teixeira e Silva, Ricardo P. M. Cruz
YOLOMM – You Only Look Once for Multi-modal Multi-tasking

Autonomous driving can reduce the number of road accidents due to human error and result in safer roads. One important part of the system is the perception unit, which provides information about the environment surrounding the car. Currently, most manufacturers are using not only RGB cameras, which are passive sensors that capture light already in the environment but also Lidar. This sensor actively emits laser pulses to a surface or object and measures reflection and time-of-flight. Previous work, YOLOP, already proposed a model for object detection and semantic segmentation, but only using RGB. This work extends it for Lidar and evaluates performance on KITTI, a public autonomous driving dataset. The implementation shows improved precision across all objects of different sizes. The implementation is entirely made available: https://github.com/filipepcampos/yolomm .

Filipe Campos, Francisco Gonçalves Cerqueira, Ricardo P. M. Cruz, Jaime S. Cardoso
Classify NIR Iris Images Under Alcohol/Drugs/Sleepiness Conditions Using a Siamese Network

This paper proposes a biometric application for iris capture devices using a Siamese network based on an EfficientNetv2 and a triplet loss function to classify iris NIR images captured under alcohol/drugs/sleepiness conditions. The results show that our model can detect the “Fit/Unfit” alertness condition from iris samples captured after alcohol, drug consumption, and sleepiness conditions robustly with an accuracy of 87.3% and 97.0% for Fit/Unfit, respectively. The sleepiness condition is the most challenging, with an accuracy of 72.4%. The Siamese model uses a smaller number of parameters than the standard Deep learning Network algorithm. This work complements and improves the literature on biometric applications for developing an automatic system to classify “Fitness for Duty” using iris images and prevent accidents due to alcohol/drug consumption and sleepiness.

Juan Tapia, Christoph Busch
Bipartite Graph Coarsening for Text Classification Using Graph Neural Networks

Text classification is a fundamental task in Text Mining (TM) with applications ranging from spam detection to sentiment analysis. One of the current approaches to this task is Graph Neural Network (GNN), primarily used to deal with complex and unstructured data. However, the scalability of GNNs is a significant challenge when dealing with large-scale graphs. Multilevel optimization is prominent among the methods proposed to tackle the issues that arise in such a scenario. This approach uses a hierarchical coarsening technique to reduce a graph, then applies a target algorithm to the coarsest graph and projects the output back to the original graph. Here, we propose a novel approach for text classification using GNN. We build a bipartite graph from the input corpus and then apply the coarsening technique of the multilevel optimization to generate ten contracted graphs to analyze the GNN’s performance, training time, and memory consumption as the graph is gradually reduced. Although we conducted experiments on text classification, we emphasize that the proposed method is not bound to a specific task and, thus, can be generalized to different problems modeled as bipartite graphs. Experiments on datasets from various domains and sizes show that our approach reduces memory consumption and training time without significantly losing performance.

Nícolas Roque dos Santos, Diego Minatel, Alan Demétrius Baria Valejo, Alneu de A. Lopes
Towards Robust Defect Detection in Casting Using Contrastive Learning

Defect detection plays a vital role in ensuring product quality and safety within industrial casting processes. In these dynamic environments, the occasional emergence of new defects in the production line poses a significant challenge for supervised methods. We present a defect detection framework to effectively detect novel defect patterns without prior exposure during training. Our method is based on contrastive learning applied to the Faster R-CNN model, enhanced with a contrastive head to obtain discriminative representations of different defects. By training on an diverse and comprehensive labeled dataset, our method achieves comparable performance to the supervised baseline model, showcasing commendable defect detection capabilities. To evaluate the robustness of our approach, we authentically replicate a real-world use case by deliberately excluding several defect types from the training data. Remarkably, in this new context, our proposed method significantly improves detection performance of the baseline model, particularly in situations with very limited training data, showcasing a remarkable 34.7% enhancement. Our research highlights the potential of the proposed method in real-world environments where the number of available images may be limited or inexistent. By providing valuable insights into defect detection in challenging scenarios, our framework could contribute to ensuring efficient and reliable product quality and safety in industrial manufacturing processes.

Eneko Intxausti, Ekhi Zugasti, Carlos Cernuda, Ane Miren Leibar, Estibaliz Elizondo
Development and Testing of an MRI-Compatible Immobilization Device for Head and Neck Imaging

MRI imaging with long acquisition times is prone to motion artifacts that can compromise image quality and lead to misinterpretation.Aiming to address this challenge at the sub-millimeter level, we developed and evaluated a maxilla immobilization approach, which is known to have better performance than other non-invasive techniques, using a personalized mouthpiece connected to an external MRI-compatible frame.The effectiveness of the device was evaluated by analyzing MRI imagery obtained in different immobilization conditions on a human volunteer. The SURF and Block Matching algorithms were assessed, supplemented by custom software.Compared with simple cushioning, the immobilizer reduced the amplitudes of involuntary slow-drift movements of the head by more than a factor two in the axial plane, with final values of 0.25 mm and 0.060°. Faster involuntary motions, including those caused by breathing (which were identifiable), were also suppressed, with final standard deviation values below 0.045 mm and 0.025°.It was also observed a strong restriction of intentional movements, translationally and angularly, by factors from 7.8 to 4.6, with final values of 0.5 mm and 0.2° for moderate forcing.

Francisco Zagalo, Susete Fetal, Paulo Fonte, Antero Abrunhosa, Sónia Afonso, Luís Lopes, Miguel Castelo-Branco
DIF-SR: A Differential Item Functioning-Based Sample Reweighting Method

In recent years, numerous machine learning-based systems have actively propagated discriminatory effects and harmed historically disadvantaged groups through their decision-making. This undesired behavior highlights the importance of research topics such as fairness in machine learning, whose primary goal is to include fairness notions into the training process to build fairer models. In parallel, Differential Item Functioning (DIF) is a mathematical tool often used to identify bias in test preparation for candidate selection; DIF detection assists in identifying test items that disproportionately favor or disadvantage candidates solely because they belong to a specific sociodemographic group. This paper argues that transposing DIF concepts into the machine learning domain can lead to promising approaches for developing fairer solutions. As such, we propose DIF-SR, the first DIF-based Sample Reweighting method for weighting samples so that the assigned values help build fairer classifiers. DIF-SR can be seen as a data preprocessor that imposes more importance on the most auspicious examples in achieving equity ideals. We experimentally evaluated our proposal against two baseline strategies by employing twelve datasets, five classification algorithms, four performance measures, one multicriteria measure, and one statistical significance test. Results indicate that the sample weight computed by DIF-SR can guide supervised machine learning methods to fit fairer models, simultaneously improving group fairness notions such as demographic parity, equal opportunity, and equalized odds.

Diego Minatel, Antonio R. S. Parmezan, Mariana Cúri, Alneu de A. Lopes
IR-Guided Energy Optimization Framework for Depth Enhancement in Time of Flight Imaging

This paper introduces an optimization energy framework based on infrared guidance to improve depth consistency in Time of Flight image systems. The primary objective is to formulate the problem as an image energy optimization task, aimed at maximizing the coherence between the depth map and the corresponding infrared image, both captured simultaneously from the same Time of Flight sensor. The concept of depth consistency relies on the underlying hypothesis concerning the correlation between depth maps and their corresponding infrared images. The proposed optimization framework adopts a weighted approach, leveraging an iterative estimator. The image energy is characterized by introducing spatial conditional entropy as a correlation measure and spatial error as image regularization. To address the issue of missing depth values, a preprocessing step is initially applied, by using a depth completion method based on infrared guided belief propagation, which was proposed in a previous work. Subsequently, the proposed framework is employed to regularize and enhance the inpainted depth. The experimental results demonstrate a range of qualitative improvements in depth map reconstruction, with a particular emphasis on the sharpness and continuity of edges.

Amina Achaibou, Filiberto Pla, Javier Calpe
Multi-conformation Aproach of ENM-NMA Dynamic-Based Descriptors for HIV Drug Resistance Prediction

Drug resistance is a key factor in the failure of drug therapy, as the antiretroviral therapy against the human immunodeficiency virus (HIV). Due to the high costs of direct phenotypic assays, genotypic assays, based on sequencing of the viral genome or part of it, are commonly used to infer drug resistance via in silico predictions. In these approaches, the interpretation of the sequence information constitutes the biggest challenge. The large amount of data linking genotype and phenotype information provides a framework for predicting drug resistance from genotype, based on machine learning methods. Primarily, the sequence based information is used but largely fails to predict resistance in previously unobserved variants. The inclusion of structural and dynamic information is supposed to improve the predictions but has been limited by their computational cost of calculation. This study shows the feasibility of dynamic descriptors derived from normal mode analysis in elastic network models of HIV type 1 (HIV-1) protease in predicting drug resistance. We show that exploring the pre-configuration of dynamic information covering the intrinsic movement spectrum of proteinase in HIV-1 by multiple conformation approach descriptors improve the classification task.

Jorge A. Jimenez-Gari, Mario Pupo-Meriño, Héctor R. Gonzalez, Francesc J. Ferri
Replay-Based Online Adaptation for Unsupervised Deep Visual Odometry

Online adaptation is a promising paradigm that enables dynamic adaptation to new environments. In recent years, there has been a growing interest in exploring online adaptation for various problems, including visual odometry, a crucial task in robotics, autonomous systems, and driver assistance applications. In this work, we leverage experience replay, a potent technique for enhancing online adaptation, to explore the replay-based online adaptation for unsupervised deep visual odometry. Our experiments reveal a remarkable performance boost compared to the non-adapted model. Furthermore, we conduct a comparative analysis against established methods, demonstrating competitive results that showcase the potential of online adaptation in advancing visual odometry.

Yevhen Kuznietsov, Marc Proesmans, Luc Van Gool
Facial Point Graphs for Stroke Identification

Stroke can cause significant damage to neurons, resulting in various sequelae that negatively impact the patient’s ability to perform essential daily activities such as chewing, swallowing, and verbal communication. Therefore, it is important for patients with such difficulties to undergo a treatment process and be monitored during its execution to assess the improvement of their health condition. The use of computerized tools and algorithms that can quickly and affordably detect such sequelae proves helpful in aiding the patient’s recovery. Due to the death of internal brain cells, a stroke often leads to facial paralysis, resulting in certain asymmetry between the two sides of the face. This paper focuses on analyzing this asymmetry using a deep learning method without relying on handcrafted calculations, introducing the Facial Point Graphs (FPG) model, a novel approach that excels in learning geometric information and effectively handling variations beyond the scope of manual calculations. FPG allows the model to effectively detect orofacial impairment caused by a stroke using video data. The experimental findings on the Toronto Neuroface dataset revealed the proposed approach surpassed state-of-the-art results, promising substantial advancements in this domain.

Nicolas Barbosa Gomes, Arissa Yoshida, Guilherme Camargo de Oliveira, Mateus Roder, João Paulo Papa
Fast, Memory-Efficient Spectral Clustering with Cosine Similarity

Spectral clustering is a popular and effective method but known to face two significant challenges: scalability and out-of-sample extension. In this paper, we extend the work of Chen (ICPR 2018) on the speed scalability of spectral clustering in the setting of cosine similarity to deal with massive or online data that are too large to be fully loaded into computer memory. We start by assuming a small batch of data drawn from the full set and develop an efficient procedure that learns both the nonlinear embedding and clustering map from the sample and extends them easily to the rest of the data as they are gradually loaded. We then introduce an automatic approach to selecting the optimal value of the sample size. The combination of the two steps leads to a streamlined memory-efficient algorithm that only uses a small number of batches of data (as they become available), with memory and computational costs that are independent of the size of the data. Experiments are conducted on benchmark data sets to demonstrate the fast speed and excellent accuracy of the proposed algorithm. We conclude the paper by pointing out several future research directions.

Ran Li, Guangliang Chen
An End-to-End Deep Learning Approach for Video Captioning Through Mobile Devices

Video captioning is a computer vision task that aims at generating a description for video content. This can be achieved using deep learning approaches that leverage image and audio data. In this work, we have developed two strategies to tackle this task in the context of resource-constrained devices: (i) generating one caption per frame combined with audio classification, and (ii) generating one caption for a set of frames combined with audio classification. In these strategies, we have utilized one architecture for the image data and another for the audio data. We have developed an application tailored for resource-constrained devices, where the image sensor captures images at a specific frame rate. The audio data is captured from a microphone for a predefined duration at time. Our application combines the results from both modalities to create a comprehensive description. The main contribution of this work is the introduction of a new end-to-end application that can utilize the developed strategies and be beneficial for environment monitoring. Our method has been implemented on a low-resource computer, which poses a significant challenge.

Rafael J. Pezzuto Damaceno, Roberto M. Cesar Jr.
Stingless Bee Classification: A New Dataset and Baseline Results

Bees play an important role as pollinating agents, contributing to the reproduction of many plant species around the world. Brazil is the home for different species of stingless bees, with around 200 registered species out of the more than 500 species classified worldwide. Each species constructs the entrance to its colony in an unique but similar way among colonies of the same species. In this work, we proposed a new dataset created in collaboration with stingless beekeepers from Brazil for the exploration of stingless bee species classification. The dataset consists of 158 samples distributed unequally among the 13 species: Boca de Sapo, Borá, Bugia, Iraí, Japurá, Jataí, Lambe Olhos, Mandaguari, Mirim Droryana, Mirim Preguiça, Moça Branca, Mandaçaia, and Tubuna. The results presented in this work were obtained using deep learning models (i.e. CNN architectures) such as VGG and DenseNet, which are commonly used for image classification task in different application domains. Pre-trained models from ImageNet were used, along with transfer learning techniques, and due to the small size of the dataset, data augmentation techniques were applied, resulting in an expanded dataset of 1,106 samples. The experimental results demonstrated that the DenseNet model achieved the best results, reaching an accuracy of $$95\%$$ 95 % . The dataset created will be also made available as a contribution of these work. As far as we know, the stingless bee species identification task based on the colony entrance is addressed for the first time in this work.

Matheus H. C. Leme, Vinicius S. Simm, Douglas Rorie Tanno, Yandre M. G. Costa, Marcos Aurélio Domingues
Backmatter
Metadata
Title
Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications
Editors
Verónica Vasconcelos
Inês Domingues
Simão Paredes
Copyright Year
2024
Electronic ISBN
978-3-031-49018-7
Print ISBN
978-3-031-49017-0
DOI
https://doi.org/10.1007/978-3-031-49018-7

Premium Partner