Skip to main content
Top

2022 | Book

Image Analysis and Processing – ICIAP 2022

21st International Conference, Lecce, Italy, May 23–27, 2022, Proceedings, Part III

Editors: Prof. Stan Sclaroff, Cosimo Distante, Marco Leo, Dr. Giovanni M. Farinella, Prof. Dr. Federico Tombari

Publisher: Springer International Publishing

Book Series : Lecture Notes in Computer Science

insite
SEARCH

About this book

The proceedings set LNCS 13231, 13232, and 13233 constitutes the refereed proceedings of the 21st International Conference on Image Analysis and Processing, ICIAP 2022, which was held during May 23-27, 2022, in Lecce, Italy,

The 168 papers included in the proceedings were carefully reviewed and selected from 307 submissions. They deal with video analysis and understanding; pattern recognition and machine learning; deep learning; multi-view geometry and 3D computer vision; image analysis, detection and recognition; multimedia; biomedical and assistive technology; digital forensics and biometrics; image processing for cultural heritage; robot vision; etc.

Table of Contents

Frontmatter

Pattern Recognition and Machine Learning

Frontmatter
Hangul Fonts Dataset: A Hierarchical and Compositional Dataset for Investigating Learned Representations

Hierarchy and compositionality are common latent properties in many natural and scientific image datasets. Determining when a deep network’s hidden activations represent hierarchy and compositionality is important both for understanding deep representation learning and for applying deep networks in domains where interpretability is crucial. However, current benchmark machine learning datasets either have little hierarchical or compositional structure, or the structure is not known. This gap impedes precise analysis of a network’s representations and thus hinders development of new methods that can learn such properties. To address this gap, we developed a new benchmark dataset with known hierarchical and compositional structure. The Hangul Fonts Dataset (HFD) is comprised of 35 fonts from the Korean writing system (Hangul), each with 11,172 blocks (syllables) composed from the product of initial, medial, and final glyphs. All blocks can be grouped into a few geometric types which induces a hierarchy across blocks. In addition, each block is composed of individual glyphs with rotations, translations, scalings, and naturalistic style variation across fonts. We find that both shallow and deep unsupervised methods show only modest evidence of hierarchy and compositionality in their representations of the HFD compared to supervised deep networks. Thus, HFD enables the identification of shortcomings in existing methods, a critical first step toward developing new machine learning algorithms to extract hierarchical and compositional structure in the context of naturalistic variability.

Jesse A. Livezey, Ahyeon Hwang, Jacob Yeung, Kristofer E. Bouchard
Out-of-Distribution Detection Using Outlier Detection Methods

Out-of-distribution detection (OOD) deals with anomalous input to neural networks. In the past, specialized methods have been proposed to identify anomalous input. Similarly, it was shown that feature extraction models in combination with outlier detection algorithms are well suited to detect anomalous input. We use outlier detection algorithms to detect anomalous input as reliable as specialized methods from the field of OOD. No neural network adaptation is required; detection is based on the model’s softmax score. Our approach works unsupervised using an Isolation Forest and can be further improved by using a supervised learning method such as Gradient Boosting.

Jan Diers, Christian Pigorsch
Relaxation Labeling Meets GANs: Solving Jigsaw Puzzles with Missing Borders

This paper proposes JiGAN, a GAN-based method for solving Jigsaw puzzles with eroded or missing borders. Missing borders is a common real-world situation, for example, when dealing with the reconstruction of broken artifacts or ruined frescoes. In this particular condition, the puzzle’s pieces do not align perfectly due to the borders’ gaps; in this situation, the patches’ direct match is unfeasible due to the lack of color and line continuations. JiGAN, is a two-steps procedure that tackles this issue: first, we repair the eroded borders with a GAN-based image extension model and measure the alignment affinity between pieces; then, we solve the puzzle with the relaxation labeling algorithm to enforce consistency in pieces positioning, hence, reconstructing the puzzle. We test the method on a large dataset of small puzzles and on three commonly used benchmark datasets to demonstrate the feasibility of the proposed approach.

Marina Khoroshiltseva, Arianna Traviglia, Marcello Pelillo, Sebastiano Vascon
Computationally Efficient Rehearsal for Online Continual Learning

Continual learning is a crucial ability for learning systems that have to adapt to changing data distributions, without reducing their performance in what they have already learned. Rehearsal methods offer a simple countermeasure to help avoid this catastrophic forgetting which frequently occurs in dynamic situations and is a major limitation of machine learning models. These methods continuously train neural networks using a mix of data both from the stream and from a rehearsal buffer, which maintains past training samples. Although the rehearsal approach is reasonable and simple to implement, its effectiveness and efficiency is significantly affected by several hyperparameters such as the number of training iterations performed at each step, the choice of learning rate, and the choice on whether to retrain the agent at each step. These options are especially important in resource-constrained environments commonly found in online continual learning for image analysis. This work evaluates several rehearsal training strategies for continual online learning and proposes the combined use of a drift detector that decides on (a) when to train using data from the buffer and the online stream, and (b) how to train, based on a combination of heuristics. Experiments on the MNIST and CIFAR-10 image classification datasets demonstrate the effectiveness of the proposed approach over baseline training strategies at a fraction of the computational cost.

Charalampos Davalas, Dimitrios Michail, Christos Diou, Iraklis Varlamis, Konstantinos Tserpes
Recurrent Vision Transformer for Solving Visual Reasoning Problems

Although convolutional neural networks (CNNs) showed remarkable results in many vision tasks, they are still strained by simple yet challenging visual reasoning problems. Inspired by the recent success of the Transformer network in computer vision, in this paper, we introduce the Recurrent Vision Transformer (RViT) model. Thanks to the impact of recurrent connections and spatial attention in reasoning tasks, this network achieves competitive results on the same-different visual reasoning problems from the SVRT dataset. The weight-sharing both in spatial and depth dimensions regularizes the model, allowing it to learn using far fewer free parameters, using only 28k training samples. A comprehensive ablation study confirms the importance of a hybrid CNN + Transformer architecture and the role of the feedback connections, which iteratively refine the internal representation until a stable prediction is obtained. In the end, this study can lay the basis for a deeper understanding of the role of attention and recurrent connections for solving visual abstract reasoning tasks. The code for reproducing our results is publicly available here: https://tinyurl.com/recvit .

Nicola Messina, Giuseppe Amato, Fabio Carrara, Claudio Gennaro, Fabrizio Falchi
Metric Learning-Based Unsupervised Domain Adaptation for 3D Skeleton Hand Activities Categorization

First-person hand activity recognition plays a significant role in the computer vision field with various applications. Thanks to recent advances in depth sensors, several 3D skeleton-based hand activity recognition methods using supervised Deep Learning (DL) have been proposed, proven effective when a large amount of labeled data is available. However, the annotation of such data remains difficult and costly, which motivates the use of unsupervised methods. We propose in this paper a new approach based on unsupervised domain adaptation (UDA) for 3D skeleton hand activity clustering. It aims at exploiting the knowledge-driven from labeled samples of the source domain to categorize the unlabeled ones of the target domain. To this end, we introduce a novel metric learning-based loss function to learn a highly discriminative representation while preserving a good activity recognition accuracy on the source domain. The learned representation is used as a low-level manifold to cluster unlabeled samples. In addition, to ensure the best clustering results, we proposed a statistical and consensus-clustering-based strategy. The proposed approach is experimented on the real-world FPHA data set.

Yasser Boutaleb, Catherine Soladié, Nam-duong Duong, Amine Kacete, Jérôme Royan, Renaud Seguier
Using Random Forest Distances for Outlier Detection

In recent years, a great variety of outlier detectors have been proposed in the literature, many of which are based on pairwise distances or derived concepts. However, in such methods, most of the efforts have been devoted to the outlier detection mechanisms, not paying attention to the distance measure – in most cases the basic Euclidean distance is used. Instead, in the clustering field, data-dependent measures have shown to be very useful, especially those based on Random Forests: actually, Random Forests are partitioners of the space able to naturally encode the relation between two objects. In the outlier detection field, these informative distances have received scarce attention. This manuscript is aimed at filling this gap, studying the suitability of these measures in the identification of outliers. In our scheme, we build an unsupervised Random Forest model, from which we extract pairwise distances; these distances are then input to an outlier detector. In particular, we study the impact of several Random Forest-based distances, including advanced and recent ones, on different outlier detectors. We evaluate thoroughly our methodology on nine benchmark datasets for outlier detection, focusing on different aspects of the pipeline, such as the parametrization of the forest, the type of distance-based outlier detector, and most importantly, the impact of the adopted distance.

Antonella Mensi, Ferdinando Cicalese, Manuele Bicego
Case Study on the Use of the SafeML Approach in Training Autonomous Driving Vehicles

The development quality for the control software for autonomous vehicles is rapidly progressing, so that the control units in the field generally perform very reliably. Nevertheless, fatal misjudgments occasionally occur putting people at risk: such as the recent accident in which a Tesla vehicle in Autopilot mode rammed a police vehicle. Since the object recognition software which is a part of the control software is based on machine learning (ML) algorithms at its core, one can distinguish a training phase from a deployment phase of the software. In this paper we investigate to what extent the deployment phase has an impact on the robustness and reliability of the software; because just as traditional, software based on ML degrades with time. A widely known effect is the so-called concept drift: in this case, one finds that the deployment conditions in the field have changed and the software, based on the outdated training data, no longer responds adequately to the current field situation. In a previous research paper, we developed the SafeML approach with colleagues from the University of Hull, where datasets are compared for their statistical distance measures. In doing so, we detected that for simple, benchmark data, the statistical distance correlates with the classification accuracy in the field. The contribution of this paper is to analyze the applicability of the SafeML approach to complex, multidimensional data used in autonomous driving. In our analysis, we found that the SafeML approach can be used for this data as well. In practice, this would mean that a vehicle could constantly check itself and detect concept drift situation early.

Matthias Bergler, Ramin Tavakoli Kolagari, Kristina Lundqvist
User-Biased Food Recognition for Health Monitoring

This paper presents a user-biased food recognition system. The presented approach has been developed in the context of the FoodRec project, which aims to define an automatic framework for the monitoring of people’s health and habits, during their smoke quitting program. The goal of food recognition is to extract and infer semantic information from the food images to classify diverse foods present in the image. We propose a novel Deep Convolutional Neural Network able to recognize food items of specific users and monitor their habits. It consists of a food branch to learn visual representation for the input food items and a user branch to take into account the specific user’s eating habits. Furthermore, we introduce a new FoodRec-50 dataset with 2000 images and 50 food categories collected by the iOS and Android smartphone applications, taken by 164 users during their smoking cessation therapy. The information inferred from the users’ eating habits is then exploited to track and monitor the dietary habits of people involved in a smoke quitting protocol. Experimental results show that the proposed food recognition method outperforms the baseline model results on the FoodRec-50 dataset. We also performed an ablation study which demonstrated that the proposed architecture is able to tune the prediction based on the users’ eating habits.

Mazhar Hussain, Alessandro Ortis, Riccardo Polosa, Sebastiano Battiato
Multi-view Spectral Clustering via Integrating Label and Data Graph Learning

Nowadays, one-step multi-view clustering algorithms attract many interests. The main issue of multi-view clustering approaches is how to combine the information extracted from the available views. A popular approach is to use view-based graphs and/or a consensus graph to describe the different views. We introduce a novel one-step graph-based multi-view clustering approach in this study. Our suggested method, in contrast to existing graph-based one-step clustering methods, provides two major novelties to the method called Nonnegative Embedding and Spectral Embedding (NESE) proposed in the recent paper [1]. To begin, we use the cluster label correlation to create an additional graph in addition to the graphs associated with the data space. Second, the cluster-label matrix is constrained by adopting some restrictions to make it more consistent. The effectiveness of the proposed method is demonstrated by experimental results on many public datasets.

Sally El Hajjar, Fadi Dornaika, Fahed Abdallah, Hichem Omrani
Distance-Based Random Forest Clustering with Missing Data

In recent years there has been an increased interest in clustering methods based on Random Forests, due to their flexibility and their capability in describing data. One problem of current RF-clustering approaches is that they are not able to directly deal with missing data, a common scenario in many application fields (e.g. Bioinformatics): the usual solution in this case is to pre-impute incomplete data before running standard clustering methods. In this paper we present the first Random Forest clustering approach able to directly deal with missing data. We start from the very recent RatioRF distance for clustering [3], which has shown to outperform all other distance-based RF clustering schemes, extending the framework in two directions, which allow the integration of missing data mechanisms directly inside the clustering pipeline. Experimental results, based on 6 standard UCI ML datasets, are promising, also in comparison with some literature alternatives.

Matteo Raniero, Manuele Bicego, Ferdinando Cicalese

Video Analysis and Understanding

Frontmatter
Unsupervised Person Re-identification Based on Skeleton Joints Using Graph Convolutional Networks

With the remarkable progress of deep learning methods, person re-identification has received a lot of attention from researchers. However, the majority of previous work mainly focus on supervised learning setting, which requires expensive data annotations. In this paper, we address this problem by proposing a purely unsupervised learning model. Inspired by the effectiveness of modeling the spatio-temporal information of pedestrian video, we mine the relationships between human body joints. Specifically, we propose a novel framework by learning inter-frame and intra-frame relationships for discriminative feature learning via two Graph Convolutional Networks (GCN) modules: spatial and temporal. The spatial module captures the structural information of the human body and the temporal module propagates information across adjacent frames. At the end, we perform hierarchical clustering by selecting P identities and K instances (PK sampling) to generate pseudo-labels for the unlabeled data. By iteratively optimizing these modules, our model extracts robust spatial-temporal information that can alleviate the occlusion problem. We conduct experiments on two benchmarks: MARS and DukeMTMC-VideoReID datasets, where we demonstrate the effectiveness of our proposed method.

Khadija Khaldi, Pranav Mantini, Shishir K. Shah
Keyframe Insights into Real-Time Video Tagging of Compressed UHD Content

We present a method that can analyze coded ultra-high resolution (UHD) video content an order of magnitude faster than real-time. We observe that the larger the resolution of a video, the larger the fraction of the overall processing time is spent on decoding frames from the video. In this paper, we exploit the way video is coded to significantly speed up the frame decoding process. More precisely, we only decode keyframes, which can be decoded significantly faster than ‘random’ frames in the video. A key insight is that in modern video codecs, keyframes are often placed around scene changes (shot boundaries), and hence form a very representative subset of frames of the video. We show on the example of video genre tagging that keyframes nicely lend themselves to video analysis tasks. Unlike previous genre prediction methods which include a multitude of signals, we train a per-frame genre classification system using a CNN that solely takes (key-)frames as input. We show that the aggregated genre predictions are very competitive to much more involved methods at predicting the video genre(s), and even outperform state-of-the-art genre tagging that solely rely on video frames as input. The proposed system can reliably tag video genres of a compressed video between 12 $$\times $$ × (8K content) and 96 $$\times $$ × (1080p content) faster than real-time.

Dominic Rüfenacht
Exploring the Use of Efficient Projection Kernels for Motion Saliency Estimation

In this paper we investigate the potential of a family of efficient filters – the Gray-Code Kernels – for addressing visual saliency estimation guided by motion. Our implementation relies on the use of 3D kernels applied to overlapping blocks of frames and is able to gather meaningful spatio-temporal information with a very light computation. We introduce an attention module that reasons on the use of pooling strategies, combined in an unsupervised way to derive a saliency map highlighting the presence of motion in the scene. In the experiments we show that our method is able to effectively and efficiently identify the portion of the image where the motion is occurring, providing tolerance to a variety of scene conditions.

Elena Nicora, Nicoletta Noceti
FirstPiano: A New Egocentric Hand Action Dataset Oriented Towards Augmented Reality Applications

Research on hand action recognition has achieved very interesting performance in recent years, notably thanks to deep learning methods. With those improvements, we can see new visions towards real applications of new Human-Machine interfaces (HMI) using this recognition. Such new interactions and interfaces need data to develop the best user experience iteratively. However, current datasets for hand action recognition in an egocentric view, even if perfectly useful for these problems of recognition, they generally lack of a limited but coherent context for the proposed actions. Indeed, these datasets tend to provide a wide range of actions, more or less in relation to each other, which does not help to create an interesting context for HMI application purposes. Thereby, we present in this paper a new dataset, FirstPiano, for hand action recognition in an egocentric view, in the context of piano training. FirstPiano provides a total of 672 video sequences directly extracted from the sensors of the Microsoft HoloLens Augmented Reality device. Each sequence is provided in depth, infrared and grayscale data, with 4 different points of view for the last one, for a total of 6 streams for each video. We also present the first benchmark of experiments using a Capsule Network over different classification problems and different stream combinations. Our dataset and experiments can therefore be interesting for research communities of action recognition and human-machine interface.

Théo Voillemin, Hazem Wannous, Jean-Philippe Vandeborre
Learning Video Retrieval Models with Relevance-Aware Online Mining

Due to the amount of videos and related captions uploaded every hour, deep learning-based solutions for cross-modal video retrieval are attracting more and more attention. A typical approach consists in learning a joint text-video embedding space, where the similarity of a video and its associated caption is maximized, whereas a lower similarity is enforced with all the other captions, called negatives. This approach assumes that only the video and caption pairs in the dataset are valid, but different captions - positives - may also describe its visual contents, hence some of them may be wrongly penalized. To address this shortcoming, we propose the Relevance-Aware Negatives and Positives mining (RANP) which, based on the semantics of the negatives, improves their selection while also increasing the similarity of other valid positives. We explore the influence of these techniques on two video-text datasets: EPIC-Kitchens-100 and MSR-VTT. By using the proposed techniques, we achieve considerable improvements in terms of nDCG and mAP, leading to state-of-the-art results, e.g. +5.3% nDCG and +3.0% mAP on EPIC-Kitchens-100. We share code and pretrained models at https://github.com/aranciokov/ranp .

Alex Falcon, Giuseppe Serra, Oswald Lanz
Foreground Detection Using an Attention Module and a Video Encoding

Foreground detection is the task of labelling the foreground or background pixels in the video sequence and it depends on the context of the scene. For many years, methods based on background model have been the most used approaches for detecting foreground; however, their methods are sensitive to error propagation from the first background model estimations. To address this problem, we proposed a U-net based architecture with an attention module, where the encoding of the entire video sequence is used as attention context to get features related to the background model. We tested our network on sixteen scenes from the CDnet2014 dataset, with an average F-measure of 88.42. The results also show that our model outperforms traditional and neural networks methods. Thus, we demonstrated that an attention module on a U-net based architecture can deal with the foreground detection challenges.

Anthony A. Benavides-Arce, Victor Flores-Benites, Rensso Mora-Colque
Test-Time Adaptation for Egocentric Action Recognition

Egocentric action recognition is becoming an increasingly researched topic thanks to the rising popularity of wearable cameras. Despite the numerous publications in the field, the learned representations still suffers from an intrinsic “environmental bias”. To address this issue, domain adaptation and generalization approaches have been proposed, which operate by either adapting the model to target data during training or by learning a model able to generalize to unseen videos by exploiting the knowledge from multiple source domains. In this work, we propose to adapt a model trained on source data to novel environments at test time, making adaptation practical to real-world scenarios where target data are not available at training time. On the popular EPIC-Kitchens dataset, we present a new benchmark for Test-Time Adaptation (TTA) in egocentric action recognition. Moreover, we propose a new multi-modal TTA approach, which we call RNA $$^{++}$$ + + , and combine it with a new set of losses aiming at reducing classifier’s uncertainty, showing remarkable results w.r.t. existing TTA methods inherited from image classification. Code available: https://github.com/EgocentricVision/RNA-TTA .

Mirco Plananamente, Chiara Plizzari, Barbara Caputo
Combining EfficientNet and Vision Transformers for Video Deepfake Detection

Deepfakes are the result of digital manipulation to forge realistic yet fake imagery. With the astonishing advances in deep generative models, fake images or videos are nowadays obtained using variational autoencoders (VAEs) or Generative Adversarial Networks (GANs). These technologies are becoming more accessible and accurate, resulting in fake videos that are very difficult to be detected. Traditionally, Convolutional Neural Networks (CNNs) have been used to perform video deepfake detection, with the best results obtained using methods based on EfficientNet B7. In this study, we focus on video deep fake detection on faces, given that most methods are becoming extremely accurate in the generation of realistic human faces. Specifically, we combine various types of Vision Transformers with a convolutional EfficientNet B0 used as a feature extractor, obtaining comparable results with some very recent methods that use Vision Transformers. Differently from the state-of-the-art approaches, we use neither distillation nor ensemble methods. Furthermore, we present a straightforward inference procedure based on a simple voting scheme for handling multiple faces in the same video shot. The best model achieved an AUC of 0.951 and an F1 score of 88.0%, very close to the state-of-the-art on the DeepFake Detection Challenge (DFDC). The code for reproducing our results is publicly available here: https://tinyurl.com/cnn-vit-dfd .

Davide Alessandro Coccomini, Nicola Messina, Claudio Gennaro, Fabrizio Falchi
Human Action Recognition with Transformers

Having a reliable tool to predict the actions performed in a video can be very useful for intelligent security systems, for many applications related to robotics and for limiting human interactions with the system. In this work we present an architecture trained to predict the action present in digital video sequences. The proposed architecture consists of two main blocks: (i) a 3D backbone that extracts features from each frame of video sequence and (ii) a temporal pooling. In this case, we use Bidirectional Encoder Representations from Transformers (BERT) as technology for temporal pooling instead of a Temporal Global Average Pooling (TGAP). The output of the proposed architecture is the prediction of the action taking place in the video sequence. We use two different backbones, ip-CSN and ir-CSN, in order to evaluate the performance of the entire architecture on two publicly available datasets: HMDB-51, UCF-101. A comparison has been made with the most important architectures that constitute the state of the art for this task. We have obtained results that outperform the state of the art in terms of Top-1 and Top-3 accuracy.

Pier Luigi Mazzeo, Paolo Spagnolo, Matteo Fasano, Cosimo Distante
Decontextualized I3D ConvNet for Ultra-Distance Runners Performance Analysis at a Glance

In May 2021, the site runnersworld.com published that participation in ultra-distance races has increased by 1,676% in the last 23 years. Moreover, nearly 41% of those runners participate in more than one race per year. The development of wearable devices has undoubtedly contributed to motivating participants by providing performance measures in real-time. However, we believe there is room for improvement, particularly from the organizers point of view. This work aims to determine how the runners performance can be quantified and predicted by considering a non-invasive technique focusing on the ultra-running scenario. In this sense, participants are captured when they pass through a set of locations placed along the race track. Each footage is considered an input to an I3D ConvNet to extract the participant’s running gait in our work. Furthermore, weather and illumination capture conditions or occlusions may affect these footages due to the race staff and other runners. To address this challenging task, we have tracked and codified the participant’s running gait at some RPs and removed the context intending to ensure a runner-of-interest proper evaluation. The evaluation suggests that the features extracted by an I3D ConvNet provide enough information to estimate the participant’s performance along the different race tracks.

David Freire-Obregón, Javier Lorenzo-Navarro, Modesto Castrillón-Santana
Densification of Sparse Optical Flow Using Edges Information

Optical flow methods, which estimate a dense motion field starting from a sparse one, are playing an important role in many visual learning and recognition applications. The proposed system is based only on sparse optical flow and line detector. It is able to densify the starting optical flow, reaching good performances in objective and subjective manner, using common applications like clustering and standard KITTI evaluation kit. In particular, an appreciable improvement has been achieved in terms of quantity of motion vectors grows (up to 540%). Since often in smart cameras both optical flow and lines are available, the proposed approach avoids overloading the Engine Control Unit to transmit the entire image flow and allows reducing the power consumption, realizing a real-time robust system.

Antonio Buemi, Giuseppe Spampinato, Arcangelo Bruna, Viviana D’Alto
Cycle Consistency Based Method for Learning Disentangled Representation for Stochastic Video Prediction

Video frame prediction is an interesting computer vision problem of predicting the future frames of a video sequence from a given set of context frames. Video prediction models have found wide-scale perspective applications in autonomous navigation, representation learning, and healthcare. However, predicting future frames is challenging due to the high dimensional and stochastic nature of video data. This work proposes a novel cycle consistency loss to disentangle video representation into a low dimensional time-dependent pose and time-independent content latent factors in two different VAE based video prediction models. The key motivation behind cycle consistency loss is that future frame predictions are more plausible and realistic if they reconstruct the previous frames. The proposed cycle consistency loss is also generic because it can be applied to other VAE-based stochastic video prediction architectures with slight architectural modifications. We validate our disentanglement hypothesis and the quality of long-range predictions on standard synthetic and challenging real-world datasets such as Stochastic Moving MNIST and BAIR.

Ujjwal Tiwari, P. Aditya Sreekar, Anoop Namboodiri
SeeFar: Vehicle Speed Estimation and Flow Analysis from a Moving UAV

Visual perception from drones has been largely investigated for Intelligent Traffic Monitoring System (ITMS) recently. In this paper, we introduce SeeFar to achieve vehicle speed estimation and traffic flow analysis based on YOLOv5 and DeepSORT from a moving drone. SeeFar differs from previous works in three key ways: the speed estimation and flow analysis components are integrated into a unified framework; our method of predicting car speed has the least constraints while maintaining a high accuracy; our flow analysor is direction-aware and outlier-aware. Specifically, we design the speed estimator only using the camera imaging geometry, where the transformation between world space and image space is completed by the variable Ground Sampling Distance. Besides, previous papers do not evaluate their speed estimators at scale due to the difficulty of obtaining the ground truth, we therefore propose a simple yet efficient approach to estimate the true speeds of vehicles via the prior size of the road signs. We evaluate SeeFar on our ten videos that contain 929 vehicle samples. Experiments on these sequences demonstrate the effectiveness of SeeFar by achieving 98.0% accuracy of speed estimation and 99.1% accuracy of traffic volume prediction, respectively.

Mang Ning, Xiaoliang Ma, Yao Lu, Simone Calderara, Rita Cucchiara
Spatial-Temporal Autoencoder with Attention Network for Video Compression

Deep learning-based approaches are now state of the art in numerous tasks, including video compression, and are having a revolutionary influence in video processing. Recently, learned video compression methods exhibit a fast development trend with promising results. In this paper, taking advantage of the powerful non-linear representation ability of neural networks, we replace each standard component of video compression with a neural network. We propose a spatial-temporal video compression network (STVC) using the spatial-temporal priors with an attention module (STPA). On the one hand, joint spatial-temporal priors are used for generating latent representations and reconstructing compressed outputs because efficient temporal and spatial information representation plays a crucial role in video coding. On the other hand, we also added an efficient and effective Attention module such that the model pays more effort on restoring the artifact-rich areas. Moreover, we formalize the rate-distortion optimization into a single loss function, in which the network learns to leverage the Spatial-temporal redundancy presented in the frames and decreases the bit rate while maintaining visual quality in the decoded frames. The experiment results show that our approach delivers the state-of-the-art learning video compression performance in terms of MS-SSIM and PSNR.

Neetu Sigger, Naseer Al-Jawed, Tuan Nguyen
On the Evaluation of Video-Based Crowd Counting Models

Crowd counting is a challenging and relevant computer vision task. Most of the existing methods are image-based, i.e., they only exploit the spatial information of a single image to estimate the corresponding people count. Recently, video-based methods have been proposed to improve counting accuracy by also exploiting temporal information coming from the correlation between adjacent frames. In this work, we point out the need to properly evaluate the temporal information’s specific contribution over the spatial one. This issue has not been discussed by existing work, and in some cases such evaluation has been carried out in a way that may lead to overestimating the contribution of the temporal information. To address this issue we propose a categorisation of existing video-based models, discuss how the contribution of the temporal information has been evaluated by existing work, and propose an evaluation approach aimed at providing a more complete evaluation for two different categories of video-based methods. We finally illustrate our approach, for a specific category, through experiments on several benchmark video data sets.

Emanuele Ledda, Lorenzo Putzu, Rita Delussu, Giorgio Fumera, Fabio Roli
Frame-Wise Action Recognition Training Framework for Skeleton-Based Anomaly Behavior Detection

We propose a novel training framework for frame-wise action recognition from a single video frame. Most existing action recognition methods employ sequence-wise features extraction, but these approaches constrain the setting of the camera’s frame rate and are not always effective in the variety of existing network environments. To this end, our framework employs temporal attention obtained from a pretrained action recognition model, allowing training action recognition from a single video frame selected based on its “action-ness,” even from existing training data without the need for additional data. In this paper, we demonstrate the effectiveness of our framework to address the challenging task of anomaly behavior detection from a single-frame skeleton. Our method is realized by using frame-wise features extracted from a skeleton-based action recognition model trained by our framework. By coupling our framework and our anomaly behavior detection method, we develop a powerful detector of anomaly behavior that humans can recognize from a single video frame. We evaluate our method with the “ShanghaiTech Campus” anomaly behavior detection benchmark dataset, and confirm its effectiveness when the input consists of single-frame skeletons.

Hiroaki Tani, Tomoyuki Shibata
The Automated Temporal Analysis of Gaze Following in a Visual Tracking Task

The attention assessment of an individual in following the motion of a target object provides valuable insights into understanding one’s behavioural patterns in cognitive disorders including Autism Spectrum Disorder (ASD). Existing frameworks often require dedicated devices for gaze capture, focus on stationary target objects, or fails to conduct a temporal analysis of the participant’s response. Thus, in order to address the persisting research gap in the analysis of video capture of a visual tracking task, this paper proposes a novel framework to analyse the temporal relationship between the 3D head pose angles and object displacement, and demonstrates its validity via application on the EYEDIAP video dataset. The conducted multivariate time-series analysis is two-fold; the statistical correlation computes the similarity between the time series as an overall measure of attention; and the Dynamic Time Warping (DTW) algorithm aligns the two sequences, and computes relevant temporal metrics. The temporal features of latency and maximum time of focus retention enabled an intragroup comparison between the performance of the participants. Further analysis disclosed valuable insights into the behavioural response of participants, including the superior response to horizontal motion of the target and the improvement in retention of focus on the vertical motion over time, implying that following a vertical target initially proved a challenging task.

Vidushani Dhanawansa, Pradeepa Samarasinghe, Bryan Gardiner, Pratheepan Yogarajah, Anuradha Karunasena
Untrimmed Action Anticipation

Egocentric action anticipation consists in predicting a future action the camera wearer will perform from egocentric video. While the task has recently attracted the attention of the research community, current approaches assume that the input videos are “trimmed”, meaning that a short video sequence is sampled a fixed time before the beginning of the action. We argue that, despite the recent advances in the field, trimmed action anticipation has a limited applicability in real-world scenarios where it is important to deal with “untrimmed” video inputs and it cannot be assumed that the exact moment in which the action will begin is known at test time. To overcome such limitations, we propose an untrimmed action anticipation task, which, similarly to temporal action detection, assumes that the input video is untrimmed at test time, while still requiring predictions to be made before the actions actually take place. We propose an evaluation procedure for methods designed to address this novel task, and compare several baselines on the EPIC-KITCHENS-100 dataset. Experiments show that the performance of current models designed for trimmed action anticipation is very limited and more research on this task is required.

Ivan Rodin, Antonino Furnari, Dimitrios Mavroeidis, Giovanni Maria Farinella
Forecasting Future Instance Segmentation with Learned Optical Flow and Warping

For an autonomous vehicle it is essential to observe the ongoing dynamics of a scene and consequently predict imminent future scenarios to ensure safety to itself and others. This can be done using different sensors and modalities. In this paper we investigate the usage of optical flow for predicting future semantic segmentations. To do so we propose a model that forecasts flow fields autoregressively. Such predictions are then used to guide the inference of a learned warping function that moves instance segmentations on to future frames. Results on the Cityscapes dataset demonstrate the effectiveness of optical-flow methods.

Andrea Ciamarra, Federico Becattini, Lorenzo Seidenari, Alberto Del Bimbo
Depth-Aware Multi-object Tracking in Spherical Videos

This paper deals with the multi-object tracking (MOT) problem in videos acquired by 360-degree cameras. Targets are tracked by a frame-by-frame association strategy. At each frame, candidate targets are detected by a pre-trained state-of-the-art deep model. Associations to the targets known till the previous frame are found by solving a data association problem considering the locations of the targets in the scene. In case of a missing detection, a Kalman filter is used to track the target. Differently than works at the state-of-the-art, the proposed tracker considers the depth of the targets in the scene. The distance of the targets from the camera can be estimated by geometrical facts peculiar to the adopted 360-degree camera and by assuming targets move on the ground-plane. Distance estimates are used to model the location of the targets in the scene, solve the data association problem, and handle missing detection. Experimental results on publicly available data demonstrate the effectiveness of the adopted approach.

Liliana Lo Presti, Giuseppe Mazzola, Guido Averna, Edoardo Ardizzone, Marco La Cascia
FasterVideo: Efficient Online Joint Object Detection and Tracking

Object detection and tracking in videos represent essential and computationally demanding building blocks for current and future visual perception systems. In order to reduce the efficiency gap between available methods and computational requirements of real-world applications, we propose to re-think one of the most successful methods for image object detection, Faster R-CNN, and extend it to the video domain. Specifically, we extend the detection framework to learn instance-level embeddings which prove beneficial for data association and re-identification purposes. Focusing on the computational aspects of detection and tracking, our proposed method reaches a very high computational efficiency necessary for relevant applications, while still managing to compete with recent and state-of-the-art methods as shown in the experiments we conduct on standard object tracking benchmarks (Code available at https://github.com/Malga-Vision/fastervideo ).

Issa Mouawad, Francesca Odone
A Large-scale TV Dataset for Partial Video Copy Detection

This paper is interested with the performance evaluation of the partial video copy detection. Several public datasets exist designed from web videos. The detection problem is inherent to the continuous video broadcasting. The alternative is then to process with TV datasets offering a deeper scalability and a control of degradations for a fine performance evaluation. We propose in this paper a TV dataset called STVD. It is designed with a protocol ensuring a scalable capture and robust groundtruthing. STVD is the largest public dataset on the task with a near 83k videos having a total duration of 10,660 h. Performance evaluation results of representative methods on the dataset are reported in the paper for a baseline comparison.

Van-Hao Le, Mathieu Delalandre, Donatello Conte
Poker Bluff Detection Dataset Based on Facial Analysis

Unstaged data with people acting naturally in real-world scenarios is essential for high-stakes deception detection (HSDD) research. Unfortunately, multiple HSDD studies involve staged scenarios in controlled settings with subjects who were instructed to lie. Using in-the-wild footage of subjects and analyzing facial expressions instead of invasive tracking of biological processes enables the collection of real-world data. Poker is a high-stakes game involving a deceptive strategy called bluffing and is an ideal research subject for improving HSDD techniques. Videos of professional poker tournaments online provide a convenient data source. Because proficiency in HSDD generalizes well for different high-stakes situations, findings from poker bluff detection research have the potential to transfer well to other more practical HSDD applications like interrogations and customs inspections. In the hopes of encouraging additional research on real-world HSDD, we present a novel in-the-wild dataset for poker bluff detection. To verify the quality of our dataset, we test its regression accuracy and achieve a Mean Square Error of 0.0288 with an InceptionV3 model.

Jacob Feinland, Jacob Barkovitch, Dokyu Lee, Alex Kaforey, Umur Aybars Ciftci
Engagement Detection with Multi-Task Training in E-Learning Environments

Recognition of user interaction, in particular engagement detection, became highly crucial for online working and learning environments, especially during the COVID-19 outbreak. Such recognition and detection systems significantly improve the user experience and efficiency by providing valuable feedback. In this paper, we propose a novel Engagement Detection with Multi-Task Training (ED-MTT) system which minimizes mean squared error and triplet loss together to determine the engagement level of students in an e-learning environment. The performance of this system is evaluated and compared against the state-of-the-art on a publicly available dataset as well as videos collected from real-life scenarios. The results show that ED-MTT achieves $$6\%$$ 6 % lower MSE than the best state-of-the-art performance with highly acceptable training time and lightweight feature extraction.

Onur Copur, Mert Nakıp, Simone Scardapane, Jürgen Slowack

Special Session

Frontmatter
Ship Detection and Tracking Based on a Custom Aerial Dataset

This paper presents an approach based on machine learning techniques for detection and tracking ship in marine environment monitoring, with focus on a custom large data set based on aerial images. The work is placed in the context of autonomous navigation by the use of an unmanned surface naval platform assisted by an aerial drone. The work is according to a data-centric Artificial Intelligence (AI) approach, which involves building AI systems with quality data with a focus on ensuring that the data clearly conveys what the AI must learn. The application of machine learning techniques is used for automatic target detection and tracking. Target information in the surrounding environment allows context-awareness and obstacle identification and it can support naval platform in the management of collision avoidance. The paper focuses on the need of large amounts of data for the training stage to perform robust detections and tracking even in critical glare and waves variations. The paper presents a custom data set which includes fine-tuned public ship aerial images and images acquired by UAV flights over different maritime scenarios. The network’s training results are described and the detection and tracking performance is evaluated in different video sequences from UAV flights over such scenarios.

Luigi Paiano, Francesca Calabrese, Marco Cataldo, Luca Sebastiani, Nicola Leonardi
Prediction of Fish Location by Combining Fisheries Data and Sea Bottom Temperature Forecasting

This paper combines fisheries dependent data and environmental data to be used in a machine learning pipeline to predict the spatio-temporal abundance of two species (plaice and sole) commonly caught by the Belgian fishery in the North Sea. By combining fisheries related features with environmental data, sea bottom temperature derived from remote sensing, a higher accuracy can be achieved. In a forecast setting, the predictive accuracy is further improved by predicting, using a recurrent deep neural network, the sea bottom temperature up to four days in advance instead of relying on the last previous temperature measurement.

Matthieu Ospici, Klaas Sys, Sophie Guegan-Marat
Robust Human-Identifiable Markers for Absolute Relocalization of Underwater Robots in Marine Data Science Applications

Since global navigation satellite systems (GNSS) for determining the absolute geolocation do not reach into the ocean, underwater robots typically obtain a GNSS position at the water surface and then use a combination of different sensors for estimating their pose while diving, including inertial navigation, acoustic doppler velocity logs, ultra short baseline localization systems and pressure sensors. When re-navigating to the same seafloor location after several days, months or years, e.g. for coastal monitoring, the absolute uncertainty of such systems can be in the range of meters for shallow water, and tens of meters for deeper waters in practice. To enable absolute relocalization in marine data science applications that require absolute seafloor positions in the range of centimeter precision, in this contribution we suggest to equip the monitoring area with visual markers that can be detected reliably even in case they are partially overgrown or partially buried by sediment, which can happen quickly in coastal waters. Inspired by patterns successful in camera calibration, we create robust markers that exhibit features at different scales, in order to allow detection, identification and pose estimation from different cameras and various altitudes as visibility (and therefore the maximum possible survey altitude) in coastal waters can vary significantly across seasons, tides and weather. The low frequency content of the marker resembles a human-readable digit, in order to allow easy identification by scientists. We present early results including promising initial tests in coastal waters.

Philip Herrmann, Sylvia Reissmann, Marcel Rothenbeck, Felix Woelk, Kevin Köser
Towards Cross Domain Transfer Learning for Underwater Correspondence Search

Underwater images are challenging for correspondence search algorithms, which are traditionally designed based on images captured in air and under uniform illumination. In water however, medium interactions have a much higher impact on the light propagation. Absorption and scattering cause wavelength- and distance-dependent color distortion, blurring and contrast reductions. For deeper or turbid waters, artificial illumination is required that usually moves rigidly with the camera and thus increases the appearance differences of the same seafloor spot in different images. Correspondence search, e.g. using image features, is however a core task in underwater visual navigation employed in seafloor surveys and is also required for 3D reconstruction, image retrieval and object detection. For underwater images, it has to be robust against the challenging imaging conditions to avoid decreased accuracy or even failure of computer vision algorithms. However, explicitly taking underwater nuisances into account during the feature extraction and matching process is challenging. On the other hand, learned feature extraction models achieved high performance in many in-air problems in recent years. Hence we investigate, how such a learned robust feature model, D2Net, can be applied to the underwater environment and particularly look into the issue of cross domain transfer learning as a strategy to deal with the lack of annotated underwater training data.

Patricia Schöntag, David Nakath, Stefan Röhrl, Kevin Köser
On the Evaluation of Two Methods Applied to the Morphometry of Linear Dunes

The evolution of dunes is a relevant topic in many areas, including the study of the marine environment. In particular, the widespread linear dunes are characterized by roughly parallel ridges, which extend for long distances. The identification of the main morphometric characteristics of linear dunes is a relevant topic, with wide applicability. This work evaluates two approaches for linear dune morphometry. The methods are built over the Radon transform, the Discrete Fourier Transform, and autocorrelation. The evaluation is performed over a set of publicly available dune images. Results show that the performance of the methods varies over the morphometric characteristics considered.

Tatiana Taís Schein, Leonardo R. Emmendorfer, Fabiano NobreMendes, Bárbara D. A. Rodriguez, Luis Pedro Almeida, Vinícius Menezes de Oliveira
Underwater Image Enhancement Using Pre-trained Transformer

The goal of this work is to apply a denoising image transformer to remove the distortion from underwater images and compare it with other similar approaches. Automatic restoration of underwater images plays an important role since it allows to increase the quality of the images, without the need for more expensive equipment. This is a critical example of the important role of the machine learning algorithms to support marine exploration and monitoring, reducing the need for human intervention like the manual processing of the images, thus saving time, effort, and cost. This paper is the first application of the image transformer-based approach called “Pre-Trained Image Processing Transformer” to underwater images. This approach is tested on the UFO-120 dataset, containing 1500 images with the corresponding clean images.

Abderrahmene Boudiaf, Yuhang Guo, Adarsh Ghimire, Naoufel Werghi, Giulia De Masi, Sajid Javed, Jorge Dias
Backmatter
Metadata
Title
Image Analysis and Processing – ICIAP 2022
Editors
Prof. Stan Sclaroff
Cosimo Distante
Marco Leo
Dr. Giovanni M. Farinella
Prof. Dr. Federico Tombari
Copyright Year
2022
Electronic ISBN
978-3-031-06433-3
Print ISBN
978-3-031-06432-6
DOI
https://doi.org/10.1007/978-3-031-06433-3

Premium Partner