Skip to main content
Top

2020 | Book

Image Analysis and Recognition

17th International Conference, ICIAR 2020, Póvoa de Varzim, Portugal, June 24–26, 2020, Proceedings, Part I

insite
SEARCH

About this book

This two-volume set LNCS 12131 and LNCS 12132 constitutes the refereed proceedings of the 17th International Conference on Image Analysis and Recognition, ICIAR 2020, held in Póvoa de Varzim, Portugal, in June 2020.
The 54 full papers presented together with 15 short papers were carefully reviewed and selected from 123 submissions. The papers are organized in the following topical sections: image processing and analysis; video analysis; computer vision; 3D computer vision; machine learning; medical image and analysis; analysis of histopathology images; diagnosis and screening of ophthalmic diseases; and grand challenge on automatic lung cancer patient management.

Due to the corona pandemic, ICIAR 2020 was held virtually only.

Table of Contents

Frontmatter

Image Processing and Analysis

Frontmatter
Exploring Workout Repetition Counting and Validation Through Deep Learning

Studying human motion from images and videos has turned into an interesting topic of research given the recent advances in computer vision and deep learning algorithms. When focusing on the automatic procedure of tracking physical exercises, cameras can be used for full human pose estimation in relation to worn sensors. In this work, we propose a method for workout repetition counting and validation based on a set of skeleton-based and deep semantic features that are obtained from a 2D human pose estimation network. Given that some of the individuals’ body parts might be occluded throughout physical exercises, we also perform a multi-view analysis on supporting cameras to improve our recognition rates. Nevertheless, the obtained results for a single-view approach show that we are able to count valid repetitions with over $$90\%$$ precision scores for 4 out of 5 considered exercises, while recognizing more than $$50\%$$ of the invalid ones.

Bruno Ferreira, Pedro M. Ferreira, Gil Pinheiro, Nelson Figueiredo, Filipe Carvalho, Paulo Menezes, Jorge Batista
FlowChroma - A Deep Recurrent Neural Network for Video Colorization

We develop an automated video colorization framework that minimizes the flickering of colors across frames. If we apply image colorization techniques to successive frames of a video, they treat each frame as a separate colorization task. Thus, they do not necessarily maintain the colors of a scene consistently across subsequent frames. The proposed solution includes a novel deep recurrent encoder-decoder architecture which is capable of maintaining temporal and contextual coherence between consecutive frames of a video. We use a high-level semantic feature extractor to automatically identify the context of a scenario including objects, with a custom fusion layer that combines the spatial and temporal features of a frame sequence. We demonstrate experimental results, qualitatively showing that recurrent neural networks can be successfully used to improve color consistency in video colorization.

Thejan Wijesinghe, Chamath Abeysinghe, Chanuka Wijayakoon, Lahiru Jayathilake, Uthayasanker Thayasivam
Benchmark for Generic Product Detection: A Low Data Baseline for Dense Object Detection

Object detection in densely packed scenes is a new area where standard object detectors fail to train well [6]. Dense object detectors like RetinaNet [7] trained on large and dense datasets show great performance. We train a standard object detector on a small, normally packed dataset with data augmentation techniques. This dataset is 265 times smaller than the standard dataset, in terms of number of annotations. This low data baseline achieves satisfactory results (mAP = 0.56) at standard IoU of 0.5. We also create a varied benchmark for generic SKU product detection by providing full annotations for multiple public datasets. It can be accessed at this URL . We hope that this benchmark helps in building robust detectors that perform reliably across different settings in the wild.

Srikrishna Varadarajan, Sonaal Kant, Muktabh Mayank Srivastava
Supervised and Unsupervised Detections for Multiple Object Tracking in Traffic Scenes: A Comparative Study

In this paper, we propose a multiple object tracker, called MF-Tracker, that integrates multiple classical features (spatial distances and colours) and modern features (detection labels and re-identification features) in its tracking framework. Since our tracker can work with detections coming either from unsupervised and supervised object detectors, we also investigated the impact of supervised and unsupervised detection inputs in our method and for tracking road users in general. We also compared our results with existing methods that were applied on the UA-Detrac and the UrbanTracker datasets. Results show that our proposed method is performing very well in both datasets with different inputs (MOTA ranging from 0.3491 to 0.5805 for unsupervised inputs on the UrbanTracker dataset and an average MOTA of 0.7638 for supervised inputs on the UA Detrac dataset) under different circumstances. A well-trained supervised object detector can give better results in challenging scenarios. However, in simpler scenarios, if good training data is not available, unsupervised method can perform well and can be a good alternative.

Hui-Lee Ooi, Guillaume-Alexandre Bilodeau, Nicolas Saunier
Variation of Perceived Colour Difference Under Different Surround Luminance

With the wider availability of High Dynamic Range (HDR) Wide Colour Gamut (WCG) content, both consumers and content producers have become more concerned about the preservation of creative intent. While the accurate representation of colour plays a vital role in preserving creative intent, there are relatively fewer objective image and video quality assessment methods that are available which consider the colour quality. This paper will study the effect of surrounding luminance on perception of a colour stimulus, specifically, whether the perceptual uniformity is preserved in colour spaces and colour differencing methods as the surrounding luminance changes. The work presented in this paper provides important information and insight required for the future development of a successful colour quality assessment model.

Thilan Costa, Vincent Gaudet, Edward R. Vrscay, Zhou Wang
4K or Not? - Automatic Image Resolution Assessment

Recent years have witnessed a growing popularity of 4K or ultra high definition (UHD) content. However, the acquisition, production, post-production, and distribution pipelines of such content often go through stages where the actual video resolution goes below 4K/UHD level and is then upscaled to 4K/UHD resolution at later stages. As a result, the claimed 4K content in the real world often drops below the intended 4K quality, while final consumers are not well informed about such quality degradation. Here, we present our recent research progress on automatic image resolution assessment methods that determine whether a given image has true 4K resolution or not. Specifically, we developed a largest of its kind database of more than 10,000 true and fake 4K/UHD images with ground-truth labels. We have also made some initial attempts on constructing edge feature, Fourier transform feature, and deep learning based methods for the classification task. We believe that the built database and the attempted methods will help accelerate the research progress on automatic image resolution assessment.

Vyas Anirudh Akundy, Zhou Wang
Detecting Macroblocking in Images Caused by Transmission Error

Macroblocking is a type of widely observed video artifact where severe block-shaped artifacts appear in video frames. Macroblocking may be produced by heavy lossy compression but is visually most annoying when transmission error such as packet loss occurs during network video transmission. Since receivers do not have access to the pristine-quality original videos, macroblocking detection needs to be performed using no-reference (NR) approaches. This paper presents our recent research progress on detecting macroblocking caused by packet loss. We build the first of its kind macroblocking database that contains approximately 150,000 video frames with labels. Using the database, We make initial attempts of using transfer learning based deep learning techniques to tackle this challenging problem with and without using the Apache Spark big data processing framework. Our results show that it is beneficiary to use Spark. We believe that the current work will help the future development of macroblocking detection methods.

Ganesh Rajasekar, Zhou Wang
Bag of Tricks for Retail Product Image Classification

Retail Product Image Classification is an important Computer Vision and Machine Learning problem for building real world systems like self-checkout stores and automated retail execution evaluation. In this work, we present various tricks to increase accuracy of Deep Learning models on different types of retail product image classification datasets. These tricks enable us to increase the accuracy of fine tuned convnets for retail product image classification by a large margin. As the most prominent trick, we introduce a new neural network layer called Local-Concepts-Accumulation (LCA) layer which gives consistent gains across multiple datasets. Two other tricks we find to increase accuracy on retail product identification are using an instagram-pretrained Convnet and using Maximum Entropy as an auxiliary loss for classification.

Muktabh Mayank Srivastava
Detection and Recognition of Food in Photo Galleries for Analysis of User Preferences

Food analysis is one of the most important parts of user preference prediction engines for recommendation systems in the travel domain. In this paper, we describe and study the neural network method that allows you to recognize food in a gallery of photos taken with mobile devices. The described method consists of three main stages, including the classification of scenes, food detection, and subsequent classification. An essential feature of the developed method is the use of lightweight neural network models, which allows its usage on mobile devices. The development of the method was carried out using both known open data and a proprietary data set.

Evgeniy Miasnikov, Andrey Savchenko
Real Time Automatic Urban Traffic Management Framework Based on Convolutional Neural Network Under Limited Resources Constraint

Automatic traffic flow monitoring and control systems have become one of the most in-demand tasks due to the massive growth of the urban population, particularly in large cities. While numerous methods are available to address this issue with an unconstrained use of computational resources, a resource-constrained solution is yet to become publicly available. This paper aims to propose a real-time system framework to control the traffic flow and signals dealing with resource limitation constraints. Experimental results showed a high accuracy performance on the desired task and the scalability of the proposed framework.

Antoine Meicler, Assan Sanogo, Nadiya Shvai, Arcadi Llanza, Abul Hasnat, Marouan Khata, Ed-Doughmi Younes, Alami Khalil, Yazid Lachachi, Amir Nakib
Slicing and Dicing Soccer: Automatic Detection of Complex Events from Spatio-Temporal Data

The automatic detection of events in sport videos has important applications for data analytics, as well as for broadcasting and media companies. This paper presents a comprehensive approach for detecting a wide range of complex events in soccer videos starting from positional data. The event detector is designed as a two-tier system that detects atomic and complex events. Atomic events are detected based on temporal and logical combinations of the detected objects, their relative distances, as well as spatio-temporal features such as velocity and acceleration. Complex events are defined as temporal and logical combinations of atomic and complex events, and are expressed by means of a declarative Interval Temporal Logic (ITL). The effectiveness of the proposed approach is demonstrated over 16 different events, including complex situations such as tackles and filtering passes. By formalizing events based on a principled ITL, it is possible to easily perform reasoning tasks, such as understanding which passes or crosses result in a goal being scored. To counterbalance the lack of suitable, annotated public datasets, we built on an open source soccer simulation engine to release the synthetic SoccER (Soccer Event Recognition) dataset, which includes complete positional data and annotations for more than 1.6 million atomic events and 9,000 complex events. The dataset and code are available at https://gitlab.com/grains2/slicing-and-dicing-soccer .

Lia Morra, Francesco Manigrasso, Giuseppe Canto, Claudio Gianfrate, Enrico Guarino, Fabrizio Lamberti

Video Analysis

Frontmatter
RN-VID: A Feature Fusion Architecture for Video Object Detection

Consecutive frames in a video are highly redundant. Therefore, to perform the task of video object detection, executing single frame detectors on every frame without reusing any information is quite wasteful. It is with this idea in mind that we propose RN-VID (standing for RetinaNet-VIDeo), a novel approach to video object detection. Our contributions are twofold. First, we propose a new architecture that allows the usage of information from nearby frames to enhance feature maps. Second, we propose a novel module to merge feature maps of same dimensions using re-ordering of channels and $$1 \times 1$$ convolutions. We then demonstrate that RN-VID achieves better mean average precision (mAP) than corresponding single frame detectors with little additional cost during inference.

Hughes Perreault, Maguelonne Heritier, Pierre Gravel, Guillaume-Alexandre Bilodeau, Nicolas Saunier
Color Inference from Semantic Labeling for Person Search in Videos

We propose an explainable model for classifying the color of pixels in images. We propose a method based on binary search trees and a large peer-labeled color name dataset, allowing us to synthesize the average human perception of colors. We test our method on the application of Person Search. In this context, persons are described from their semantic parts, such as hat, shirt, ... and person search consists in looking for people based on these descriptions. We label segments of pedestrians with their associated colors and evaluate our solution on datasets such as PCN and Colorful-Fashion. We show a precision as high as 83% as well as the model ability to generalize to multiple domains with no retraining.

Jules Simon, Guillaume-Alexandre Bilodeau, David Steele, Harshad Mahadik
2D Bidirectional Gated Recurrent Unit Convolutional Neural Networks for End-to-End Violence Detection in Videos

Abnormal behavior detection, action recognition, fight and violence detection in videos is an area that has attracted a lot of interest in recent years. In this work, we propose an architecture that combines a Bidirectional Gated Recurrent Unit (BiGRU) and a 2D Convolutional Neural Network (CNN) to detect violence in video sequences. A CNN is used to extract spatial characteristics from each frame, while the BiGRU extracts temporal and local motion characteristics using CNN extracted features from multiple frames. The proposed end-to-end deep learning network is tested in three public datasets with varying scene complexities. The proposed network achieves accuracies up to 98%. The obtained results are promising and show the performance of the proposed end-to-end approach.

Abdarahmane Traoré, Moulay A. Akhloufi
Video Based Live Tracking of Fishes in Tanks

We explore video tracking and classification in the context of real time marine wildlife observation. Among other applications it can help biologists by automating the process of gathering data, which is often done manually. In this paper we present a system to tackle the challenge of tracking and classifying fish in real time. We apply Background Subtraction techniques to detect the fish, followed by Feature Matching methods to track their movements over time. To deal with the shortcomings of tracking by detection we use a Kalman Filter to predict fish positions and a local search recovery method to re-identify fish tracks that are temporarily lost due to occlusions or lack of contrast. The species of tracked fish is recognized through Image Classification methods, using environment dependent features. We developed and tested our system using a custom built dataset, with several labeled image sequences of the fish tanks in the Oceanário de Lisboa. The impact of the proposed tracking methods are quantified and discussed. The proposed system is able to track and classify fish in real time in two scenarios, main tank and coral reef, reflecting different challenges.

José Castelo, H. Sofia Pinto, Alexandre Bernardino, Núria Baylina
Using External Knowledge to Improve Zero-Shot Action Recognition in Egocentric Videos

Zero-shot learning is a very promising research topic. For a vision-based action recognition system, for instance, zero-shot learning allows to recognise actions never seen during the training phase. Previous works in zero-shot action recognition have exploited in several ways the visual appearance of input videos to infer actions. Here, we propose to add external knowledge to improve the performance of purely vision-based systems. Specifically, we have explored three different sources of knowledge in the form of text corpora. Our resulting system follows the literature and disentangles actions into verbs and objects. In particular, we independently train two vision-based detectors: (i) a verb detector and (ii) an active object detector. During inference, we combine the probability distributions generated from those detectors to obtain a probability distribution of actions. Finally, the vision-based estimation is further combined with an action prior extracted from text corpora (external knowledge). We evaluate our approach on the EGTEA Gaze+ dataset, an Egocentric Action Recognition dataset, demonstrating that the use of external knowledge improves the recognition of actions never seen by the detectors.

Adrián Núñez-Marcos, Gorka Azkune, Eneko Agirre, Diego López-de-Ipiña, Ignacio Arganda-Carreras
A Semantics-Guided Warping for Semi-supervised Video Object Instance Segmentation

In the semi-supervised video object instance segmentation domain, the mask warping technique, which warps the mask of the target object to flow vectors frame by frame, is widely used to extract target object. The big issue with this approach is that the generated warped map is not always of high accuracy, where the background or other objects may be wrongly detected as the target object. To cope with this problem, we propose to use the semantics of the target object as a guidance during the warping process. The warping confidence computation firstly judges the confidence of the generated warped map. Then a semantic selection is introduced to optimize the warped map with low confidence, where the target object is re-identified using the semantics-labels of the target object. The proposed method is assessed on the recently published large-scale Youtube-VOS dataset and compared to some state-of-the-art methods. The experimental results show that the proposed approach has a promising performance.

Qiong Wang, Lu Zhang, Kidiyo Kpalma
Two-Stream Framework for Activity Recognition with 2D Human Pose Estimation

Two-Stream framework with spatial information and optical flow information have reached the great performance for action recognition task in video. The optical flow information captures the low-level motion characteristics via a fixed quantity of consecutive video frames, which however contains noise information and is incompetent to characterize different actions with varying posture and duration. Usually ten frames before and after a frame are used as optical flow information, which may be too long or too short to capture the useful motion features for different actions. Moreover, the cost of calculating optical flow information from several consecutive video frames is high. To solve these issues, we propose a novel framework to recognize actions by capturing a high-level motion feature, human pose estimation, instead of the optical flow. Our framework uses 2D human pose estimation as the motion feature, and fuses it with the spatial information using attention mechanisms. We handle extensive experiments on two excellent and challenging datasets of realistic human action, HMDB-51 and UCF-101. The experimental results illustrate that our two-stream framework outperforms state-of-the-art approaches in terms of accuracy.

Wei Chang, Chunyang Ye, Hui Zhou
Video Object Segmentation Using Convex Optimization of Foreground and Background Distributions

In this study, a video object segmentation approach using convex optimization of foreground and background distributions is proposed. The proposed approach consists of four stages. First, optical flow computation and superpixel segmentation are performed on video frames. Second, convex optimization with a mixed energy function is employed to estimate the initial foreground and background distributions of video frames. Third, binary label maps for video frames are generated by maximum a posteriori (MAP) estimation. Fourth, the binary label maps are refined to obtain the final video object segmentation maps. Based on the experimental results obtained in this study, the performance of the proposed approach is better than those of three comparison approaches.

Jia-Wei Chen, Jin-Jang Leou

Computer Vision

Frontmatter
Deep Learning for Partial Fingerprint Inpainting and Recognition

Image completion and inpainting has been widely studied by the computer vision research community. With the recent growth and availability of computation power, we are now able to perform more complex inpainting than ever before. Techniques based on both learning and non-learning methods have been proposed for image inpainting. Some of these approaches have been used for fingerprint image enhancement. However, we lack techniques for fingerprint completion using deep learning. This is especially the case for techniques with the goal of augmenting the number of correct minutiae matchpoints for fingerprint recognition. This paper proposes new deep architectures to improve the accuracy of prints matching in live scan images. The proposed techniques have been tested using a professional software for fingerprint matching to evaluate the performance of deep learning in that aspect. The obtained results are promising and show an increase of 36.94% in minutiae match points identification.

Marc-André Blais, Andy Couturier, Moulay A. Akhloufi
A Visual Perception Framework to Analyse Neonatal Pain in Face Images

Neonatal pain assessment by facial expressions are currently among the most used methods in the clinical practice, due to the fact that the human being, at an early stage of life, is not able to verbally communicate pain. Therefore, the pain assessment and its subsequently treatment are carried out by an indirect and non-objective analysis of reactions of the neonate when facing a painful procedure. This work proposes a computational framework to investigate the visual perception patterns of adults when assessing pain in order to better understand the relevance of neonate facial features commonly used by health professionals when evaluating pain in newborn babies. The results showed that there is no statistical difference of visual fixation among all groups of volunteers, whether they are health professionals or not.

Lucas Pereira Carlini, Juliana C. A. Soares, Giselle V. T. Silva, Tatiany M. Heideirich, Rita C. X. Balda, Marina C. M. Barros, Ruth Guinsburg, Carlos Eduardo Thomaz
Combining Asynchronous Events and Traditional Frames for Steering Angle Prediction

Advances in deep learning over the last decade enabled by the availability of more computing resources have revived interest in end-to-end neural network methods for command prediction in vehicle control. Most of the existing frameworks in the literature make use of visual data from conventional video cameras to infer low level (steering wheel, speed, etc.) or high level (curvature, driving path and more) commands for actuation. In this paper, we propose an efficient convolutional neural network model that takes both perceptual data in the form of signals (events) from an event-based sensor and traditional frames from the same camera to generate steering wheel angle. We show that our model outperforms many state-of-the-art deep learning approaches using just one type of input among regular frames or events while being much more efficient.

Abdoulaye O. Ly, Moulay A. Akhloufi
Survey of Preprocessing Techniques and Classification Approaches in Online Signature Verification

This paper reviews the latest results in the field of online signature verification and summarizes the previously published major surveys and also over 30 papers from the last decade. We examine the steps of the verification process and show the most popular approaches used. Our results show that alignment and scaling are the most common methods used in preprocessing. Position, velocity, and pressure are the most commonly used measures for feature extraction while dynamic time warping is the most commonly used approach for verification. A comparison between these methods using different databases concludes this work. The error rate varies between 0.77% to 7.13%, with an average of 2.94%. The results and comparisons published in this paper may help researchers choose the most promising approaches for their systems.

Mohammad Saleem, Bence Kovari
SSIM Based Signature of Facial Micro-Expressions

Facial microexpressions (MEs) play a crucial role in the non verbal communication. Their automatic detection and recognition on a real video is a topic of great interest in different fields. However, the main difficulty in automatically capturing this kind of feature consists in its rapid temporal evolution, i.e. MEs occur in very few frames of video acquired by a conventional camera. In this paper a first study concerning the perceptual characteristics of ME is performed. The study is based on the observation that MEs are visible by a human observer, even though they are very rapid, and almost independently of the context. The Structural SIMilarity index (SSIM), which is a common perception-based metric, has been then used to detect a sort of fingerprint of MEs, that will be indicated as PES (Perceptual Expression Signature). The latter is able to efficiently guide the preprocessing step for MEs recognition procedure, as it allows for a fast video segmentation by providing only those frames where a ME occurs with high probability. Preliminary empirical studies on MEs in the wild have confirmed the feasibility of such an approach.

Vittoria Bruni, Domenico Vitulano
Learning to Search for Objects in Images from Human Gaze Sequences

Human vision relies on saccades to extract high quality information on small areas of the field of view, pointing the high resolution region of the retina (i.e. fovea) to the regions of interest. The eye motions are guided by top-down information provided by the task, which in our case is the search for a given object. In this work we propose a Recurrent Neural Network (RNN) model that learns from human demonstrations how to explore an image. The exploration samples are obtained from eye tracking data acquired while subjects inspect images. The proposed model extracts visual features from Convolutional Neural Networks (CNNs), which correspond to the input of the RNN. The contribution of this work is to consider the visual features along with the object label in a new model that is able to search for a given object in an image. We make a comparative study on the importance of context during object search tasks, showing that foveated images perform better than uniform image region crops.

Afonso Nunes, Rui Figueiredo, Plinio Moreno
Detecting Defects in Materials Using Deep Convolutional Neural Networks

This paper proposes representing and detecting manufacturing defects at the micrometre scale using deep convolutional neural networks. The information theoretic notion of entropy is used to quantify the information gain or mutual information of filters throughout the network, where the deepest network layers are generally shown to exhibit the highest mutual information between filter responses and defects, and thus serve as the most discriminative features. Quantitative detection experiments based on the AlexNet architecture investigate a variety of design parameters pertaining to data preprocessing and network architecture, where the optimal architectures achieve an average accuracy of 98.54%. CNNs are relatively easy to perform and give impressive achievements in classification tasks. However, the informational complexity coming from the depth of networks represents a limit to improve their capabilities.

Quentin Boyadjian, Nicolas Vanderesse, Matthew Toews, Philippe Bocher
Visual Perception Ranking of Chess Players

In this work, we have carried out a performance analysis of chess players comparing a standard ranking measure with a novel one proposed here. Using the idea of treating participants eye movements, when answering several on-screen valid chess questions of distinguished complexities, as high-dimensional spatial attention patterns we have shown that expertise is consistently associated with the ability to process visual information holistically using fewer fixations rather than locally focusing on individual pieces. These findings might disclose new insights for predicting chess skills.

Laercio R. Silva Junior, Carlos E. Thomaz
Video Tampering Detection for Decentralized Video Transcoding Networks

This paper introduces a complete methodology based on Machine Learning and Computer Vision techniques for the verification of video transcoding computations in decentralized networks, particularly the Open Source project Livepeer. A base video dataset is presented, with over 180k samples transcoded using the x264 codec. As a novelty, we propose a set of four features computed as a full reference comparison between the source and the rendered videos. Using these features, a One Class Support Vector Machine is trained to identify good encodings with a high accuracy. Experimental results are presented and the particular constraints of this use case are explained.

Rabindranath Andujar, Ignacio Peletier, Jesus Oliva, Marc Cymontkowski, Yondon Fu, Eric Tang, Josh Allman
Generalized Subspace Learning by Roweis Discriminant Analysis

We present a new method which generalizes subspace learning based on eigenvalue and generalized eigenvalue problems. This method, Roweis Discriminant Analysis (RDA) named after Sam Roweis, is a family of infinite number of algorithms where Principal Component Analysis (PCA), Supervised PCA (SPCA), and Fisher Discriminant Analysis (FDA) are special cases. One of the extreme special cases, named Double Supervised Discriminant Analysis (DSDA), uses the labels twice and it is novel. We propose a dual for RDA for some special cases. We also propose kernel RDA, generalizing kernel PCA, kernel SPCA, and kernel FDA, using both dual RDA and representation theory. Our theoretical analysis explains previously known facts such as why SPCA can use regression but FDA cannot, why PCA and SPCA have duals but FDA does not, why kernel PCA and kernel SPCA use kernel trick but kernel FDA does not, and why PCA is the best linear method for reconstruction. Roweisfaces and kernel Roweisfaces are also proposed generalizing eigenfaces, Fisherfaces, supervised eigenfaces, and their kernel variants. We also report experiments showing the effectiveness of RDA and kernel RDA on some benchmark datasets.

Benyamin Ghojogh, Fakhri Karray, Mark Crowley
Understanding Public Speakers’ Performance: First Contributions to Support a Computational Approach

Communication is part of our everyday life and our ability to communicate can have a significant role in a variety of contexts in our personal, academic, and professional lives. For long, the characterization of what is a good communicator has been subject to research and debate by several areas, particularly in Education, with a focus on improving the performance of teachers. In this context, the literature suggests that the ability to communicate is not only defined by the verbal component, but also by a plethora of non-verbal contributions providing redundant or complementary information, and, sometimes, being the message itself. However, even though we can recognize a good or bad communicator, objectively, little is known about what aspects – and to what extent—define the quality of a presentation. The goal of this work is to create the grounds to support the study of the defining characteristics of a good communicator in a more systematic and objective form. To this end, we conceptualize and provide a first prototype for a computational approach to characterize the different elements that are involved in communication, from audiovisual data, illustrating the outcomes and applicability of the proposed methods on a video database of public speakers.

Fábio Barros, Ângelo Conde, Sandra C. Soares, António J. R. Neves, Samuel Silva
Open Source Multipurpose Multimedia Annotation Tool

Efficient tools and frameworks for image and video annotation become more necessary for pattern recognition and computer vision research as datasets for training and testing of algorithms get increasingly larger. Different software packages have been developed to deal with these tasks, but they are usually designed for specific demands, problems or are not open to the public. This paper presents an open source multipurpose tool for annotation on multimedia datasets with extended flexibility through customizable labels, option of working on a shared database for collaborative annotation and with special care given on usability and efficiency for the best user experience. The Annotation Tool is available in the following link: www.thi.de/go/thi-labeling-tool .

Joed Lopes da Silva, Alan Naoto Tabata, Lucas Cardoso Broto, Marta Pereira Cocron, Alessandro Zimmer, Thomas Brandmeier
SLAM-Based Multistate Tracking System for Mobile Human-Robot Interaction

The transfer from the utilization of simple robots for specifically predefined tasks to the integration of generalized autonomous systems poses a number of challenges for the collaboration between humans and robots. These include the independent orientation of robots in unknown environments and the intuitive interaction with human cooperation partners. We present a robust human-robot interaction (HRI) system that proactively searches for interaction partners and follows them in unknown real environments. For this purpose, an algorithm for simultaneous localization and mapping of the environment is integrated along with a dynamic system for determination of the partner’s willingness and the tracking of the partner’s localization. Interruptions of interactions are recovered by a separate recovery mode that is able to identify prior collaboration partners.

Thorsten Hempel, Ayoub Al-Hamadi

3D Computer Vision

Frontmatter
Dense Disparity Maps from RGB and Sparse Depth Information Using Deep Regression Models

A dense and accurate disparity map is relevant for a large number of applications, ranging from autonomous driving to robotic grasping. Recent developments in machine learning techniques enable us to bypass sensor limitations, such as low resolution, by using deep regression models to complete otherwise sparse representations of the 3D space. This article proposes two main approaches that use a single RGB image and sparse depth information gathered from a variety of sensors/techniques (stereo, LiDAR and Light Stripe Ranging (LSR)): a Convolutional Neural Network (CNN) and a cascade architecture, that aims to improve the results of the first. Ablation studies were conducted to infer the impact of these depth cues on the performance of each model. The models trained with LiDAR sparse information are the most reliable, achieving an average Root Mean Squared Error (RMSE) of 11.8 cm on our own Inhouse dataset; while the LSR proved to be too sparse of an input to compute accurate predictions on its own.

Pedro Nuno Leite, Renato Jorge Silva, Daniel Filipe Campos, Andry Maykol Pinto
Exploitation of Dense MLS City Maps for 3D Object Detection

In this paper we propose a novel method for the exploitation of High Density Localization (HDL) maps obtained by Mobile Laser Scanning in order to increase the performance of state-of-the-art real time dynamic object detection (RTDOD) methods utilizing Rotating Multi-Beam (RMB) Lidar measurements. First, we align the onboard measurements to the 3D HDL map with a multimodal point cloud registration algorithm operating in the Hough space. Next we apply a grid based probabilistic step to filter out the object regions on the RMB Lidar data which were falsely predicted as dynamic objects by RTDOD, although they are part of the static background scene. On the other hand, to find objects erroneously missed by the RTDOD predictions, we implement a Markov Random Field based point level change detection approach between the map and the current onboard measurement frame. Finally, to analyse the changed but previously unclassified segments of the RMB Lidar clouds, we apply a geometric blob separation and a Support Vector Machine based classification to distinguish the different object types. Comparative tests are provided in high traffic road sections of Budapest, Hungary, and we show an improvement of $$5,96\%$$ in precision, $$9,21\%$$ in recall and $$7,93\%$$ in F-score metrics against the state-of-the-art RTDOD algorithm.

Örkény Zováthi, Balázs Nagy, Csaba Benedek
Automatic Stereo Disparity Search Range Detection on Parallel Computing Architectures

From the earliest to the state-of-the-art algorithms, stereo depth estimation techniques often require a disparity search range (DSR) value to be chosen manually. However, the optimal DSR varies from one scene to another making the results depend on the operator input and operator having to optimize the configuration by using trial-and-error. In this paper we present a novel technique suitable for parallel computing architectures which detects the optimum DSR for a given scene without requiring operator input or prior knowledge of the scene. Experiments on stereo images from Middlebury, KITTI and Sceneflow bench-mark datasets indicate that our technique can automatically extract a suitable DSR value from different scenes, which leads to better consistency in matching. The technique presented here can be used with existing stereo algorithms to limit the size of the cost volume as it is being built (without requiring pre-processing or operator input). A CUDA based implementation of our method can deliver real-time performance on consumer grade GPUs at high frame rates.

Ruveen Perera, Tobias Low
Multi-camera Motion Estimation with Affine Correspondences

We present a study of minimal-case motion estimation with affine correspondences and introduce a new solution for multi-camera motion estimation with affine correspondences. Ego-motion estimation using one or more cameras is a well-studied topic with applications in 3D reconstruction and mobile robotics. Most feature-based motion estimation techniques use point correspondences. Recently, several researchers have developed novel epipolar constraints using affine correspondences. In this paper, we extend the epipolar constraint on affine correspondences to the multi-camera setting and develop and evaluate a novel minimal solver using this new constraint. Our solver uses six affine correspondences in the minimal case, which is a significant improvement over the point-based version that requires seventeen point correspondences. Experiments on synthetic and real data show that, in comparison to the point-based solver, our affine solver effectively reduces the number of RANSAC iterations needed for motion estimation while maintaining comparable accuracy.

Khaled Alyousefi, Jonathan Ventura
Backmatter
Metadata
Title
Image Analysis and Recognition
Editors
Prof. Aurélio Campilho
Fakhri Karray
Zhou Wang
Copyright Year
2020
Electronic ISBN
978-3-030-50347-5
Print ISBN
978-3-030-50346-8
DOI
https://doi.org/10.1007/978-3-030-50347-5

Premium Partner