Skip to main content

2019 | Book

Computer Vision Systems

12th International Conference, ICVS 2019, Thessaloniki, Greece, September 23–25, 2019, Proceedings

Editors: Dr. Dimitrios Tzovaras, Dr. Dimitrios Giakoumis, Prof. Dr. Markus Vincze, Prof. Antonis Argyros

Publisher: Springer International Publishing

Book Series: Lecture Notes in Computer Science


About this book

This book constitutes the refereed proceedings of the 12th International Conference on Computer Vision Systems, ICVS 2019, held in Thessaloniki, Greece, in September 2019.

The 72 papers presented were carefully reviewed and selected from 114 submissions. The papers are organized in the following topical sections; hardware accelerated and real time vision systems; robotic vision; vision systems applications; high-level and learning vision systems; cognitive vision systems; movement analytics and gesture recognition for human-machine collaboration in industry; cognitive and computer vision assisted systems for energy awareness and behavior analysis; and vision-enabled UAV and counter UAV technologies for surveillance and security of critical infrastructures.

Table of Contents


Hardware Accelerated and Real Time Vision Systems

Hardware Accelerated Image Processing on an FPGA-SoC Based Vision System for Closed Loop Monitoring and Additive Manufacturing Process Control

In many industrial sectors such as aeronautics, power generation, oil & gas, complex metal parts especially the critical ones are constructed and manufactured for a very long lifespan (more than 10 years). 4D Hybrid, an EU research project develops a new concept of hybrid additive manufacturing (AM) modules to ensure first time right production. To achieve that, the in-line process monitoring activity can be persistently realized by a complex sensing and vision system mounted on the equipment to ensure that the process responds to nominal operating conditions. This paper presents concepts and design of the vision system for process monitoring activities with hardware accelerated image processing by using camera hardware based on SoC FPGA devices, a hybrid of FPGA and ARM-based Cortex-A9 dual core CPU.

Dietmar Scharf, Bach Le Viet, Thi Bich Hoa Le, Janine Rechenberg, Stefan Tschierschke, Ernst Vogl, Ambra Vandone, Mattia Giardini
Real-Time Binocular Vision Implementation on an SoC TMS320C6678 DSP

In recent years, computer binocular vision has been commonly utilized to provide depth information for autonomous vehicles. This paper presents an efficient binocular vision system implemented on an SoC TMS320C6678 DSP for real-time depth information extrapolation, where the search range propagates from the bottom of an image to its top. To further improve the stereo matching efficiency, the cost function is factorized into five independent parts. The value of each part is pre-calculated and stored in the DSP memory for direct data indexing. The experimental results illustrate that the proposed algorithm performs in real time, when processing the KITTI stereo datasets with eight cores in parallel.

Rui Fan, Sicheng Duanmu, Yanan Liu, Yilong Zhu, Jianhao Jiao, Mohammud Junaid Bocus, Yang Yu, Lujia Wang, Ming Liu
Real-Time Lightweight CNN in Robots with Very Limited Computational Resources: Detecting Ball in NAO

This paper proposed a lightweight CNN architecture called Binary-8 for ball detection on NAO robots together with a labelled dataset of 1000+ images containing balls in various scenarios to address the most basic and key issue in robot soccer games: detecting the ball. In contrast to the existing ball detection methods base on traditional machine learning and image processing, this paper presents a lightweight CNN object detection approach for CPU. In order to deal with the problems of tiny size, blurred image, occlusion and many other similar objects during detection, the paper designed a network structure with strong enough feature extraction ability. In order to achieve real time performance, the paper uses the ideas of depthwise separable convolution and binary weights. Besides, we also use SIMD (Single Instruction Multiple Data) to accelerate the operations. Full procedure and net structure have been given in this paper. Experimental results show that the proposed CNN architecture can run at full frame rate (140 Fps on CPU) with an accurate percentage of 97.13%.

Qingqing Yan, Shu Li, Chengju Liu, Qijun Chen
Reference-Free Adaptive Attitude Determination Method Using Low-Cost MARG Sensors

In this paper, an improved iterative method for attitude determination using microelectromechanical-system (MEMS) Magnetic, Angular Rate, and Gravity (MARG) sensors is proposed. The proposed complementary filter is motivated by several existing algorithms and it decreases the amount of variables for iteration which consequently lowers the convergence time. To enhance the adaptive ability i.e. the performance under external acceleration, of the proposed method, a novel scheme is designed, where the gravity estimation residual is utilized for adaptive tuning of the complementary gain. Experiments are carried out to demonstrate the advantages of the proposed method. The comparisons with representative methods show that the proposed method is more effective, not only in convergence speed, but in dynamic performance under harsh conditions as well.

Jian Ding, Jin Wu, Mingsen Deng, Ming Liu
Feature-Agnostic Low-Cost Place Recognition for Appearance-Based Mapping

The agent’s ability to locate itself in an unfamiliar environment is essential for a reliable navigation. To address this challenge, place recognition methods are widely adopted. A common trend among most of these methods is that they are either tailored to work in specific environments or need prior training overhead [11]. Whereas, others demand extreme computational resources, such as CNN [8]. In this paper, we study the existing GSOM-based place recognition framework [12] and investigate the question of translating the system to other feature spaces, such as HOG, for low-cost place recognition. The experiments performed on four challenging sequences demonstrate the algorithm’s ability to learn the representation of the new feature space without parameter tuning, provided the scaling factor along each dimension of the descriptor is taken into account. This highlights the feature-agnostic characteristic of the algorithm. We further observed that despite the low dimensionality of the HOG descriptor, the algorithm shows comparable place recognition results to the gist features, while offering threefold speed-ups in execution time.

S. M. Ali Musa Kazmi, Mahmoud A. Mohamed, Bärbel Mertsching

Robotic Vision

Semi-semantic Line-Cluster Assisted Monocular SLAM for Indoor Environments

This paper presents a novel method to reduce the scale drift for indoor monocular simultaneous localization and mapping (SLAM). We leverage the prior knowledge that in the indoor environment, the line segments form tight clusters, e.g. many door frames in a straight corridor are of the same shape, size and orientation, so the same edges of these door frames form a tight line segment cluster. We implement our method in the popular ORB-SLAM2, which also serves as our baseline. In the front end we detect the line segments in each frame and incrementally cluster them in the 3D space. In the back end, we optimize the map imposing the constraint that the line segments of the same cluster should be the same. Experimental results show that our proposed method successfully reduces the scale drift for indoor monocular SLAM.

Ting Sun, Dezhen Song, Dit-Yan Yeung, Ming Liu
Appearance-Based Loop Closure Detection with Scale-Restrictive Visual Features

In this paper, an appearance-based loop closure detection pipeline for autonomous robots is presented. Our method uses scale-restrictive visual features for image representation with a view to reduce the computational cost. In order to achieve this, a training process is performed, where a feature matching technique indicates the features’ repeatability with respect to scale. Votes are distributed into the database through a nearest neighbor method, while a binomial probability function is responsible for the selection of the most suitable loop closing pair. Subsequently, a geometrical consistency check on the chosen pair follows. The method is subjected into an extensive evaluation via a variety of outdoor, publicly-available datasets revealing high recall rates for 100$$\%$$ precision, as compared against its baseline version, as well as, other state-of-the-art approaches.

Konstantinos A. Tsintotas, Panagiotis Giannis, Loukas Bampis, Antonios Gasteratos
Grasping Unknown Objects by Exploiting Complementarity with Robot Hand Geometry

Grasping unknown objects with multi-fingered hands is challenging due to incomplete information regarding scene geometry and the complicated control and planning of robot hands. We propose a method for grasping unknown objects with multi-fingered hands based on shape complementarity between the robot hand and the object. Taking as input a point cloud of the scene we locally perform shape completion and then we search for hand poses and finger configurations that optimize a local shape complementarity metric. We validate the proposed approach in MuJoCo physics engine. Our experiments show that the explicit consideration of shape complementarity of the hand leads to robust grasping of unknown objects.

Marios Kiatos, Sotiris Malassiotis
Grapes Visual Segmentation for Harvesting Robots Using Local Texture Descriptors

This paper investigates the performance of Local Binary Patterns variants in grape segmentation for autonomous agricultural robots, namely Agrobots, applied to viniculture and winery. Robust fruit detection is challenging and needs to be accurate to enable the Agrobot to execute demanding tasks of precise farming. Segmentation task is handled by classification with the supervised machine learning model k-Nearest Neighbor ($$ k $$-NN), including extracted features from Local Binary Patterns (LBP) and their variants in combination of color components. LBP variants are tested for both varieties of red and white grapes, subject to performance measures of accuracy, recall and precision. The results for red grapes indicate an approximate intended accuracy of 94% of detection, while the results relating to white grapes confirm the concerns of complex indiscreet visual cues providing accuracies of 83%.

Eftichia Badeka, Theofanis Kalabokas, Konstantinos Tziridis, Alexander Nicolaou, Eleni Vrochidou, Efthimia Mavridou, George A. Papakostas, Theodore Pachidis
Open Space Attraction Based Navigation in Dark Tunnels for MAVs

This work establishes a novel framework for characterizing the open space of featureless dark tunnel environments for Micro Aerial Vehicles (MAVs) navigation tasks. The proposed method leverages the processing of a single camera to identify the deepest area in the scene in order to provide a collision free heading command for the MAV. In the sequel and inspired by haze removal approaches, the proposed novel idea is structured around a single image depth map estimation scheme, without metric depth measurements. The core contribution of the developed framework stems from the extraction of a 2D centroid in the image plane that characterizes the center of the tunnel’s darkest area, which is assumed to represent the open space, while the robustness of the proposed scheme is being examined under varying light/dusty conditions. Simulation and experimental results demonstrate the effectiveness of the proposed method in challenging underground tunnel environments [1].

Christoforos Kanellakis, Petros Karvelis, George Nikolakopoulos
6D Gripper Pose Estimation from RGB-D Image

This paper proposes an end-to-end system to directly estimate the 6D pose of gripper given RGB and depth images of an object. A dataset containing RGB-D images and 6D poses of 20 kinds, 10 for known objects and 10 for unknown ones, is developed in the first place. With all coordinates information gained from successful grasp, the separation between object properties and grasping strategies could be avoided. To improve the usability and uniformity of raw data, distinctive data preprocessing approach is illustrated immediately after the creation of the dataset. Entire convolutional neural network frame is given subsequently and the training with unique loss function adjusts the model to desired accuracy. Testing on both known and unknown objects verifies our system when it comes to grasping precision.

Qirong Tang, Xue Hu, Zhugang Chu, Shun Wu
Robust Rotation Interpolation Based on SO(n) Geodesic Distance

A novel interpolation algorithm for smoothing of successive rotation matrices based on the geodesic distance of special orthogonal group SO(n) is proposed. The derived theory is capable of achieving optimal interpolation and owns better accuracy and robustness than representatives.

Jin Wu, Ming Liu, Jian Ding, Mingsen Deng
Estimation of Wildfire Size and Location Using a Monocular Camera on a Semi-autonomous Quadcopter

This paper addresses the problem of estimating the location and size of a wildfire, within the frame of a semi-autonomous recon and data analytics quadcopter. We approach this problem by developing three different algorithms, in order to accommodate this problem. Two of these taking into the account that the middle of the camera’s FOV is horizontal with respect to the drone it is mounted. The third algorithm relates to the bottom point of the FOV, directly under the drone in 3D space. The evaluation shows that having the pixels correlate to ratios in percentages rather than predetermined values, with respect to the edges of the fire, will result in better performance and higher accuracy. Placing the monocular camera horizontally in relation to the drone will provide an accuracy of 68.20%, while mounting the camera with an angle, will deliver an accuracy of 60.76%.

Lucas Goncalves de Paula, Kristian Hyttel, Kenneth Richard Geipel, Jacobo Eduardo de Domingo Gil, Iuliu Novac, Dimitrios Chrysostomou
V-Disparity Based Obstacle Avoidance for Dynamic Path Planning of a Robot-Trailer

Structured space exploration with mobile robots is imperative for autonomous operation in challenging outdoor applications. To this end, robots should be equipped with global path planners that ensure coverage and full exploration of the operational area as well as dynamic local planners that address local obstacle avoidance. The paper at hand proposes a local obstacle detection algorithm based on a fast stereo vision processing step, integrated with a dynamic path planner to avoid the detected obstacles in real-time, while simultaneously keeping track of the global path. It considers a robot-trailer articulated system, based on which the trailer trace should cover the entire operational space in order to perform a dedicated application. This is achieved by exploiting a model predictive controller to keep track of the trailer path while performing stereo vision-based local obstacle detection. A global path is initially posed that ensures full coverage of the operational space and during robot’s motion, the detected obstacles are reported in the robot’s occupancy grid map, which is considered from a hybrid global and local planner approach to avoid them locally. The developed algorithm has been evaluated in a simulation environment and proved adequate performance.

Efthimios Tsiogas, Ioannis Kostavelis, Dimitrios Giakoumis, Dimitrios Tzovaras
Intersection Recognition Using Results of Semantic Segmentation for Visual Navigation

It is popular to use three-dimensional sensing devices such as LiDAR and RADAR for autonomous navigation of ground vehicles in modern approaches. However, there are significant problems: the price of 3D sensing devices, the cost for 3D map building, the robustness against errors accumulated in long-term moving. Visual navigation based on a topological map using only cheap cameras as external sensors has potential to solve these problems; road-following and intersection recognition can enable robust navigation. This paper proposes a novel scheme for intersection recognition using results of semantic segmentation, which has a high affinity for vision-based road-following strongly depending on semantic segmentation. The proposed scheme mainly composed of mode filtering for a segmented image and similarity computation like the Hamming distance showed that good accuracy for the Tsukuba-Challenge 2018 dataset constructed by the authors: perfect results were obtained for more than half intersections included in the dataset. In addition, a running experiment using the proposed scheme with vision-based road-following showed that the proposed scheme could classify intersections appropriately in actual environments.

Hiroki Ishida, Kouchi Matsutani, Miho Adachi, Shingo Kobayashi, Ryusuke Miyamoto
Autonomous MAV Navigation in Underground Mines Using Darkness Contours Detection

This article considers a low-cost and light weight platform for the task of autonomous flying for inspection in underground mine tunnels. The main contribution of this paper is integrating simple, efficient and well-established methods in the computer vision community in a state of the art vision-based system for Micro Aerial Vehicle (MAV) navigation in dark tunnels. These methods include Otsu’s threshold and Moore-Neighborhood object tracing. The vision system can detect the position of low-illuminated tunnels in image frame by exploiting the inherent darkness in the longitudinal direction. In the sequel, it is converted from the pixel coordinates to the heading rate command of the MAV for adjusting the heading towards the center of the tunnel. The efficacy of the proposed framework has been evaluated in multiple experimental field trials in an underground mine in Sweden, thus demonstrating the capability of low-cost and resource-constrained aerial vehicles to fly autonomously through tunnel confined spaces.

Sina Sharif Mansouri, Miguel Castaño, Christoforos Kanellakis, George Nikolakopoulos
Improving Traversability Estimation Through Autonomous Robot Experimentation

The ability to have unmanned ground vehicles navigate unmapped off-road terrain has high impact potential in application areas ranging from supply and logistics, to search and rescue, to planetary exploration. To achieve this, robots must be able to estimate the traversability of the terrain they are facing, in order to be able to plan a safe path through rugged terrain. In the work described here, we pursue the idea of fine-tuning a generic visual recognition network to our task and to new environments, but without requiring any manually labelled data. Instead, we present an autonomous data collection method that allows the robot to derive ground truth labels by attempting to traverse a scene and using localization to decide if the traversal was successful. We then present and experimentally evaluate two deep learning architectures that can be used to adapt a pre-trained network to a new environment. We prove that the networks successfully adapt to their new task and environment from a relatively small dataset.

Christos Sevastopoulos, Katerina Maria Oikonomou, Stasinos Konstantopoulos
Towards Automated Order Picking Robots for Warehouses and Retail

Order picking is one of the most expensive tasks in warehouses nowadays and at the same time one of the hardest to automate. Technical progress in automation technologies however allowed for first robotic products on fully automated picking in certain applications. This paper presents a mobile order picking robot for retail store or warehouse order fulfillment on typical packaged retail store items. This task is especially challenging due to the variety of items which need to be recognized and manipulated by the robot. Besides providing a comprehensive system overview the paper discusses the chosen techniques for textured object detection and manipulation in greater detail. The paper concludes with a general evaluation of the complete system and elaborates various potential avenues of further improvement.

Richard Bormann, Bruno Ferreira de Brito, Jochen Lindermayr, Marco Omainska, Mayank Patel

Vision Systems Applications

Tillage Machine Control Based on a Vision System for Soil Roughness and Soil Cover Estimation

Soil roughness and soil cover are important control variables for plant cropping. A certain level of soil roughness can prevent soil erosion, but to rough soil prevents good plant emergence. Local heterogeneities in the field make it difficult to get homogeneous soil roughness. Residues, like straw, influences the soil roughness estimation and play an important role in preventing soil erosion. We propose a system to control the tillage intensity of a power harrow by varying the driving speed and PTO speed of a tractor. The basis for the control algorithm is a roughness estimation system based on an RGB stereo camera. A soil roughness index is calculated from the reconstructed soil surface point cloud. The vision system also integrates an algorithm to detect soil cover, like residues. Two different machine learning methods for pixel-wise semantic segmentation of soil cover were implemented, an entangled random forest and a convolutional neural net. The pixel-wise classification of each image into soil, living organic matter, dead organic matter and stone allow for mapping of soil cover during tillage. The results of the semantic segmentation of soil cover were compared to ground truth labelled data using the grid method. The soil roughness measurements were validated using the manual sieve analysis. The whole control system was validated in field trials on different locations.

Peter Riegler-Nurscher, Johann Prankl, Markus Vincze
Color Calibration on Human Skin Images

Many recent medical developments rely on image analysis, however, it is not convenient nor cost-efficient to use professional image acquisition tools in every clinic or laboratory. Hence, a reliable color calibration is necessary; color calibration refers to adjusting the pixel colors to a standard color space.During a real-life project on neonatal jaundice disease detection, we faced a problem to perform skin color calibration on already taken images of neonatal babies. These images were captured with a smartphone (Samsung Galaxy S7, equipped with a 12 Mega Pixel camera to capture 4032 $$\times $$ 3024 resolution images) in the presence of a specific calibration pattern. This post-processing image analysis deprived us from calibrating the camera itself. There is currently no comprehensive study on color calibration methods applied to human skin images, particularly when using amateur cameras (e.g. smartphones). We made a comprehensive study and we proposed a novel approach for color calibration, Gaussian process regression (GPR), a machine learning model that adapts to environmental variables. The results show that the GPR achieves equal results to state-of-the-art color calibration techniques, while also creating more general models.

Mahdi Amani, Håvard Falk, Oliver Damsgaard Jensen, Gunnar Vartdal, Anders Aune, Frank Lindseth
Hybrid Geometric Similarity and Local Consistency Measure for GPR Hyperbola Detection

The recent development of novel powerful sensor topologies, namely Ground Penetrating Radar (GPR) antennas, gave a thrust to the modeling of underground environment. An important step towards underground modelling is the detection of the typical hyperbola patterns on 2D images (B-scans), formulated due to the reflections of underground utilities. This work introduces a soil-agnostic approach for hyperbola detection, starting from one dimensional GPR signals, viz. A-scans, to perform a segmentation of each trace into candidate reflection pulses. Feature vector representations are calculated for segmented pulses through multilevel DWT decomposition. A theoretical geometric model of the corresponding hyperbola pattern is generated on the image plane for all point coordinates of the area under inspection. For each theoretical model, measured pulses that best support it are extracted and are utilized to validate it with a novel hybrid measure. The novel measure simultaneously controls the geometric plausibility of the examined hyperbola model and the consistency of the pulses contributing to this model across all the examined A-scan traces. Implementation details are discussed and experimental evaluation is exhibited on real GPR data.

Evangelos Skartados, Ioannis Kostavelis, Dimitrios Giakoumis, Dimitrios Tzovaras
Towards a Professional Gesture Recognition with RGB-D from Smartphone

The goal of this work is to build the basis for a smartphone application that provides functionalities for recording human motion data, train machine learning algorithms and recognize professional gestures. First, we take advantage of the new mobile phone cameras, either infrared or stereoscopic, to record RGB-D data. Then, a bottom-up pose estimation algorithm based on Deep Learning extracts the 2D human skeleton and exports the 3rd dimension using the depth. Finally, we use a gesture recognition engine, which is based on K-means and Hidden Markov Models (HMMs). The performance of the machine learning algorithm has been tested with professional gestures using a silk-weaving and a TV-assembly datasets.

Pablo Vicente Moñivar, Sotiris Manitsaris, Alina Glushkova
Data Anonymization for Data Protection on Publicly Recorded Data

Data protection in Germany has a long tradition ( ). For a long time, the German Federal Data Protection Act or Bundesdatenschutzgesetz (BDSG) was considered as one of the strictest. Since May 2017 the EU General Data Protection Regulation (GDPR) regulates data protection all over Europe and it strongly influenced by the German law. When recording data in public areas, the recordings may contain personal data, such as license plates or persons. According to the GDPR this processing of personal data has to fulfill certain requirements to be considered lawful. In this paper, we address recording visual data in public while abiding by the applicable laws. Towards this end, a formal data protection concept is developed for a mobile sensor platform. The core part of this data protection concept is the anonymization of personal data, which is implemented with state-of-the-art deep learning based methods achieving almost human-level performance. The methods are evaluated quantitatively and qualitatively on example data recorded with a real mobile sensor platform in an urban environment.

David Münch, Ann-Kristin Grosselfinger, Erik Krempel, Marcus Hebel, Michael Arens
Water Streak Detection with Convolutional Neural Networks for Scrubber Dryers

Avoiding gray water remainders behind wet floor cleaning machines is an essential requirement for safety of passersby and quality of cleaning results. Nevertheless, operators of scrubber dryers frequently do not pay sufficient attention to this aspect and automatic robotic cleaners cannot even sense water leakage. This paper introduces a compact, low-cost, low-energy water streak detection system for the use with existing and new cleaning machines. It comprises a Raspberry Pi with an Intel Movidius Neural Compute Stick, an illumination source, and a camera to observe the floor after cleaning. The paper evaluates six different Convolutional Neural Network (CNN) architectures on a self-recorded water streak data set which contains nearly 43000 images of 59 different floor types. The results show that up to 97% of all water events can be detected at a low false positive rate of only 2.6%. The fastest CNN Squeezenet can process images at a sufficient speed of over 30 Hz on the low-cost hardware such that real applicability in practice is provided. When using an NVidia Jetson Nano as alternative low-cost computing system, five out of the six networks can be operated faster than 30 Hz.

Uriel Jost, Richard Bormann
Segmenting and Detecting Nematode in Coffee Crops Using Aerial Images

A challenge in precision agriculture is the detection of pests in agricultural environments. This paper describes a methodology to detect the presence of the nematode pest in coffee crops. An Unmanned Aerial Vehicle (UAV) is used to obtain high-resolution RGB images of a commercial coffee plantation. The proposed methodology enables the extraction of visual features from image regions and uses supervised machine learning (ML) techniques to classify areas into two classes: pests and non-pests. Several learning techniques were compared using approaches with and without segmentation. Results demonstrate the methodology potential, with an average for the f-measure of 63% for Convolutional Neural Network (U-net) with manual segmentation.

Alexandre J. Oliveira, Gleice A. Assis, Vitor Guizilini, Elaine R. Faria, Jefferson R. Souza
Automatic Detection of Obstacles in Railway Tracks Using Monocular Camera

This paper presents an algorithm for automatic detection of obstructions on railway tracks. Based on computer vision techniques, this algorithm extracts the railway tracks from the image feed and automatically detects obstacles that can endanger normal railway system operation, as well as the safety of its users. To segment the railway tracks, two techniques are explored. First, the Hough transform is used to detect straight lines, which proves to be inefficient when dealing with curves. To overcome this problem, an alternative solution is developed based on mathematical morphology techniques and BLOB (Binary Large OBject) analysis, leading to a more robust segmentation. The surrounding terrain is also subject to analysis. The algorithm’s performance is evaluated considering different scenarios with and without simulated anomalies, demonstrating the effectiveness of the proposed solution.

Guilherme Kano, Tiago Andrade, Alexandra Moutinho
A Sequential Approach for Pain Recognition Based on Facial Representations

Pain assessment is a hard subjective problem, but still, it is critical in many medical situations. Many computational approaches explore pain detection and estimation using different types of data and descriptors. Among these, spontaneous facial expressions coded by the Facial Action Coding System (FACS) have achieved outstanding results in frame-by-frame stationary analysis, but not in temporal analysis. We explore spatiotemporal features extracted from video sequences considering pain stimuli as references in the temporal analysis. Our proposal focuses on guided learning by warping the appearance surround the facial action units (AUs). The facial features from frames are processed sequentially to extract their temporal correspondences. These sequences are generated from the original videos and must represent a single-stimulus effect in a short period, so we develop generation policies. Experimental results on the publicly available UNBC-McMaster database have demonstrated that our approach yields significant advances over the state-of-the-art.

Antoni Mauricio, Fábio Cappabianco, Adriano Veloso, Guillermo Cámara
A Computer Vision System Supporting Blind People - The Supermarket Case

The proposed application builds on the latest advancements of computer vision with the aim to improve the autonomy of people with visual impairment at both practical and emotional level. More specifically, it is an assistive system that relies on visual information to recognise the objects and faces surrounding the user. The system is supported by a set of sensors for capturing the visual information and for transmitting the auditory messages to the users. In this paper, we present a computer vision application, e-vision, in the context of visiting the supermarket for buying groceries.

Kostas Georgiadis, Fotis Kalaganis, Panagiotis Migkotzidis, Elisavet Chatzilari, Spiros Nikolopoulos, Ioannis Kompatsiaris

High-Level and Learning Vision Systems

Comparing Ellipse Detection and Deep Neural Networks for the Identification of Drinking Glasses in Images

This study compares a deep learning approach with the traditional computer vision method of ellipse detection on the task of detecting semi-transparent drinking glasses filled with water in images. Deep neural networks can, in principle, be trained until they exhibit excellent performance in terms of detection accuracy. However, their ability to generalise to different types of surroundings relies on large amounts of training data, while ellipse detection can work in any environment without requiring additional data or algorithm tuning. Two deep neural networks trained on different image data sets containing drinking glasses were tested in this study. Both networks achieved high levels of detection accuracy, independently of the test image resolution. In contrast, the ellipse detection method was less consistent, greatly depending on the visibility of the top and bottom of the glasses, and water levels. The method detected the top of the glasses in less than half of the cases, at lower resolutions; and detection results were even worse for the water level and bottom of the glasses, in all resolutions.

Abdul Jabbar, Alexandre Mendes, Stephan Chalup
Detecting Video Anomaly with a Stacked Convolutional LSTM Framework

Automatic anomaly detection in real-world video surveillance is still challenging. In this paper, we propose an autoencoder architecture based on a stacked convolutional LSTM framework that highlights both spatial and temporal aspects in detecting anomalies of surveillance videos. The spatial component(i.e. spatial encoder/decoder) uses Convolutional Neural Network (CNN) and carries information about scenes and objects. The temporal component(i.e. temporal encoder/decoder) uses stacked convolutional LSTM and conveys object movement. Specifically, we integrate CNN and the stacked convolutional LSTM to learn normal patterns from the training data, which contains only normal events. With the integrated approach, our method can better model spatio-temporal information than many others. We train our models in an unsupervised manner, and labels are required only in the testing phase. Our method is evaluated on the datasets of Avenue, UCSD and ShanghaiTech Campus. The results show that the accuracy of our method rivals state-of-the-art methods with a faster detection speed.

Hao Wei, Kai Li, Haichang Li, Yifan Lyu, Xiaohui Hu
Multi-scale Relation Network for Few-Shot Learning Based on Meta-learning

Deep neural networks can learn a huge function space, because they have millions of parameters to fit large amounts of labeled data. However, this advantage is a major obstacle for few-shot learning, because which has to make predictions based on only few samples of each class. In this work, inspired by multi-scale features methods and relation network which uses neural network to learn metrics, we propose a concise and efficient network, multi-scale relation network. The network consists of a feature extractor and a metric learner. Firstly, the feature extractor extracts multi-scale features by combining features from different convolutional layers. Secondly, we generate the relation feature by calculating the absolute value of the difference between multi-scale features. The results on benchmark sets show that our method avoids the over fitting and elongates the period of learning process, providing higher performance with simple design choices.

Yueming Ding, Xia Tian, Lirong Yin, Xiaobing Chen, Shan Liu, Bo Yang, Wenfeng Zheng
Planar Pose Estimation Using Object Detection and Reinforcement Learning

Pose estimation concerns systems or models dealing with the determination of a static object’s pose using, in this case, vision. This paper approaching the problem with an active vision-based solution, that integrates both perception and action in the same model. The problem is solved using a combination of neural networks for object detection and a reinforcement learning architecture for moving a camera and estimating the pose. A robotic implementation of the proposed active vision system is used for testing with promising results. Experiments show that our approach does not only solve the simple task of planar visual pose estimation, but also exhibits robustness to changes in the environment.

Frederik Nørby Rasmussen, Sebastian Terp Andersen, Bjarne Grossmann, Evangelos Boukas, Lazaros Nalpantidis
A Two-Stage Approach for Commonality-Based Temporal Localization of Periodic Motions

We present an unsupervised method for the detection of all temporal segments of videos or motion capture data, that correspond to periodic motions. The proposed method is based on the detection of similar segments (commonalities) in different parts of the input sequence and employs a two-stage approach that operates on the matrix of pairwise distances of all input frames. The quantitative evaluation of the proposed method on three standard ground-truth-annotated datasets (two video datasets, one 3D human motion capture dataset) demonstrate its improved performance in comparison to existing approaches.

Costas Panagiotakis, Antonis Argyros
Deep Residual Temporal Convolutional Networks for Skeleton-Based Human Action Recognition

Deep residual networks for action recognition based on skeleton data can avoid the degradation problem, and a 56-layer Res-Net has recently achieved good results. Since a much “shallower” 11-layer model (Res-TCN) with a temporal convolution network and a simplified residual unit achieved almost competitive performance, we investigate deep variants of Res-TCN and compare them to Res-Net architectures. Our results outperform the other approaches in this class of residual networks. Our investigation suggests that the resistance of deep residual networks to degradation is not only determined by the architecture but also by data and task properties.

R. Khamsehashari, K. Gadzicki, C. Zetzsche
Monte Carlo Tree Search on Directed Acyclic Graphs for Object Pose Verification

Reliable object pose estimation is an integral part of robotic vision systems as it enables robots to manipulate their surroundings. Powerful methods exist that estimate object poses from RGB and RGB-D images, yielding a set of hypotheses per object. However, determining the best hypotheses from the set of possible combinations is a challenging task. We apply MCTS to this problem to find an optimal solution in limited time and propose to share information between equivalent object combinations that emerge during the tree search, so-called transpositions. Thereby, the number of combinations that need to be considered is reduced and the search gathers information on these transpositions in a single statistic. We evaluate the resulting verification method on the YCB-VIDEO dataset and show more reliable detection of the best solution as compared to state of the art. In addition, we report a significant speed-up compared to previous MCTS-based methods for object pose verification.

Dominik Bauer, Timothy Patten, Markus Vincze
Leveraging Symmetries to Improve Object Detection and Pose Estimation from Range Data

Many man-made objects around us exhibit rotational symmetries. This fact can be exploited to improve object detection and 6D pose estimation performance. To this end we propose a set of extensions to the state-of-the-art PPF pipeline. We describe how a fundamental region is selected on symmetrical objects and used to construct a compact model hash table and a Hough voting space without redundancies. We also introduce a symmetry-aware distance metric for the pose clustering step. Our experiments on T-LESS and ToyotaLight datasets demonstrate that these extensions lead to a consistent improvement in the pose estimation recall score compared to the baseline pipeline, while simultaneously reducing computation time by up to 4 times.

Sergey V. Alexandrov, Timothy Patten, Markus Vincze
Towards Meaningful Uncertainty Information for CNN Based 6D Pose Estimates

Image based object recognition and pose estimation is nowadays a heavily focused research field important for robotic object manipulation. Despite the impressive recent success of CNNs to our knowledge none includes a self-estimation of its predicted pose’s uncertainty.In this paper we introduce a novel fusion-based CNN output architecture for 6d object pose estimation obtaining competitive performance on the YCB-Video dataset while also providing a meaningful uncertainty information per 6d pose estimate. It is motivated by the recent success in semantic segmentation, which means that CNNs can learn to know what they see in a pixel. Therefore our CNN produces a per-pixel output of a point in object coordinates with image space uncertainty, which is then fused by (generalized) PnP resulting in a 6d pose with $$6\times 6$$ covariance matrix. We show that a CNN can compute image space uncertainty while the way from there to pose uncertainty is well solved analytically. In addition, the architecture allows to fuse additional sensor and context information (e.g. binocular or depth data) and makes the CNN independent of the camera parameters by which a training sample was taken. (Code available under .)

Jesse Richter-Klug, Udo Frese
QuiltGAN: An Adversarially Trained, Procedural Algorithm for Texture Generation

We investigate a generative method that synthesises high-resolution images based on a single constraint source image. Our approach consists of three types of conditional deep convolutional generative adversarial networks (cDCGAN) that are trained to generate samples of an image patch conditional on the surrounding image regions. The cDCGAN discriminator evaluates the realism of the generated sample concatenated with the surrounding pixels that were conditioned on. This encourages the cDCGAN generator to create image patches that seamlessly blend with their surroundings while maintaining the randomisation of the standard GAN process. After training, the cDCGANs recursively generate a sequence of samples which are then stitched together to synthesise a larger image. Our algorithm is able to produce a nearly infinite collection of variations of a single input image that have enough variability while preserving the essential large-scale constraints. We test our system on several types of images, including urban landscapes, building facades and textures, comparing very favourably against standard image quilting approaches.

Renato Barros Arantes, George Vogiatzis, Diego Faria
Automated Mechanical Multi-sensorial Scanning

The 3D reconstruction of Cultural Heritage objects is a significant and advantageous technology for conservators and restorers. It contributes to the proper documentation of CH items, allows researchers, scholars and the general public to better manipulate and understand CH objects and gives the opportunity for remote and enhanced on-site experiences through virtual museums or even personal digital collections. The latest technological advances in computer vision in conjunction with robotics facilitate the development of automated and optimal solutions for digitizing complicated artifacts. In this direction, the current study presents an integrated, portable solution based on a modular architecture, for accurate multi-sensorial 3D scanning via a dedicated motorized mechanical arm and efficient automatic 3D reconstruction of a big variety of cultural heritage assets even in situ. The system is composed of a customized 3D reconstruction module, an automated motion planning module and a physical positioning system built by combining a mechanical arm and a rotary table. The key strength of the proposed system is that it is a cost-effective and time-saving solution applying computer vision and robotic technologies in order to serve Cultural Heritage preservation.

Vaia Rousopoulou, Konstantinos Papachristou, Nikolaos Dimitriou, Anastasios Drosou, Dimitrios Tzovaras

Cognitive Vision Systems

Point Pair Feature Matching: Evaluating Methods to Detect Simple Shapes

A recent benchmark for 3D object detection and 6D pose estimation from RGB-D images shows the dominance of methods based on Point Pair Feature Matching (PPFM). Since its invention in 2010 several modifications have been proposed to cope with its weaknesses, which are computational complexity, sensitivity to noise, and difficulties in the detection of geometrically simple objects with planar surfaces and rotational symmetries. In this work we focus on the latter. We present a novel approach to automatically detect rotational symmetries by matching the object model to itself. Furthermore, we adapt methods for pose verification and use more discriminative features which incorporate global information into the Point Pair Feature. We also examine the effects of other, already existing extensions by testing them on our specialized dataset for geometrically primitive objects. Results show that particularly our handling of symmetries and the augmented features are able to boost recognition rates.

Markus Ziegler, Martin Rudorfer, Xaver Kroischke, Sebastian Krone, Jörg Krüger
Multi-DisNet: Machine Learning-Based Object Distance Estimation from Multiple Cameras

In this paper, a novel method for distance estimation from multiple cameras to the object viewed with these cameras is presented. The core element of the method is multilayer neural network named Multi-DisNet, which is used to learn the relationship between the sizes of the object bounding boxes in the cameras images and the distance between the object and the cameras. The Multi-DisNet was trained using a supervised learning technique where the input features were manually calculated parameters of the objects bounding boxes in the cameras images and outputs were ground-truth distances between the objects and the cameras. The presented distance estimation system can be of benefit for all applications where object (obstacle) distance estimation is essential for the safety such as autonomous driving applications in automotive or railway. The presented object distance estimation system was evaluated on the images of real-world railway scenes. As a proof-of-concept, the results on the fusion of two sensors, an RGB and thermal camera mounted on a moving train, in the Multi-DisNet distance estimation system are shown. Shown results demonstrate both the good performance of Multi-DisNet system to estimate the mid (up to 200 m) and long-range (up to 1000 m) object distance and benefit of sensor fusion to overcome the problem of not reliable object detection.

Haseeb Muhammad Abdul, Ristić-Durrant Danijela, Gräser Axel, Banić Milan, Stamenković Dušan
Hierarchical Image Inpainting by a Deep Context Encoder Exploiting Structural Similarity and Saliency Criteria

The purpose of this paper is to present a context learning algorithm for inpainting missing regions using visual features. This encoder learns physical structure and semantic information from the image and this representation differentiates it from simple auto encoders. Such properties are crucial for tasks like image in-painting, classification and detection. Training was performed by patch-wise reconstruction loss using Structural Similarity (SSIM) jointly with an adversarial loss. The reconstruction loss is also augmented using spatially varying saliency maps that increase the error penalty on distinctive regions and thus promote image sharpness. Furthermore, in order to improve image continuity on the boundary of the missing region, distance functions with increasing importance towards the center of the inpainting region are also used either independently or in conjunction with the saliency maps. We also show that our choice of reconstruction loss outperforms conventional criteria such as the L2 norm. This means giving more weight to pixels closer to the border of the missing image parts and also giving more important to salience parts of the image to guide the reconstruction, thus producing more realistic images.

Nikolaos Stagakis, Evangelia I. Zacharaki, Konstantinos Moustakas
Online Information Augmented SiamRPN

Recently, many Siamese network based object tracking methods have been proposed and have shown good performances. These method give two images to two identical artificial neural networks as the inputs and find the target area based on the similarity measured by the Siamese network. However, the measure used in the Siamese network is based on the offline training, and therefore, easily fail to adapt to online changes. In this paper, we propose to apply a distance measure which considers the relative position between the objects and the histogram information as additional online information. This additional information prevents the tracking to fail when hard negative cases appear in the scene.

Edward Budiman Sutanto, Sukho Lee
Deep-Learning-Based Computer Vision System for Surface-Defect Detection

Automating optical-inspection systems using machine learning has become an interesting and promising area of research. In particular, the deep-learning approaches have shown a very high and direct impact on the application domain of visual inspection. This paper presents a complete inspection system for automated quality control of a specific industrial product. Both hardware and software part of the system are described, with machine vision used for image acquisition and pre-processing followed by a segmentation-based deep-learning model used for surface-defect detection. The deep-learning model is compared with the state-of-the-art commercial software, showing that the proposed approach outperforms the related method on the specific domain of surface-crack detection. Experiments are performed on a real-world quality-control case and demonstrate that the deep-learning model can be successfully used even when only 33 defective training samples are available. This makes the deep-learning method practical for use in industry where the number of available defective samples is limited.

Domen Tabernik, Samo Šela, Jure Skvarč, Danijel Skočaj
Color-Guided Adaptive Support Weights for Active Stereo Systems

In this paper we present a color-guided adaptive support weight scheme for the cost aggregation of active stereo matching systems. These systems work by stereo matching two images using the texture provided by infrared pseudo-random dot pattern projectors. This method might fail in regions where the pattern is absent, due to the geometry of the scene and/or the reflectivity properties of the test material. However, leveraging the texture information provided by a separate color sensor might uncover details otherwise unseen by the infrared sensors. We propose a cost aggregation method that utilizes both an infrared and a color image of the scene, making smart aggregation choices depending on the underlying texture information provided by the two separate images. We use our cost aggregation method with the fully self-supervised real-time architecture presented in [14], having in mind the usage of low-cost commercial active stereo matching sensors, like the Intel Realsense D435 sensor, in industrial applications demanding high-quality depth maps. We evaluate our results on our own dataset comprised by vehicle surface data, and give qualitative evidence of the disparity estimation improvements.

Ioannis Kleitsiotis, Nikolaos Dimitriou, Konstantinos Votis, Dimitrios Tzovaras
Image Enhancing in Poorly Illuminated Subterranean Environments for MAV Applications: A Comparison Study

This work focuses on a comprehensive study and evaluation of existing low-level vision techniques for low light image enhancement, targeting applications in subterranean environments. More specifically, an emerging effort is currently pursuing the deployment of Micro Aerial Vehicles in subterranean environments for search and rescue missions, infrastructure inspection and other tasks. A major part of the autonomy of these vehicles, as well as the feedback to the operator, has been based on the processing of the information provided from onboard visual sensors. Nevertheless, subterranean environments are characterized by a low natural illumination that directly affects the performance of the utilized visual algorithms. In this article, an novel extensive comparison study is presented among five State-of the-Art low light image enhancement algorithms for evaluating their performance and identifying further developments needed. The evaluation has been performed from datasets collected in real underground tunnel environments with challenging conditions from the onboard sensor of a MAV.

Christoforos Kanellakis, Petros Karvelis, George Nikolakopoulos
Robust Optical Flow Estimation Using the Monocular Epipolar Geometry

The estimation of optical flow in cases of illumination change, sparsely-textured regions or fast moving objects is a challenging problem. In this paper, we analyze the use of a texture constancy constraint based on local descriptors (i.e., HOG) integrated with the monocular epipolar geometry to estimate robustly optical flow. The framework is implemented in differential data fidelities using a total variation model in a multi-resolution scheme. Besides, we propose an effective method to refine the fundamental matrix along with the estimation of the optical flow. Experimental results based on the challenging KITTI dataset show that the integration of texture constancy constraint with the monocular epipolar line constraint and the enhancement of the fundamental matrix significantly increases the accuracy of the estimated optical flow. Furthermore, a comparison with existing state-of-the-art approaches shows better performance for the proposed approach.

Mahmoud A. Mohamed, Bärbel Mertsching
3D Hand Tracking by Employing Probabilistic Principal Component Analysis to Model Action Priors

This paper addresses the problem of 3D hand pose estimation by modeling specific hand actions using probabilistic Principal Component Analysis. For each of the considered actions, a parametric subspace is learned based on a dataset of sample action executions. The developed method tracks the 3D hand pose either in the case of unconstrained hand motion or in the case that the hand is engaged in some of the modelled actions. The tracker uses gradient descent optimization to fit a 3D hand model to the available observations. An online criterion is used to automatically switch between tracking the hand in the unconstrained case and tracking it in the case of learned action sub-spaces. To train and evaluate the proposed method, we captured a new dataset that contains sample executions of 5 different grasp-like hand actions and hand/object interactions. We tested the proposed method both quantitatively and qualitatively. For the quantitative evaluation we relied on our dataset to create synthetic sequences from which we artificially removed observations to simulate occlusions. The obtained results show that the proposed method improves 3D hand pose estimation over existing approaches, especially in the presence of occlusions, where the employed action models assist the accurate recovery of the 3D hand pose despite the missing observations.

Emmanouil Oulof Porfyrakis, Alexandros Makris, Antonis Argyros
Cross-Domain Interpolation for Unpaired Image-to-Image Translation

Unpaired Image-to-image translation is a brand new challenging problem that consists of latent vectors extracting and matching from a source domain A and a target domain B. Both latent spaces are matched and interpolated by a directed correspondence function F for $$A \rightarrow B$$ and G for $$B \rightarrow A$$. The current efforts point to Generative Adversarial Networks (GANs) based models due they synthesize new quite realistic samples across different domains by learning critical features from their latent spaces. Nonetheless, domain exploration is not explicit supervision; thereby most GANs based models do not achieve to learn the key features. In consequence, the correspondence function overfits and fails in reverse or loses translation quality. In this paper, we propose a guided learning model through manifold bi-directional translation loops between the source and the target domains considering the Wasserstein distance between their probability distributions. The bi-directional translation is CycleGAN-based but considering the latent space Z as an intermediate domain which guides the learning process and reduces the inducted error from loops. We show experimental results in several public datasets including Cityscapes, Horse2zebra, and Monet2photo at the EECS-Berkeley webpage ( ). Our results are competitive to the state-of-the-art regarding visual quality, stability, and other baseline metrics.

Jorge López, Antoni Mauricio, Jose Díaz, Guillermo Cámara
A Short-Term Biometric Based System for Accurate Personalized Tracking

Surveillance systems have long been in the focus of the research community. Although the accurate detection of the human presence in the scene is now possible even under extreme environmental conditions via the advanced modern camera sensors, efficient personalized tracking is still an open issue and a significant challenge for researchers addressing. Moreover, personalized tracking will not only enhance the tracking robustness but it can also find useful application in several commercial surveillance use-cases, ranging from security to occupancy statistics (i.e. per building, per space and per human). In this respect, this paper introduces a novel the biometric approach for enhanced privacy preserving human tracking based on a novel soft-biometric feature of humans. The moving blobs in the recorded scene can be easily detected in the colour images, while the human silhouettes are detected from the corresponding depth ones. The state-of-the-art 3D Weighted Walkthroughs (3DWW) transformation is applied on the extracted human 3D point cloud, forming thus, a short-term soft biometric signature. The re-authentication of the humans is performed via the comparison of their last valid signature with current one. A thorough analysis on the adjustment of the system’s optimal operational settings has been carried out and the experimental results illustrate the promising robustness, accuracy and efficiency on human tracking performance.

Georgios Stavropoulos, Nikolaos Dimitriou, Anastasios Drosou, Dimitrios Tzovaras

Workshop on: Movement Analytics and Gesture Recognition for Human- Machine Collaboration in Industry 4.0

Real-Time Gestural Control of Robot Manipulator Through Deep Learning Human-Pose Inference

With the raise of collaborative robots, human-robot interaction needs to be as natural as possible. In this work, we present a framework for real-time continuous motion control of a real collaborative robot (cobot) from gestures captured by an RGB camera. Through deep learning existing techniques, we obtain human skeletal pose information both in 2D and 3D. We use it to design a controller that makes the robot mirror in real-time the movements of a human arm or hand.

Jesus Bujalance Martin, Fabien Moutarde
A Comparison of Computational Intelligence Techniques for Real-Time Discrete Multivariate Time Series Classification of Conducting Gestures

Gesture classification is a computational process that can identify and classify human gestures. More specifically, gesture classification is often a discrete multivariate time series classification problem and various computational intelligence solutions have been developed for these problems. It is difficult to determine which existing techniques and approaches to algorithms will produce the most effective solutions for discrete multivariate time series classification problems. In this study, we compare twelve different classification algorithms to report which techniques and approaches are most effective for recognizing conducting beat pattern gestures. After performing 10-fold cross-validation tests on twelve commonly used algorithms, the results show that of the algorithms tested, the most accurate were RNN, LSTM, and DTW; all of which had an accuracy of 100%. We found that in general, algorithms which can take in a dynamic sequence input and classification algorithms that are discriminative performed consistently well, while their counterparts varied in performance. From these results we determine that when selecting a computational intelligence technique to solve these classification problems, it would be advantageous to consider the top performing algorithms along with furthering research into new dynamic input and discriminative algorithms.

Justin van Heek, Gideon Woo, Jack Park, Herbert H. Tsang
A Deep Network for Automatic Video-Based Food Bite Detection

Past research has now provided compelling evidence pointing towards correlations among individual eating styles and the development of (un)healthy eating patterns, obesity and other medical conditions. In this setting, an automatic, non-invasive food bite detection system can be a really useful tool in the hands of nutritionists, dietary experts and medical doctors in order to explore real-life eating behaviors and dietary habits. Unfortunately, the automatic detection of food bites can be challenging due to occlusions between hands and mouth, use of different kitchen utensils and personalized eating habits. On the other hand, although accurate, manual bite detection is time-consuming for the annotator, making it infeasible for large scale experimental deployments or real-life settings. To this regard, we propose a novel deep learning methodology that relies solely on human body and face motion data extracted from videos depicting people eating meals. The purpose is to develop a system that can accurately, robustly and automatically identify food bite instances, with the long-term goal to complement or even replace manual bite-annotation protocols currently in use. The experimental results on a large dataset reveal the superb classification performance of the proposed methodology on the task of bite detection and paves the way for additional research on automatic bite detection systems.

Dimitrios Konstantinidis, Kosmas Dimitropoulos, Ioannis Ioakimidis, Billy Langlet, Petros Daras
Extracting the Inertia Properties of the Human Upper Body Using Computer Vision

Currently, biomechanics analyses of the upper human body are mostly kinematic i.e., they are concerned with the positions, velocities, and accelerations of the joints on the human body with little consideration on the forces required to produces them. Tough kinetic analysis can give insight to the torques required by the muscles to generate motion and therefore provide more information regarding human movements, it is generally used in a relatively small scope (e.g. one joint or the contact forces the hand applies). The problem is that in order to calculate the joint torques on an articulated body, such as the human arm, the correct shape and weight must be measured. For robot manipulators, this is done by the manufacturer during the designing phase, however, on the human arm, direct measurement of the volume and the weight is very difficult and extremely impractical. Methods for indirect estimation of those parameters have been proposed, such as the use of medical imaging or standardized scaling factors (SF). However, there is always a trade off between accuracy and practicality. This paper uses computer vision (CV) to extract the shape of each body segment and find the inertia parameters. The joint torques are calculated using those parameters and they are compared to joint torques that were calculated using SF to establish the inertia properties. The purpose here is to examine a practical method for real-time joint torques calculation that can be personalized and accurate.

Dimitrios Menychtas, Alina Glushkova, Sotirios Manitsaris
Single Fingertip Detection Using Simple Geometric Properties of the Hand Image: A Case Study for Augmented Reality in an Educational App

We propose a fingertip detection method suitable for portable devices’ applications where the user can interact with objects or interface elements located in the augmented space. The method has been experimentally implemented in the context of an application that uses a board containing ArUco markers [1]. The user can press virtual buttons laid on the board or drag items along predetermined paths on the board level by extending the index finger (or thumb) and placing its edge on the object of interest, while the other hand holds the device so that both the object and the hand are visible. We present brief but indicative results of our technique.

Nikolaos Nomikos, Dimitris Kalles
Leveraging Pre-trained CNN Models for Skeleton-Based Action Recognition

Skeleton-based human action recognition has recently drawn increasing attention thanks to the availability of low-cost motion capture devices, and accessibility of large-scale 3D skeleton datasets. One of the key challenges in action recognition lies in the high dimensionality of the captured data. In recent works, researchers draw inspiration from the success of deep learning in computer vision in order to improve the performances of action recognition systems. Unfortunately, most of these studies do not leverage different available deep architectures but develop new architectures. Most of the available architecture achieve very high accuracy in different image classification problems. In this paper, we use these architectures that are already pre-trained on other image classification tasks. Skeleton sequences are first transformed into image-like data representation. The resulting images are used to train different state-of-the-art CNN architectures following different training procedures. The experimental results obtained on the popular NTU RGB+D dataset, are very promising and outperform most of the state-of-the-art results.

Sohaib Laraba, Joëlle Tilmanne, Thierry Dutoit

Workshop on: Cognitive and Computer Vision Assisted Systems for Energy Awareness and Behavior Analysis

An Augmented Reality Game for Energy Awareness

Energy efficiency requires a behavioral shift towards sustainable consumption. Such a change can be supported by persuasive IT applications, which employ a variety of stimuli to increase the energy literacy and awareness of consumers. We describe FunergyAR, an Augmented Reality digital game targeting children and their families. FunergyAR incorporates Computer Vision and Augmented Reality components within traditional game mechanics and can be used either in a standalone manner or together with Funergy, a card game designed for improving energy savvy behaviors in children.

Piero Fraternali, Sergio Luis Herrera Gonzalez
Energy Consumption Patterns of Residential Users: A Study in Greece

Electricity is an integral part of our lives and is directly linked to all areas of indoor human activity. In order to achieve good management of household electricity consumption, it is first necessary to make a correct and detailed measurement of it. Based on that aspect, this paper utilizes smart meters to monitor the electricity consumption of 120 different houses for a year in Greece. The measurements are saved and analyzed in order to gain a perspective of energy consumption patterns in comparison to temperature and personal energy profiling. The results and information of this paper could be used by current and future users as a guide to shift electricity behavior towards energy saving and also create new standardized profiles regarding demand response management to achieve energy efficiency.

Aristeidis Karananos, Asimina Dimara, Konstantinos Arvanitis, Christos Timplalexis, Stelios Krinidis, Dimitrios Tzovaras
Overview of Legacy AC Automation for Energy-Efficient Thermal Comfort Preservation

The rapid maturity of everyday sensor technologies has had a significant impact on our ability to collect information from the physical world. There are tremendous opportunities in using sensor technologies (both wired and wireless) for building operation, monitoring and control. The key promise of sensor technology in building operation is to reduce the cost of installing data acquisition and control systems (typically 40% of the cost of controls technology in a heating, ventilation, and air conditioning (HVAC) system). Reducing or eliminating this cost component has a dramatic effect on the overall installed system cost. With low-cost sensor and control systems, not only will the cost of system installation be significantly reduced, but it will become economical to use more sensors, thereby establishing highly energy efficient building operations and demand responsiveness that will enhance our electric grid reliability.

Michail Terzopoulos, Christos Korkas, Iakovos T. Michailidis, Elias Kosmatopoulos
Can I Shift My Load? Optimizing the Selection of the Best Electrical Tariff for Tertiary Buildings

Sustainability is strongly related to the appropriate use of available resources, being an important cornerstone in any company’s administration due to the direct influence on its efficiency and ability to compete in the global market. Therefore, the intelligent and proper management of these resources is a pressing matter in terms of cost savings. Among the possible alternatives for optimisation, the one regarding electricity consumption stands out due to its strong influence on the expenses account. In general, this type of optimisation can be carried out from two different perspectives: one that concerns the efficient use of energy itself and the other related to the proper adjustment of the electricity contract so that it meets the infrastructure needs while avoiding extra costs derived from poorly sized bills. This paper describes the application of an artificial intelligence based methodology for the optimisation of the parameters contracted in the electricity tariff in the Spanish market. This technique is able to adjust the power term needed so that the global economic cost derived from energy consumption is significantly reduced. The papers discusses the impact that this proposal may have on a demand response scenario associated to load shifting practices within university buildings. Furthermore, the role of human beings, specifically university employees, and their actions towards reducing the overuse of power consumption at the same time is also addressed.

Oihane Kamara-Esteban, Cruz E. Borges, Diego Casado-Mansilla
Occupancy Inference Through Energy Consumption Data: A Smart Home Experiment

This work is addressing the problem of occupancy detection in domestic environments, which is considered crucial in the aspect of increasing energy efficiency in buildings. In particular, in contrast with most previous researches, which obtained occupancy data through dedicated sensors, this study is investigating the possibility of using total consumption solely obtained from central smart meters installed in the examined buildings. In order to evaluate the feasibility of this simplified approach, the supervised machine learning classifier Random Forest was trained and tested on the experimental dataset. Repeated simulation tests show encouraging results achieving a high average performance with accuracy of 85%.

Adamantia Chouliara, Konstantinos Peppas, Apostolos C. Tsolakis, Thanasis Vafeiadis, Stelios Krinidis, Dimitrios Tzovaras
A Dynamic Convergence Algorithm for Thermal Comfort Modelling

This paper attempts to utilize experimental results in order to correlate clothing insulation and metabolic rate with indoor temperature. Inferring clothing insulation and metabolic rate values from ASHRAE standards is an alternative that totally ignores environmental conditions that actually affect human clothing and activity. In this work, comfort feedback regarding occupants’ thermal sensation is utilized by an algorithm that predicts clothing insulation and metabolic rate values. The analysis of those values reveals certain patterns that lead to the formulation of two non-linear equations between clothing – indoor temperature and metabolic rate – indoor temperature. The formulation of the equations is based on the experimental results derived from the thermal comfort feedback provided by actual building occupants. On trial tests are presented and conclusions regarding the method’s effectiveness and limitations are drawn.

Asimina Dimara, Christos Timplalexis, Stelios Krinidis, Dimitrios Tzovaras
Thermal Comfort Metabolic Rate and Clothing Inference

This paper examines the implementation of an algorithm for the prediction of metabolic rate (M) and clothing insulation ($$I_{cl}$$) values in indoor spaces. Thermal comfort is calculated according to Fanger’s steady state model. In Fanger’s approach, M and $$I_{cl}$$ are two parameters that have a strong impact on the calculation of thermal comfort. The estimation of those parameters is usually done, utilizing tables that match certain activities with metabolic rate values and garments with insulation values that aggregate to a person’s total clothing. In this work, M and $$I_{cl}$$ are predicted utilizing indoor temperature (T), indoor humidity (H) and thermal comfort feedback provided by the building occupants. The training of the predictive model, required generating a set of training data using values in pre-defined boundaries for each variable. The accuracy of the algorithm is showcased by experimental results. The promising capabilities that derive from the successful implementation of the proposed method are discussed in the conclusions.

Christos Timplalexis, Asimina Dimara, Stelios Krinidis, Dimitrios Tzovaras
User-Centered Visual Analytics Approach for Interactive and Explainable Energy Demand Analysis in Prosumer Scenarios

As part of the energy transition, the spread of prosumers in the energy market requires utilities to look for new approaches in managing local energy demand and supply. Doing this effectively requires better understanding and managing of local energy consumption and production patterns in prosumer scenarios. This situation is particularly challenging for small municipal utilities who traditionally do not have access to sophisticated modeling and forecasting methods and solutions. To this end, we propose a user-centered and a visual analytics approach for the development of a tool for an interactive and explainable day-ahead forecasting and analysis of energy demand in local prosumer environments. We also suggest supporting this with behavioral analysis to enable the analysis of potential relationships between consumption patterns and the interaction of prosumers with energy analysis tools such as customer portals, recommendation systems, and similar. In order to achieve this, we propose a combination of explainable machine learning methods such as kNN and decision trees with interactive visualization and explorative data analysis. This should enable utility analysts to understand how different factors influence expected consumption and perform what-if analyses to better assess possible demand forecasts under uncertain conditions.

Ana I. Grimaldo, Jasminko Novak

Workshop on: Vision-Enabled UAV and Counter-UAV Technologies for Surveillance and Security of Critical Infrastructures

Critical Infrastructure Security Against Drone Attacks Using Visual Analytics

The recent developments in the field of unmanned aerial vehicles (UAV or drones) technology has generated a lot of interdisciplinary applications, ranging from remote surveillance of energy infrastructure, to agriculture. However, in the context of national security, low-cost drone equipment has also been viewed as an easy means to cause destructive effects against national critical infrastructures and civilian population. Addressing the challenge of real-time detection and continuous tracking, this paper proposed presents a holistic architecture consisting of both software and hardware design. The software-based video analytics component leverages upon the advancement of Region based Fully Convolutional Network model for drone detection. The hardware component includes a low-cost sensing equipment powered by Raspberry Pi for controlling the camera platform for continuously tracking the orientation of the drone by streaming the video footage captured from the long-range surveillance camera. The novelty of the proposed framework is twofold namely the detection of the drone in real-time and continuous tracking of the detected drone through controlling the camera platform. The framework relies on the capability of the long-range camera to lock into the drone and subsequently track the drone through space. The analytics processing component utilises the NVIDIA$$\circledR $$ GeForce$$\circledR $$ GTX 1080 with 8 GB GDDR5X GPU. The experimental results of the proposed framework have been validated against real-world threat scenarios simulated for the protection of the national critical infrastructure.

Xindi Zhang, Krishna Chandramouli
Classification of Drones with a Surveillance Radar Signal

This paper deals with the automatic classification of Drones using a surveillance radar signal. We show that, using state-of-the-art feature-based machine learning techniques, UAV tracks can be automatically distinguished from other object (e.g. bird, airplane, car) tracks. In fact, on a collection of real data, we measure an accuracy higher than 98%. We have also exploited the possibility of using the same features to distinguish the type of the wing of drone, between Fixed Wing and Rotary Wing, reaching an accuracy higher than 93%.

Marco Messina, Gianpaolo Pinelli
Minimal-Time Trajectories for Interception of Malicious Drones in Constrained Environments

This work is motivated by the need to improve existing systems of interception of drones by using other drones. Physical neutralization of malicious drones is recently reaching interest in the field of counter-drone technologies. The exposure time of these threats is a key factor in environments of high population densities such as cities, where the presence of obstacles can complicate the task of persecution and capture of the intruder drone. This paper is therefore focused on the development and optimization of a strategy of tracking and intercepting malicious drones in a scenario with obstacles. A simulation environment is designed in Matlab-Simulink to test and compare traditional interception methods, such as Pure Pursuit which is quite common in missile guidance field, with the proposed strategy. The results show an improvement in the interception strategy by means of a reduction in the time of exposure of the threat with the developed algorithm, even when considering obstacle environment.

Manuel García, Antidio Viguria, Guillermo Heredia, Aníbal Ollero
UAV Classification with Deep Learning Using Surveillance Radar Data

The Unmanned Aerial Vehicle (UAV) proliferation has raised many concerns, since their potentially malicious usage renders them as a detrimental tool for a number of illegal activities. Radar based counter-UAV applications provide a robust solution for UAV detection and classification. Most of the existing research addresses the problem of UAV classification by extracting features from the time variations of the Fourier spectra. Yet, these solutions require that the UAV is illuminated by the radar for a longer time which can be only met by a tracking radar architecture. On the other hand, surveillance radar architectures don’t have such a cumbersome requirement and are generally superior in maintaining situational awareness, due their ability for constantly searching on a 360$$^{\circ }$$ area for targets. Nevertheless, the available automatic UAV classification methods for this type of radar sensors are relatively inefficient. This work proposes the incorporation of the deep learning paradigm in the classification pipeline, to provide an alternative UAV classification method that can handle data from a surveillance radar. Therefore, a Deep Neural Network (DNN) model is employed to discern between UAVs and negative examples (e.g. birds, noise, etc.). The conducted experiments demonstrate the validity of the proposed method, where the overall classification accuracy can reach up to $$95.0\%$$.

Stamatios Samaras, Vasileios Magoulianitis, Anastasios Dimou, Dimitrios Zarpalas, Petros Daras
UAV Localization Using Panoramic Thermal Cameras

Drone detection and localization became a real challenge of many companies over the past years and for the years to come. Several technologies of different kind have given some results. Among them, thermal sensors, particularly panoramic thermal imager can provide good results. In this paper, we will address two subjects. First, we will introduce the characteristics that panoramic thermal imaging systems should reach to prove their efficiency in a C-UAV system. Then, in a second part, we will present the use of data captured from multiple 360° cameras together in order to localize targets in a 3D environment and the benefits that flow from it: distance, altitude, GPS coordinates, speed, physical dimensions can then be estimated.

Anthony Thomas, Vincent Leboucher, Antoine Cotinat, Pascal Finet, Mathilde Gilbert
Multimodal Deep Learning Framework for Enhanced Accuracy of UAV Detection

Counter-Unmanned Aerial Vehicle (c-UAV) systems are considered an emerging technology dedicated to address the critical issue of malicious UAV detection. Acquiring useful information from a multitude of data gathered using a topology of different sensors for UAV detection constitutes a problem with substantial importance. In this paper, we present a novel multimodal deep learning methodology to filter and combine data from a variety of unimodal approaches dedicated to UAV detection. Specifically, the aim of this work is to detect, and classify potential UAVs based on a fusion procedure of features from UAV detections provided by unimodal components. Actually, we propose a general fusion neural network framework in order to merge features extracted from unimodal modules and make deductions with increased accuracy. Our method is validated by thorough application to UAV detection and classification tasks. Our model approach achieves significant performance improvement over the unimodal detection results.

Eleni Diamantidou, Antonios Lalas, Konstantinos Votis, Dimitrios Tzovaras
Multi-scale Feature Fused Single Shot Detector for Small Object Detection in UAV Images

Small object detection is a challenging computer vision problem due to their low feature representation in the images and factors such as occlusions and noise. In images captured from a camera mounted on an unmanned aerial vehicle (UAV), objects are usually acquired in small sizes depending on the UAV flight altitude. The state-of-the-art object detectors often have lower detection accuracy with small objects. New approaches of combining features at multi-levels in the network helps in improving the object detection performance. In this paper, we propose a multi-scale approach of low-level feature combinations with deconvolutional modules on a single shot multibox detection (SSD) object detector to improve the small object detection in images acquired from a UAV. The proposed SSD based architecture is evaluated on UAV datasets to compare its performance with the state-of-the-art detectors.

Manzoor Razaak, Hamideh Kerdegari, Vasileios Argyriou, Paolo Remagnino
Autonomous Swarm of Heterogeneous Robots for Surveillance Operations

The introduction of Unmanned vehicles (UxVs) in the recent years has created a new security field that can use them as both a potential threat as well as new technological weapons against those threats. Dealing with these issues from the counter-threat perspective, the proposed architecture project focuses on designing and developing a complete system which utilizes the capabilities of multiple UxVs for surveillance objectives in different operational environments. Utilizing a combination of diverse UxVs equipped with various sensors, the developed architecture involves the detection and the characterization of threats based on both visual and thermal data. The identification of objects is enriched with additional information extracted from other sensors such as radars and RF sensors to secure the efficiency of the overall system. The current prototype displays diverse interoperability concerning the multiple visual sources that feed the system with the required optical data. Novel detection models identify the necessary threats while this information is enriched with higher-level semantic representations. Finally, the operator is informed properly according to the visual identification modules and the outcomes of the UxVs operations. The system can provide optimal surveillance capacities to the relevant authorities towards an increased situational awareness.

Georgios Orfanidis, Savvas Apostolidis, Athanasios Kapoutsis, Konstantinos Ioannidis, Elias Kosmatopoulos, Stefanos Vrochidis, Ioannis Kompatsiaris
Computer Vision Systems
Dr. Dimitrios Tzovaras
Dr. Dimitrios Giakoumis
Prof. Dr. Markus Vincze
Prof. Antonis Argyros
Copyright Year
Electronic ISBN
Print ISBN

Premium Partner