Skip to main content

2022 | Buch

Computer Vision, Imaging and Computer Graphics Theory and Applications

15th International Joint Conference, VISIGRAPP 2020 Valletta, Malta, February 27–29, 2020, Revised Selected Papers

herausgegeben von: Kadi Bouatouch, A. Augusto de Sousa, Manuela Chessa, Alexis Paljic, Andreas Kerren, Christophe Hurter, Giovanni Maria Farinella, Petia Radeva, Jose Braz

Verlag: Springer International Publishing

Buchreihe : Communications in Computer and Information Science

insite
SUCHEN

Über dieses Buch

This book constitutes thoroughly revised and selected papers from the 15th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, VISIGRAPP 2020, held in Valletta, Malta, in February 2020.
The 25 thoroughly revised and extended papers presented in this volume were carefully reviewed and selected from 455 submissions. The papers contribute to the understanding of relevant trends of current research on computer graphics; human computer interaction; information visualization; computer vision.

Inhaltsverzeichnis

Frontmatter
Correction to: Regression-Based 3D Hand Pose Estimation for Human-Robot Interaction
Chaitanya Bandi, Ulrike Thomas

Computer Graphics Theory and Applications

Frontmatter
Unified Model and Framework for Interactive Mixed Entity Systems
Abstract
Mixed reality, natural user interfaces, and the internet of things converge towards an advanced sort of interactive system. These systems enable new forms of interactivity, allowing intuitive user interactions with ubiquitous services in mixed environments. However, they require to synchronize multiple platforms and various technologies. Their heterogeneity makes them complex, and sparsely interoperable or extensible. Therefore, designers and developers require new models, tools, and methodologies to support their creation. We present a unified model of the entities composing these systems, breaking them down into graphs of mixed entities. This model decorrelates real and virtual but still describes their interplay. It characterizes and classifies both the external and internal interactions of mixed entities. We also present a design and implementation framework based on our unified model. Our framework takes advantage of our model to simplify, accelerate, and unify the production of these systems. We showcase the use of our framework by designers and developers in the case of a smart building management system.
Guillaume Bataille, Valérie Gouranton, Jérémy Lacoche, Danielle Pelé, Bruno Arnaldi
Skeleton-and-Trackball Interactive Rotation Specification for 3D Scenes
Abstract
We present a new technique for specifying rotations of 3D shapes around axes inferred from the local shape structure, in support of 3D exploration and manipulation tasks. We compute such axes by extracting approximations of the 3D curve skeleton of such shapes using the skeletons of their 2D image silhouettes and depth information present in the Z buffer. Our method allows specifying rotations around parts of arbitrary 3D shapes with a single click, works in real time for large scenes, can be easily added to any OpenGL-based scene viewer, and is simple to implement. We compare our method with classical trackball rotation, both in isolation and in combination, in a controlled user study. Our results show that, when combined with trackball, skeleton-based rotation reduces task completion times and increases user satisfaction, while not introducing additional costs, being thus an interesting addition to the palette of 3D manipulation tools.
Xiaorui Zhai, Xingyu Chen, Lingyun Yu, Alexandru Telea
CSG Tree Extraction from 3D Point Clouds and Meshes Using a Hybrid Approach
Abstract
The problem of Constructive Solid Geometry (CSG) tree reconstruction from 3D point clouds or 3D triangle meshes is hard to solve. At first, the input data set (point cloud, triangle soup or triangle mesh) has to be segmented and geometric primitives (spheres, cylinders, ...) have to be fitted to each subset. Then, the size- and shape optimal CSG tree has to be extracted. We propose a pipeline for CSG reconstruction consisting of multiple stages: A primitive extraction step, which uses deep learning for primitive detection, a clustered variant of RANSAC for parameter fitting, and a Genetic Algorithm (GA) for convex polytope generation. It directly transforms 3D point clouds or triangle meshes into solid primitives. The filtered primitive set is then used as input for a GA-based CSG extraction stage. We evaluate two different CSG extraction methodologies and furthermore compare our pipeline to current state-of-the-art methods.
Markus Friedrich, Steffen Illium, Pierre-Alain Fayolle, Claudia Linnhoff-Popien

Human Computer Interaction Theory and Applications

Frontmatter
Intention Understanding for Human-Aware Mobile Robots: Comparing Cues and the Effect of Demographics
Abstract
Mobile robots are becoming more and more ubiquitous in our everyday living environments. Therefore, it is very important that people can easily interpret what the robot’s intentions are. This is especially important when a robot is driving down a crowded corridor. It is essential for people in its vicinity to understand which way the robot wants to go next. To explore what signals are the best for conveying its intention to turn, we implemented three lighting schemes and tested them out in an online experiment. We found that signals resembling automotive signaling work the best also for logistic mobile robots. We further find that people’s opinion of these signaling methods will be influenced by their demographic background (gender, age).
Oskar Palinko, Eduardo Ruiz Ramirez, Norbert Krüger, Leon Bodenhagen
Tracking Eye Movement for Controlling Real-Time Image-Abstraction Techniques
Abstract
Acquisition and consumption of visual media such as digital image and videos is becoming one of the most important forms of modern communication. However, since the creation and sharing of images is increasing exponentially, images as a media form suffer from being devalued, as the quality of single images are getting less and less important, and the frequency of the shared content turns to be the focus. In this work, an interactive system which allows users to interact with volatile and diverting artwork based on their eye movement only is presented. The system uses real-time image-abstraction techniques to create an artwork unique to each situation. It supports multiple, distinct interaction modes, which share common design principles, enabling users to experience game-like interactions focusing on eye-movement and the diverting image content itself. This approach hints at possible future research in the field of relaxation exercises and casual art consumption and creation.
Maximilian Söchting, Matthias Trapp

Information Visualization Theory and Applications

Frontmatter
Improving Deep Learning Projections by Neighborhood Analysis
Abstract
Visualization of multidimensional data is a difficult task, for which there are many tools. Among these tools, dimensionality reduction methods were shown to be particularly helpful to explore data visually. Techniques with good visual separation are very popular, such as those from the SNE-class, but those often are computationally expensive and non-parametric. An approach based on neural networks was recently proposed to address those shortcomings, but it introduces some fuzziness in the generated projection, which is not desired. In this paper we thoroughly explain the parameter space of this neural network approach and propose a new neighborhood-based learning paradigm, which further improves the quality of the projections learned by the neural networks, and we illustrate our approach on large real-world datasets.
Terri S. Modrakowski, Mateus Espadoto, Alexandre X. Falcão, Nina S. T. Hirata, Alexandru Telea
Scalable Visual Exploration of 3D Shape Databases via Feature Synthesis and Selection
Abstract
We present a set of techniques to address the problem of scalable creation of visual overview representations of large 3D shape databases based on dimensionality reduction of feature vectors extracted from shape descriptions. We address the problem of feature extraction by exploring both combinations of hand-engineered geometric features and using the latent feature vectors generated by a deep learning classification method, and discuss the comparative advantages of both approaches. Separately, we address the problem of generating insightful 2D projections of these feature vectors that are able to separate well different groups of similar shapes by two approaches. First, we create quality projections by both automatic search in the space of feature combinations and, alternatively, by leveraging human insight to improve projections by iterative feature selection. Secondly, we use deep learning to automatically construct projections from the extracted features. We show that our three variations of deep learning, which jointly treat feature extraction, selection, and projection, allow efficient creation of high-quality visual overviews of large shape collections, require minimal user intervention, and are easy to implement. We demonstrate our approach on several real-world 3D shape databases.
Xingyu Chen, Guangping Zeng, Jiří Kosinka, Alexandru Telea
Visual Analysis of Linked Musicological Data with the musiXplora
Abstract
While digitizing data is the first major step for many digital humanities projects, the visual analysis is of high value for humanists, as it brings a wide range of possibilities to work with data. While rather traditional analysis often concentrates on standalone or sets of information (close reading), global inspections of linked data are also requested by today’s researchers and made possible through digital processing. Hence, distance reading approaches are more and more found in humanities projects. Next to such approaches allowing new research questions of quantitative analysis, linking previously separate information on a data level is another way of providing humanists with access to further, previously not reachable, global inspections of faceted datasets.
As a domain with both, faceted data and a rather low level of digitization, musicology is a prime example of how the digital humanities may improve and support the daily workflows of humanists. Despite the generally low level of digitization, multiple projects already build a basis to help in digitizing the field. As an example, the musiXplora project collected a vast amount of musicological data throughout the last 16 years and now offers both, a detailed biography of persons, places, objects, events, media, institutions and terms and also the linkage between these kinds of entities to help in giving a user a comprehensible overview in the traditionally fragmented field of musicology. Supported by a set of visualizations, the website of the project allows for visual analysis on close reading and distant reading levels. This does not only help researchers in their daily workflows but also offers users with a more casual nature an interesting view inside the domain of musicology.
Richard Khulusi, Josef Focht, Stefan Jänicke
A Research-Teaching Guide for Visual Data Analysis in Digital Humanities
Abstract
The use of visualization to underpin distant reading arguments on cultural heritage data has established in the digital humanities domain. Novel strategies to represent data visually typically arise from interdisciplinary projects involving humanities and visualization scholars. However, the quality of outcomes might be inhibited as typical challenges of interdisciplinary research arise, and, at the same time, problem solving strategies are missing. I taught a course on visual data analysis in the digital humanities to let students with diverse study backgrounds experience those challenges in their early academic careers. This paper illustrates the research-teaching components of my course. This includes the contents of the theoretical training with active learning tasks, aspects of the practical training and considerations for teachers aiming to compose a related course.
Stefan Jänicke
Coherent Topological Landscapes for Simulation Ensembles
Abstract
The topological structure is an intrinsic feature of a scalar field of any spatial dimensionality. The dependence of the topology on the isovalue of the field can be represented in the form of merge and split trees, which are usually combined to a contour tree. Topological landscapes are algorithmically constructed 2D scalar fields, which have the same topological structure (and, therefore, correspond to the same contour tree) as the given multidimensional scalar field and serve as an intuitive low-dimensional depiction of its topological features. Topological landscapes computed for a set of scalar fields, e.g., created by varying over time or by varying simulation parameter values in a simulation ensemble, are not necessarily coherent among themselves. Therefore, a comparative analysis of topology in an ensemble is hindered. We propose a concept for the generation of coherent contour trees for simulation ensembles that is based on merging contour trees of all scalar fields of the ensemble. The coherent contour tree can be exploited to generate coherent topological landscapes. Visual analysis of varying scalar field topology is, then, supported by animating landscapes or by volume rendering of a stack of temporal slices representing color-coded landscapes. We apply the proposed methodology to synthetic data for evaluation purposes as well as to 2D and 3D simulation ensemble data.
Marina Evers, Maria Herick, Vladimir Molchanov, Lars Linsen

Computer Vision Theory and Applications

Frontmatter
Efficient Range Sensing Using Imperceptible Structured Light
Abstract
A novel projector-camera method is presented that interleaves a sequence of pattern images in the dithering sequence of a DLP projector, in a way that the patterns are imperceptible, and can be acquired cleanly with a synchronized high speed camera. This capability enables the procam system to perform as a real-time range sensor, without affecting the appearance of the projected data. The system encodes and decodes a stream of Gray code patterns imperceptibly, and is deployed on a calibrated and stereo rectified procam system to perform depth triangulation from the extracted patterns. The bandwidth achieved imperceptibly is close to 8 million points per second using a general purpose CPU, which is comparable to perceptible commercial hardware accelerated structured light depth cameras.
Avery Cole, Sheikh Ziauddin, Jonathon Malcolm, Michael Greenspan
Hierarchical Object Detection and Classification Using SSD Multi-Loss
Abstract
When merging existing similar datasets, it would be attractive to benefit from a higher detection rate of objects and the additional partial ground-truth samples for improving object classification. To this end, a novel CNN detector with a hierarchical binary classification system is proposed. The detector is based on the Single-Shot multibox Detector (SSD) and inspired by the hierarchical classification used in the YOLO9000 detector. Localization and classification are separated during training, by introducing a novel loss term that handles hierarchical classification in the loss function (SSD-ML). We experiment with the proposed SSD-ML detector on the generic PASCAL VOC dataset and show that additional super-categories can be learned with minimal impact on the overall accuracy. Furthermore, we find that not all objects are required to have classification label information as classification performance only drops from \(73.3\%\) to \(70.6\%\) while \(60\%\) of the label information is removed. The flexibility of the detector with respect to the different levels of details in label definitions is investigated for a traffic surveillance application, involving public and proprietary datasets with non-overlapping class definitions. Including classification label information from our dataset raises the performance significantly from \(70.7\%\) to \(82.2\%\). The experiments show that the desired hierarchical labels can be learned from the public datasets, while only using box information from our dataset. In general, this shows that it is possible to combine existing datasets with similar object classes and partial annotations and benefit in terms of growth of detection rate and improved class categorization performance.
Matthijs H. Zwemer, Rob G. J. Wijnhoven, Peter H. N. de With
Scene Text Localization Using Lightweight Convolutional Networks
Abstract
Various research initiatives have been reported regarding highly effective results for the text detection problem, which consists of detecting textual elements, such as words and phrases, in digital images. Text localization is an important step on very widely used mobile applications, for instance, on-the-go translations and recognition of text for the visually impaired. At the same time, edge computing is revolutionizing the way embedded systems are architected by moving complex processing and analysis to end devices (e.g., mobile and wearable devices). In this context, the development of lightweight networks that can be run in devices with restricted computing power and with a minimum latency as possible is essential to make plenty of mobile-oriented solutions feasible in practice. In this work, we investigate the use of efficient object detection networks to address this task, proposing the fusion of two lightweight neural network architectures, MobileNetV2 and Single Shot Detector (SSD), into our approach named MobText. As experimental results in the ICDAR’11 and ICDAR’13 datasets demonstrates that our solution yields the best trade-off between effectiveness and efficiency in terms of processing time, achieving the state-of-the-art results on the ICDAR’11 dataset with an F-measure of \(96.09\%\) and an average processing time of 464 ms on a smartphone device, over experiments executed on both dataset images and with images captured in real time from the portable device.
Luis Gustavo Lorgus Decker, Allan Pinto, Jose Luis Flores Campana, Manuel Cordova Neira, Andreza Aparecida dos Santos, Jhonatas Santos de Jesus Conceição, Helio Pedrini, Marcus de Assis Angeloni, Lin Tzy Li, Diogo Carbonera Luvizon, Ricardo da S. Torres
Early Stopping for Two-Stream Fusion Applied to Action Recognition
Abstract
Various information streams, such as scene appearance and estimated movement of objects involved, can help in characterizing actions in videos. These information modalities perform better in different scenarios and complementary features can be combined to achieve superior results compared to the individual ones. As important as the definition of representative and complementary feature streams is the choice of proper combination strategies that explore the strengths of each aspect. In this work, we analyze different fusion approaches to combine complementary modalities. In order to define the best parameters of our fusion methods using the training set, we have to reduce overfitting in individual modalities, otherwise, the 100%-accurate outputs would not offer a realistic and relevant representation for the fusion method. Thus, we analyze an early stopping technique for training individual networks. In addition to reducing overfitting, this method also reduces the training cost, since it usually requires fewer epochs to complete the classification process. Experiments are conducted on UCF101 and HMDB51 datasets, which are two challenging benchmarks in the context of action recognition.
Helena de Almeida Maia, Marcos Roberto e Souza, Anderson Carlos Sousa e Santos, Julio Cesar Mendoza Bobadilla, Marcelo Bernardes Vieira, Helio Pedrini
RGB-D Images Based 3D Plant Growth Prediction by Sequential Images-to-Images Translation with Plant Priors
Abstract
This paper presents a neural network based method for 3D plant growth prediction based on sequential images-to-images translation. Especially, we extend an existing image-to-image translation technique based on U-Net to images-to-images translation by incorporating convLSTM into skip connections in U-Net. With this architecture, we can achieve sequential image prediction tasks such that future images are predicted from several past ones. Since depth images are incorporated as additional channel into our network, the prediction can be represented in 3D space. As an application of our method, we develop a 3D plant growth prediction system. In the evaluation, the performance of our network was investigated in terms of the importance of each module in the network. We verified how the prediction accuracy was affected by the internal structure of the network. In addition, the extension of our network with plant priors was further investigated to evaluate the impact for plant growth prediction tasks.
Tomohiro Hamamoto, Hideaki Uchiyama, Atsushi Shimada, Rin-ichiro Taniguchi
ConvPoseCNN2: Prediction and Refinement of Dense 6D Object Pose
Abstract
Object pose estimation is a key perceptual capability in robotics. We propose a fully-convolutional extension of the PoseCNN method, which densely predicts object translations and orientations. This has several advantages such as improving the spatial resolution of the orientation predictions—useful in highly-cluttered arrangements, significant reduction in parameters by avoiding full connectivity, and fast inference. We propose and discuss several aggregation methods for dense orientation predictions that can be applied as a post-processing step, such as averaging and clustering techniques. We demonstrate that our method achieves the same accuracy as PoseCNN on the challenging YCB-Video dataset and provide a detailed ablation study of several variants of our method. Finally, we demonstrate that the model can be further improved by inserting an iterative refinement module into the middle of the network, which enforces consistency of the prediction.
Arul Selvam Periyasamy, Catherine Capellen, Max Schwarz, Sven Behnke
Expression Modeling Using Dynamic Kernels for Quantitative Assessment of Facial Paralysis
Abstract
Facial paralysis is a syndrome that causes difficulty in the movement of facial muscles on either one or both sides of the face. In this paper, the quantitative assessment for facial paralysis is proposed using dynamic kernels to detect facial paralysis and its various effect level’s on a person’s face by modeling different facial expressions. Initially, the movements of facial muscles are captured locally by spatio-temporal features for each video. Using the extracted spatio-temporal features from all the videos, a large Gaussian mixture model (GMM) is trained to learn the dynamics of facial muscles globally. In order to handle these local and global features in variable-length patterns like videos, we propose to use a dynamic kernel modeling approach. Dynamic kernels are generally known for handling variable-length data patterns like speech, videos, etc., either by mapping them into fixed-length data patterns or by creating new kernels for example by selecting discriminative sets of representations obtained from GMM statistics. In the proposed work, we explore three different kinds of dynamic kernels, namely, explicit mapping kernels, probability-based kernels, and intermediate matching kernels for the modeling of facial expressions. These kernels are then used as feature vectors for classification using the support vector machine (SVM) to detect severity levels of facial paralysis. The efficacy of the proposed dynamic kernel modeling approach for the quantitative assessment of facial paralysis is demonstrated on a self-collected facial paralysis video dataset of 39 facially paralyzed patients of different severity levels. The dataset collected contains patients from different age groups and gender, further, the videos are recorded from seven different view angles for making the proposed model robust to subject and view variations.
Nazil Perveen, Chalavadi Krishna Mohan, Yen Wei Chen
Perceptually-Informed No-Reference Image Harmonisation
Abstract
Many image synthesis tasks, such as image compositing, rely on the process of image harmonisation. The goal of harmonisation is to create a plausible combination of component elements. The subjective quality of this combination is directly related to the existence of human-detectable appearance differences between these component parts, suggesting that consideration for human perceptual tolerances is an important aspect of designing automatic harmonisation algorithms. In this paper, we first investigate the impact of a perceptually-calibrated composite artifact detector on the performance of a state-of-the-art deep harmonisation model. We first evaluate a two-stage model, whereby the performance of both pre-trained models and their naive combination is assessed against a large data-set of 68128 automatically generated image composites. We find that without any task-specific adaptations, the two-stage model achieves comparable results to the baseline harmoniser fed with ground truth composite masks. Based on these findings, we design and train an end-to-end model, and evaluate its performance against a set of baseline models. Overall, our results indicate that explicit modeling and incorporation of image features conditioned on a human perceptual task improves the performance of no-reference harmonisation algorithms. We conclude by discussing the generalisability of our approach in the context of related work.
Alan Dolhasz, Carlo Harvey, Ian Williams
On-board UAV Pilots Identification in Counter UAV Images
Abstract
Among Unmanned Aerial Vehicles (UAV) countermeasures, the detection of the drone position and the identification of the human pilot represent two crucial tasks, as demonstrated by the attention already obtained from security agencies in different countries. Many research works focus on the UAV detection but they rarely take into account the problem of the detection of the pilot of another approaching UAV. This work proposes a full autonomous pipeline that, taking images from a flying UAV, can detect the humans in the scene and recognizing the eventual presence of the pilot(s). The system has been designed to be run on-board of the UAV, and tests have been performed on an NVIDIA Jetson TX2. Moreover, the SnT-ARG-PilotDetect dataset, designed to assess the capabilities to identify the UAV pilots in realistic scenarios, is introduced for the first time and made publicly available. An accurate comparison of different classification approaches on the pilot and non-pilot images of the proposed dataset has been performed, and results show the validity of the proposed pipeline for piloting behavior classification.
Dario Cazzato, Claudio Cimarelli, Holger Voos
Exploring Tele-Assistance for Cyber-Physical Systems with MAUI
Abstract
In this paper we present an improved version of MAUI [9] (MAUI - Maintenance Assistance User Interface), extending the user-study, giving detailed insight into the implementations and introducing a new User-Interface for mobile use. MAUI is a novel take on tele-assisted tasks on cyber-physical systems. In its core we do not only provide real-time communication between workers and experts, but also allow an expert to have full control over the worker’s user-interface.
By precisely separating levels of our software-stack, we enable features like hot-patching and hot-loading of any content or web-based application. Our results show reduced error-rate on task performance once an expert takes over virtual task to relieve the worker. Conditions evaluated include the worker operating on his own and the worker being actively supported. While in operation gauges have to be read, switches to be press and virtual user-interfaces of machinery have to be controlled. Furthermore, we explore the creation of user-interfaces through a developer-study and use the feedback to create a mobile version of MAUI.
Philipp Fleck, Fernando Reyes-Aviles, Christian Pirchheim, Clemens Arth, Dieter Schmalstieg
On the Use of 3D CNNs for Video Saliency Modeling
Abstract
There has been emerging interest recently in three dimensional (3D) convolutional neural networks (CNNs) as a powerful tool to encode spatio-temporal representations in videos, by adding a third temporal dimension to pre-existing 2D CNNs. In this chapter, we discuss the effectiveness of using 3D convolutions to capture the important motion features in the context of video saliency prediction. The method filters the spatio-temporal features across multiple adjacent frames. This cubic convolution could be effectively applied on a dense sequence of frames propagating the previous frames’ information into the current, reflecting processing mechanisms of the human visual system for better saliency prediction. We extensively evaluate the model performance compared to the state-of-the-art video saliency models on both 2D and 360\(^\circ \) videos. The architecture can efficiently learn expressive spatio-temporal representations and produce high quality video saliency maps on three large-scale 2D datasets, DHF1K, UCF-SPORTS and DAVIS. Investigations on the 360\(^\circ \) Salient360! and datasets show how the approach can generalise.
Yasser Abdelaziz Dahou Djilali, Mohamed Sayah, Kevin McGuinness, Noel E. O’Connor
CNN-Based Deblurring of THz Time-Domain Images
Abstract
In recent years, terahertz (THz) time-domain imaging attracted significant attention and become a useful tool in many applications. A THz time-domain imaging system measures amplitude changes of the THz radiation across a range of frequencies so the absorption coefficient of the materials in the sample can be obtained. THz time-domain images represent 3D hyperspectral cubes with several hundred bands corresponding to different wavelengths i.e., frequencies. Moreover, a THz beam has a non-zero beam waist and therefore introduces band-dependent blurring effects in the resulting images accompanied by system-dependent noise. Removal of blurring effects and noise from the whole 3D hyperspectral cube is addressed in the current work. We will start by introducing THz beam shape effects and its formulation as a deblurring problem, followed by presenting a convolutional neural network (CNN)-based approach which is able to tackle all bands jointly. To the best of our knowledge, this is the first time that a CNN is used to remove the THz beam shape effects from all bands jointly of THz time-domain images. Experiments on synthetic images show that the proposed approach significantly outperforms conventional model-based deblurring methods and band-by-band approaches.
Marina Ljubenović, Shabab Bazrafkan, Pavel Paramonov, Jan De Beenhouwer, Jan Sijbers
Thermal Image Super-Resolution: A Novel Unsupervised Approach
Abstract
This paper proposes the use of a CycleGAN architecture for thermal image super-resolution under a transfer domain strategy, where middle-resolution images from one camera are transferred to a higher resolution domain of another camera. The proposed approach is trained with a large dataset acquired using three thermal cameras at different resolutions. An unsupervised learning process is followed to train the architecture. Additional loss function is proposed trying to improve results from the state of the art approaches. Following the first thermal image super-resolution challenge (PBVS-CVPR2020) evaluations are performed. A comparison with previous works is presented showing the proposed approach reaches the best results.
Rafael E. Rivadeneira, Angel D. Sappa, Boris X. Vintimilla
Regression-Based 3D Hand Pose Estimation for Human-Robot Interaction
Abstract
In shared workspaces where humans and robots interact, a significant task is to hand over objects. The process of hand over needs to be reliable, a human must not be injured during the process, hence reliable tracking of human hands is necessary. To avoid collision, we apply an encoder-decoder based 2D and 3D keypoint regression network on color images. In this paper, we introduce a complete pipeline with the idea of stacked and cascaded convolutional neural networks and tune the parameters of the network for real-time applications. Experiments are conducted on multiple datasets, with low and high occlusions and we evaluate the trained models on multiple datasets for the human-robot interaction test set.
Chaitanya Bandi, Ulrike Thomas
Detection and Recognition of Barriers in Egocentric Images for Safe Urban Sidewalks
Abstract
The impact of walking in modern cities has proven to be quite significant with many advantages especially in the fields of environment and citizens’ health. Although society is trying to promote it as the cheapest and most sustainable means of transportation, many road accidents have involved pedestrians and cyclists in the recent years. The frequent presence of various obstacles on urban sidewalks puts the lives of citizens in danger. Their immediate detection and removal are of great importance for maintaining clean and safe access to infrastructure of urban environments. Following the great success of egocentric applications that take advantage of the uninterrupted use of smartphone devices to address serious problems that concern humanity, we aim to develop methodologies for detecting barriers and other dangerous obstacles encountered by pedestrians on urban sidewalks. For this purpose a dedicated image dataset is generated and used as the basis for analyzing the performance of different methods in detecting and recognizing different types of obstacle using three different architectures of deep learning algorithms. The high accuracy of the experimental results shows that the development of egocentric applications can successfully help to maintain the safety and cleanliness of sidewalks and at the same time to reduce pedestrian accidents.
Zenonas Theodosiou, Harris Partaourides, Simoni Panayi, Andreas Kitsis, Andreas Lanitis
Backmatter
Metadaten
Titel
Computer Vision, Imaging and Computer Graphics Theory and Applications
herausgegeben von
Kadi Bouatouch
A. Augusto de Sousa
Manuela Chessa
Alexis Paljic
Andreas Kerren
Christophe Hurter
Giovanni Maria Farinella
Petia Radeva
Jose Braz
Copyright-Jahr
2022
Electronic ISBN
978-3-030-94893-1
Print ISBN
978-3-030-94892-4
DOI
https://doi.org/10.1007/978-3-030-94893-1