Skip to main content

About this book

This book constitutes thoroughly revised and selected papers from the 12th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications, VISIGRAPP 2017, held in Porto, Portugal, February 27 - March 1, 2017. The 18 thoroughly revised and extended papers presented in this volume were carefully reviewed and selected from 402 submissions. The papers contribute to the understanding of relevant trends of current research on image and video formation, preprocessing, analysis and understanding; motion, tracking and stereo vision; computer graphics and rendering; data visualization and interactive visual data analysis; agent-based human-robot interactions; and user experience.

Table of Contents


Computer Graphics Theory and Applications


Calibrating, Rendering and Evaluating the Head Mounted Light Field Display

There are several benefits of using a light field display over a traditional HMD; in particular the light field can avoid the vergence-accommodation conflict and can also correct for near- and farsightedness. By rendering only four corner cameras of a subimage array, then these four views can be interpolated in order to create all subimages of the light field. We implement the interpolation of the subimages in the light field with the use of pixel reprojection, while maintaining correct perspective and shading. We give an comprehensive explanation of the construction and calibration of a head mounted light field display, and finally we evaluate the image quality through image difference and conduct a user evaluation of the light field images in order to evaluate if users are able to perceive a difference in the light field images created with the full array of virtual cameras and our method using four cameras and pixel reprojection. In most cases the users were unable to distinguish the images, and we conclude that pixel reprojection is a feasible method for rendering light fields as far as quality is concerned.
Anne Juhler Hansen, Jákup Klein, Martin Kraus

Human Computer Interaction Theory and Applications


More than Just Human: The Psychology of Social Robots

Social robots, specifically designed to interact with humans, already play an increasing role in many domains such as healthcare, transportation, or care of the elderly. However, research and design still lack a profound theoretical basis considering their role as social beings and the psychological rules that apply to the interaction between humans and robots. From a psychological perspective, social robots have ideal conditions to influence human judgments and behavior and to activate mechanisms of projection. On the one hand, researchers and practitioners in human-robot interaction (HRI) may see such effects as a welcome precondition for the general acceptance of social robots. On the other hand, such native trust provides a ground for dysfunctional effects like over-trust or manipulation. The present paper puts a focus on such questions concerning the “psychology of social robots”. Following an interdisciplinary approach we combine theory and methods from HCI and psychology, aiming to form a basis for successful and human-centered robot design. We point out central research questions and areas of relevance and a summary of first results of our own and others’ research. Finally, we present a preliminary model of robot personality and discuss areas for future research.
Daniel Ullrich, Sarah Diefenbach

The Effect of Audio Guide on the Levels of Contentment of Museum Novices: Relationships Between Visitors’ Eye Movements, Audio Guide Contents, and the Levels of Contentment

Museums offer the opportunity to acquire knowledge about artistic, cultural, historical or scientific interest through a large number of exhibitions. However, even if these masterpieces are visually accessible to all visitors, the background of these works is not necessarily acquired because visitors do not have enough knowledge to fully appreciate them. An audio guide is a tool commonly used to fill this gap. The purpose of this study is to understand the relationships between the eye movements of visitors for the acquisition of information by seeing, the content of the audio guide that should help them understand the objects by hearing, and the contentment level of museum experience. This paper reports the results of an eye-tracking experiment in which eighteen participants were invited to appreciate a variety of images with or without an audioguide used in an actual museum, to complete a questionnaire on subjective feelings and to attend an interview. It is found that the relationship between the viewing time or the frequency of fixation and the satisfaction of the sight, and the effect of the audio-guide on these eye movements. And also found that participants could be categorized into four categories, suggesting an effective way to provide an audio guide.
Kazumi Egawa, Muneo Kitajima

The Impact of Field of View on Robotic Telepresence Navigation Tasks

Telepresence interfaces for navigation tasks involving remote robots are generally designed for providing users with sensory and/or contextual feedback, mainly through onboard camera video stream or map-based localization. This choice is motivated by the fact that operating a mobile robot from distance may be mentally challenging for the users when they do not possess a proper awareness of the environment. However, fixed or narrow field of view cameras often available on these robots may lead to lack of awareness or worse navigation performance due to missing or limited peripheral vision. The aim of this paper is to investigate, through a comparative analysis, how an augmented field of view and/or a pan-tilt camera can impact on users’ performance in remote robot navigation tasks. Thus, a user study has been carried out to assess three different experimental configurations, i.e., a fixed camera with narrow (45\(^{\circ }\)) field of view, a pan-tilt camera with a wide-angle (180\(^{\circ }\)) horizontal field of view, and a fixed camera with a wide-angle (180\(^{\circ }\)) diagonal field of view. Results showed a strong preference for the wide-angle field of view navigation modality, which provided users with greater situational awareness by requiring a lower cognitive effort.
Federica Bazzano, Fabrizio Lamberti, Andrea Sanna, Marco Gaspardone

Computer Vision Theory and Applications


One-Shot Learned Priors in Augmented Active Appearance Models for Anatomical Landmark Tracking

In motion science, biology and robotics animal movement analyses are used for the detailed understanding of the human bipedal locomotion. For this investigations an immense amount of recorded image data has to be evaluated by biological experts. During this time-consuming evaluation single anatomical landmarks, for example bone ends, have to be located and annotated in each image. In this paper we show a reduction of this effort by automating the annotation with a minimum level of user interaction. Recent approaches, based on Active Appearance Models, are improved by priors based on anatomical knowledge and an online tracking method, requiring only a single labeled frame. In contrast, we propose a one-shot learned tracking-by-detection prior which overcomes the shortcomings of template drifts without increasing the number of training data. We evaluate our approach based on a variety of real-world X-ray locomotion datasets and show that our method outperforms recent state-of-the-art concepts for the task at hand.
Oliver Mothes, Joachim Denzler

Line-Based SLAM Considering Prior Distribution of Distance and Angle of Line Features in an Urban Environment

In this paper, we propose a line-based SLAM from an image sequence captured by a camera mounted on a vehicle in consideration with the prior distribution of line features that detected in an urban environments. Since such scenes captured by the vehicle in urban envirounments can be expected to include a lot of line segments detected from road markings and buildings, we employ line segments as features for our SLAM. We use additional prior regarding the line segments so that we can improve the accuracy of the SLAM. We assume that the angle of the vector of the line segments to the vehicle’s direction of travel conform to four-component Gaussian mixture distribution. We define a new cost function considering the prior distribution and optimize the relative camera pose, position, and the 3D line segments by bundle adjustment. The prior distribution is also extended into 2D, the distance and angle of the line segments. In addition, we make digital maps from the detected line segments. Our method increases the accuracy of localization and corrects tilted lines in the digital maps. We apply our method to both the single-camera system and the multi-camera system for demonstrate the accuracy improvement by the prior distribution of distance and angle of line features.
Kei Uehara, Hideo Saito, Kosuke Hara

Weak-Perspective and Scaled-Orthographic Structure from Motion with Missing Data

Perspective n-Point (PnP) problem is in focus of 3D computer vision community since the late 80’s. Standard solutions deal with the pinhole camera model, the problem is challenging due to the perspectivity. The well-known PnP algorithms assume that the intrinsic camera parameters are known, therefore, only extrinsic ones are needed to estimate. It is carried out by a rough estimation, usually given in closed forms, then the accurate camera parameters are obtained via numerical optimization. In this paper, we show that both the weak-perspective and scaled orthographic camera models can be optimally calibrated including the intrinsic camera parameters. Moreover, the latter one is done without iteration if the \(L_2\) norm is used. It is also shown that the calibration can be inserted into a structure from motion algorithm. We also show that the scaled orthographic version can be powered by GPUs, yielding real-time performance.
Levente Hajder

Spatiotemporal Optimization for Rolling Shutter Camera Pose Interpolation

Rolling Shutter cameras are predominant in the tablet and smart-phone market due to their low cost and small size. However, these cameras require specific geometric models when either the camera or the scene is in motion to account for the sequential exposure of the different lines of the image. This paper proposes to improve a state-of-the-art model for RS cameras through the use of Non Uniformly Time-Sampled B-splines. This allows to interpolate the pose of the camera while taking into account the varying dynamic of its motion, using higher density of control points where needed while keeping a low number of control points where the motion is smooth. Two methods are proposed to determine adequate distributions for the control points, using either an IMU sensor or an iterative reprojection error minimization. The non-uniform camera model is integrated into a Bundle Adjustment optimization which is able to converge even from a poor initial estimate. A routine of spatiotemporal optimization is presented in order to optimize both the spatial and temporal positions of the control points. Results on synthetic and real datasets are shown to prove the concepts and future works are introduced that should lead to the integration of our model in a SLAM algorithm.
Philippe-Antoine Gohard, Bertrand Vandeportaele, Michel Devy

Hierarchical Hardware/Software Algorithm for Multi-view Object Reconstruction by 3D Point Clouds Matching

The Matching or Registration of 3D point clouds is a problem that arises in a variety of research areas with applications ranging from heritage reconstruction to quality control of precision parts in industrial settings. The central problem in this research area is that of receiving two point clouds, usually representing different parts of the same object and finding the best possible rigid alignment between them. Noise in data, a varying degree of overlap and different data acquisition devices make this a complex problem with a high computational cost. This issue is sometimes addressed by adding hardware to the scanning system, but this hardware is frequently expensive and bulky. We present an algorithm that makes use of cheap, widely available (smartphone) sensors to obtain extra information during data acquisition. This information then allows for fast software registration. The first such hybrid hardware-software approach was presented in [31]. In this paper we improve the performance of this algorithm by using hierarchical techniques. Experimental results using real data show how the algorithm presented greatly improves the computation time of the previous algorithm and compares favorably to state of the art algorithms.
Ferran Roure, Xavier Lladó, Joaquim Salvi, Tomislav Privanić, Yago Diez

Real-Time HDTV-to-8K TV Hardware Upconverter

8K is the pinnacle of the video systems and 8K broadcasting service will be started in December 2018. However, the availability of content for 8K TV is still insufficient, a situation similar to that of HDTV in the 1990s. Upconverting analogue content to HDTV content was important to supplement the insufficient HDTV content. This upconverted content was also important for news coverage as HDTV equipment was heavy and bulky. The current situation for 8K TV is similar wherein covering news with 8K TV equipment is very difficult as this equipment is much heavier and bulkier than that required for HDTV in the 1990s. The HDTV content available currently is sufficient, and the equipment has also evolved to facilitate news coverage; therefore, an HDTV-to-8K TV upconverter can be a solution to the problems described above. However, upconversion from interlaced HDTV to 8K TV results in an enlargement of the images by a factor of 32, thus making the upconverted images very blurry. Super resolution (SR) is a technology to solve the enlargement blur issue. One of the most common SR technologies is super resolution image reconstruction (SRR). However, SRR has limitations to use for the HDTV-to-8K TV upconverter. In this paper an HDTV-to-8K TV upconverter with nonlinear processing SR has been proposed in this study in order to fix this issue.
Seiichi Gohshi

Non-local Haze Propagation with an Iso-Depth Prior

The primary challenge for removing haze from a single image is lack of decomposition cues between the original light transport and airlight scattering in a scene. Many dehazing algorithms start from an assumption on natural image statistics to estimate airlight from sparse cues. The sparsely estimated airlight cues need to be propagated according to the local density of airlight in the form of a transmission map, which allows us to obtain a haze-free image by subtracting airlight from the hazy input. Traditional airlight-propagation methods rely on ordinary regularization on a grid random field, which often results in isolated haze artifacts when they fail in estimating local density of airlight properly. In this work, we propose a non-local regularization method for dehazing by combining Markov random fields (MRFs) with nearest-neighbor fields (NNFs) extracted from the hazy input using the PatchMatch algorithm. Our method starts from the insightful observation that the extracted NNFs can associate pixels at the similar depth. Since regional haze in the atmosphere is correlated with its depth, we can allow propagation across the iso-depth pixels with the MRF-based regularization problem with the NNFs. Our results validate how our method can restore a wide range of hazy images of natural landscape clearly without suffering from haze isolation artifacts. Also, our regularization method is directly applicable to various dehazing methods.
Incheol Kim, Min H. Kim

CUDA-Accelerated Feature-Based Egomotion Estimation

Egomotion estimation is a fundamental issue in structure from motion and autonomous navigation for mobile robots. Several camera motion estimation methods from a set of variable number of image correspondences have been proposed. Seven- and eight-point methods have been first designed to estimate the fundamental matrix. Five-point methods represent the minimal number of required correspondences to estimate the essential matrix. These feature-based methods raised special interest for their application in a hypothesize-and-test framework to deal with the problem of outliers. This algorithm allows relative pose recovery at the expense of a much higher computational time when dealing with higher ratios of outliers. To solve this problem with a certain amount of speedup, we propose in this work, a CUDA-based solution for the essential matrix estimation from eight, seven and five point correspondences, complemented with robust estimation. The mapping of these algorithms to the CUDA hardware architecture is given in detail as well as the hardware-specific performance considerations. The correspondences in the presented schemes are formulated as bearing vectors to be able to deal with all camera systems. Performance analysis against existing CPU implementations is also given, showing a speedup 4 times faster than the CPU for an outlier ratio \(\epsilon =0.5\) which is common for the essential matrix estimation from automatically computed point correspondences, for the five-point-based estimation. More speedup was shown for the seven and eight-point based implementations reaching 76 times and 57 times respectively.
Safa Ouerghi, Remi Boutteau, Xavier Savatier, Fethi Tlili

Automatic Retinal Vascularity Identification and Artery/Vein Classification Using Near-Infrared Reflectance Retinographies

The retinal microcirculation structure is commonly used as an important source of information in many medical specialities for the diagnosis of relevant diseases such as, for reference, hypertension, arteriosclerosis, or diabetes. Also, the evaluation of the cerebrovascular and cardiovascular disease progression could be performed through the identification of abnormal signs in the retinal vasculature architecture. Given that these alterations affect differently the artery and vein vascularities, a precise characterization of both blood vessel types is also crucial for the diagnosis and treatment of a significant variety of retinal and systemic pathologies. In this work, we present a fully automatic method for the retinal vessel identification and classification in arteries and veins using Optical Coherence Tomography scans. In our analysis, we used a dataset composed by 30 near-infrared reflectance retinography images from different patients, which were used to test and validate the proposed method. In particular, a total of 597 vessel segments were manually labelled by an expert clinician, being used as groundtruth for the validation process. As result, this methodology achieved a satisfactory performance in the complex issue of the retinal vessel tree identification and classification.
Joaquim de Moura, Jorge Novo, Marcos Ortega, Noelia Barreira, Pablo Charlón

Change Detection and Blob Tracking of Fish in Underwater Scenarios

In this paper, the difficult task of detecting fishes in underwater scenarios is analyzed with a special focus on crowded scenes where the differentiation between separate fishes is even more challenging. An extension for the Gaussian Switch Model is developed for the detection which applies an intelligent update scheme to create more accurate background models even for difficult scenes. To deal with very crowded areas in the scene we use the Flux Tensor to create a first coarse segmentation and only update areas that are with high certainty background. The spatial coherency is increased by the N\(^2\)Cut, which is a Ncut adaption to change detection. More relevant information are gathered with a novel blob tracker that uses a specially developed energy function and handling of errors during the change detection. This method keeps the generality of the whole approach so that it can be used for any moving object. The proposed algorithm enabled us to get very accurate underwater segmentations as well as precise results in tracking scenarios.
Martin Radolko, Fahimeh Farhadifard, Uwe von Lukas

Optimizing Phoneme-to-Viseme Mapping for Continuous Lip-Reading in Spanish

Speech is the most used communication method between humans and it is considered a multisensory process. Even though there is a popular belief that speech is something that we hear, there is overwhelming evidence that the brain treats speech as something that we hear and see. Much of the research has focused on Automatic Speech Recognition (ASR) systems, treating speech primarily as an acoustic form of communication. In the last years, there has been an increasing interest in systems for Automatic Lip-Reading (ALR), although exploiting the visual information has been proved to be challenging. One of the main problems in ALR is how to make the system robust to the visual ambiguities that appear at the word level. These ambiguities make confused and imprecise the definition of the minimum distinguishable unit of the video domain. In contrast to the audio domain, where the phoneme is the standard minimum auditory unit, there is no consensus on the definition of the minimum visual unit (the viseme). In this work, we focus on the automatic construction of a phoneme-to-viseme mapping based on visual similarities between phonemes to maximize word recognition. We investigate the usefulness of different phoneme-to-viseme mappings, obtaining the best results for intermediate vocabulary lengths. We construct an automatic system that uses DCT and SIFT descriptors to extract the main characteristics of the mouth region and HMMs to model the statistic relations of both viseme and phoneme sequences. We test our system in two Spanish corpora with continuous speech (AV@CAR and VLRF) containing 19 and 24 speakers, respectively. Our results indicate that we are able to recognize 47% (resp. 51%) of the phonemes and 23% (resp. 21%) of the words, for AV@CAR and VLRF. We also show additional results that support the usefulness of visemes. Experiments on a comparable ALR system trained exclusively using phonemes at all its stages confirm the existence of strong visual ambiguities between groups of phonemes. This fact and the higher word accuracy obtained when using phoneme-to-viseme mappings, justify the usefulness of visemes instead of the direct use of phonemes for ALR.
Adriana Fernandez-Lopez, Federico M. Sukno

A Context-Aware Method for View-Point Invariant Long-Term Re-identification

In this work, we propose a novel context-aware framework towards long-term person re-identification. In contrast to the classical context-unaware architecture, in this method we exploit contextual features that can be identified reliably and guide the re-identification process in a much faster and accurate manner. The system is designed for the long-term Re-ID in walking scenarios, so persons are characterized by soft-biometric features (i.e., anthropometric and gait) acquired using a Kinect\(^\mathrm {TM}\) v.2 sensor. Context is associated to the posture of the person with respect to the camera, since the quality of the data acquired from the used sensor significantly depends on this variable. Within each context, only the most relevant features are selected with the help of feature selection techniques, and custom individual classifiers are trained. Afterwards, a context-aware ensemble fusion strategy which we term as ‘Context specific score-level fusion’, merges the results of individual classifiers. In typical ‘in-the-wild’ scenarios the samples of a person may not appear in all contexts of interest. To tackle this problem we propose a cross-context analysis where features are mapped between contexts and allow the transfer of the identification characteristics of a person between different contexts. We demonstrate in this work the experimental verification of the performance of the proposed context-aware system against the classical context-unaware system. We include in the results the analysis of switching context conditions within a video sequence through a pilot study of circular path movement. All the analysis accentuate the impact of contexts in simplifying the searching process by bestowing promising results.
Athira Nambiar, Alexandre Bernardino

On Using 3D Support Geometries for Measuring Human-Made Corner Structures with a Robotic Total Station

Performing accurate measurements on non-planar targets using a robotic total station in reflectorless mode is prone to errors. Besides requiring a fully reflected laser beam of the electronic distance meter, a proper orientation of the pan-tilt unit is required for each individual accurate 3D point measurement. Dominant physical 3D structures like corners and edges often don’t fulfill these requirements and are not directly measurable.
In this work, three algorithms and user interfaces are evaluated through simulation and physical measurements for simple and efficient construction-side measurement correction of systematic errors. We incorporate additional measurements close to the non-measurable target, and our approach does not require any post-processing of single-point measurements. Our experimental results prove that the systematic error can be lowered by almost an order of magnitude by using support geometries, i.e. incorporating a 3D point, a 3D line or a 3D plane as additional measurements.
Christoph Klug, Dieter Schmalstieg, Thomas Gloor, Clemens Arth


Additional information

Premium Partner

    Image Credits