Skip to main content

2019 | Buch

Advances in Visual Computing

14th International Symposium on Visual Computing, ISVC 2019, Lake Tahoe, NV, USA, October 7–9, 2019, Proceedings, Part I

herausgegeben von: George Bebis, Richard Boyle, Bahram Parvin, Darko Koracin, Daniela Ushizima, Sek Chai, Shinjiro Sueda, Xin Lin, Aidong Lu, Daniel Thalmann, Chaoli Wang, Panpan Xu

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 14th International Symposium on Visual Computing, ISVC 2019, held in Lake Tahoe, NV, USA in October 2019.

The 100 papers presented in this double volume were carefully reviewed and selected from 163 submissions. The papers are organized into the following topical sections: Deep Learning I; Computer Graphics I; Segmentation/Recognition; Video Analysis and Event Recognition; Visualization; ST: Computational Vision, AI and Mathematical methods for Biomedical and Biological Image Analysis; Biometrics; Virtual Reality I; Applications I; ST: Vision for Remote Sensing and Infrastructure Inspection; Computer Graphics II; Applications II; Deep Learning II; Virtual Reality II; Object Recognition/Detection/Categorization; and Poster.

Inhaltsverzeichnis

Frontmatter
Correction to: DeepGRU: Deep Gesture Recognition Utility

The given name and family name of an author were not tagged correctly in the originally published article. The author’s given name is “Joseph J.” and his family name is “LaViola.” This was corrected.

Mehran Maghoumi, Joseph J. LaViola Jr.

Deep Learning I

Frontmatter
Application of Image Classification for Fine-Grained Nudity Detection

Many online social platforms need to use an image content filtering solution to detect nudity automatically. Existing solutions only focus on binary classification models to detect explicit nudity or pornography, which is not enough to distinguish between a great number of various racy outfits that people wear (swimsuit, summer outfit, etc.) that might be considered inappropriate according to user’s own preferences. This paper addresses the problem by proposing a robust technology which detects fine-grained human body parts (chest, back, abdomen, etc.) and assigns a level of nudity for each part using a multi-label image classification model.Since existing datasets were not sufficient for the given problem, we created a new dataset with a total of 37.872 images and 4.879.517 annotated labels that describe human body parts and nudity level (6 labels for each body part).We fine-tuned multiple state-of-the-art convolutional neural network models (VGG, ResNet, MobileNet) on our dataset for multi-label image classification. Our solution has a total accuracy of 98.1% on the test dataset and a low false positive rate of 0.8%.

Cristian Ion, Cristian Minea
DeepGRU: Deep Gesture Recognition Utility

We propose DeepGRU, a novel end-to-end deep network model informed by recent developments in deep learning for gesture and action recognition, that is streamlined and device-agnostic. DeepGRU, which uses only raw skeleton, pose or vector data is quickly understood, implemented, and trained, and yet achieves state-of-the-art results on challenging datasets. At the heart of our method lies a set of stacked gated recurrent units (GRU), two fully-connected layers and a novel global attention model. We evaluate our method on seven publicly available datasets, containing various number of samples and spanning over a broad range of interactions (full-body, multi-actor, hand gestures, etc.). In all but one case we outperform the state-of-the-art pose-based methods. For instance, we achieve a recognition accuracy of 84.9% and 92.3% on cross-subject and cross-view tests of the NTU RGB+D dataset respectively, and also 100% recognition accuracy on the UT-Kinect dataset. We show that even in the absence of powerful hardware, or a large amount of training data, and with as little as four samples per class, DeepGRU can be trained in under 10 min while beating traditional methods specifically designed for small training sets, making it an enticing choice for rapid application prototyping and development.

Mehran Maghoumi, Joseph J. LaViola Jr.
Delineation of Road Networks Using Deep Residual Neural Networks and Iterative Hough Transform

In this paper we present a complete pipeline for extracting road network vector data from satellite RGB orthophotos of urban areas. Firstly, a network based on the SegNeXt architecture with a novel loss function is employed for the semantic segmentation of the roads. Results show that the proposed network produces on average better results than other state-of-the-art semantic segmentation techniques. Secondly, we propose a fast post-processing technique for vectorizing the rasterized segmentation result, removing erroneous lines, and refining the road network. The result is a set of vectors representing the road network. We have extensively tested the proposed pipeline and provide quantitative and qualitative comparisons with other state-of-the-art based on a number of known metrics.

Pinjing Xu, Charalambos Poullis
DomainSiam: Domain-Aware Siamese Network for Visual Object Tracking

Visual object tracking is a fundamental task in the field of computer vision. Recently, Siamese trackers have achieved state-of-the-art performance on recent benchmarks. However, Siamese trackers do not fully utilize semantic and objectness information from pre-trained networks that have been trained on image classification task. Furthermore, the pre-trained Siamese architecture is sparsely activated by the category label, which leads to unnecessary calculations and overfitting. In this paper, we propose to learn a Domain-Aware that fully utilizes semantic and objectness information while producing a class-agnostic using a ridge regression network. Moreover, to reduce the sparsity problem, we solve the ridge regression problem with a differentiable weighted-dynamic loss function. Our tracker, dubbed DomainSiam, improves the feature learning in the training phase and generalization capability to other domains. Extensive experiments are performed on five tracking benchmarks, including OTB2013 and OTB2015, for a validation set as well as VOT2017, VOT2018, LaSOT, TrackingNet, and GOT10k for a testing set. DomainSiam achieves a state-of-the-art performance on these benchmarks while running at 53 FPS.

Mohamed H. Abdelpakey, Mohamed S. Shehata
Reconstruction Error Aware Pruning for Accelerating Neural Networks

This paper presents a pruning method, Reconstruction Error Aware Pruning (REAP), to reduce the redundancy of convolutional neural network models for accelerating the inference. REAP is an extension of one of the state-of-the-art channel pruning methods. Our method takes 3 steps, (1) evaluating the importance of each channel based on the reconstruction error of the outputs in each convolutional layer, (2) pruning less important channels, (3) updating the remaining weights by the least squares method so as to reconstruct the outputs. By pruning with REAP, one can produce a fast and accurate model out of a large pretrained model. Besides, REAP saves us lots of time and efforts required for retraining the pruned model. As our method requires a large computational cost, we have developed an algorithm based on biorthogonal system to conduct the computation efficiently. In the experiments, we show that REAP can conduct pruning with smaller sacrifice of the model performances than several existing state-of-the-art methods such as CP [9], ThiNet [17], DCP [25], and so on.

Koji Kamma, Toshikazu Wada

Computer Graphics I

Frontmatter
Bioinspired Simulation of Knotting Hagfish

Hagfish are capable of not only forming knots, but also sliding them along the length of their bodies. This remarkable behavior is used by the animal for a wide variety of purposes, such as feeding and manipulation. Clearly of interest to biologists, this knotting behavior is also relevant to other fields, such as bioinspired soft robotics. However, this knot-sliding behavior has been challenging to model and has not been simulated on a computer. In this paper, we present the first physics-based simulation of the knot-sliding behavior of hagfish. We show that a contact-based inverse dynamics approach, motivated by the biological concept called positive thigmotaxis, works very well for this challenging control problem.

Yura Hwang, Theodore A. Uyeno, Shinjiro Sueda
Interactive 3D Visualization for Monitoring and Analysis of Geographical Traffic Data of Various Domains

Visual interactive tools are of great importance for monitoring and analysis of geographical data, and, in particular, traffic data. Substantial research effort goes into visualization techniques of various kinds of geography-bound traffic data. Unfortunately, such techniques are very domain-specific and often lack useful features. We propose an interactive visualization system for monitoring and analyzing traffic data on a 3D globe. Our system is general and can be transparently used in different domains, which we examplify by two simulated demonstrations of use cases: Logistic Service and Data Communication. Using these examples, we show that our approach is more general than the current state of the art, and that there are significant similarities between several domains in need of interactive visualization, which are mostly treated as completely separate.

Daniil Rodin, Oded Shmueli, Gershon Elber
Propagate and Pair: A Single-Pass Approach to Critical Point Pairing in Reeb Graphs

With the popularization of Topological Data Analysis, the Reeb graph has found new applications as a summarization technique in the analysis and visualization of large and complex data, whose usefulness extends beyond just the graph itself. Pairing critical points enables forming topological fingerprints, known as persistence diagrams, that provides insights into the structure and noise in data. Although the body of work addressing the efficient calculation of Reeb graphs is large, the literature on pairing is limited. In this paper, we discuss two algorithmic approaches for pairing critical points in Reeb graphs, first a multipass approach, followed by a new single-pass algorithm, called Propagate and Pair.

Junyi Tu, Mustafa Hajij, Paul Rosen
Real-Time Ray Tracing with Spherically Projected Object Data

As raytracing becomes feasible in regards to computational costs for real-time applications, new challenges emerge to achieve sufficient quality. To aim for an acceptable framerate, the amount of consecutive rays is strongly reduced to keep the workload on the GPU low, but sophisticated approaches for denoising are required. One of the major bottlenecks is finding the ray intersection with the geometry. In this work, we present a fast alternative by pre-computing a spherical projection of an object and reduce the cost of intersection-testing independent of the vertex count by projecting the object onto a circumscribed sphere. Further on, we test our Spherical Projection Approximation (SPA) by implementing it into a DirectX Raytracing (DXR) framework and comparing framerates and outcome quality for indirect light with DXR’s native triangle intersection for various dense objects. We found, that our approach not only hails comparable quality in representing the indirect light, but is also significantly faster and consequently provides a raytracing alternative to achieve real-time capabilities for complex scenes.

Bridget Makena Winn, Reed Garmsen, Irene Humer, Christian Eckhardt
Underwater Photogrammetry Reconstruction: GPU Texture Generation from Videos Captured via AUV

Photogrammetry is a useful tool for creating computer models of archaeological sites for monitoring and for general public outreach. Modeling archaeological sites found in the marine environment is particularly challenging due to danger to divers, the cost of underwater photography equipment and lighting challenges. The automatic acquisition of video footage of underwater marine archaeology sites using an AUV can be an advantageous alternative, yet also incurs its own obstacles. In this paper we present our system and enhancements for applying a standard photogrammetry reconstruction pipeline to underwater sites using video footage captured from an AUV. Our primary contribution is a GPU driven algorithm for texture construction to reduce blur in the final model. We demonstrate the results of our system on a well known wreck site in Malta.

Kolton Yager, Christopher Clark, Timmy Gambin, Zoë J. Wood

Segmentation/Recognition

Frontmatter
Adaptive Attention Model for Lidar Instance Segmentation

Detecting and categorizing the instances of objects using Lidar scans are of critical importance for highly autonomous vehicles, which are expected to safely and swiftly maneuver through complex urban streets without the intervention of human drivers. In contrast to recent detection-based approaches [6, 10], we formulate the problem as a point-wise segmentation problem and focus on improving the recognition of small objects, which is very challenging due to the low resolution of commercial Lidar systems. Specifically, we propose a novel end-to-end convolutional neural network (CNN) that encapsulates adaptive attention information, and achieve instance segmentation by fusing multiple auxiliary tasks. We examined our algorithms on the 2D projection data derived from KITTI 3D object detection dataset [8] and achieved at least 14.6% improvement in Intersection over Union (IoU) with faster inference time (25.3 ms per Lidar scan) than the state-of-the-art algorithms.

Peixi Xiong, Xuetao Hao, Yunming Shao, Jerry Yu
View Dependent Surface Material Recognition

The paper presents a detailed study of surface material recognition dependence on the illumination and viewing conditions which is a hard challenge in a realistic scene interpretation. The results document sharp classification accuracy decrease when using usual texture recognition approach, i.e., small learning set size and the vertical viewing and illumination angle which is a very inadequate representation of the enormous material appearance variability. The visual appearance of materials is considered in the state-of-the-art Bidirectional Texture Function (BTF) representation and measured using the upper-end BTF gonioreflectometer. The materials in this study are sixty-five different wood species. The supervised material recognition uses the shallow convolutional neural network (CNN) for the error analysis of angular dependency. We propose a Gaussian mixture model-based method for robust material segmentation.

Stanislav Mikeš, Michal Haindl
3D Visual Object Detection from Monocular Images

3D visual object detection is a fundamental requirement for autonomous vehicles. However, accurately detecting 3D objects was until recently a quality unique to expensive LiDAR ranging devices. Approaches based on cheaper monocular imagery are typically incapable of identifying 3D objects. In this paper, we propose a novel approach to predict accurate 3D bounding box locations on monocular images. We first train a generative adversarial network (GAN) to perform monocular depth estimation. The ground truth training depth data is obtained via depth completion on LiDAR scans. Next, we combine both depth and appearance data into a birds-eye-view representation with height, density and grayscale intensity as the three feature channels. Finally, We train a convolutional neural network (CNN) on our feature map leveraging bounding boxes annotated on corresponding LiDAR scans. Experiments show that our method performs favorably against baselines.

Qiaosong Wang, Christopher Rasmussen
Skin Identification Using Deep Convolutional Neural Network

Skin identification can be used in several security applications such as border’s security checkpoints and facial recognition in bio-metric systems. Traditional skin identification techniques were unable to deal with the high complexity and uncertainty of human skin in uncontrolled environments. To address this gap, this research proposes a new skin identification technique using deep convolutional neural network. The proposed sequential deep model consists of three blocks of convolutional layers, followed by a series of fully connected layers, optimized to maximize skin texture classification accuracy. The proposed model performance has been compared with some of the well-known texture-based skin identification techniques and delivered superior results in terms of overall accuracy. The experiments were carried out over two datasets including FSD Benchmark dataset as well as an in-house skin texture patch dataset. Results show that the proposed deep skin identification model with highest reported accuracy of 0.932 and minimum loss of 0.224 delivers reliable and robust skin identification.

Mahdi Maktab Dar Oghaz, Vasileios Argyriou, Dorothy Monekosso, Paolo Remagnino
Resolution-Independent Meshes of Superpixels

The over-segmentation into superpixels is an important pre-processing step to smartly compress the input size and speed up higher level tasks. A superpixel was traditionally considered as a small cluster of square-based pixels that have similar color intensities and are closely located to each other. In this discrete model the boundaries of superpixels often have irregular zigzags consisting of horizontal or vertical edges from a given pixel grid. However digital images represent a continuous world, hence the following continuous model in the resolution-independent formulation can be more suitable for the reconstruction problem.Instead of uniting squares in a grid, a resolution-independent superpixel is defined as a polygon that has straight edges with any possible slope at subpixel resolution. The harder continuous version of the over-segmentation problem is to split an image into polygons and find a best (say, constant) color of each polygon so that the resulting colored mesh well approximates the given image. Such a mesh of polygons can be rendered at any higher resolution with all edges kept straight.We propose a fast conversion of any traditional superpixels into polygons and guarantees that their straight edges do not intersect. The meshes based on the superpixels SEEDS (Superpixels Extracted via Energy-Driven Sampling) and SLIC (Simple Linear Iterative Clustering) are compared with past meshes based on the Line Segment Detector. The experiments on the Berkeley Segmentation Database confirm that the new superpixels have more compact shapes than pixel-based superpixels.

Vitaliy Kurlin, Philip Smith

Video Analysis and Event Recognition

Frontmatter
Automatic Video Colorization Using 3D Conditional Generative Adversarial Networks

In this work, we present a method for automatic colorization of grayscale videos. The core of the method is a Generative Adversarial Network that is trained and tested on sequences of frames in a sliding window manner. Network convolutional and deconvolutional layers are three-dimensional, with frame height, width and time as the dimensions taken into account. Multiple chrominance estimates per frame are aggregated and combined with available luminance information to recreate a colored sequence. Colorization trials are run successfully on a dataset of old black-and-white films. The usefulness of our method is also validated with numerical results, computed with a newly proposed metric that measures colorization consistency over a frame sequence.

Panagiotis Kouzouglidis, Giorgos Sfikas, Christophoros Nikou
Improving Visual Reasoning with Attention Alignment

Since attention mechanisms were introduced, they have become an important component of neural network architectures. This is because they mimic how humans reason about visual stimuli by focusing on important parts of the input. In visual tasks like image captioning and visual question answering (VQA), networks can generate the correct answer or a comprehensible caption despite attending to wrong part of an image or text. This lack of synchronization between human and network attention hinders the model’s ability to generalize. To improve human-like reasoning capabilities of the model, it is necessary to align what the network and a human will focus on, given the same input. We propose a mechanism to correct visual attention in the network by explicitly training the model to learn the salient parts of an image available in the VQA-HAT dataset. The results show an improvement in the visual question answering task across different types of questions.

Komal Sharan, Ashwinkumar Ganesan, Tim Oates
Multi-camera Temporal Grouping for Play/Break Event Detection in Soccer Games

Many current deep learning approaches to action recognition focus on recognizing concrete (e.g., single actor) actions in trimmed videos from datasets such as UCF-101 and HMDB-51. However, high-level semantic analysis of sports videos often requires recognizing more abstract events or situations involving multiple players with longer time-scale context. This paper builds upon inflated 3D (I3D) ConvNets for video action recognition to detect and differentiate six abstract categories of events in untrimmed videos of soccer games from multiple fixed cameras: normal play, plus breaks in play due to kick-offs, free kicks, throw-ins, and goal and corner kicks. Raw video unit classifications by variants of the basic I3D network are post-processed by two novel and efficient grouping methods for localizing the boundaries of events. Our experiments show that the proposed methods can achieve 84.2% weighted precision for event categories at the level of video units, and boost event temporal localization mean average precision at 0.5 tIoU (mAP@0.5) to 62.0%.

Chunbo Song, Christopher Rasmussen
Trajectory Prediction by Coupling Scene-LSTM with Human Movement LSTM

We develop a novel human trajectory prediction system that incorporates the scene information (Scene-LSTM) as well as individual pedestrian movement (Pedestrian-LSTM) trained simultaneously within static crowded scenes. We superimpose a two-level grid structure (grid cells and subgrids) on the scene to encode spatial granularity plus common human movements. The Scene-LSTM captures the commonly traveled paths that can be used to significantly influence the accuracy of human trajectory prediction in local areas (i.e. grid cells). We further design scene data filters, consisting of a hard filter and a soft filter, to select the relevant scene information in a local region when necessary and combine it with Pedestrian-LSTM for forecasting a pedestrian’s future locations. The experimental results on several publicly available datasets demonstrate that our method outperforms related works and can produce more accurate predicted trajectories in different scene contexts.

Manh Huynh, Gita Alaghband
Augmented Curiosity: Depth and Optical Flow Prediction for Efficient Exploration

Exploring novel environments for a specific target poses the challenge of how to adequately provide positive external rewards to an artificial agent. In scenarios with sparse external rewards, a reinforcement learning algorithm often cannot develop a successful policy function to govern an agent’s behavior. However, intrinsic rewards can provide feedback on an agent’s actions and enable updates towards a proper policy function in sparse scenarios. Our approaches called the Optical Flow-Augmented Curiosity Module (OF-ACM) and Depth-Augmented Curiosity Module (D-ACM) extend the Intrinsic Curiosity Model (ICM) by Pathak et al. The ICM forms an intrinsic reward signal from the error between a prediction and the ground truth of the next state. Shown with experiments in visually rich and sparse feature scenarios in ViZDoom, our predictive modules exhibit improved exploration capabilities and learning of an ideal policy function. Our modules leverage additional sources of information, such as depth images and optical flow, to generate superior embeddings that serve as inputs for next state prediction. With D-ACM we show a 63.3% average improvement in time to convergence of a policy over ICM in “My Way Home” scenarios.

Juan Carvajal, Thomas Molnar, Lukasz Burzawa, Eugenio Culurciello

Visualization

Frontmatter
Information Visualization for Highlighting Conflicts in Educational Timetabling Problems

Scheduling is a very important problem in many organizations, such as hospitals, transportation companies, sports confederations and educational institutions. Obtaining a good schedule results in the maximization of some desired benefit. In particular, in educational institutions (from elementary school to universities) this problem is periodically experienced, either during the preparation of class-teacher or examination timetabling. When a solution proposal is being elaborated (manual, semi-automatic or automatic processes), it is common the occurrence of the phenomenon called conflict, clash or collision. It is characterized by the simultaneous use of a resource (human or material) that can not be shared and, therefore, its occurrence makes that proposal impracticable for adoption. In semi-automatic systems, it is common the identification of such problems and, through user interaction, its resolution. Automatic systems, on the other hand, try to identify/solve conflicts without user intervention. Despite this, conflicts are not easy to resolve. This article proposes the use of information visualization techniques as an approach to highlight the occurrence of conflicts and, using user hints, contribute to its resolution, aiming at obtaining better quality timetables for practical adoption. The proposed visualizations were evaluated in order to determine its expressiveness and effectiveness, considering four aspects: coverage of the research questions, efficiency of the adopted visual mapping, supported level of human interaction and scalability. A conceptual qualitative study showed that the use of these techniques can aim users, mainly non-specialized, to identify conflicts and improve the desired educational timetables.

Wanderley de Souza Alencar, Hugo Alexandre Dantas do Nascimento, Walid Abdala Rfaei Jradi, Fabrizzio Alphonsus A. M. N. Soares, Juliana Paula Felix
ContourNet: Salient Local Contour Identification for Blob Detection in Plasma Fusion Simulation Data

We present ContourNet, a deep learning approach to identify salient local isocontours as blobs in large-scale 5D gyrokinetic tokamak simulation data. Blobs—regions of high turbulence that run along the edge wall down toward the diverter and can damage the tokamak—are non-well-defined features but have been empirically localized by isocontours in 2D normalized fluctuating density fields. The key of our study is to train ContourNet to follow the empirical rules to detect blobs over the time-varying simulation data. The architecture of ContourNet is a convolutional neural segmentation network: the inputs are the density field and a rasterized isocontour; the output is a set of isocontour encircling blobs. At the training stage, we feed the network with manually identified isocontours and propagated labels. At the inference stage, we extract isocontours from the segmented blob regions. Results show that our approach can achieve both high accuracy and performance, which enables scientists to understand the blob dynamics influencing the confinement of the plasma.

Martin Imre, Jun Han, Julien Dominski, Michael Churchill, Ralph Kube, Choong-Seock Chang, Tom Peterka, Hanqi Guo, Chaoli Wang
Mutual Information-Based Texture Spectral Similarity Criterion

Fast novel texture spectral similarity criterion, capable of assessing spectral modeling resemblance of color and Bidirectional Texture Functions (BTF) textures, is presented. The criterion reliably compares the multi-spectral pixel values of two textures, and thus it allows to assist an optimal modeling or acquisition setup development by comparing the original data with its synthetic simulations. The suggested criterion, together with existing alternatives, is extensively tested in a long series of thousands specially designed monotonically degrading experiments moreover, successfully compared on a wide variety of color and BTF textures.

Michal Haindl, Michal Havlíček
Accurate Computation of Interval Volume Measures for Improving Histograms

The interval volume measure is defined as the volume of the space occupied by a range of isosurfaces corresponding to an interval of isovalues. The interval volume measures are very useful since they can be taken as an alternative way of producing a smooth noise-suppressing histogram. This paper proposes two new methods (i.e., the subdividing method and the slicing method) that can calculate interval volume measures with very high accuracy for scalar regular-grid volumetric datasets. It is assumed that the underlying function inside the grid cell is defined by trilinear interpolation. A refined histogram method that can produce accurate interval volume measures is also presented in the paper. All three methods are compared against one another in terms of accuracy and performance. Their improvement for computing global and local histograms is demonstrated by comparing against the previous methods.

Cuilan Wang
Ant-SNE: Tracking Community Evolution via Animated t-SNE

We introduce a method for tracking the community evolution and a prototype (Ant-SNE) for analyzing multivariate time series and guiding interactive exploration through high-dimensional data. The method is based on t-distributed Stochastic Neighbor Embedding (t-SNE), a machine learning algorithm for nonlinear dimension reduction well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. By tracking the evolution of temporal multivariate data points, we are able to locate unusual behaviors (outliers) and interesting sub-series for further analysis. In the experiments, we conducted two case studies with the US employment dataset and the HPC health status dataset in order to confirm the effectiveness of the proposed system.

Ngan V. T. Nguyen, Tommy Dang

ST: Computational Vision, AI and Mathematical Methods for Biomedical and Biological Image Analysis

Frontmatter
Automated Segmentation of the Pectoral Muscle in Axial Breast MR Images

Pectoral muscle segmentation is a crucial step in various computer-aided applications of breast Magnetic Resonance Imaging (MRI). Due to imaging artifact and homogeneity between the pectoral and breast regions, the pectoral muscle boundary estimation is not a trivial task. In this paper, a fully automatic segmentation method based on deep learning is proposed for accurate delineation of the pectoral muscle boundary in axial breast MR images. The proposed method involves two main steps: pectoral muscle segmentation and boundary estimation. For pectoral muscle segmentation, a model based on the U-Net architecture is used to segment the pectoral muscle from the input image. Next, the pectoral muscle boundary is estimated through candidate points detection and contour segmentation. The proposed method was evaluated quantitatively with two real-world datasets, our own private dataset, and a publicly available dataset. The first dataset includes 12 patients breast MR images and the second dataset consists of 80 patients breast MR images. The proposed method achieved a Dice score of 95% in the first dataset and 89% in the second dataset. The high segmentation performance of the proposed method when evaluated on large scale quantitative breast MR images confirms its potential applicability in future breast cancer clinical applications.

Sahar Zafari, Mazen Diab, Tuomas Eerola, Summer E. Hanson, Gregory P. Reece, Gary J. Whitman, Mia K. Markey, Krishnaswamy Ravi-Chandar, Alan Bovik, Heikki Kälviäinen
Angio-AI: Cerebral Perfusion Angiography with Machine Learning

Angiography is a medical imaging technique used to visualize blood vessels. Perfusion angiography, where perfusion is defined as the passage of blood through the vasculature and tissue, is a computational tool created to quantify blood flow from angiography images. Perfusion angiography is critical in areas such as stroke diagnosis, where identification of areas with low blood flow and where assessment of revascularization are essential. Currently, perfusion angiography is performed through deconvolution methods that are susceptible to noise present in angiographic imaging. This paper introduces a machine learning-based formulation to perfusion angiography that can greatly speed-up the process. Specifically, kernel spectral regression (KSR) is used to learn the function mapping between digital subtraction angiography (DSA) frames and blood flow parameters. Model performance is evaluated by examining the similarity of the parametric maps produced by the model as compared those obtained via deconvolution. Our experiments on 15 patients show that the proposed Angio-AI framework can reliably compute parametric cerebral perfusion characterization in terms of cerebral blood volume (CBV), cerebral blood flow (CBF), arterial cerebral blood volume, and time-to-peak (TTP).

Ebrahim Feghhi, Yinsheng Zhou, John Tran, David S. Liebeskind, Fabien Scalzo
Conformal Welding for Brain-Intelligence Analysis

In this work, we present a geometric method to explore the relationship between brain anatomical structure and human intelligence based on conformal welding theory. We first generate the anatomical atlas on the structural MRI data; then, compute the signature for each cortical region by welding the conformal maps of the region and its complement domain along the common boundary, and combine all the region signature as that for the whole brain; and finally, use the signatures for shape visualization and classification using the learning methods. The signature is global, intrinsic to surface and curve geometry, and invariant to conformal transformations; and the computation is efficient through solving sparse linear systems. Experiments on real data set with 243 subjects demonstrate the efficacy of the proposed method and concluded that the conformal welding signature of cortical surface can classify human intelligence with a competitive accuracy rate compared with traditional features.

Liqun Yang, Muhammad Razib, Kenia Chang He, Tianren Yang, Zhong-Lin Lu, Xianfeng Gu, Wei Zeng
Learning Graph Cut Class Prototypes for Thigh CT Tissue Identification

Perceptual grouping remains a challenging topic in the fields of image analysis and computer vision. Image segmentation is often formulated as an optimization problem that may be solved by graph partitioning, or variational approaches. The graph cut method has shown wide applicability in various segmentation and object recognition tasks. It approaches image segmentation as a graph partitioning problem. To find the optimal graph partition, graph cut methods minimize an energy function that consists of data and smoothness terms. An advantage of this method is that it can combine local and global visual information to obtain semantic segmentation of objects in the visual scene. In this work, we introduce unsupervised and supervised learning techniques for generating the class prototypes used by graph cuts. Our hypothesis is that computation of accurate statistical priors improves the accuracy of the graph cut solution. We utilize these techniques for tissue identification in the mid-thigh using CT scans. We evaluate the performance of the compared approaches against reference data. Our results show that inclusion of accurate statistical priors produces better delineation than unsupervised learning of the prototypes. In addition, these methods are suitable for tissue identification as they can model multiple tissue types to perform simultaneous segmentation and identification.

Taposh Biswas, Sokratis Makrogiannis
Automatic Estimation of Arterial Input Function in Digital Subtraction Angiography

Estimation of cerebral blood flow (CBF) from digital subtraction angiogram (DSA) is typically obtained through deconvolution of the contrast concentration time-curve with the arterial input function (AIF). Automatic detection of the AIF through analysis of angiograms could expedite this computation and improve its accuracy by allowing fully automated angiogram processing. This optimization is decisive given the significance of CBF modeling in diagnosing and treating cases of acute ischemic stroke, arteriovenous malformation, brain tumor, and other deviations in cerebral or renal perfusion, for example. This study presents an AIF detection model that relies on the identification of the intracranial carotid artery (ICA) through image segmentation. A contrast agent is used to detect the presence of blood flow in the angiogram, which facilitates signal intensity monitoring throughout 20 frames, ultimately allowing us to compute the AIF. When compared to the manually outlined AIF, the predicted model reached an AUROC value of $$98.54\%$$ . Automatic AIF detection using machine learning methods could therefore provide consistent, reproducible, and accurate results that could quantify CBF and allow physicians to expedite more informed diagnoses to a wide variety of conditions altering cerebral blood flow.

Alexander Liebeskind, Adit Deshpande, Julie Murakami, Fabien Scalzo

Biometrics

Frontmatter
One-Shot-Learning for Visual Lip-Based Biometric Authentication

Lip-based biometric authentication is the process of verifying an individual’s identity based on visual information taken from lips whilst speaking. To date research in this area has involved more traditional approaches and inconsistent results that are difficult to compare. This work aims to push the field forward through the application of deep learning. A deep artificial neural network using spatiotemporal convolutional and bidirectional gated recurrent unit layers is trained end-to-end. For the first time one-shot-learning is applied to lip-based biometric authentication by implementing a siamese network architecture, meaning the model only needs a single prior example in order to authenticate new users. This approach sets a new state-of-the-art performance for lip-based biometric authentication on the XM2VTS dataset and Lausanne protocol with an equal error rate of 0.93% on the evaluation set and a false acceptance rate of 1.07% at a 1% false rejection rate.

Carrie Wright, Darryl Stewart
Age Group and Gender Classification of Unconstrained Faces

Age and Gender classification of unconstrained imaging conditions has attracted an increased recognition as it is applicable in many real-world applications. Recent deep learning-based methods have shown encouraging performance in this field. We, therefore, propose an end-to-end deep learning-based method for robust age group and gender classification of unconstrained images. Particularly, we address the estimations problem with a classification based model that treats age value as a separate class and an independent label. The proposed deep convolutional neural network model learns the relevant informative age and gender representations directly from the image pixel. Technically, the model is first pre-trained on large-scale IMDb-WIKI facial aging dataset, and then fine-tuned on MORPH-II, another large-scale facial aging dataset to learn, and pick up the bias and particularities of each dataset. Finally, it is fine-tuned on the original dataset (OIU-Adience benchmark) with gender and age group labels. The experimental results when analyzed for estimation accuracy on OIU-Adience dataset, show that our model obtains the state-of-the-art performance in both age group and gender classification with an exact and one-off accuracy of 83.1% and 93.8% on age, and also an exact accuracy of 96.2% on gender.

Olatunbosun Agbo-Ajala, Serestina Viriri
Efficient 3D Face Recognition in Uncontrolled Environment

Face recognition in an uncontrolled environment is challenging as body movement and pose variation can result in missing facial features. In this paper, we tackle this problem by fusing multiple RGB-D images with varying poses. In particular, we develop an efficient pose fusion algorithm that frontalizes the faces and combines the multiple inputs. We then introduce a new 3D registration method based on the unified coordinate system (UCS) to compensate for pose and scale variations and normalize the probe and gallery face. To perform 3D face recognition, we train a Support Vector Machine (SVM) with both 2D color and 3D geometric features. Experimental results on a RGB-D dataset show that our method can achieve a high recognition rate and is robust in the presence of pose and expression variations.

Yuqi Ding, Nianyi Li, S. Susan Young, Jinwei Ye
Pupil Center Localization Using SOMA and CNN

We present a new method for eye pupil detection in images. The algorithm runs in two steps. Firstly, a reasonable number of good candidates for pupil position are determined quickly by making use of the self-organizing migrating algorithm. Subsequently, the final position of pupil, among the preselected candidates, is determined precisely by making use of a convolutional neural network. The motivation for this two-step architecture is to create the algorithm that is both precise and fast. The favorable computational speed follows from the fact that only the meaningful positions and sizes are checked in the potentially most time-consuming second step. Moreover, the demands on training and the training set for the network are lower than if the network is used exclusively in one step architecture. The algorithm is capable to run on less powerful computers, e.g. on embedded computers in cars. In our tests, the algorithm achieved good results.

Radovan Fusek, Eduard Sojka, Michael Holusa
Real-Time Face Features Localization with Recurrent Refined Dense CNN Architectures

Based on an innovative, efficient recurrent deep learning architecture, we present a highly stable and robust technique to localize face features on still images, captured and live video sequences. This dense (Fully Convolutional) CNN architecture, referred as the Refined Dense Mobilenet (RDM), is composed of (1) a main encoder-decoder block which aims to approximate face feature locations and, (2) a sequence of refiners which aims to robustly converge at the vicinity of the features. On video sequences, architecture is adapted into a Recurrent RDM where a shape prior component is re-injected in the form of temporal heatmaps obtained at previous frame inference.Accuracy and stability of RDM/R-RDM architectures are compared with state-of-the-art Random Forest and CNN based approaches. The idea of combining a holistic feature localizer – taking advantage of large receptive fields to minimize large error – and refiners – working at higher resolution to converge at feature vicinities – is proving high accuracy in localizing face features. We demonstrate RDM/R-RDM architectures improve localization scores on 300W and AFLW datasets. Moreover, by relying on modern, efficient convolutional blocks and based on our recurrent architecture, we deliver the first stable and accurate real-time implementation of face feature localization on low-end Mobile devices.

Nicolas Livet

Virtual Reality I

Frontmatter
Estimation of the Distance Between Fingertips Using Silhouette and Texture Information of Dorsal of Hand

A three-dimensional virtual object can be manipulated by hand and finger movements with an optical hand tracking device which can recognize the posture of one’s hand. Many of the conventional hand posture recognitions are based on three-dimensional coordinates of fingertips and a skeletal model of the hand. It is difficult for the conventional methods to estimate the posture of the hand when a fingertip is hidden from an optical camera, and self-occlusion often hides the fingertip. Our study, therefore, proposes an estimation of the posture of a hand based on a hand dorsal image that can be taken even when the hand occludes its fingertips. Manipulation of a virtual object requires the recognition of movements like pinching, and many of such movements can be recognized based on the distance between the fingertips of the thumb and the forefinger. Therefore, we use a regression model to estimate the distance between the fingertips of the thumb and forefinger using hand dorsal images. The regression model was constructed using Convolution Neural Networks (CNN). Our study proposes Silhouette and Texture methods for estimation of the distance between fingertips using hand dorsal images and aggregates them into two methods: Clipping method and Aggregation method. The Root Mean Squared Error (RMSE) of estimation of the distance between fingertips was 1.98 mm or less by Aggregation method for hand dorsal images which does not contain any fingertips. The RMSE of Aggregation method is smaller than that of other methods. The result shows that the proposed Aggregation method could be an effective method which is robust to self-occlusion.

Takuma Shimizume, Takeshi Umezawa, Noritaka Osawa
Measuring Reflectance of Anisotropic Materials Using Two Handheld Cameras

In this paper, we propose a method for measuring the reflectance of anisotropic materials using a simple apparatus consisting of two handheld cameras, a small LED light source, a turning table and a chessboard with markers. The system is configured to obtain the different incoming and outgoing light directions, and the brightness of pixels on the surface of the material. The anisotropic Ward BRDF (Bidirectional Reflectance Distribution Function) model is used to approximate the reflectance, and the model parameters are estimated from the incoming and outgoing angles and the brightness of pixels by using a non-linear optimization method. The initial values of the anisotropic direction are given based on the peak specular lobe on the surface, and the best-fitted one is chosen for the anisotropic direction. The optimized parameters show the well-fitted results between the observed brightness and the BRDF model for each RGB channel. It was confirmed that our system was able to measure the reflectance of different isotropic and anisotropic materials.

Zar Zar Tun, Seiji Tsunezaki, Takashi Komuro, Shoji Yamamoto, Norimichi Tsumura
FunPlogs – A Serious Puzzle Mini-game for Learning Fundamental Programming Principles Using Visual Scripting

Learning to program can be a tedious task for students. The intrinsic motivation towards games can help to facilitate the first steps in such learning tasks. In this paper we introduce FunPlogs – a serious puzzle mini-game for learning fundamental programming principles. We use visual scripting aspects within this game. These must be used by the students to solve spatial puzzle-like tasks. Within this game we integrate a user-driven content creation approach for the game, so that students can cooperatively create new levels. We show the feasibility of the game concept in a prototype implementation and indicate a high joy of use during a user study.

Robin Horst, Ramtin Naraghi-Taghi-Off, Savina Diez, Tobias Uhmann, Arne Müller, Ralf Dörner
Automatic Camera Path Generation from 360 Video

Omnidirectional ( $$360^\circ $$ ) video is a novel media format, rapidly becoming adopted in media production and consumption as part of today’s ongoing virtual reality revolution. The goal of automatic camera path generation is to calculate automatically a visually interesting camera path from a $$360^\circ $$ video in order to provide a traditional, TV-like consumption experience. In this work, we describe our algorithm for automatic camera path generation, based on extraction of the information of the scene objects with deep learning based methods.

Hannes Fassold
Highlighting Techniques for 360° Video Virtual Reality and Their Immersive Authoring

Highlighting important elements is a fundamental task to guide the user’s attention in Virtual Reality (VR) applications. Besides the authoring process of a VR application, the creation of these cues for highlighting in 360° video VR already is a non-trivial task itself. It is even more challenging for laymen in the field of VR.This paper has three main contributions: (1) We investigate existing highlighting methods for VR scenes setting that are based on 3D models and explore on their suitability for a 360° video VR setting. We involve six highlighting methods. (2) We propose immersive authoring methods suitable for laymen to create these highlights within a 360° video VR application. (3) In a knowledge communication use-case that demonstrates a virtual laboratory, we evaluate the highlighting techniques and their authoring methods. In the user study we indicate that the “outlining” method is highly suitable for a 360° setting and that the according immersive authoring method we propose is appropriate for laymen authoring.

Robin Horst, Savina Diez, Ralf Dörner

Applications I

Frontmatter
Jitter-Free Registration for Unmanned Aerial Vehicle Videos

Unmanned Aerial Vehicles (UAVs), such as tethered drones, become increasingly popular for video acquisition, within video surveillance or remote, scientific measurement contexts. However, UAV recordings often present an unstable, variable viewpoint that is detrimental to the automatic exploitation of their content. This is often countered by one amongst two strategies, video registration and video stabilization, which are usually affected by distinct issues, namely jitter and drifting. This paper proposes a hybrid solution between both techniques that produces a jitter-free registration. A lightweight implementation enables real time, automatic generation of videos with a constant viewpoint from unstable video sequences acquired with stationary UAVs. Performance evaluation is carried out using video recordings from traffic surveillance scenes up to 15 min long, including multiple mobile objects.

Pierre Lemaire, Carlos Fernando Crispim-Junior, Lionel Robinault, Laure Tougne
Heart Rate Based Face Synthesis for Pulse Estimation

With the technological advancements in non-invasive heart rate (HR) detection, it becomes more feasible to estimate heart rate using commodity digital cameras. However, achieving high accuracy in HR estimation still remains a challenge. One of the bottlenecks is the lack of sufficient facial videos annotated with corresponding HR signals. In order to prevent this bottleneck, we propose to create videos enriched with different HR values from existing data sets with an attempt to increase the data size in a controllable manner. This paper presents a new method to generate facial videos with various heart rate values through a video synthesis procedure. Our method involves the synthesis of heart beat effects from skin colors of a face. New face video is generated with various heart rate values while taking identity information into account. The quality of the synthetic videos is evaluated by comparing to the original ground truth videos at the pixel level as well as by computing their differentiability across the synthetic face videos. Furthermore, the usability of the new data is assessed through the application of HR estimation from remote video approaches.

Umur Aybars Ciftci, Lijun Yin
Light-Weight Novel View Synthesis for Casual Multiview Photography

Traditional view synthesis for image-based rendering requires various processes: camera synchronization with professional equipment, geometric calibration, multiview stereo, and surface reconstruction, resulting in heavy computation, in addition to manual user interactions throughout these processes. Therefore, view synthesis has been available exclusively for professional users. In this paper, we address these expensive costs to enable view synthesis for casual users even with mobile-phone cameras. We assume that casual users take multiple photographs using their phone-cameras, which are used for view synthesis. First, without relying on any expensive synchronization hardware, our method can capture synchronous multiview photographs by utilizing a wireless network protocol. Second, our method provides light-weight image-based rendering on the mobile phone, where heavy computational processes, such as estimating geometry proxies, alpha mattes, and inpainted textures, are processed by a server to be shared in an interactable time. Finally, it allows us to render novel view synthesis along a virtual camera path on the mobile devices, enabling bullet-time photography from casual multiview captures.

Inchang Choi, Yeong Beum Lee, Dae R. Jeong, Insik Shin, Min H. Kim
DeepPrivacy: A Generative Adversarial Network for Face Anonymization

We propose a novel architecture which is able to automatically anonymize faces in images while retaining the original data distribution. We ensure total anonymization of all faces in an image by generating images exclusively on privacy-safe information. Our model is based on a conditional generative adversarial network, generating images considering the original pose and image background. The conditional information enables us to generate highly realistic faces with a seamless transition between the generated face and the existing background. Furthermore, we introduce a diverse dataset of human faces, including unconventional poses, occluded faces, and a vast variability in backgrounds. Finally, we present experimental results reflecting the capability of our model to anonymize images while preserving the data distribution, making the data suitable for further training of deep learning models. As far as we know, no other solution has been proposed that guarantees the anonymization of faces while generating realistic images.

Håkon Hukkelås, Rudolf Mester, Frank Lindseth
Swarm Optimization Algorithm for Road Bypass Extrapolation

Ant Colony Optimization (ACO) algorithms work by leveraging a population of agents that communicate through interaction with deposited “pheromone,” and have been applied in various configurations to the long-standing problem of identifying trafficable terrain from aerial imagery. While these approaches have proven successful in highlighting paved roads in urban, highly-developed sites, they tend to fail in peri-urban and rural locations due to the lower frequency of unnatural features. In this work, we describe a workflow that uses site-specific, near-infrared and first-return LIDAR data to predict the “accessible space” of an image–i.e., the more open regions with shallow elevation gradient that may be readily traversible by both mounted (e.g., all-terrain vehicles) and dismounted forces. Collectively, these regions are supplied as input to an ACO algorithm, modified so that the agents perform a random walk weighted by local elevation change, which allows for a more comprehensive exploration of increasingly featureless imaged terrain. Performance of this workflow is evaluated using two study sites in the continental United States: the Muscatatuck Urban Training Center in rural Indiana, and Camp Shelby in Mississippi. Comparison of results with ground-truth datasets show a high degree of success in predicting areas trafficable by a wide variety of mobile units.

Michael A. Rowland, Glenn M. Suir, Michael L. Mayo, Austin Davis

ST: Vision for Remote Sensing and Infrastructure Inspection

Frontmatter
Concrete Crack Pixel Classification Using an Encoder Decoder Based Deep Learning Architecture

Civil infrastructure inspection in hazardous areas such as underwater beams, bridge decks, etc., is a perilous task. In addition, other factors like labor intensity, time, etc. influence the inspection of infrastructures. Recent studies [11] represent that, an autonomous inspection of civil infrastructure can eradicate most of the problems stemming from manual inspection. In this paper, we address the problem of detecting cracks in the concrete surface. Most of the recent crack detection techniques use deep architecture. However, finding the exact location of crack efficiently has been a difficult problem recently. Therefore, a deep architecture is proposed in this paper, to identify the exact location of cracks. Our architecture labels each pixel as crack or non-crack, which eliminates the need for using any existing post-processing techniques in the current literature [5, 11]. Moreover, acquiring enough data for learning is another challenge in concrete defect detection. According to previous studies, only 10% of an image contains edge pixels (in our case defected areas) [31]. We proposed a robust data augmentation technique to alleviate the need for collecting more crack image samples. The experimental results show that, with our method, significant accuracy can be obtained with very less sample of data. Our proposed method also outperforms the existing methods of concrete crack classification.

Umme Hafsa Billah, Alireza Tavakkoli, Hung Manh La
A Geometry-Based Method for the Spatio-Temporal Detection of Cracks in 4D-Reconstructions

We present a novel geometry-based approach for the detection of small-scale cracks in a temporal series of 3D-reconstructions of concrete objects such as pillars and beams of bridges and other infrastructure. The detection algorithm relies on a geometry-derived coloration of the 3D surfaces for computing the optical flow between time steps. Our filtering technique identifies cracks based on motion discontinuities in the local crack neighborhood. This approach avoids using the material color which is likely to change over time due to weathering and other environmental influences. In addition, we detect and exclude regions with significant local changes in geometry over time e.g. due to vegetation. We verified our method with reconstructions of a horizontal concrete beam under increasing vertical load at the center. For this case, where the main crack direction is known and a precise registration of the beam geometries over time exists, this approach produces accurate crack detection regardless of substantial color variations and is also able to mask out regions with simulated growth of vegetation over time.

Carl Matthes, Adrian Kreskowski, Bernd Froehlich
An Automatic Digital Terrain Generation Technique for Terrestrial Sensing and Virtual Reality Applications

The identification and modeling of the terrain from point cloud data is an important component of Terrestrial Remote Sensing (TRS) applications. The main focus in terrain modeling is capturing details of complex geological features of landforms. Traditional terrain modeling approaches rely on the user to exert control over terrain features. However, relying on the user input to manually develop the digital terrain becomes intractable when considering the amount of data generated by new remote sensing systems capable of producing massive aerial and ground-based point clouds from scanned environments. This article provides a novel terrain modeling technique capable of automatically generating accurate and physically realistic Digital Terrain Models (DTM) from a variety of point cloud data. The proposed method runs efficiently on large-scale point cloud data with real-time performance over large segments of terrestrial landforms. Moreover, generated digital models are designed to effectively render within a Virtual Reality (VR) environment in real time. The paper concludes with an in–depth discussion of possible research directions and outstanding technical and scientific challenges to improve the proposed approach.

Lee Easson, Alireza Tavakkoli, Jonathan Greenberg
Rebar Detection and Localization for Non-destructive Infrastructure Evaluation of Bridges Using Deep Residual Networks

Nondestructive Evaluation (NDE) of civil infrastructure has been an active area of research for the past few decades. Traditional inspection of civil infrastructure, mostly relying on visual inspection is time-consuming, labor-intensive and often provides subjective and erroneous results. To facilitate this process, different sensors for data collection and techniques for data analyses have been used to effectively carry out this task in an automated manner. The purpose of this research is to provide a novel Deep Learning-based method for detection of steel rebars in reinforced concrete bridge elements using data from Ground Penetrating Radar (GPR). At the same time, a novel technique is proposed for the localization of rebar in B-scan images. In order to examine the performance of the rebar detection and localization system, results are outlined to demonstrate the feasibility of the proposed system within relevant practical applications.

Habib Ahmed, Hung Manh La, Gokhan Pekcan

Computer Graphics II

Frontmatter
Intrinsic Decomposition by Learning from Varying Lighting Conditions

Intrinsic image decomposition describes an image based on its reflectance and shading components. In this paper we tackle the problem of estimating the diffuse reflectance from a sequence of images captured from a fixed viewpoint under various illuminations. To this end we propose a deep learning approach to avoid heuristics and strong assumptions on the reflectance prior. We compare two network architectures: one classic ‘U’ shaped Convolutional Neural Network (CNN) and a Recurrent Neural Network (RNN) composed of Convolutional Gated Recurrent Units (CGRU). We train our networks on a new dataset specifically designed for the task of intrinsic decomposition from sequences. We test our networks on MIT and BigTime datasets and outperform state-of-the-art algorithms both qualitatively and quantitatively.

Gregoire Nieto, Mohammad Rouhani, Philippe Robert
Pixel2Field: Single Image Transformation to Physical Field of Sports Videos

Locating players on a 2D field plane for sports match is the first step towards developing many types of sports analytics applications. Most existing mechanisms of locating players require them to wear sensors during sports play. Sports games can be easily recorded by cameras with low cost. Current human detection and tracking techniques can be used to locate players in the video, which is typically distorted for panorama view. We propose an end-to-end system named Pixel2Field, which can transform every pixel location into their scaled 2d field image. This is done by first undistorting the image by estimating the distortion coefficients, followed by a homography recovery. Experiments using detected soccer players from a distorted video show the proposed transformation method works well. To the best of knowledge, this is the first end-to-end system that can transform frame pixel location to field location without any human intervention. This unlock a lot of opportunities for developing sports analytics applications.

Liang Peng
UnrealGT: Using Unreal Engine to Generate Ground Truth Datasets

Large amounts of data have become an essential requirement in the development of modern computer vision algorithms, e.g. the training of neural networks. Due to data protection laws, overflight permissions for UAVs or expensive equipment, data collection is often a costly and time-consuming task. Especially, if the ground truth is generated by manually annotating the collected data. By means of synthetic data generation, large amounts of image- and metadata can be extracted directly from a virtual scene, which in turn can be customized to meet the specific needs of the algorithm or the use-case. Furthermore, the use of virtual objects avoids problems that might arise due to data protection issues and does not require the use of expensive sensors. In this work we propose a framework for synthetic test data generation utilizing the Unreal Engine. The Unreal Engine provides a simulation environment that allows one to simulate complex situations in a virtual world, such as data acquisition with UAVs or autonomous diving. However, our process is agnostic to the computer vision task for which the data is generated and, thus, can be used to create generic datasets. We evaluate our framework by generating synthetic test data, with which a CNN for object detection as well as a V-SLAM algorithm are trained and evaluated. The evaluation shows that our generated synthetic data can be used as an alternative to real data.

Thomas Pollok, Lorenz Junglas, Boitumelo Ruf, Arne Schumann
Fast Omnidirectional Depth Densification

Omnidirectional cameras are commonly equipped with fisheye lenses to capture 360-degree visual information, and severe spherical projective distortion occurs when a 360-degree image is stored as a two-dimensional image array. As a consequence, traditional depth estimation methods are not directly applicable to omnidirectional cameras. Dense depth estimation for omnidirectional imaging has been achieved by applying several offline processes, such as patch-matching, optical flow, and convolutional propagation filtering, resulting in additional heavy computation. No dense depth estimation for real-time applications is available yet. In response, we propose an efficient depth densification method designed for omnidirectional imaging to achieve 360-degree dense depth video with an omnidirectional camera. First, we compute the sparse depth estimates using a conventional simultaneous localization and mapping (SLAM) method, and then use these estimates as input to a depth densification method. We propose a novel densification method using the spherical pull-push method by devising a joint spherical pyramid for color and depth, based on multi-level icosahedron subdivision surfaces. This allows us to propagate the sparse depth continuously over 360-degree angles efficiently in an edge-aware manner. The results demonstrate that our real-time densification method is comparable to state-of-the-art offline methods in terms of per-pixel depth accuracy. Combining our depth densification with a conventional SLAM allows us to capture real-time 360-degree RGB-D video with a single omnidirectional camera.

Hyeonjoong Jang, Daniel S. Jeon, Hyunho Ha, Min H. Kim
Backmatter
Metadaten
Titel
Advances in Visual Computing
herausgegeben von
George Bebis
Richard Boyle
Bahram Parvin
Darko Koracin
Daniela Ushizima
Sek Chai
Shinjiro Sueda
Xin Lin
Aidong Lu
Daniel Thalmann
Chaoli Wang
Panpan Xu
Copyright-Jahr
2019
Electronic ISBN
978-3-030-33720-9
Print ISBN
978-3-030-33719-3
DOI
https://doi.org/10.1007/978-3-030-33720-9

Premium Partner