Skip to main content

2020 | Buch

Advances in Visual Computing

15th International Symposium, ISVC 2020, San Diego, CA, USA, October 5–7, 2020, Proceedings, Part II

herausgegeben von: George Bebis, Zhaozheng Yin, Edward Kim, Jan Bender, Kartic Subr, Bum Chul Kwon, Jian Zhao, Denis Kalkofen, George Baciu

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This two-volume set of LNCS 12509 and 12510 constitutes the refereed proceedings of the 15th International Symposium on Visual Computing, ISVC 2020, which was supposed to be held in San Diego, CA, USA in October 2020, took place virtually instead due to the COVID-19 pandemic.

The 118 papers presented in these volumes were carefully reviewed and selected from 175 submissions. The papers are organized into the following topical sections:

Part I: deep learning; segmentation; visualization; video analysis and event recognition; ST: computational bioimaging; applications; biometrics; motion and tracking; computer graphics; virtual reality; and ST: computer vision advances in geo-spatial applications and remote sensing

Part II: object recognition/detection/categorization; 3D reconstruction; medical image analysis; vision for robotics; statistical pattern recognition; posters

Inhaltsverzeichnis

Frontmatter

Object Recognition/Detection/Categorization

Frontmatter
Few-Shot Image Recognition with Manifolds

In this paper, we extend the traditional few-shot learning (FSL) problem to the situation when the source-domain data is not accessible but only high-level information in the form of class prototypes is available. This limited information setup for the FSL problem deserves much attention due to its implication of privacy-preserving inaccessibility to the source-domain data but it has rarely been addressed before. Because of limited training data, we propose a non-parametric approach to this FSL problem by assuming that all the class prototypes are structurally arranged on a manifold. Accordingly, we estimate the novel-class prototype locations by projecting the few-shot samples onto the average of the subspaces on which the surrounding classes lie. During classification, we again exploit the structural arrangement of the categories by inducing a Markov chain on the graph constructed with the class prototypes. This manifold distance obtained using the Markov chain is expected to produce better results compared to a traditional nearest-neighbor-based Euclidean distance. To evaluate our proposed framework, we have tested it on two image datasets – the large-scale ImageNet and the small-scale but fine-grained CUB-200. We have also studied parameter sensitivity to better understand our framework.

Debasmit Das, J. H. Moon, C. S. George Lee
A Scale-Aware YOLO Model for Pedestrian Detection

Pedestrian detection is considered one of the most challenging problems in computer vision, as it involves the combination of classification and localization within a scene. Recently, convolutional neural networks (CNNs) have been demonstrated to achieve superior detection results compared to traditional approaches. Although YOLOv3 (an improved You Only Look Once model) is proposed as one of state-of-the-art methods in CNN-based object detection, it remains very challenging to leverage this method for real-time pedestrian detection. In this paper, we propose a new framework called SA YOLOv3, a scale-aware You Only Look Once framework which improves YOLOv3 in improving pedestrian detection of small scale pedestrian instances in a real-time manner. Our network introduces two sub-networks which detect pedestrians of different scales. Outputs from the sub-networks are then combined to generate robust detection results. Experimental results show that the proposed SA YOLOv3 framework outperforms the results of YOLOv3 on public datasets and run at an average of 11 fps on a GPU.

Xingyi Yang, Yong Wang, Robert Laganière
Image Categorization Using Agglomerative Clustering Based Smoothed Dirichlet Mixtures

With the rapid growth of multimedia data and the diversity of the available image contents, it becomes necessary to develop advanced machine learning algorithms for the purpose of categorizing and recognizing images. Hierarchical clustering methods have shown promising results in computer vision applications. In this paper, we present a new unsupervised image categorization technique in which we cluster images using an agglomerative hierarchical procedure and a dissimilarity metric is derived based on smoothed Dirichlet (SD) distribution. We propose a mixture of SD distributions and a maximum-likelihood learning framework, from which we derive a Kulback-Leibler divergence between two SD mixture models. Experiments on challenging images dataset that contains different indoor and outdoor places reveal the importance of the hierarchical clustering when categorizing images. The conducted tests prove the robustness of the proposed image categorization approach as compared to the other related-works.

Fatma Najar, Nizar Bouguila
SAT-CNN: A Small Neural Network for Object Recognition from Satellite Imagery

Satellite imagery presents a number of challenges for object detection such as significant variation in object size (from small cars to airports) and low object resolution. In this work we focus on recognizing objects taken from the xView Satellite Imagery dataset. The xView dataset introduces its own set of challenges, the most prominent being the imbalance between the 60 classes present. xView also contains considerable label noise as well as both semantic and visual overlap between classes. In this work we focus on techniques to improve performance on an imbalanced, noisy dataset through data augmentation and balancing. We show that a very small convolutional neural network (SAT-CNN) with approximately three million parameters can outperform a deep pre-trained classifier, VGG16 - which is used for many state-of-the-art tasks - with over 138 million parameters.

Dustin K. Barnes, Sara R. Davis, Emily M. Hand
Domain Adaptive Transfer Learning on Visual Attention Aware Data Augmentation for Fine-Grained Visual Categorization

Fine-Grained Visual Categorization (FGVC) is a challenging topic in computer vision. It is a problem characterized by large intra-class differences and subtle inter-class differences. In this paper, we tackle this problem in a weakly supervised manner, where neural network models are getting fed with additional data using a data augmentation technique through a visual attention mechanism. We perform domain adaptive knowledge transfer via fine-tuning on our base network model. We perform our experiment on six challenging and commonly used FGVC datasets, and we show competitive improvement on accuracies by using attention-aware data augmentation techniques with features derived from deep learning model InceptionV3, pre-trained on large scale datasets. Our method outperforms competitor methods on multiple FGVC datasets and showed competitive results on other datasets. Experimental studies show that transfer learning from large scale datasets can be utilized effectively with visual attention based data augmentation, which can obtain state-of-the-art results on several FGVC datasets. We present a comprehensive analysis of our experiments. Our method achieves state-of-the-art results in multiple fine-grained classification datasets including challenging CUB200-2011 bird, Flowers-102, and FGVC-Aircrafts datasets.

Ashiq Imran, Vassilis Athitsos

3D Reconstruction

Frontmatter
A Light-Weight Monocular Depth Estimation with Edge-Guided Occlusion Fading Reduction

Self-supervised monocular depth estimation methods suffer occlusion fading, which is a result of a lack of supervision by the ground truth pixels. A recent work introduced a post-processing method to reduce occlusion fading; however, the results have a severe halo effect. This work proposes a novel edge-guided post-processing method that reduces occlusion fading for self-supervised monocular depth estimation. We also introduce Atrous Spatial Pyramid Pooling with Forward-Path (ASPPF) into the network to reduce computational costs and improve inference performance. The proposed ASPPF-based network is lighter, faster, and better than current depth estimation networks. Our light-weight network only needs 7.6 million parameters and can achieve up to 67 frames per second for $$256\,\times \,512$$ 256 × 512 inputs using a single nVIDIA GTX1080 GPU. The proposed network also outperforms the current state-of-the-art methods on the KITTI benchmark. The ASPPF-based network and edge-guided post-processing produces better results, both quantitatively and qualitatively than the competitors.

Kuo-Shiuan Peng, Gregory Ditzler, Jerzy Rozenblit
Iterative Closest Point with Minimal Free Space Constraints

The Iterative Closest Point (ICP) method is widely used for fitting geometric models to sensor data. By formulating the problem as a minimization of distances evaluated at observed surface points, the method is computationally efficient and applicable to a rich variety of model representations. However, when the scene surface is only partially visible, the model can be ill-constrained by surface observations alone. Existing methods that penalize free space violations may resolve this issue, but require that the explicit model surface is available or can be computed quickly, to remain efficient. We introduce an extension of ICP that integrates free space constraints, while the number of distance computations remains linear in the scene’s surface area. We support arbitrary shape spaces, requiring only that the distance to the model surface can be computed at a given point. We describe an implementation for range images and validate our method on implicit model fitting problems that benefit from the use of free space constraints.

Simen Haugo, Annette Stahl
Minimal Free Space Constraints for Implicit Distance Bounds

A general approach for fitting implicit models to sensor data is to optimize an objective function measuring the quality of the fit. The objective function often involves evaluating the model’s implicit function at several points in space. When the model is expensive to evaluate, the number of points can become a bottleneck, making the use of volumetric information, such as free space constraints, challenging. When the model is the Euclidean distance function to its surface, previous work has been able to integrate free space constraints in the optimization problem, such that the number of distance computations is linear in the scene’s surface area. Here, we extend this work to only require the model’s implicit function to be a bound of the Euclidean distance. We derive necessary and sufficient conditions for the model to be consistent with free space. We validate the correctness of the derived constraints on implicit model fitting problems that benefit from the use of free space constraints.

Simen Haugo, Annette Stahl

Medical Image Analysis

Frontmatter
Fetal Brain Segmentation Using Convolutional Neural Networks with Fusion Strategies

Most of the Convolutional Neural Network (CNN) architectures are based on a single prediction map when optimising the loss function. This may lead to the following consequences; firstly, the model may not be optimised, and secondly the model may be prone to noise hence more sensitive to false positives/negatives, both resulting in poorer results. In this paper, we propose four fusion strategies to promote ensemble learning within a network architecture by combining its main prediction map with its side outputs. The architectures combine multi-source, multi-scale and multi-level local and global information collectively together with spatial information. To evaluate the performance of the proposed fusion strategies, we integrated each of them into three baseline architectures namely the classical U-Net, attention U-Net and recurrent residual U-net. Subsequently, we evaluate each model by conducting two experiments; firstly, we train all models on 200 normal fetal brain cases and test them on 74 abnormal cases, and secondly we train and test all models on 200 normal cases using a 4-fold cross validation strategy. Experimental results show that all fusion strategies consistently improve the performance of the baseline models and outperformed existing methods in the literature.

Andrik Rampun, Deborah Jarvis, Paul Griffiths, Paul Armitage
Fundus2Angio: A Conditional GAN Architecture for Generating Fluorescein Angiography Images from Retinal Fundus Photography

Carrying out clinical diagnosis of retinal vascular degeneration using Fluorescein Angiography (FA) is a time consuming process and can pose significant adverse effects on the patient. Angiography requires insertion of a dye that may cause severe adverse effects and can even be fatal. Currently, there are no non-invasive systems capable of generating Fluorescein Angiography images. However, retinal fundus photography is a non-invasive imaging technique that can be completed in a few seconds. In order to eliminate the need for FA, we propose a conditional generative adversarial network (GAN) to translate fundus images to FA images. The proposed GAN consists of a novel residual block capable of generating high quality FA images. These images are important tools in the differential diagnosis of retinal diseases without the need for invasive procedure with possible side effects. Our experiments show that the proposed architecture achieves a low FID score of 30.3 and outperforms other state-of-the-art generative networks. Furthermore, our proposed model achieves better qualitative results indistinguishable from real angiograms.

Sharif Amit Kamran, Khondker Fariha Hossain, Alireza Tavakkoli, Stewart Zuckerbrod, Salah A. Baker, Kenton M. Sanders
Multiscale Detection of Cancerous Tissue in High Resolution Slide Scans

We present an algorithm for multi-scale tumor (chimeric cell) detection in high resolution slide scans. The broad range of tumor sizes in our dataset pose a challenge for current Convolutional Neural Networks (CNN) which often fail when image features are very small (8 pixels). Our approach modifies the effective receptive field at different layers in a CNN so that objects with a broad range of varying scales can be detected in a single forward pass. We define rules for computing adaptive prior anchor boxes which we show are solvable under the equal proportion interval principle. Two mechanisms in our CNN architecture alleviate the effects of non-discriminative features prevalent in our data - a foveal detection algorithm that incorporates a cascade residual-inception module and a deconvolution module with additional context information. When integrated into a Single Shot MultiBox Detector (SSD), these additions permit more accurate detection of small-scale objects. The results permit efficient real-time analysis of medical images in pathology and related biomedical research fields.

Qingchao Zhang, Coy D. Heldermon, Corey Toler-Franklin
DeepTKAClassifier: Brand Classification of Total Knee Arthroplasty Implants Using Explainable Deep Convolutional Neural Networks

Total knee arthroplasty (TKA) is one of the most successful surgical procedures worldwide. It improves quality of life, mobility, and functionality for the vast majority of patients. However, a TKA surgery may fail over time for several reasons, thus it requires a revision arthroplasty surgery. Identifying TKA implants is a critical consideration in preoperative planning of revision surgery. This study aims to develop, train, and validate deep convolutional neural network models to precisely classify four widely-used TKA implants based on only plain knee radiographs. Using 9,052 computationally annotated knee radiographs, we achieved weighted average precision, recall, and F1-score of 0.97, 0.97, and 0.97, respectively, with Cohen Kappa of 0.96.

Shi Yan, Taghi Ramazanian, Elham Sagheb, Sunyang Fu, Sunghwan Sohn, David G. Lewallen, Hongfang Liu, Walter K. Kremers, Vipin Chaudhary, Michael Taunton, Hilal Maradit Kremers, Ahmad P. Tafti
Multi-modal Image Fusion Based on Weight Local Features and Novel Sum-Modified-Laplacian in Non-subsampled Shearlet Transform Domain

Multi-modal medical image fusion plays a significant role in clinical applications like noninvasive diagnosis and image-guided surgery. However, designing an efficient image fusion technique is still a challenging task. In this paper, we propose an improved multi-modal medical image fusion method to enhance the visual quality and contrast of the fused image. To achieve this work, the registered source images are firstly decomposed into low-frequency (LF) and several high-frequency (HF) sub-images via non-subsampled shearlet transform (NSST). Afterward, LF sub-images are combined using the proposed weight local features fusion rule based on local energy and standard deviation, while HF sub-images are fused based on the novel sum-modified-laplacien (NSML) technique. Finally, inversed NSST is applied to reconstruct the fused image. Furthermore, the proposed method is extended to color multi-modal image fusion that effectively restrains color distortion and enhances spatial and spectral resolutions. To evaluate the performance, various experiments conducted on different datasets of gray-scale and color images. Experimental results show that the proposed scheme achieves better performance than other state-of-art proposed algorithms in both visual effects and objective criteria.

Hajer Ouerghi, Olfa Mourali, Ezzeddine Zagrouba
Robust Prostate Cancer Classification with Siamese Neural Networks

Nuclear magnetic resonance (NMR) is a powerful and non–invasive diagnostic tool. However, NMR scanned images are often noisy due to patient motions or breathing. Although modern Computer Aided Diagnosis (CAD) systems, mainly based on Deep Learning (DL), together with expert radiologists, can obtain very accurate predictions, working with noisy data can induce a wrong diagnose or require a new acquisition, spending time and exposing the patient to an extra dose of radiation. In this paper, we propose a new DL model, based on a Siamese neural network, able to withstand random noise perturbations. We use data coming from the ProstateX challenge and demonstrate the superior robustness of our model to random noise compared to a similar architecture, albeit deprived of the Siamese branch. In addition, our approach is also resistant to adversarial attacks and shows overall better AUC performance.

Alberto Rossi, Monica Bianchini, Franco Scarselli

Vision for Robotics

Frontmatter
Simple Camera-to-2D-LiDAR Calibration Method for General Use

As systems that utilize computer vision move into the public domain, methods of calibration need to become easier to use. Though multi-plane LiDAR systems have proven to be useful for vehicles and large robotic platforms, many smaller platforms and low-cost solutions still require 2D LiDAR combined with RGB cameras. Current methods of calibrating these sensors make assumptions about camera and laser placement and/or require complex calibration routines. In this paper we propose a new method of feature correspondence in the two sensors and an optimization method capable of using a calibration target with unknown lengths in its geometry. Our system is designed with an inexperienced layperson as the intended user, which has led us to remove as many assumptions about both the target and laser as possible. We show that our system is capable of calibrating the 2-sensor system from a single sample in configurations other methods are unable to handle.

Andrew H. Palmer, Chris Peterson, Janelle Blankenburg, David Feil-Seifer, Monica Nicolescu
SalsaNext: Fast, Uncertainty-Aware Semantic Segmentation of LiDAR Point Clouds

In this paper, we introduce SalsaNext for the uncertainty-aware semantic segmentation of a full 3D LiDAR point cloud in real-time. SalsaNext is the next version of SalsaNet [1] which has an encoder-decoder architecture where the encoder unit has a set of ResNet blocks and the decoder part combines upsampled features from the residual blocks. In contrast to SalsaNet, we introduce a new context module, replace the ResNet encoder blocks with a new residual dilated convolution stack with gradually increasing receptive fields and add the pixel-shuffle layer in the decoder. Additionally, we switch from stride convolution to average pooling and also apply central dropout treatment. To directly optimize the Jaccard index, we further combine the weighted cross entropy loss with Lovász-Softmax loss [4]. We finally inject a Bayesian treatment to compute the epistemic and aleatoric uncertainties for each point in the cloud. We provide a thorough quantitative evaluation on the Semantic-KITTI dataset [3], which demonstrates that the proposed SalsaNext outperforms other published semantic segmentation networks and achieves $$3.6\%$$ 3.6 % more accuracy over the previous state-of-the-art method. We also release our source code ( https://github.com/TiagoCortinhal/SalsaNext ).

Tiago Cortinhal, George Tzelepis, Eren Erdal Aksoy
Mobile Manipulator Robot Visual Servoing and Guidance for Dynamic Target Grasping

This paper deals with the problem of real time closed loop tracking and grasping of a dynamic target by a mobile manipulation robot. Dynamic object tracking and manipulation is crucial for robotic system that intend to physically interact with the real world. The robot considered corresponds to an eye-in-hand gripper-and-arm combination mounted on a four-wheel base. The proposed policy is inspired by the principles of visual servoing, and leverages a computationally simple paradigm of virtual force-based formulation, due to the intended deployment for real-time closed loop control. The main objective of our strategy is to align a dynamic target frame to the onboard gripper, while respecting the constraints of the mobile manipulator system. The algorithm was implemented on a real robot and evaluated across multiple diverse real time experimental studies, detailed within this paper.

Prateek Arora, Christos Papachristos

Statistical Pattern Recognition

Frontmatter
Interpreting Galaxy Deblender GAN from the Discriminator’s Perspective

In large galaxy surveys it can be difficult to separate overlapping galaxies, a process called deblending. Generative adversarial networks (GANs) have shown great potential in addressing this fundamental problem. However, it remains a significant challenge to comprehend how the network works, which is particularly difficult for non-expert users. This research focuses on understanding the behaviors of one of the network’s major components, the Discriminator, which plays a vital role but is often overlooked. Specifically, we propose an enhanced Layer-wise Relevance Propagation (LRP) algorithm called Polarized-LRP. It generates a heatmap-based visualization highlighting the area in the input image that contributes to the network decision. It consists of two parts i.e. a positive contribution heatmap for the images classified as ground truth and a negative contribution heatmap for the ones classified as generated. As a use case, we have chosen the deblending of two overlapping galaxy images via a branched GAN model. Using the Galaxy Zoo dataset we demonstrate that our method clearly reveals the attention areas of the Discriminator to differentiate generated galaxy images from ground truth images, and outperforms the original LRP method. To connect the Discriminator’s impact on the Generator, we also visualize the attention shift of the Generator across the training process. An interesting result we have achieved is the detection of a problematic data augmentation procedure that would else have remained hidden. We find that our proposed method serves as a useful visual analytical tool for more effective training and a deeper understanding of GAN models.

Heyi Li, Yuewei Lin, Klaus Mueller, Wei Xu
Variational Bayesian Sequence-to-Sequence Networks for Memory-Efficient Sign Language Translation

Memory-efficient continuous Sign Language Translation is a significant challenge for the development of assisted technologies with real-time applicability for the deaf. In this work, we introduce a paradigm of designing recurrent deep networks whereby the output of the recurrent layer is derived from appropriate arguments from nonparametric statistics. A novel variational Bayesian sequence-to-sequence network architecture is proposed that consists of a) a full Gaussian posterior distribution for data-driven memory compression and b) a nonparametric Indian Buffet Process prior for regularization applied on the Gated Recurrent Unit non-gate weights. We dub our approach Stick-Breaking Recurrent network and show that it can achieve a substantial weight compression without diminishing modeling performance.

Harris Partaourides, Andreas Voskou, Dimitrios Kosmopoulos, Sotirios Chatzis, Dimitris N. Metaxas
A Gaussian Process Upsampling Model for Improvements in Optical Character Recognition

The automatic evaluation and extraction of financial documents is a key process in business efficiency. Most of the extraction relies on the Optical Character Recognition (OCR), whose outcome is dependent on the quality of the document image. The image data fed to the automated systems can be of unreliable quality, inherently low-resolution or downsampled and compressed by a transmitting program. In this paper, we illustrate a novel Gaussian Process (GP) upsampling model for the purposes of improving OCR process and extraction through upsampling low resolution documents.

Steven I. Reeves, Dongwook Lee, Anurag Singh, Kunal Verma

Posters

Frontmatter
Video Based Fire Detection Using Xception and Conv-LSTM

Immediate detection of wildfires can aid firefighters in saving lives. The research community has invested a lot of their efforts in detecting fires using vision-based systems, due to their ability to monitor vast open spaces. Most of the state-of-the-art vision-based fire detection systems operate on individual images, limiting them to only spatial features. This paper presents a novel system that explores the spatio-temporal information available within a video sequence to perform classification of a scene into fire or non-fire category. The system, in its initial step, selects 15 key frames from an input video sequence. The frame selection step allows the system to capture the entire movement available in a video sequence regardless of the duration. The spatio-temporal information among those frames can then be captured using a deep convolutional neural network (CNN) called Xception, which is pre-trained on the ImageNet, and a convolutional long short term memory network (ConvLSTM). The system is evaluated on a challenging new dataset, presented in this paper, containing 70 fire and 70 non-fire sequences. The dataset contains aerial shots of fire and fire-like sequences, such as fog, sunrise and bright flashing objects, captured using a dynamic/moving camera for an average duration of 13 s. The classification accuracy of 95.83% highlights the effectiveness of the proposed system in tackling such challenging scenarios.

Tanmay T. Verlekar, Alexandre Bernardino
Highway Traffic Classification for the Perception Level of Situation Awareness

The automotive industry is rapidly moving towards the highest level of autonomy. However, one of the major challenges for highly autonomous vehicles is the differentiation between driving modes according to different driving situations. Different driving zones have different driving safety regulations. For example, German traffic regulations require a higher degree of safety measurements for highway driving. Therefore, a classification of the different driving scenarios on a highway is necessary to regulate these safety assessments. This paper presents a novel vision-based approach to the classification of German highway driving scenarios. We develop three different and precise algorithms utilizing image processing and machine learning approaches to recognize speed signs, traffic lights, and highway traffic signs. Based on the results of these algorithms, a weight-based classification process is performed, which determines the current driving situation either as a highway driving mode or not. The main goal of this research work is to maintain and to ensure the high safety specifications required for the German highway. Finally, the result of this classification process is provided as an extracted driving scenario-based feature on the perceptual level of a system known as situation awareness to provide a high level of driving safety. This study was realized on a custom-made hardware unit called “CE-Box”, which was developed at the Department of Computer Engineering at TU Chemnitz as an automotive test solution for testing automotive software applications on an embedded hardware unit.

Julkar Nine, Shanmugapriyan Manoharan, Manoj Sapkota, Shadi Saleh, Wolfram Hardt
3D-CNN for Facial Emotion Recognition in Videos

In this paper, we present a video-based emotion recognition neural network operating on three dimensions. We show that 3D convolutional neural networks (3D-CNN) can be very good for predicting facial emotions that are expressed over a sequence of frames. We optimize the 3D-CNN architecture through hyper-parameters search, and prove that this has a very strong influence on the results, even if architecture tuning of 3D CNNs has not been much addressed in the literature. Our proposed resulting architecture improves over the results of the state-of-the-art techniques when tested on the CK+ and Oulu-CASIA datasets. We compare the results with cross-validation methods. The designed 3D-CNN yields a 97.56% using Leave-One-Subject-Out cross-validation, and 100% using 10-fold cross-validation on the CK+ dataset, and 84.17% using 10-fold cross-validation on the Oulu-CASIA dataset.

Jad Haddad, Olivier Lezoray, Philippe Hamel
Reducing Triangle Inequality Violations with Deep Learning and Its Application to Image Retrieval

Given a distance matrix with triangular inequality violations, the metric nearness problem requires to find the closest matrix that satisfies the triangle inequality. It has been experimentally shown that deep neural networks can be used to efficiently produce close matrices with a fewer number of triangular inequality violations. This paper further extends the deep learning approach to the metric nearness problem by applying it to the content-based image retrieval. Since vantage space representation of an image database requires distances to satisfy triangle inequalities, applying deep learning to the matrices in the vantage space with triangular inequality violations produces distance matrices with a fewer number of violations. Experiments performed on the Corel-1k dataset demonstrate that fully convolutional autoencoders considerably reduce triangular inequality violations on distance matrices. Overall, the image retrieval accuracy based on the distance matrices generated by the deep learning model is better than that based on the original matrices in 91.16% of the time.

Izat Khamiyev, Magzhan Gabidolla, Alisher Iskakov, M. Fatih Demirci
A Driver Guidance System to Support the Stationary Wireless Charging of Electric Vehicles

Air pollution is a problem in many cities. Although it is possible to mitigate this problem by replacing combustion with electric engines, at the time of writing, electric vehicles are still a rarity in European cities. Reasons for not buying an electric vehicle are not only the high purchase costs but also the uncomfortable initiation of the charging process. A more convenient alternative is wireless charging, which is enabled by integrating an induction plate into the floor and installing a charging interface at the vehicle. To maximize efficiency, the vehicle’s charging interface must be positioned accurately above the induction plate which is integrated into the floor. Since the driver cannot perceive the region below the vehicle, it is difficult to precisely align the position of the charging interface by maneuvering the vehicle. In this paper, we first discuss the requirements for driver guidance systems that help drivers to accurately position their vehicle and thus, enables them to maximize the charging efficiency. Thereafter, we present a prototypical implementation of such a system. To minimize the deployment cost for charging station operators, our prototype uses an inexpensive off-the-shelf camera system to localize the vehicles that are approaching the station. To simplify the retrofitting of existing vehicles, the prototype uses a smartphone app to generate navigation visualizations. To validate the approach, we present experiments indicating that, despite its low cost, the prototype can technically achieve the necessary precision.

Bijan Shahbaz Nejad, Peter Roch, Marcus Handte, Pedro José Marrón
An Efficient Tiny Feature Map Network for Real-Time Semantic Segmentation

In this paper, we propose an efficient semantic segmentation network named Tiny Feature Map Network (TFMNet). This network significantly improves the running speed while achieves good accuracy. Our scheme uses a lightweight backbone network to extract primary features from input images of particular sizes. The hybrid dilated convolution framework and the DenseASPP module are used to alleviate the gridding problem. We evaluate the proposed network on the Cityscapes and CamVid datasets, and obtain performance comparable with the existing state-of-the-art real-time semantic segmentation methods. Specifically, it achieves $$72.9\%$$ 72.9 % mIoU on the Cityscapes test dataset with only 2.4M parameters and a speed of 113 FPS on NVIDIA GTX 1080 Ti without pre-training on the ImageNet dataset.

Hang Huang, Peng Zhi, Haoran Zhou, Yujin Zhang, Qiang Wu, Binbin Yong, Weijun Tan, Qingguo Zhou
A Modified Syn2Real Network for Nighttime Rainy Image Restoration

The restoration or enhancement of rainy images at nighttime is of great significance to outdoor computer vision applications such as self-driving and traffic surveillance. While image deraining has drawn increasingly research attention currently and the majority of deraining methods are able to achieve satisfying performance for daytime image rain removal, there are few related studies for nighttime image deraining, as the conditions of nighttime rainy scenes are more complicated and challenging. To address the nighttime image deraining issues, we designed an improved model based on the Syn2Real network, called NIRR. In order to obtain good rain removal and visual effect under the nighttime rainy scene, we propose a new refined loss function for the supervised learning phase, which combines the perceptual loss and SSIM loss. The qualitative and quantitative experimental results show that our proposed method outperforms the state-of-the-arts whether it is on the synthetic nighttime rainy image or on the real-world nighttime rainy image.

Qunfang Tang, Jie Yang, Haibo Liu, Zhiqiang Guo
Unsupervised Domain Adaptation for Person Re-Identification with Few and Unlabeled Target Data

Existing, fully supervised methods for person re-identification (ReID) require annotated data acquired in the target domain in which the method is expected to operate. This includes the IDs as well as images of persons in that domain. This is an obstacle in the deployment of ReID methods in novel settings. For solving this problem, semi-supervised or even unsupervised ReID methods have been proposed. Still, due to their assumptions and operational requirements, such methods are not easily deployable and/or prove less performant to novel domains/settings, especially those related to small person galleries. In this paper, we propose a novel approach for person ReID that alleviates these problems. This is achieved by proposing a completely unsupervised method for fine tuning the ReID performance of models learned in prior, auxiliary domains, to new, completely different ones. The proposed model adaptation is achieved based on only few and unlabeled target persons’ data. Extensive experiments investigate several aspects of the proposed method in an ablative study. Moreover, we show that the proposed method is able to improve considerably the performance of state-of-the-art ReID methods in state-of-the-art datasets.

George Galanakis, Xenophon Zabulis, Antonis A. Argyros
How Does Computer Animation Affect Our Perception of Emotions in Video Summarization?

With the exponential growth of film productions and the popularization of the web, the summary of films has become a useful and important resource. Movies data specifically has become one of the most entertaining sources for viewers, especially during quarantine. However, browsing a movie in enormous collections and searching for a desired scene within a complete movie is a tedious and time-consuming task. As a result, automatic and personalized movie summarization has become a common research topic. In this paper, we focus on emotion summarization for videos with one shot and apply three independent methods for its summarization. We provide two different ways to visualize the main emotions of the generated summary and compare both approaches. The first one uses the original frames of the video and the other uses an open source facial animation tool to create a virtual assistant that provides the emotion summarization. For evaluation, we conducted an extrinsic evaluation using a questionnaire to measure the quality of each generated video summary. Experimental results show that even though both videos had similar answers, a different technique for each video had the most satisfying and informative summary.

Camila Kolling, Victor Araujo, Rodrigo C. Barros, Soraia Raupp Musse
Where’s Wally: A Gigapixel Image Study for Face Recognition in Crowds

Several devices are capable of capturing images with a large number of people, including those of high resolution known as gigapixel images. These images can be helpful for studies and investigations, such as finding people in a crowd. Although they can provide more details, the task of identifying someone in the crowd is quite challenging and complex. In this paper, we aim to assist the work of a human observer with larger images with crowds by reducing the search space for several images to a ranking of ten images related to a specific person. Our model collects faces in a crowded gigapixel image and then searches for people using three different poses (front, right and left). We built a handcraft dataset with 42 people to evaluate our method, achieving a recognition rate of 69% in the complete dataset. We highlight that, from the 31% “not found” among the top ten in the ranking, many of them are very close to this boundary and, in addition, 92% of non-matched are occluded by some accessory or another face. Experimental results showed great potential for our method to support a human observer to find people in the crowd, especially cluttered images, providing her/him with a reduced search space.

Cristiane B. R. Ferreira, Helio Pedrini, Wanderley de Souza Alencar, William D. Ferreira, Thyago Peres Carvalho, Naiane Sousa, Fabrizzio Soares
Optical Flow Based Background Subtraction with a Moving Camera: Application to Autonomous Driving

In this research we present a novel algorithm for background subtraction using a moving camera. Our algorithm is based purely on visual information obtained from a camera mounted on an electric bus, operating in downtown Reno which automatically detects moving objects of interest with the view to provide information for collision avoidance and number of vehicles for an autonomous vehicle. In our approach we exploit the optical flow vectors generated by the motion of the camera on the bus while keeping parameter assumptions at a minimum. At first, we estimate the Focus of Expansion which is used to model and simulate 3D points given the intrinsic parameters of the camera and perform multiple linear regression to estimate the regression equation parameters and implement on the real data set of every frame to identify moving objects. We validated our algorithm using data taken from a common bus route in the city of Reno.

Sotirios Diamantas, Kostas Alexis
Deep Facial Expression Recognition with Occlusion Regularization

In computer vision, occlusions are mainly known as a challenge to cope with. For instance, partial occlusions of the face may lower the performance of facial expression recognition systems. However, when incorporated into the training, occlusions can be also helpful in improving the overall performance. In this paper, we propose and evaluate occlusion augmentation as a simple but effective regularizing tool for improving the general performance of deep learning based facial expression and action unit recognition systems, even if no occlusion is present in the test data. In our experiments we consistently found significant performance improvements on three databases (Bosphorus, RAF-DB, and AffectNet) and three CNN architectures (Xception, MobileNet, and a custom model), suggesting that occlusion regularization works independently of the dataset and architecture. Based on our clear results, we strongly recommend to integrate occlusion regularization into the training of all CNN-based facial expression recognition systems, because it promises performance gains at very low cost.

Nikul Pandya, Philipp Werner, Ayoub Al-Hamadi
Semantic Segmentation with Peripheral Vision

Deep convolutional neural networks exhibit exceptional performance on many computer vision tasks, including image semantic segmentation. Pre-trained networks trained on a relevant and large benchmark have a notable impact on these successful achievements. However, confronting a domain shift, usage of pre-trained deep encoders cannot boost the performance of those models. In general, transfer learning is not a general solution for various computer vision applications with small accessible image databases. An alternative approach is to develop stronger deep network models applicable to any problem rather than encouraging scientists to explore available pre-trained encoders for their computer vision tasks. To deviate the direction of the research trend in image semantic segmentation toward more effective models, we proposed an innovative convolutional module simulating the peripheral ability of the human eyes. By utilizing our module in an encoder-decoder configuration, after extensive experiments, we achieved acceptable outcomes on several challenging benchmarks, including PASCAL VOC2012 and CamVid.

M. Hamed Mozaffari, Won-Sook Lee
Generator from Edges: Reconstruction of Facial Images

Applications that involve supervised training require paired images. Researchers of single image super-resolution (SISR) create such images by artificially generating blurry input images from the corresponding ground truth. Similarly we can create paired images with the canny edge. We propose Generator From Edges (GFE) [Fig. 1]. Our aim is to determine the best architecture for GFE, along with reviews of perceptual loss [1, 2]. To this end, we conducted three experiments. First, we explored the effects of the adversarial loss often used in SISR. In particular, we uncovered that it is not an essential component to form a perceptual loss. Eliminating adversarial loss will lead to a more effective architecture from the perspective of hardware resource. It also means that considerations for the problems pertaining to generative adversarial network (GAN) [3], such as mode collapse, are not necessary. Second, we reexamined VGG loss and found that the mid-layers yield the best results. By extracting the full potential of VGG loss, the overall performance of perceptual loss improves significantly. Third, based on the findings of the first two experiments, we reevaluated the dense network to construct GFE. Using GFE as an intermediate process, reconstructing a facial image from a pencil sketch can become an easy task.

Nao Takano, Gita Alaghband
CD: Combined Distances of Contrast Distributions for Image Quality Analysis

The quality of visual input impacts both human and machine perception. Consequently many processing techniques exist that deal with different distortions. Usually they are applied freely and unsupervised. We propose a novel method called CD $$^2$$ 2 to protect against errors that arise during image processing. It is based on distributions of image contrast and custom distance functions which capture the effect of noise, compression, etc. CD $$^2$$ 2 achieves excellent performance on image quality analysis benchmarks and in a separate user test with only a small data and computation overhead.

Sascha Xu, Jan Bauer, Benjamin Axmann, Wolfgang Maass
Real-Time Person Tracking and Association on Doorbell Cameras

This paper presents key techniques for real-time, multi-person tracking and association on doorbell surveillance cameras at the edge. The challenges for this task are: significant person size changes during tracking caused by person approaching or departing from the doorbell camera, person occlusions due to limited camera field and occluding objects in the camera view, and the requirement for a lightweight algorithm that can run in real time on the doorbell camera at the edge. To address these challenges, we propose a multi-person tracker that uses a detect-track-associate strategy to achieve good performance in speed and accuracy. The person detector only runs at every n-th frame, and between person detection frames a low-cost point-based tracker is used to track the subjects. To maintain subject tracking accuracy, at each person detection frame, a person association algorithm is used to associate persons detected in the current frame to the current and recently tracked subjects and identify any new subjects. To improve the performance of the point-based tracker, human-shaped masks are used to filter out background points. Further, to address the challenge of drastic target scale change during the tracking we introduced an adaptive image resizing strategy to dynamically adjust the tracker input image size to allow the point-based tracker to operate at the optimal image resolution given a fixed number of feature points. For fast and accurate person association, we introduced the Sped-Up LOMO, a fast version of the popular local maximal occurrence (LOMO) person descriptor. The experimental results on doorbell surveillance videos illustrate the efficacy of the proposed person tracking and association framework.

Sung Chun Lee, Gang Qian, Allison Beach
MySnapFoodLog: Culturally Sensitive Food Photo-Logging App for Dietary Biculturalism Studies

It is believed that immigrants to the U.S. have increased rates of chronic diseases due to their adoption of the Western diet. There is a need to better understand the dietary intake of these immigrants. Tracking food consumption can be easily done by using a food app, but there is currently no culturally-appropriate food tracking app that is relatively easy for participants and the research community to use. The MySnapFoodLog app was developed using the cross-platform Flutter framework to track users’ food consumption with the goal of using AI to recognize Filipino foods and determine if a meal is healthy or unhealthy. A pilot study demonstrates the feasibility of the app alpha release and the need for further data collection and training to improve the Filipino food recognition system.

Paul Stanik III, Brendan Tran Morris, Reimund Serafica, Kelly Harmon Webber
Hand Gesture Recognition Based on the Fusion of Visual and Touch Sensing Data

The use of computers has evolved so rapidly that our daily lives revolve around it. With the advancement of computer science and technology, the interaction between humans and computers is not limited to mice and keyboards. The whole-body interaction is the trend supported by the newest techniques. Hand gesture becomes more and more common, however, is challenged by lighting conditions, limited hand movements, and the occlusion of the hand images. The objective of this paper is to reduce those challenges by fusing vision and touch sensing data to accommodate the requirements of advanced human-computer interaction. In the development of this system, vision and touchpad sensing data were used to detect the fingertips using machine learning. The fingertips detection results were fused by a K-nearest neighbor classifier to form the proposed hybrid hand gesture recognition system. The classifier is then trained to classify four hand gestures. The classifier was tested in three different scenarios with static, slow motion, and fast movement of the hand. The overall performance of the system on both static and slow-moving hand are 100% precision for both training and testing sets, and 0% false-positive rate. In the fast-moving hand scenario, the system got a 95.25% accuracy, 94.59% precision, 96% recall, and 5.41% false-positive rate. Finally, using the proposed classifier, a real-time, simple, accurate, reliable, and cost-effective system was realised to control the Windows media player. The outcome of fusing the two input sensors offered better precision and recall performance of the system.

F. T. Timbane, S. Du, R. Aylward
Gastrointestinal Tract Anomaly Detection from Endoscopic Videos Using Object Detection Approach

Endoscopy is a medical procedure used for the imaging and examination of our internal body organs for detecting, visualizing and localising anomalies to facilitate their further treatment. Currently, the medical practitioners expertise is vastly relied upon to analyse these endoscopic videos. This can be a bottleneck in rural areas where specialized medical practitioners are scarce. By learning from and improving upon existing research, the proposed system leverages object detection methods to achieve an automated detection mechanism to provide real-time annotations to assist medical professionals performing endoscopy and provide insights for educational purposes. It works by extracting video frames and processing it using a real-time object detection deep learning model trained on a standard dataset to detect two anomalies namely: Esophagitis and Polyps. The output is in the form of an annotated video. Using Intersection over Union metric (IOU), the model is observed to be performing accurately on the training set but shows a lesser accuracy on the test set of images. This however can be improved using alternate metrics which are more suited to irregular shaped multi-class, multiple object detection and can better explain the observed results.

Tejas Chheda, Rithvika Iyer, Soumya Koppaka, Dhananjay Kalbande
A Multimodal High Level Video Segmentation for Content Targeted Online Advertising

In this paper we introduce a novel advertisement system, dedicated to multimedia documents broadcasted over the Internet. The proposed approach takes into account the consumer’s perspective and inserts contextual relevant ads at the level of the scenes boundaries, while reducing the degree of intrusiveness. From the methodological point of view, the major contribution of the paper concerns a temporal video segmentation method into scenes based on a multimodal (visual, audio and semantic) fusion of information. The experimental evaluation, carried out on a large dataset with more than 30 video documents validates the proposed methodology with average F1 scores superior to 85%.

Bogdan Mocanu, Ruxandra Tapu, Titus Zaharia
AI Playground: Unreal Engine-Based Data Ablation Tool for Deep Learning

Machine learning requires data, but acquiring and labeling real-world data is challenging, expensive, and time-consuming. More importantly, it is nearly impossible to alter real data post-acquisition (e.g., change the illumination of a room), making it very difficult to measure how specific properties of the data affect performance. In this paper, we present AI Playground (AIP), an open-source, Unreal Engine-based tool for generating and labeling virtual image data. With AIP, it is trivial to capture the same image under different conditions (e.g., fidelity, lighting, etc.) and with different ground truths (e.g., depth or surface normal values). AIP is easily extendable and can be used with or without code. To validate our proposed tool, we generated eight datasets of otherwise identical but varying lighting and fidelity conditions. We then trained deep neural networks to predict (1) depth values, (2) surface normals, or (3) object labels and assessed each network’s intra- and cross-dataset performance. Among other insights, we verified that sensitivity to different settings is problem-dependent. We confirmed the findings of other studies that segmentation models are very sensitive to fidelity, but we also found that they are just as sensitive to lighting. In contrast, depth and normal estimation models seem to be less sensitive to fidelity or lighting, and more sensitive to the structure of the image. Finally, we tested our trained depth-estimation networks on two real-world datasets and obtained results comparable to training on real data alone, confirming that our virtual environments are realistic enough for real-world tasks.

Mehdi Mousavi, Aashis Khanal, Rolando Estrada
Homework Helper: Providing Valuable Feedback on Math Mistakes

Many parents feel uncomfortable helping their children with homework, with only 66% of parents consistently checking their child’s homework [22]. Because of this, many turn to math games and problem solvers as they have become widely available in recent years [12, 21]. Many of these applications rely on multiple choice or keyboard entry submission of answers, limiting their adoption. Auto graders and applications, such as PhotoMath, deprive students of the opportunity to correct their own mistakes, automatically generating a solution with no explanation [19]. This work introduces a novel homework assistant – Homework Helper (HWHelper) – that is capable of determining mathematical errors in order to provide meaningful feedback to students without solutions. In this paper, we focus on simple arithmetic calculations, specifically multi-digit addition, introducing 2D-Add, a new dataset of worked addition problems. We design a system that acts as a guided learning tool for students allowing them to learn from and correct their mistakes. HWHelper segments a sheet of math problems, identifies the student’s answer, performs arithmetic and pinpoints mistakes made, providing feedback to the student. HWHelper fills a significant gap in the current state-of-the-art for student math homework feedback.

Sara R. Davis, Carli DeCapito, Eugene Nelson, Karun Sharma, Emily M. Hand
Interface Design for HCI Classroom: From Learners’ Perspective

Having a good Human-Computer Interaction (HCI) design is challenging. Previous works have contributed significantly to fostering HCI, including design principle with report study from the instructor view. The questions of how and to what extent students perceive the design principles are still left open. To answer this question, this paper conducts a study of HCI adoption in the classroom. The studio-based learning method is adapted to teach 83 graduate and undergraduate students in 16 weeks long with four activities. A standalone presentation tool for instant online peer feedback during the presentation session is developed to help students justify and critique other’s work. Our tool provides a sandbox, which supports multiple application types, including Web-applications, Object Detection, Web-based Virtual Reality (VR), and Augmented Reality (AR). After presenting one assignment and two projects, our results shows that students acquired a better understanding of the Golden Rules principle over time, which is demonstrated by the development of visual interface design. The Wordcloud reveals the primary focus was on the user interface and sheds light on students’ interest in user experience. The inter-rater score indicates the agreement among students that they have the same level of understanding of the principles. The results show a high level of guideline compliance with HCI principles, in which we witness variations in visual cognitive styles. Regardless of diversity in visual preference, the students present high consistency and a similar perspective on adopting HCI design principles. The results also elicit suggestions into the development of the HCI curriculum in the future.

Huyen N. Nguyen, Vinh T. Nguyen, Tommy Dang
Pre-trained Convolutional Neural Network for the Diagnosis of Tuberculosis

Tuberculosis (TB) is an infectious disease that claimed about 1.5 million lives in 2018. TB is most prevalent in developing regions. Even though TB disease is curable, it necessitates early detection to prevent its spread and casualties. Chest radiographs are one of the most reliable screening techniques; although, its accuracy is dependent on professional radiologists interpretation of the individual images. Consequently, we present a computer-aided detection system using a pre-trained convolutional neural network as features extractor and logistic regression classifier to automatically analyze the chest radiographs to provide a timely and accurate interpretation of multiple images. The chest radiographs were pre-processed before extracting distinctive features and then fed to the classifier to detect which image is infected. This work established the potential of implementing pre-trained Convolutional Neural Network models in the medical domain to obtained good results despite limited datasets.

Mustapha Oloko-Oba, Serestina Viriri
Near-Optimal Concentric Circles Layout

The majority of graph visualization algorithms emphasize improving the readability of graphs by focusing on various vertex and edge rendering techniques. However, revealing the global connectivity structure of a graph by identifying significant vertices is an important and useful part of any graph analytics system. Centrality measures reveal the “most important” vertices of a graph, commonly referred to as central or influential vertices. Hence, a centrality-oriented visualization may highlight these important vertices and give deep insights into graph data. This paper proposes a mathematical optimization-based clustered graph layout called Near-Optimal Concentric Circles (NOCC) layout to visualize medium to large scale-free graphs. We cluster the vertices by their betweenness values and optimally place them on concentric circles to reveal the extensive connectivity structure of the graph while achieving aesthetically pleasing layouts. Besides, we incorporate different edge rendering techniques to improve graph readability and interaction.

Prabhakar V. Vemavarapu, Mehmet Engin Tozal, Christoph W. Borst
Facial Expression Recognition and Ordinal Intensity Estimation: A Multilabel Learning Approach

Facial Expression Recognition has gained considerable attention in the field of affective computing, but only a few works considered the intensity of emotion embedded in the expression. Even the available studies on expression intensity estimation successfully assigned a nominal/regression value or classified emotion in a range of intervals. The approaches from multiclass and its extensions do not conform to man heuristic manner of recognising emotion with the respective intensity. This work is presenting a Multi-label CNN-based model which could simultaneously recognise emotion and also provide ordinal metrics as the intensity of the emotion. In the experiments conducted on BU-3DFE and Cohn Kanade (CK+) datasets, we check how well our model could adapt and generalise. Our model gives promising results with multilabel evaluation metrics and generalise well when trained on BU-3DFE and evaluated on CK+.

Olufisayo Ekundayo, Serestina Viriri
Prostate MRI Registration Using Siamese Metric Learning

The process of registering intra-procedural prostate magnetic resonance images (MRI) with corresponding pre-procedural images improves the accuracy of certain surgeries, such as a prostate biopsy. Aligning the two images by means of rigid and elastic deformation may permit more precise use of the needle during the operation. However, gathering the necessary data and computing the ground truth is a problematic step. Currently, a single dataset is available and it is composed of only a few cases, making the training of standard deep convolutional neural networks difficult. To address this issue the moving image (intra-procedural) is randomly augmented producing different copies, and a convolutional siamese neural network tries to choose the best aligned copy with respect to the reference image (pre-procedural). The results of this research show that this method is superior to both a simple baseline obtained with standard image processing techniques and a deep CNN model. Furthermore, the best policy found for building the couple set for the siamese neural network reveals that a rule based on the mutual information that considers only the highest and the lowest value, representing similar and dissimilar cases, is the best option for training. The use of mutual information allows the model to be unsupervised, since the segmentation is no longer necessary. Finally, research on the size of the augmented set is conducted, showing that producing 18 different candidates is sufficient for a good performance.

Alexander Lyons, Alberto Rossi
Unsupervised Anomaly Detection of the First Person in Gait from an Egocentric Camera

Assistive technology is increasingly important as the senior population grows. The purpose of this study is to develop a means of preventing fatal injury by monitoring the movements of the elderly and sounding an alarm if an accident occurs. We present a method of detecting an anomaly in a first-person’s gait from an egocentric video. Followed by the conventional anomaly detection methods, we train the model in an unsupervised manner. We use optical flow images to capture ego-motion information in the first person. To verify the effectiveness of our model, we introduced and conducted experiments with a novel first-person video anomaly detection dataset and showed that our model outperformed the baseline method.

Mana Masuda, Ryo Hachiuma, Ryo Fujii, Hideo Saito
Emotion Categorization from Video-Frame Images Using a Novel Sequential Voting Technique

Emotion categorization can be the process of identifying different emotions in humans based on their facial expressions. It requires time and sometimes it is hard for human classifiers to agree with each other about an emotion category of a facial expression. However, machine learning classifiers have done well in classifying different emotions and have widely been used in recent years to facilitate the task of emotion categorization. Much research on emotion video databases uses a few frames from when emotion is expressed at peak to classify emotion, which might not give a good classification accuracy when predicting frames where the emotion is less intense. In this paper, using the CK+ emotion dataset as an example, we use more frames to analyze emotion from mid and peak frame images and compared our results to a method using fewer peak frames. Furthermore, we propose an approach based on sequential voting and apply it to more frames of the CK+ database. Our approach resulted in up to 85.9% accuracy for the mid frames and overall accuracy of 96.5% for the CK+ database compared with the accuracy of 73.4% and 93.8% from existing techniques.

Harisu Abdullahi Shehu, Will Browne, Hedwig Eisenbarth
Systematic Optimization of Image Processing Pipelines Using GPUs

Real-time computer vision systems require fast and efficient image processing pipelines. Experiments have shown that GPUs are highly suited for image processing operations, since many tasks can be processed in parallel. However, calling GPU-accelerated functions requires uploading the input parameters to the GPU’s memory, calling the function itself, and downloading the result afterwards. In addition, since not all functions benefit from an increase in parallelism, many pipelines cannot be implemented exclusively using GPU functions. As a result, the optimization of pipelines requires a careful analysis of the achievable function speedup and the cost of copying data. In this paper, we first define a mathematical model to estimate the performance of an image processing pipeline. Thereafter, we present a number of micro-benchmarks gathered using OpenCV which we use to validate the model and which quantify the cost and benefits for different classes of functions. Our experiments show that comparing the function speedup without considering the time for copying can overestimate the achievable performance gain of GPU acceleration by a factor of two. Finally, we present a tool that analyzes the possible combinations of CPU and GPU function implementations for a given pipeline and computes the most efficient composition. By using the tool on their target hardware, developers can easily apply our model to optimize their application performance systematically.

Peter Roch, Bijan Shahbaz Nejad, Marcus Handte, Pedro José Marrón
A Hybrid Approach for Improved Image Similarity Using Semantic Segmentation

Content Based Image Retrieval (CBIR) is the task of finding the images from the datasets that consider similar to the input query based on its visual characteristics. Several methods from the state of the art based on visual methods (Bag of visual words, VLAD, ...) or recent deep leaning methods try to solve the CBIR problem. In particular, Deep learning is a new field and used for several vision applications including CBIR. But, even with the increase of the performance of deep learning algorithms, this problem is still a challenge in computer vision. To tackle this problem, we present in this paper an efficient CBIR framework based on incorporation between deep learning based semantic segmentation and visual features. We show experimentally that the incorporate leads to the increase of accuracy of our CBIR framework. We study the performance of the proposed approach on four different datasets (Wang, MSRC V1, MSRC V2, Linnaeus).

Achref Ouni, Eric Royer, Marc Chevaldonné, Michel Dhome
Automated Classification of Parkinson’s Disease Using Diffusion Tensor Imaging Data

Parkinson’s Disease (PD) is one of the most common neurological disorders in the world, affecting over 6 million people globally. In recent years, Diffusion Tensor Imaging (DTI) biomarkers have been established as one of the leading techniques to help diagnose the disease. However, identifying patterns and deducing even preliminary results require a neurologist to automatically analyze the scan. In this paper, we propose a Machine Learning (ML) based algorithm that can analyze DTI data and predict if the person has PD. We were able to obtain a classification accuracy of 80% and an F1 score of 0.833 using our approach. The method proposed is expected to reduce the number of misdiagnosis by assisting the neurologists in making a decision.

Harsh Sharma, Sara Soltaninejad, Irene Cheng
Nonlocal Adaptive Biharmonic Regularizer for Image Restoration

In this paper, we propose a nonlocal adaptive biharmonic regularization term for image denoising and restoration, combining the advantages of fourth order models (without the staircase effect while preserving slopes) and nonlocal methods (preserving texture). For its numerical solution, we employ the $$L^2$$ L 2 gradient descent and finite difference methods to design explicit, semi-implicit, and implicit schemes. Numerical results for denoising and restoration are shown on synthetic images, real images, and texture images. Comparisons with local fourth order regularizer and the nonlocal total variation are made, which help illustrate the advantages of the proposed model.

Ying Wen, Luminita A. Vese
A Robust Approach to Plagiarism Detection in Handwritten Documents

Plagiarism detection is a widely used technique to uniquely identify quality of work. We address in this paper, the problem of predicting similarities amongst a collection of documents. This technique has widespread uses in academic institutions. In this paper, we propose a simple yet effective method for detection of plagiarism by using a robust word detection and segmentation procedure followed by a convolution neural network (CNN)—Bi-directional Long Short Term Memory (biLSTM) pipeline to extract the text. Our approach also extract and encodes common patterns like scratches in handwriting for improving accuracy on real-world use cases. The extracted information from multiple documents using comparison metrics are used to find the documents which have been plagiarized from a source. Extensive experiments in our research show that this approach may help simplify the examining process and can act as a cheap viable alternative to many modern approaches used to detect plagiarism from handwritten documents.

Om Pandey, Ishan Gupta, Bhabani S. P. Mishra
Optical Coherence Tomography Latent Fingerprint Image Denoising

Latent fingerprints are fingerprint impressions left on the surfaces a finger comes into contact with. They are found in almost every crime scene. Conventionally, latent fingerprints have been obtained using chemicals or physical methods, thus destructive techniques. Forensic community is moving towards contact-less acquisition methods. The contact-less acquisition presents some advantages over destructive methods; such advantages include multiple acquisitions of the sample and a possibility of further analysis such as touch DNA. This work proposes a speckle-noise denoising method for optical coherence tomography (OCT) latent fingerprint images. The proposed denoising technique was derived from the adaptive threshold and the normal shrinkage. Experimental results have shown that the proposed method suppressed speckle-noise better than the adaptive threshold, NormalShrink, VisuShrink, SUREShrink and BayesShrink.

Sboniso Sifiso Mgaga, Jules-Raymond Tapamo, Nontokozo Portia Khanyile
CNN, Segmentation or Semantic Embeddings: Evaluating Scene Context for Trajectory Prediction

For autonomous vehicles (AV) and social robot’s navigation, it is important for them to completely understand their surroundings for natural and safe interactions. While it is often recognized that scene context is important for understanding pedestrian behavior, it has received less attention than modeling social-context – influence from interactions between pedestrians. In this paper, we evaluate the effectiveness of various scene representations for deep trajectory prediction. Our work focuses on characterizing the impact of scene representations (sematic images vs. semantic embeddings) and scene quality (competing semantic segmentation networks). We leverage a hierarchical RNN autoencoder to encode historical pedestrian motion, their social interaction and scene semantics into a low dimensional subspace and then decode to generate future motion prediction. Experimental evaluation on the ETH and UCY datasets show that using full scene semantics, specifically segmented images, can improve trajectory prediction over using just embeddings.

Arsal Syed, Brendan Tran Morris
Automatic Extraction of Joint Orientations in Rock Mass Using PointNet and DBSCAN

Measurement of joint orientation is an essential task for rock mass discontinuity characterization. This work presents a methodology for automatic extraction of joint orientations in a rock mass from 3D point cloud data generated using Unmanned Aerial Vehicles and photogrammetry. Our algorithm first automatically classifies joints on 3D rock surface using state-of-the-art deep network architecture PointNet. It then identifies individual joints by the Density-Based Scan with Noise (DBSCAN) clustering and computes their orientations by fitting least-square planes using Random Sample Consensus. A major case study has been developed to evaluate the performance of the entire methodology. Our results showed the proposed approach outperforms similar approaches in the literature both in terms of accuracy and time complexity. Our experiments show the great potential in the application of 3D deep learning techniques for discontinuity characterization which might be used for the estimation of other parameters as well.

Rushikesh Battulwar, Ebrahim Emami, Masoud Zare Naghadehi, Javad Sattarvand
Feature Map Retargeting to Classify Biomedical Journal Figures

In this work, we propose a layer to retarget feature maps in Convolutional Neural Networks (CNNs). Our “Retarget” layer densely samples values for each feature map channel at locations inferred by our proposed spatial attention regressor. Our layer increments an existing saliency-based distortion layer by replacing its convolutional components with depthwise convolutions. This reformulation with the tuning of its hyper-parameters makes the Retarget layer applicable at any depth of feed-forward CNNs. Keeping in spirit with Content-Aware Image Resizing retargeting methods, we introduce our layers at the bottlenecks of three pre-trained CNNs. We validate our approach on the ImageCLEF2013, ImageCLEF2015, and ImageCLEF2016 document subfigure classification task. Our redesigned DenseNet121 model with the Retarget layer achieved state-of-the-art results under the visual category when no data augmentations were performed. Performing spatial sampling for each channel of the feature maps at deeper layers exponentially increases computational cost and memory requirements. To address this, we experiment with an approximation of the nearest neighbor interpolation and show consistent improvement over the baseline models and other state-of-the-art attention models. The code is available at https://github.com/VimsLab/CNN-Retarget .

Vinit Veerendraveer Singh, Chandra Kambhamettu
Automatic 3D Object Detection from RGB-D Data Using PU-GAN

3D object detection from RGB-D data in outdoor scenes is crucial in various industrial applications such as autonomous driving, robotics, etc. However, the points obtained from range sensor scans are usually sparse and non-uniform, which seriously limit the detection performance, especially for far-away objects. By learning a rich variety of point distributions from the latent space, we believe that 3D upsampling techniques may fill up the missing knowledge due to the sparsity of the 3D points. Hence, a 3D object detection method using 3D upsampling techniques has been presented in this paper. The main contributions of the paper are two-fold. First, based on the Frustum PointNets pipeline, a 3D object detection method using PU-GAN has been implemented. A state-of-the-art 3D upsampling method, PU-GAN, is used to complement the sparsity of point cloud. Second, some effective strategies have been proposed to improve the detection performance using upsampled dense points. Extensive experimental results on KITTI benchmark show that the impacts of PU-GAN upsampled points on object detection are closely related to the object distances from the camera. They show their superiority when they are applied on objects located at around 30 meters away. By carefully designing the criteria to employ the upsampled points, the developed method can outperform the baseline Frustum PointNets by a large margin for all the categories, including car, pedestrian and cyclist objects.

Xueqing Wang, Ya-Li Hou, Xiaoli Hao, Yan Shen, Shuai Liu
Nodule Generation of Lung CT Images Using a 3D Convolutional LSTM Network

In the US, the American Cancer Society report for 2020 estimates about 228,820 new cases which could result in 135,720 deaths which translates to 371 deaths per day compared to the overall daily cancer death of 1660. The Cancer Society of South Africa (CANSA) reports that lung cancer and other chronic lung diseases are leading causes of death nationally. Research in this area is necessary in order to reduce the number of reported deaths through early detection and diagnosis. A number of studies have been done using datasets for Computed Tomography (CT) images in the diagnosis and prognosis by oncologists, radiologists and medical professionals in the healthcare sector and a number of machine learning methods are being developed using conventional neural networks (CNN) for feature extraction and binary classification with just a few researches making use of combined (hybrid) methods that have shown the capability to increase performance and accuracy in prediction and detection of early stage onset of lung cancer. In this paper, a combined model is proposed using 3D images as input to a combination of a CNN and long short-term memory (LSTM) network which is a type of recurrent neural network (RNN). The hybridization which often lead to increase need for computational resources will be adjusted by improving the nodule generation to focus only on the search space around the lung nodules, this proposed model requires less computation resources, avoiding the need to adding the whole 3D CT image into the network, therefore only the region of interest near candidate regions with nodules will be pre-processed. The results of previous traditional CNN architecture is compared to this combined 3D Convolutional LSTM for nodule generation. In the experiments, the proposed hybrid model overperforms the traditional CNN architecture which shows how much improvement a hybridization of suitable models can contribute to lung cancer research.

Kolawole Olulana, Pius Owolawi, Chunling Tu, Bolanle Abe
Conditional GAN for Prediction of Glaucoma Progression with Macular Optical Coherence Tomography

The estimation of glaucoma progression is a challenging task as the rate of disease progression varies among individuals in addition to other factors such as measurement variability and the lack of standardization in defining progression. Structural tests, such as thickness measurements of the retinal nerve fiber layer or the macula with optical coherence tomography (OCT), are able to detect anatomical changes in glaucomatous eyes. Such changes may be observed before any functional damage. In this work, we built a generative deep learning model using the conditional GAN architecture to predict glaucoma progression over time. The patient’s OCT scan is predicted from three or two prior measurements. The predicted images demonstrate high similarity with the ground truth images. In addition, our results suggest that OCT scans obtained from only two prior visits may actually be sufficient to predict the next OCT scan of the patient after six months.

Osama N. Hassan, Serhat Sahin, Vahid Mohammadzadeh, Xiaohe Yang, Navid Amini, Apoorva Mylavarapu, Jack Martinyan, Tae Hong, Golnoush Mahmoudinezhad, Daniel Rueckert, Kouros Nouri-Mahdavi, Fabien Scalzo
Backmatter
Metadaten
Titel
Advances in Visual Computing
herausgegeben von
George Bebis
Zhaozheng Yin
Edward Kim
Jan Bender
Kartic Subr
Bum Chul Kwon
Jian Zhao
Denis Kalkofen
George Baciu
Copyright-Jahr
2020
Electronic ISBN
978-3-030-64559-5
Print ISBN
978-3-030-64558-8
DOI
https://doi.org/10.1007/978-3-030-64559-5

Premium Partner