nach oben

2020 | Buch

Kapitel lesen Erstes Kapitel lesen

Advanced Concepts for Intelligent Vision Systems

20th International Conference, ACIVS 2020, Auckland, New Zealand, February 10–14, 2020, Proceedings

herausgegeben von: Jacques Blanc-Talon, Patrice Delmas, Prof. Wilfried Philips, Dan Popescu, Paul Scheunders

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This book constitutes the proceedings of the 20th INternational Conference on Advanced Concepts for Intelligent Vision Systems, ACIVS 2020, held in Auckland, New Zealand, in February 2020. The 48 papers presented in this volume were carefully reviewed and selected from a total of 78 submissions. They were organized in topical sections named: deep learning; biomedical image analysis; biometrics and identification; image analysis; image restauration, compression and watermarking; tracking, and mapping and scene analysis.

Inhaltsverzeichnis

Frontmatter

Deep Learning

Frontmatter

Deep Learning-Based Techniques for Plant Diseases Recognition in Real-Field Scenarios

Deep Learning has solved complicated applications with increasing accuracies over time. The recent interest in this technology, especially in its potential application in agriculture, has powered the growth of efficient systems to solve real problems, such as non-destructive methods for plant anomalies recognition. Despite the advances in the area, there remains a lack of performance in real-field scenarios. To deal with those issues, our research proposes an efficient solution that provides farmers with a technology that facilitates proper management of crops. We present two efficient techniques based on deep learning for plant disease recognition. The first method introduces a practical solution based on a deep meta-architecture and a feature extractor to recognize plant diseases and their location in the image. The second method addresses the problem of class imbalance and false positives through the introduction of a refinement function called Filter Bank. We validate the performance of our methods on our tomato plant diseases and pest dataset. We collected our own data and designed the annotation process. Qualitative and quantitative results show that despite the complexity of real-field scenarios, plant diseases are successfully recognized. The insights drawn from our research helps to better understand the strengths and limitations of plant diseases recognition.

Alvaro Fuentes, Sook Yoon, Dong Sun Park

EpNet: A Deep Neural Network for Ear Detection in 3D Point Clouds

The human ear is full of distinctive features, and its rigidness to facial expressions and ageing has made it attractive to biometric research communities. Accurate and robust ear detection is one of the essential steps towards biometric systems, substantially affecting the efficiency of the entire identification system. Existing ear detection methods are prone to failure in the presence of typical day-to-day circumstances, such as partial occlusions due to hair or accessories, pose variations, and different lighting conditions. Recently, some researchers have proposed different state-of-the-art deep neural network architectures for ear detection in two-dimensional (2D) images. However, the ear detection directly from three-dimensional (3D) point clouds using deep neural networks is still an unexplored problem. In this work, we propose a deep neural network architecture named EpNet for 3D ear detection, which can detect ear directly from 3D point clouds. We also propose an automatic pipeline to annotate ears in the profile face images of UND J2 public data set. The experimental results on the public data show that our proposed method can be an effective solution for 3D ear detection.

Md Mursalin, Syed Mohammed Shamsul Islam

Fire Segmentation in Still Images

In this paper, we propose a novel approach to fire in images based on a state of the art semantic segmentation method DeepLabV3. We compiled a data set of 1775 images containing fire from various sources for which we created polygon annotations. The data set is augmented with hard non-fire images from SUN397 data set. The segmentation method trained on our data set achieved results better than state of the art results on BowFire data set. We believe the created data set( http://www.fit.vutbr.cz/research/view_pub.php.cs?id=12124 ) will facilitate further development of fire detection and segmentation methods, and that the methods should be based on general purpose segmentation networks.

Jozef Mlích, Karel Koplík, Michal Hradiš, Pavel Zemčík

Region Proposal Oriented Approach for Domain Adaptive Object Detection

Faster R-CNN has become a standard model in deep-learning based object detection. However, in many cases, few annotations are available for images in the application domain referred as the target domain whereas full annotations are available for closely related public or synthetic datasets referred as source domains. Thus, a domain adaptation is needed to be able to train a model performing well in the target domain with few or no annotations in this target domain. In this work, we address this domain adaptation problem in the context of object detection in the case where no annotations are available in the target domain. Most existing approaches consider adaptation at both global and instance level but without adapting the region proposal sub-network leading to a residual domain shift. After a detailed analysis of the classical Faster R-CNN detector, we show that adapting the region proposal sub-network is crucial and propose an original way to do it. We run experiments in two different application contexts, namely autonomous driving and ski-lift video surveillance, and show that our adaptation scheme clearly outperforms the previous solution.

Hiba Alqasir, Damien Muselet, Christophe Ducottet

Deep Convolutional Network-Based Framework for Melanoma Lesion Detection and Segmentation

Analysis of skin lesion images is very crucial in melanoma detection. Melanoma is a form of skin cancer with high mortality rate. Both semi and fully automated systems have been proposed in the recent past for analysis of skin lesions and detection of melanoma. These systems have however been restricted in performance due to the complex visual characteristics of the skin lesions. Skin lesions images are characterised with fuzzy borders, low contrast between lesions and the background, variability in size and resolution and with possible presence of noise and artefacts. In this work, an efficient deep learning framework has been proposed for melanoma lesion detection and segmentation. The proposed method performs pixel-wise classification of skin lesion images to identify melanoma pixels. The framework employs an end-to-end and pixel by pixels learning approach using Deep Convolutional Networks with softmax classifier. A novel framework which learns the complex visual characteristics of skin lesions via an encoder and decoder subnetworks that are connected through a series of skip pathways that brings the semantic level of the encoder feature maps closer to that of the decoder feature maps is hereby designed. This efficiently handles multi-size, multi-resolution and noisy skin lesion images. The proposed system was evaluated on both the ISBI 2018 and PH2 skin lesion datasets.

Adekanmi Adegun, Serestina Viriri

A Novel Framework for Early Fire Detection Using Terrestrial and Aerial 360-Degree Images

In this paper, in order to contribute to the protection of the value and potential of forest ecosystems and global forest future we propose a novel fire detection framework, which combines recently introduced 360-degree remote sensing technology, multidimensional texture analysis and deep convolutional neural networks. Once 360-degree data are obtained, we convert the distorted 360-degree equirectangular projection format images to cubemap images. Subsequently, we divide the extracted cubemap images into blocks using two different sizes. This allows us to apply h-LDS multidimensional spatial texture analysis to larger size blocks and then, depending on the probability of fire existence, to smaller size blocks. Thus, we aim to accurately identify the candidate fire regions and simultaneously to reduce the computational time. Finally, the candidate fire regions are fed into a CNN network in order to distinguish between fire-coloured objects and fire. For evaluating the performance of the proposed framework, a dataset, namely “360-FIRE”, consisting of 100 images with unlimited field of view that contain synthetic fire, was created. Experimental results demonstrate the potential of the proposed framework.

Panagiotis Barmpoutis, Tania Stathaki

Biomedical Image Analysis

Frontmatter

Segmentation of Phase-Contrast MR Images for Aortic Pulse Wave Velocity Measurements

Aortic stiffness is an important diagnostic and prognostic parameter for many diseases, and is estimated by measuring the Pulse Wave Velocity (PWV) from Cardiac Magnetic Resonance (CMR) images. However, this process requires combinations of multiple sequences, which makes the acquisition long and processing tedious. We propose a method for aorta segmentation and centerline extraction from para-sagittal Phase-Contrast (PC) CMR images. The method uses the order of appearance of the blood flow in PC images to track the aortic centerline from the seed start position to the seed end position of the aorta. The only required user interaction involves selection of 2 input seed points for the start and end position of the aorta. We validate our results against the ground truth manually extracted centerlines from para-sagittal PC images and anatomical MR images. The resulting measurement values of both centerline length and PWV show high accuracy and low variability, which allows for use in clinical setting. The main advantage of our method is that it requires only velocity encoded PC image, while being able to process images encoded only in one direction.

Danilo Babin, Daniel Devos, Ljiljana Platiša, Ljubomir Jovanov, Marija Habijan, Hrvoje Leventić, Wilfried Philips

On the Uncertainty of Retinal Artery-Vein Classification with Dense Fully-Convolutional Neural Networks

Retinal imaging is a valuable tool in diagnosing many eye diseases but offers opportunities to have a direct view to central nervous system and its blood vessels. The accurate measurement of the characteristics of retinal vessels allows not only analysis of retinal diseases but also many systemic diseases like diabetes and other cardiovascular or cerebrovascular diseases. This analysis benefits from precise blood vessel characterization. Automatic machine learning methods are typically trained in the supervised manner where a training set with ground truth data is available. Due to difficulties in precise pixelwise labeling, the question of the reliability of a trained model arises. This paper addresses this question using Bayesian deep learning and extends recent research on the uncertainty quantification of retinal vasculature and artery-vein classification. It is shown that state-of-the-art results can be achieved by using the trained model. An analysis of the predictions for cases where the class labels are unavailable is given.

Azat Garifullin, Lasse Lensu, Hannu Uusitalo

Object Contour Refinement Using Instance Segmentation in Dental Images

A very accurate detection is required for fitting 3D dental model onto color images for tracking the milimetric displacement of each tooth along orthodontics treatment. Detecting the teeth boundaries with high accuracy on these images is a challenging task because of the various quality and high resolution of images. By training Mask R-CNN on a very large dataset of 170k images of patients’ mouth taken with different mobile devices, we have a reliable teeth instance segmentation, but each tooth boundaries are not accurate enough for dental care monitoring. To address this problem, we propose an efficient method for object contour refinement using instance segmentation (CRIS). Instance segmentation provides high-level information on the location and the shape of the object to guide and refine locally the contour detection process. We evaluate CRIS method on a large dataset of 600 dental images. Our method improves significantly the efficiency of several state-of-the-art contour detectors: Canny (+32.0% in ODS F-score), gPb (+17.8%), Sketch Tokens (+17.3%), Structured Edge (+12.2%), DeepContour (+15.5%), HED (+2.9%), CEDN (+2.2%), RCF (+2.2%) and also the best result (ODS F-score of 0.819). Our CRIS method can be used with any contour detection algorithms to refine object contours. In that way, this approach is promising for other applications requiring very accurate contour detection.

Trung Van Pham, Yves Lucas, Sylvie Treuillet, Laurent Debraux

Correction of Temperature Estimated from a Low-Cost Handheld Infrared Camera for Clinical Monitoring

The use of low-cost cameras for medical applications has its advantages as it enables affordable and remote evaluations of health problems; however, the accuracy is a limiting factor to use them. Previous studies indicate that parameters from object position like distance camera-object and angle of view could be used to improve temperature estimation from thermal cameras. Nevertheless, most studies are focused on expensive thermal cameras with good accuracy. In this study, an innovative experimental setup is used to study the errors associated to temperature estimation from a low-cost infrared camera: FlirOne Gen3. In our experiments, the image acquisition is done from multiple point of view (distance camera-object and viewing angles) and by using a thermal camera manipulated by hand. Then, using a regression model, a correction is proposed and tested. The results show that our proposed correction improves the temperature estimation and enhance the thermal accuracy.

Evelyn Gutierrez, Benjamin Castañeda, Sylvie Treuillet

Bayesian Feature Pyramid Networks for Automatic Multi-label Segmentation of Chest X-rays and Assessment of Cardio-Thoratic Ratio

Cardiothoratic ratio (CTR) estimated from chest radiographs is a marker indicative of cardiomegaly, the presence of which is in the criteria for heart failure diagnosis. Existing methods for automatic assessment of CTR are driven by Deep Learning-based segmentation. However, these techniques produce only point estimates of CTR but clinical decision making typically assumes the uncertainty. In this paper, we propose a novel method for chest X-ray segmentation and CTR assessment in an automatic manner. In contrast to the previous art, we, for the first time, propose to estimate CTR with uncertainty bounds. Our method is based on Deep Convolutional Neural Network with Feature Pyramid Network (FPN) decoder. We propose two modifications of FPN: replace the batch normalization with instance normalization and inject the dropout which allows to obtain the Monte-Carlo estimates of the segmentation maps at test time. Finally, using the predicted segmentation mask samples, we estimate CTR with uncertainty. In our experiments we demonstrate that the proposed method generalizes well to three different test sets. Finally, we make the annotations produced by two radiologists for all our datasets publicly available.

Roman Solovyev, Iaroslav Melekhov, Timo Lesonen, Elias Vaattovaara, Osmo Tervonen, Aleksei Tiulpin

Deep-Learning for Tidemark Segmentation in Human Osteochondral Tissues Imaged with Micro-computed Tomography

Three-dimensional (3D) semi-quantitative grading of pathological features in articular cartilage (AC) offers significant improvements in basic research of osteoarthritis (OA). We have earlier developed the 3D protocol for imaging of AC and its structures which includes staining of the sample with a contrast agent (phosphotungstic acid, PTA) and a consequent scanning with micro-computed tomography. Such a protocol was designed to provide X-ray attenuation contrast to visualize AC structure. However, at the same time, this protocol has one major disadvantage: the loss of contrast at the tidemark (calcified cartilage interface, CCI). An accurate segmentation of CCI can be very important for understanding the etiology of OA and ex-vivo evaluation of tidemark condition at early OA stages. In this paper, we present the first application of Deep Learning to PTA-stained osteochondral samples that allows to perform tidemark segmentation in a fully-automatic manner. Our method is based on U-Net trained using a combination of binary cross-entropy and soft-Jaccard loss. On cross-validation, this approach yielded intersection over the union of 0.59, 0.70, 0.79, 0.83 and 0.86 within 15 $$\upmu $$m, 30 $$\upmu $$m, 45 $$\upmu $$m, 60 $$\upmu $$m. and 75 $$\upmu $$m padded zones around the tidemark, respectively. Our codes and the dataset that consisted of 35 PTA-stained human AC samples are made publicly available together with the segmentation masks to facilitate the development of biomedical image segmentation methods.

Aleksei Tiulpin, Mikko Finnilä, Petri Lehenkari, Heikki J. Nieminen, Simo Saarakkala

Quadratic Tensor Anisotropy Measures for Reliable Curvilinear Pattern Detection

A wide range of applications needs the analysis of biomedical images as a fundamental task to extract meaningful information and allow high throughput measurements. A new method for the detection of curve-like structures in biomedical images is presented by exploiting local phase vector and the structural anisotropy information at various directions. We introduce an oriented gaussian derivative quadrature filter not only for estimating the local phase vectors, which include line features, but also for its immunity to inhomogeneous intensity and its capability to enhance curved structures having various diameters, leading to more reliable hessian analysis. A novel measure function-based hessian tensor is proposed to detect curvilinear patterns by incorporating the anisotropic indices (coherence and linearity) of curved features, producing a uniform and strong response. Over multiple orientations, the responses are maximized to achieve a rotationally invariant response, and to detect target structures with different widths and illuminations. The evaluation of the proposed method on the extraction of retinal vessels and leaf venation patterns exhibits its superior performance against state-of-the-art methods.

Mohsin Challoob, Yongsheng Gao

Biometrics and Identification

Frontmatter

Exposing Presentation Attacks by a Combination of Multi-intrinsic Image Properties, Convolutional Networks and Transfer Learning

Nowadays, adoption of face recognition for biometric authentication systems is widespread, mainly because this is one of the most accessible biometric characteristic. Techniques intended on deceive these kinds of systems by using a forged biometric sample, such as a printed paper or a recorded video of a genuine access, are known as presentation attacks. Presentation Attack Detection is a crucial step for preventing this kind of unauthorized accesses into restricted areas or devices. In this paper, we propose a new method that relies on a combination of the intrinsic properties of the image with deep neural networks to detect presentation attack attempts. Exploring depth, salience and illumination properties, along with a Convolutional Neural Network, proposed method produce robust and discriminant features which are then classified to detect presentation attacks attempts. In a very challenging cross-dataset scenario, proposed method outperform state-of-the-art methods in two of three evaluated datasets.

Rodrigo Bresan, Carlos Beluzo, Tiago Carvalho

Multiview 3D Markerless Human Pose Estimation from OpenPose Skeletons

Despite the fact that marker-based systems for human motion estimation provide very accurate tracking of the human body joints (at mm precision), these systems are often intrusive or even impossible to use depending on the circumstances, e.g. markers cannot be put on an athlete during competition. Instrumenting an athlete with the appropriate number of markers requires a lot of time and these markers may fall off during the analysis, which leads to incomplete data and requires new data capturing sessions and hence a waste of time and effort. Therefore, we present a novel multiview video-based markerless system that uses 2D joint detections per view (from OpenPose) to estimate their corresponding 3D positions while tackling the people association problem in the process to allow the tracking of multiple persons at the same time. Our proposed system can perform the tracking in real-time at 20–25 fps. Our results show a standard deviation between 9.6 and 23.7 mm for the lower body joints based on the raw measurements only. After filtering the data, the standard deviation drops to a range between 6.6 and 21.3 mm. Our proposed solution can be applied to a large number of applications, ranging from sports analysis to virtual classrooms where submillimeter precision is not necessarily required, but where the use of markers is impractical.

Maarten Slembrouck, Hiep Luong, Joeri Gerlo, Kurt Schütte, Dimitri Van Cauwelaert, Dirk De Clercq, Benedicte Vanwanseele, Peter Veelaert, Wilfried Philips

Clip-Level Feature Aggregation: A Key Factor for Video-Based Person Re-identification

In the task of video-based person re-identification, features of persons in the query and gallery sets are compared to search the best match. Generally, most existing methods aggregate the frame-level features together using a temporal method to generate the clip-level features, instead of the sequence-level representations. In this paper, we propose a new method that aggregates the clip-level features to obtain the sequence-level representations of persons, which consists of two parts, i.e., Average Aggregation Strategy (AAS) and Raw Feature Utilization (RFU). AAS makes use of all frames in a video sequence to generate a better representation of a person, while RFU investigates how batch normalization operation influences feature representations in person re-identification. The experimental results demonstrate that our method can boost the performance of existing models for better accuracy. In particular, we achieve 87.7% rank-1 and 82.3% mAP on MARS dataset without any post-processing procedure, which outperforms the existing state-of-the-art.

Chengjin Lyu, Patrick Heyer-Wollenberg, Ljiljana Platisa, Bart Goossens, Peter Veelaert, Wilfried Philips

Towards Approximating Personality Cues Through Simple Daily Activities

The goal of this work is to investigate the potential of making use of simple activity and motion patterns in a smart environment for approximating personality cues via machine learning techniques. Towards this goal, we present a novel framework for personality recognition, inspired by both Computer Vision and Psychology. Results show a correlation between several behavioral features and personality traits, as well as insights of which type of everyday tasks induce stronger personality display. We experiment with the use of Support Vector Machines, Random Forests and Gaussian Process classification achieving promising predictive ability, related to personality traits. The obtained results show consistency to a good degree, opening the path for applications in psychology, game industry, ambient assisted living, and other fields.

Francesco Gibellini, Sebastiaan Higler, Jan Lucas, Migena Luli, Morris Stallmann, Dario Dotti, Stylianos Asteriadis

Person Identification by Walking Gesture Using Skeleton Sequences

When coping with person identification problem, previous approaches either directly take raw RGB as inputs or use more sophisticated devices to capture other information. However, most of the approaches are sensitive to the changes of environment and different clothing, little variation may lead to failure identification. Recent research shows that “gait” (i.e., a person’s manner of walking) is a unique trait of a human being. Motivated by this, we propose a novel method to identify people by their gaits. In order to figure out the characteristic of individual gait, we are interested in utilizing skeletal information, which is more robust to the diversification of environment and appearance. To effectively utilize skeletal data, we analyze the spatial relationship of joints and transform the 3D skeleton coordinates into relative distances and angles between joints, and then we use a bidirectional long short-term memory neural network to explore the temporal information of the skeleton sequences. Results show that our proposed method can outperform previous methods on BIWI and IAS-Lab datasets by gaining 10.33% accuracy improvement on average.

Chu-Chien Wei, Li-Huang Tsai, Hsin-Ping Chou, Shih-Chieh Chang

Verifying Kinship from RGB-D Face Data

We present a kinship verification (KV) approach based on Deep Learning applied to RGB-D facial data. To work around the lack of an adequate 3D face database with kinship annotations, we provide an online platform where participants upload videos containing faces of theirs and of their relatives. These videos are captured with ordinary smartphone cameras. We process them to reconstruct recorded faces in tridimensional space, generating a normalized dataset which we call Kin3D. We also combine depth information from the normalized 3D reconstructions with 2D images, composing a set of RGBD data. Following approaches from related works, images are organized into four categories according to their respective type of kinship. For the classification, we use a Convolutional Neural Network (CNN) and a Support Vector Machine (SVM) for comparison. The CNN was tested both on a widely used 2D Kinship Verification database (KinFaceW-I and II) and on our Kin3D for comparison with related works. Results indicate that adding depth information improves the model’s performance, increasing the classification accuracy up to 90%. To the extent of our knowledge, this is the first database containing depth information for Kinship Verification. We provide a baseline performance to stimulate further evaluations from the research community.

Felipe Crispim, Tiago Vieira, Bruno Lima

VA-StarGAN: Continuous Affect Generation

Recent advances in Generative Adversarial Networks have shown impressive results for the task of facial affect synthesis. The most successful architecture is StarGAN, which is effective, but can only generate a discrete number of expressions. However, dimensional emotion representations, usually valence (indicating how positive or negative an emotional state is) and arousal (measuring the power of the emotion activation), are more appropriate to represent subtle emotions appearing in everyday human computer interactions. In this paper, we adapt StarGAN for continuous emotion synthesis and propose VA-StarGAN; we use a correlation-based loss instead of the usual MSE; we adapt the discriminator network to account for continuous output; we exploit and utilize the in-the-wild Aff-Wild and AffectNet databases; we propose a trick for generating the target domain when training the generator. Qualitative experiments illustrate the generation of realistic images, whilst comparison with state-of-the-art approaches shows the superiority of our method. Quantitative experiments (in which the synthesized images are used for data augmentation in training Deep Neural Networks) further validate our development.

Dimitrios Kollias, Stefanos Zafeiriou

Fast Iris Segmentation Algorithm for Visible Wavelength Images Based on Multi-color Space

Iris recognition for eye images acquired in visible wavelength is receiving increasing attention. In visible wavelength environments, there are many factors that may cover or affect the iris region which makes the iris segmentation step more difficult and challenging. In this paper, we propose a novel and fast segmentation algorithm to deal with eye images acquired in visible wavelength environments by considering the color information form multiple color spaces. The various existing color spaces such as RGB, YCbCr, and HSV are analyzed and an appropriate set of color models is selected for the segmentation process. To accurately localize the iris region, a set of convenient techniques are applied to detect and remove the non-iris regions such as pupil, specular reflection, eyelids, and eyelashes. Our experimental results and comparative analysis using the UBIRIS v2 database demonstrate the efficiency of our approach in terms of segmentation accuracy and execution time.

Shaaban Sahmoud, Hala N. Fathee

A Local Flow Phase Stretch Transform for Robust Retinal Vessel Detection

This paper presents a new method for reliably detecting retinal vessel tree using a local flow phase stretch transform (LF-PST). A local flow evaluator is proposed to increase the local contrast and the coherence of the local orientation of vessel tree. This is achieved by incorporating information about the local structure and direction of vessels, which is estimated by introducing a second curvature moment evaluation matrix (SCMEM). The SCMEM evaluates vessel patterns as only features having linearly coherent curvature. We present an oriented phase stretch transform to capture retinal vessels running at various diameters and directions. The proposed method exploits the phase angle of the transform, which includes structural features of lines and curved patterns. The LF-PST produces several phase maps, in which the vessel structure is characterized along various directions. To produce an orientation invariant response, all phases are linearly combined. The proposed method is tested on the publicly available DRIVE and IOSTAR databases with different imaging modalities and achieves encouraging segmentation results outperforming the state-of-the-art benchmark methods.

Mohsin Challoob, Yongsheng Gao

Evaluation of Unconditioned Deep Generative Synthesis of Retinal Images

Retinal images have been increasingly important in clinical diagnostics of several eye and systemic diseases. To help the medical doctors in this work, automatic and semi-automatic diagnosis methods can be used to increase the efficiency of diagnostic and follow-up processes, as well as enable wider disease screening programs. However, the training of advanced machine learning methods for improved retinal image analysis typically requires large and representative retinal image data sets. Even when large data sets of retinal images are available, the occurrence of different medical conditions is unbalanced in them. Hence, there is a need to enrich the existing data sets by data augmentation and introducing noise that is essential to build robust and reliable machine learning models. One way to overcome these shortcomings relies on generative models for synthesizing images. To study the limits of retinal image synthesis, this paper focuses on the deep generative models including a generative adversarial network and a variational autoencoder to synthesize images from noise without conditioning on any information regarding to the retina. The models are trained with the Kaggle EyePACS retinal image set, and for quantifying the image quality in a no-reference manner, the generated images are compared with the retinal images of the DiaRetDB1 database using common similarity metrics.

Sinan Kaplan, Lasse Lensu, Lauri Laaksonen, Hannu Uusitalo

Image Analysis

Frontmatter

Dynamic Texture Representation Based on Hierarchical Local Patterns

A novel effective operator, named HIerarchical LOcal Pattern (HILOP), is proposed to efficiently exploit relationships of local neighbors at a pair of adjacent hierarchical regions which are located around a center pixel of a textural image. Instead of being thresholded by the value of the central pixel as usual, the gray-scale of a local neighbor in a hierarchical area is compared to that of all neighbors in the other region. In order to capture shape and motion cues for dynamic texture (DT) representation, HILOP is taken into account investigating hierarchical relationships in plane-images of a DT sequence. The obtained histograms are then concatenated to form a robust descriptor with high performance for DT classification task. Experimental results on various benchmark datasets have validated the interest of our proposal.

Thanh Tuan Nguyen, Thanh Phuong Nguyen, Frédéric Bouchara

Temporal-Clustering Based Technique for Identifying Thermal Regions in Buildings

Nowadays, moistures and thermal leaks in buildings are manually detected by an operator, who roughly delimits those critical regions in thermal images. Nevertheless, the use of artificial intelligence (AI) techniques can greatly improve the manual thermal analysis, providing automatically more precise and objective results. This paper presents a temporal-clustering based technique that carries out the segmentation of a set of thermal orthoimages (STO) of a wall, which have been taken at different times. The algorithm has two stages: region labelling and consensus. In order to delimit regions with similar temporal temperature variation, three clustering algorithms are applied on STO, obtaining the respective three labelled images. In the second stage, a consensus algorithm between the labelled images is applied. The method thus delimitates regions with different thermal evolutions over time, each characterized by a temperature consensus vector. The approach has been tested in real scenes by using a 3D thermal scanner. A case study, composed of 48 thermal orthoimages at 30 min-intervals over 24 h, are presented.

Antonio Adán, Juan García, Blanca Quintana, Francisco J. Castilla, Víctor Pérez

Distance Weighted Loss for Forest Trail Detection Using Semantic Line

Unlike structured urban roads, forest trails do not have defined shape or appearance and have ambiguous boundaries making them challenging to be detected. In this work we propose to train a deep convolutional encoder-decoder network with a novel distance weighted loss function for end to end learning of unstructured forest trail. The forest trail is annotated with “semantic line” representing the trail, and a L1 distance map is derived from the binarized ground truth. We propose to use the distance map to weigh the loss function to guide the focus of the network on the forest trail. The proposed loss function penalizes low activations around the ground truth and high activations in areas further away from the trail. The proposed loss function is compared against other commonly used loss functions by evaluating the performance on the publicly available IDSIA forest trail dataset. The proposed method leads to higher trail detection accuracy with 2.52%, 4.69% and 8.18% improvement in mean intersection over union (mIoU) over mean squared error, Jaccard loss and cross-entropy, respectively.

Shyam Prasad Adhikari, Hyongsuk Kim

Localization of Map Changes by Exploiting SLAM Residuals

Simultaneous Localization and Mapping is widespread in both robotics and autonomous driving. This paper proposes a novel method to identify changes in maps constructed by SLAM algorithms without feature-to-feature comparison. We use ICP-like algorithms to match frames and pose graph optimization to solve the SLAM problem. Finally, we analyze the residuals to localize possible alterations of the map. The concept was tested with 2D LIDAR SLAM problems in simulated and real-life cases.

Zoltan Rozsa, Marcell Golarits, Tamas Sziranyi

Initial Pose Estimation of 3D Object with Severe Occlusion Using Deep Learning

During the last decade, augmented reality (AR) has gained explosive attention and demonstrated high potential on educational and training applications. As a core technique, AR requires a tracking method to get 3D poses of a camera or an object. Hence, providing fast, accurate, robust, and consistent tracking methods have been a main research topic in the AR field. Fortunately, tracking the camera pose using a relatively small and less-textured known object placed on the scene has been successfully mastered through various types of model-based tracking (MBT) methods. However, MBT methods requires a good initial camera pose estimator and estimating an initial camera pose from partially visible objects remains an open problem. Moreover, severe occlusions are also challenging problems for initial camera pose estimation. Thus, in this paper, we propose a deep learning method to estimate an initial camera pose from a partially visible object that may also be severely occluded. The proposed method handles such challenging scenarios by relying on the information of detected subparts of a target object to be tracked. Specifically, we first detect subparts of the target object using a state-of-the-art convolutional neural networks (CNN). The object detector returns two dimensional bounding boxes, associated classes, and confidence scores. We then use the bounding boxes and classes information to train a deep neural network (DNN) that regresses to camera’s 6-DoF pose. After initial pose estimation, we attempt to use a tweaked version of an existing MBT method to keep tracking the target object in real time on mobile platform. Experimental results demonstrate that the proposed method can estimate accurately initial camera poses from objects that are partially visible or/and severely occluded. Finally, we analyze the performance of the proposed method in more detail by comparing the estimation errors when different number of subparts are detected.

Jean-Pierre Lomaliza, Hanhoon Park

Automatic Focal Blur Segmentation Based on Difference of Blur Feature Using Theoretical Thresholding and Graphcuts

Focal blur segmentation is one of the interesting topics in computer vision. With recent improvements of camera devices, multiple focal blur images of different focal settings can be obtained by a single shooting. Utilizing the information of multiple focal blur images is expected to improve the segmentation performance. We propose one of the automatic focal blur segmentation using a pair of two focal blur images with different focal settings. Difference of blur features can be obtained from an image pair which are focused on an object and background, respectively. A theoretical threshold identifies the object and background in the difference of blur feature space. The proposed method consists of (i) the theoretical thresholding in the blur feature space; and (ii) energy minimization based on Graphcuts using color and blur features. We evaluate the proposed method using 12 and 48 image pairs, including single objects and flowers, respectively. As results of the evaluation, the averaged Informedness of the initial and the final segmentation are 0.897 and 0.972 for the single object images, and 0.730 and 0.827 for the flower images, respectively.

Natsuki Takayama, Hiroki Takahashi

Feature Map Augmentation to Improve Rotation Invariance in Convolutional Neural Networks

Whilst it is a trivial task for a human vision system to recognize and detect objects with good accuracy, making computer vision algorithms achieve the same feat remains an active area of research. For a human vision system, objects seen once are recognized with high accuracy despite alterations to its appearance by various transformations such as rotations, translations, scale, distortions and occlusion making it a state-of-the-art spatially invariant biological vision system. To make computer algorithms such as Convolutional Neural Networks (CNNs) spatially invariant one popular practice is to introduce variations in the data set through data augmentation. This achieves good results but comes with increased computation cost. In this paper, we address rotation transformation and instead of using data augmentation we propose a novel method that allows CNNs to improve rotation invariance by augmentation of feature maps. This is achieved by creating a rotation transformer layer called Rotation Invariance Transformer (RiT) that can be placed at the output end of a convolution layer. Incoming features are rotated by a given set of rotation parameters which are then passed to the next layer. We test our technique on benchmark CIFAR10 and MNIST datasets in a setting where our RiT layer is placed between the feature extraction and classification layers of the CNN. Our results show promising improvements in the networks ability to be rotation invariant across classes with no increase in model parameters.

Dinesh Kumar, Dharmendra Sharma, Roland Goecke

Automatic Optical Inspection for Millimeter Scale Probe Surface Stripping Defects Using Convolutional Neural Network

Surface defect inspection is a crucial step during the production process of IC probe. The traditional way of identifying defective IC probes mostly relies on the human visual examination through the microscope screen. However, this approach will be affected by some subjective factors or misjudgments of inspectors, and the accuracy and efficiency are not sufficiently stable. Therefore, we propose an automatic optical inspection system by incorporating the ResNet-101 deep learning architecture into the faster region-based convolutional neural network (Faster R-CNN) to detect the stripping-gold defect on the IC probe surface. The training samples were collected through our designed multi-function investigation platform IMSLAB. To circumvent the challenge of insufficient images in our datasets, we introduce data augmentation using cycle generative adversarial networks (CycleGAN). The proposed system was evaluated using 133 probes. The experimental results revealed our method performed high accuracy in stripping defect detection. The overall mean average precision (mAP) was 0.732, and the defect IC probe classification accuracy rate was 97.74%.

Yu-Chieh Ting, Daw-Tung Lin, Chih-Feng Chen, Bor-Chen Tsai

Image Restauration, Compression and Watermarking

Frontmatter

A SVM-Based Zero-Watermarking Technique for 3D Videos Traitor Tracing

The watermarking layer has a crucial role in a collusion-secure fingerprinting framework since the hidden information, or the identifier, directly attached to user identification, is implanted in the media as a watermark. In this paper, we propose a new zero watermarking technique for 3D videos based on Support Vector Machine (SVM) classifier. Hence, the proposed scheme consists of two major contributions. The first one is the protection of both the 2D video frames and the depth maps simultaneously and independently. Robust features are extracted from Temporally Informative Representative Images (TIRIs) of both the 2D video frames and depth maps to construct the master shares. Then, the relationship between the identifier and the extracted master shares is generated by performing an Exclusive OR (XOR) operation. The second contribution uses the SVM and the XOR operation to estimate the watermark. Compared to other zero watermarking techniques, the proposed scheme has proven good results of robustness and transparency even for long size watermarks, which makes it suitable for a tracing traitor framework.

Karama Abdelhedi, Faten Chaabane, Chokri Ben Amar

Design of Perspective Affine Motion Compensation for Versatile Video Coding (VVC)

The fundamental motion model of the conventional block-based motion compensation in High Efficiency Video Coding (HEVC) is a translational motion model. However, in the real world, the motion of an object exists in the form of combining many kinds of motions. In Versatile Video Coding (VVC), a block-based 4-parameter and 6-parameter affine motion compensation (AMC) is being applied. The AMC still has a limit to accurate complex motions in the natural video. In this paper, we design a perspective affine motion compensation (PAMC) method which can improve the coding efficiency and maintain low-computational complexity compared with existing AMC. Because the block with the perspective motion model is a rectangle without specific feature, the proposed PAMC shows effective encoding performance for the test sequence containing irregular object distortions or dynamic rapid motions in particular. Our proposed algorithm is implemented on VTM 2.0. The experimental results show that the BD-rate reduction of the proposed technique can be achieved up to 0.30%, 0.76%, and 0.04% for random access (RA) configuration and 0.45%, 1.39%, and 1.87% for low delay P (LDP) configuration on Y, U, and V components, respectively. Meanwhile, the increase of encoding complexity is within an acceptable range.

Young-Ju Choi, Young-Woon Lee, Byung-Gyu Kim

Investigation of Coding Standards Performances on Optically Acquired and Synthetic Holograms

Digital holography needs efficient coding tools that facilitate storage and transmission of this type of data in order to reach practical applications. This paper presents an experimental analysis of the performance of different coding tools for the compression of digital holograms. During the experiments, a dedicated compression architecture is employed in order to transform the holographic data in a representation suitable to be provided to the encoders, and for performing an objective quality evaluation of the obtained results. Several state-of-the-art image and video codecs are evaluated on different reference datasets, comprising different types of digital holograms. The evaluation is carried out on the reconstructed images with different metrics, and obtained results are critically analyzed and discussed.

Roberto Corda, Cristian Perra, Daniele Giusto

Natural Images Enhancement Using Structure Extraction and Retinex

Variational Retinex model-based methods for low-light image enhancement have been popularly studied in recent years. In this paper, we present an enhanced variational Retinex method for low-light natural image enhancement, based on the initial smoother illumination component with a structure extraction technique. The Bergman splitting algorithm is then introduced to estimate the illuminance component and reflectance component. The de-block processing and illuminance component correction are used for the enhanced reflectance as the ultimate enhanced image. Moreover, the estimated smoother illumination component can make enhanced images preserve edge details. Experimental results with a comparison demonstrate the present variational Retinex method can effectively enhance image quality and maintain image color.

Xiaoyu Du, Youshen Xia

Unsupervised Desmoking of Laparoscopy Images Using Multi-scale DesmokeNet

The presence of surgical smoke in laparoscopic surgery reduces the visibility of the operative field. In order to ensure better visualization, the present paper proposes an unsupervised deep learning approach for the task of desmoking of the laparoscopic images. This network builds upon generative adversarial networks (GANs) and converts laparoscopic images from smoke domain to smoke-free domain. The network comprises a new generator architecture that has an encoder-decoder structure composed of multi-scale feature extraction (MSFE) blocks at each encoder block. The MSFE blocks of the generator capture features at multiple scales to obtain a robust deep representation map and help to reduce the smoke component in the image. Further, a structure-consistency loss has been introduced to preserve the structure in the desmoked images. The proposed network is called Multi-scale DesmokeNet, which has been evaluated on the laparoscopic images obtain from Cholec80dataset. The quantitative and qualitative results shows the efficacy of the proposed Multi-scale DesmokeNet in comparison with other state-of-the-art desmoking methods.

V. Vishal, Varun Venkatesh, Kshetrimayum Lochan, Neeraj Sharma, Munendra Singh

VLW-Net: A Very Light-Weight Convolutional Neural Network (CNN) for Single Image Dehazing

Camera imaging is one of the most important application areas of computer image and video processing. However, computational cost is usually the main reason preventing many state of the art image processing algorithms from being applied to practical applications including camera imaging. This paper proposes a very light-weight end-to-end CNN network (VLW-Net) for single image haze removal. We proposed a new Inception structure. By combining it with a reformulated atmospheric scattering model, our proposed network is at least 6 times more light-weight than the state-of-the-arts. We conduct the experiments on both synthesized and realistic hazy image dataset, and the results demonstrate our superior performance in terms of network size, PSNR, SSIM and the subjective image quality. Moreover, the proposed network can be seamlessly applied to underwater image enhancement, and we witness obvious improvement by comparing with the state-of-the-arts.

Chenguang Liu, Li Tao, Yeong-Taeg Kim

An Improved GAN Semantic Image Inpainting

Image inpainting is used to fill in missing regions based on remaining image data. Although the existing methods, that use deep generative models to infer the missing content, produce realistic images, sometimes the results are unsatisfactory due to arithmetical issues caused by the use of unbalanced ingredients of the proposed cost functions. In this paper, we propose a loss that generates more plausible results. Experiments on two datasets show that our method predicts information in large missing regions and achieves pixel-level photorealism, significantly outperforming state-of-the-art methods [24] and [25]. Having improved the semantic image inpainting we focus on applying the method to laparoscopic images that suffer from glares. The modified technique again outperforms its rivals. Moreover, it is faster than classical PDE based inpainting techniques and, more importantly, its running time is almost independent on the size of missing area, both critical issues in medical image processing.

Panagiotis-Rikarnto Siavelis, Nefeli Lamprinou, Emmanouil Z. Psarakis

Tracking, Mapping and Scene Analysis

Frontmatter

CUDA Implementation of a Point Cloud Shape Descriptor Method for Archaeological Studies

In this work we present a new approach to study shape descriptors of archaeological objects using an implementation of the smoothed-points shape descriptor (SPSD) method that is based on the numerical mesh-free simulation method smoothed-particles hydrodynamics. SPSD can describe the textural or morphological properties of a surface by obtaining a property field descriptor based on the points shape descriptors and a smoothing function over a neighborhood of each point. The neighborhood size depends on a smoothing distance function which drives the field descriptor to either focus on small local details or larger details over big surfaces. SPSD is designed to provide a real-time scientific visualization of cloud points shape descriptors to assist in the field study of archaeological artifacts. It also has the potential to provide quantitative values (e.g. morphological properties) for artifacts analysis and classification (computational and archaeological). Due to the visualisation requirement for a real-time solution, SPSD is implemented in CUDA using an Octree method as the mechanism to solve the neighborhood particles interaction for each point cloud.

David Arturo Soriano Valdez, Patrice Delmas, Trevor Gee, Patricio Gutierrez, Jose Luis Punzo-Diaz, Rachel Ababou, Alfonso Gastelum Strozzi

Red-Green-Blue Augmented Reality Tags for Retail Stores

In this paper, we introduce a new Augmented Reality (AR) Tag to enhance detection rates, accuracy and also user experiences in marker-based AR technologies. The tag is a colour printed card, divided into three colour channels: red, blue, and green; to label the three components: (1) an oriented marker, (2) a bar-code and (3) a graphic image, respectively. In this tag, the oriented marker is used for tag detection and orientation identification, the bar-code is for storing and achieving numerical information (IDs of the models), and the texture image is to provide the users with an original sight of what the tag is displaying. When our new AR tags are placed in front of the camera, the corresponding 3D graphics (models of figures or products) will appear directly on top of it. Also, we can rotate the tags to rotate the 3D graphics; and move the camera to zoom in/out or view it from a different angle. The embedded bar-code could be 1D or 2D bar-codes; the currently popular QR code could be used. Fortunately, QR codes include position detection patterns that could be used to identify the orientation for the code. Thus, the oriented marker is not needed for QR code, and one channel is saved and used for presenting the initially displaying image. Some experiments have been carried out to identify the robustness of the proposed tags. The results show that our tags and its orientations (marker stored in the blue colour channel) are relatively easy to detect using commodity webcams. The embedded QR code (painted in blue) is readable in most test cases. Compared to the ordinary QR tag (black and white), our embedded QR code has the detection rates of 95%. The image texture is stored in the red and green channel is relatively visible. However, the blue channel is missing, which makes it not visually correctly in some cases. Application-wise, this could be used in many AR applications such as shopping. Thanks to the large storage of QR Code, this AR Tag is capable of storing and displaying virtual products of a much wider variety. The user could see its 3D figure, zoom and rotate using intuitive on-hand controls.

Minh Nguyen, Huy Le, Wei Qi Yan

Guided Stereo to Improve Depth Resolution of a Small Baseline Stereo Camera Using an Image Sequence

Using calibrated synchronised stereo cameras significantly simplifies multi-image 3D reconstruction. This is because they produce point clouds for each frame pair, which reduces multi-image 3D reconstruction to a relatively simple process of pose estimation followed by point cloud merging. There are several synchronized stereo cameras available on the market for this purpose, however a key problem is that they often come as fixed baseline units. This is a problem since the baseline that determines the range and resolution of the acquired 3D. This work deals with the fairly common scenario of trying to acquire a 3D reconstruction from a sequence of images, when the baseline of our camera is too small. Given such a sequence, in many cases it is possible to match each image with another in the sequence that has a more appropriate baseline. However is there still value in having calibrated stereo pairs then? Clearly not using the calibrated stereo pairs reduces the problem to a monocular 3D reconstruction problem, which is more complex with known issues such as scale ambiguity. This work attempts to solve the problem by proposing a guided stereo strategy that refines the coarse depth estimates from calibrated narrow stereo pairs with frames that are further away. Our experimental results are promising, since they show that this problem is solvable provided there are appropriate frames in the sequence to supplement the depth estimates from the original narrow stereo pairs.

Trevor Gee, Georgy Gimel’farb, Alexander Woodward, Rachel Ababou, Alfonso Gastelum Strozzi, Patrice Delmas

SuperNCN: Neighbourhood Consensus Network for Robust Outdoor Scenes Matching

In this paper, we present a framework for computing dense keypoint correspondences between images under strong scene appearance changes. Traditional methods, based on nearest neighbour search in the feature descriptor space, perform poorly when environmental conditions vary, e.g. when images are taken at different times of the day or seasons. Our method improves finding keypoint correspondences in such difficult conditions. First, we use Neighbourhood Consensus Networks to build spatially consistent matching grid between two images at a coarse scale. Then, we apply Superpoint-like corner detector to achieve pixel-level accuracy. Both parts use features learned with domain adaptation to increase robustness against strong scene appearance variations. The framework has been tested on a RobotCar Seasons dataset, proving large improvement on pose estimation task under challenging environmental conditions.

Grzegorz Kurzejamski, Jacek Komorowski, Lukasz Dabala, Konrad Czarnota, Simon Lynen, Tomasz Trzcinski

Using Normal/Abnormal Video Sequence Categorization to Efficient Facial Expression Recognition in the Wild

The facial expression recognition in real-world conditions, with a large variety of illumination, pose, resolution, and occlusions, is a very challenging task. The majority of the literature approaches, which deal with these challenges, do not take into account the variation of the quality of the different videos. Unlike these approaches, this paper suggests treating the video sequences according to their quality. Using Isolation Forests (IF) algorithm, the video sequences are categorized into two categories: normal videos that visibly express clear illumination and frontal pose of face, and abnormal videos that present poor illumination, different poses of face, occulted face. Two independent facial expression classifiers for the normal and abnormal videos are built using Random Forests (RF) algorithm. The experiments have demonstrated that processing independently normal and abnormal videos can be used to improve the efficiency of the facial expression recognition in the Wild.

Taoufik Ben Abdallah, Radhouane Guermazi, Mohamed Hammami

Distributed Multi-class Road User Tracking in Multi-camera Network For Smart Traffic Applications

Reliable tracking of road users is one of the important tasks in smart traffic applications. In these applications, a network of cameras is often used to extend the coverage. However, efficient usage of information from cameras which observe the same road user from different view points is seldom explored. In this paper, we present a distributed multi-camera tracker which efficiently uses information from all cameras with overlapping views to accurately track various classes of road users. Our method is designed for deployment on smart camera networks so that most computer vision tasks are executed locally on smart cameras and only concise high-level information is sent to a fusion node for global joint tracking. We evaluate the performance of our tracker on a challenging real-world traffic dataset in an aspect of Turn Movement Count (TMC) application and achieves high accuracy of 93% and 83% on vehicles and cyclist respectively. Moreover, performance testing in anomaly detection shows that the proposed method provides reliable detection of abnormal vehicle and pedestrian trajectories.

Nyan Bo Bo, Maarten Slembrouck, Peter Veelaert, Wilfried Philips

Vehicles Tracking by Combining Convolutional Neural Network Based Segmentation and Optical Flow Estimation

Object tracking is an important proxy task towards action recognition. The recent successful CNN models for detection and segmentation, such as Faster R-CNN and Mask R-CNN lead to an effective approach for tracking problem: tracking-by-detection. This very fast type of tracker takes into account only the Intersection-Over-Union (IOU) between bounding boxes to match objects without any other visual information. In contrast, the lack of visual information of IOU tracker combined with the failure detections of CNNs detectors create fragmented trajectories. Inspired by the work of Luc et al. that predicts future segmentations by using Optical flow, we propose an enhanced tracker based on tracking-by-detection and optical flow estimation in vehicle tracking scenario. Our solution generates new detections or segmentations based on translating backward and forward results of CNNs detectors by optical flow vectors. This task can fill in the gaps of trajectories. The qualitative results show that our solution achieved stable performance with different types of flow estimation methods. Then we match generated results with fragmented trajectories by SURF features. DAVIS dataset is used for evaluating the best way to generate new detections. Finally, the entire process is test on DETRAC dataset. The qualitative results show that our methods significantly improve the fragmented trajectories.

Tuan-Hung Vu, Jacques Boonaert, Sebastien Ambellouis, Abdelmalik Taleb Ahmed

Real-Time Embedded Person Detection and Tracking for Shopping Behaviour Analysis

Shopping behaviour analysis through counting and tracking of people in shop-like environments offers valuable information for store operators and provides key insights in the stores layout (e.g. frequently visited spots). Instead of using extra staff for this, automated on-premise solutions are preferred. These automated systems should be cost-effective, preferably on lightweight embedded hardware, work in very challenging situations (e.g. handling occlusions) and preferably work real-time. We solve this challenge by implementing a real-time TensorRT optimized YOLOv3-based pedestrian detector, on a Jetson TX2 hardware platform. By combining the detector with a sparse optical flow tracker we assign a unique ID to each customer and tackle the problem of loosing partially occluded customers. Our detector-tracker based solution achieves an average precision of 81.59% at a processing speed of 10 FPS. Besides valuable statistics, heat maps of frequently visited spots are extracted and used as an overlay on the video stream.

Robin Schrijvers, Steven Puttemans, T. Callemein, Toon Goedemé

Learning Target-Specific Response Attention for Siamese Network Based Visual Tracking

Recently, the Siamese network based visual tracking methods have shown great potentials in balancing the tracking accuracy and computational efficiency. These methods use two-branch convolutional neural networks (CNNs) to generate a response map between the target exemplar and each of candidate patches in the search region. However, since these methods have not fully exploit the target-specific information contained in the CNN features during the computation of the response map, they are less effective to cope with target appearance variations and background clutters. In this paper, we propose a Target-Specific Response Attention (TSRA) module to enhance the discriminability of these methods. In TSRA, a channel-wise cross-correlation operation is used to produce a multi-channel response map, where different channels correspond to different semantic information. Then, TSRA uses an attention network to dynamically re-weight the multi-channel response map at every frame. Moreover, we introduce a shortcut connection strategy to generate a residual multi-channel response map for more discriminative tracking. Finally, we integrate the proposed TSRA into the classical Siamese based tracker (i.e., SiamFC) to propose a new tracker (called TSRA-Siam). Experimental results on three popular benchmark datasets show that the proposed TSRA-Siam outperforms the baseline tracker (i.e., SiamFC) by a large margin and obtains competitive performance compared with several state-of-the-art trackers.

Penghui Zhao, Haosheng Chen, Yanjie Liang, Yan Yan, Hanzi Wang

Backmatter

Titel: Advanced Concepts for Intelligent Vision Systems
herausgegeben von: Jacques Blanc-Talon
Patrice Delmas
Prof. Wilfried Philips
Dan Popescu
Paul Scheunders
Verlag: Springer International Publishing
Electronic ISBN: 978-3-030-40605-9
Print ISBN: 978-3-030-40604-2
DOI: https://doi.org/10.1007/978-3-030-40605-9