Skip to main content

2020 | Buch

Pattern Recognition

5th Asian Conference, ACPR 2019, Auckland, New Zealand, November 26–29, 2019, Revised Selected Papers, Part I

herausgegeben von: Shivakumara Palaiahnakote, Prof. Gabriella Sanniti di Baja, Liang Wang, Prof. Dr. Wei Qi Yan

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This two-volume set constitutes the proceedings of the 5th Asian Conference on ACPR 2019, held in Auckland, New Zealand, in November 2019.
The 9 full papers presented in this volume were carefully reviewed and selected from 14 submissions. They cover topics such as: classification; action and video and motion; object detection and anomaly detection; segmentation, grouping and shape; face and body and biometrics; adversarial learning and networks; computational photography; learning theory and optimization; applications, medical and robotics; computer vision and robot vision; pattern recognition and machine learning; multi-media and signal processing; and interaction.

Inhaltsverzeichnis

Frontmatter

Classification

Frontmatter
Integrating Domain Knowledge: Using Hierarchies to Improve Deep Classifiers

One of the most prominent problems in machine learning in the age of deep learning is the availability of sufficiently large annotated datasets. For specific domains, e.g. animal species, a long-tail distribution means that some classes are observed and annotated insufficiently. Additional labels can be prohibitively expensive, e.g. because domain experts need to be involved. However, there is more information available that is to the best of our knowledge not exploited accordingly.In this paper, we propose to make use of preexisting class hierarchies like WordNet to integrate additional domain knowledge into classification. We encode the properties of such a class hierarchy into a probabilistic model. From there, we derive a novel label encoding and a corresponding loss function. On the ImageNet and NABirds datasets our method offers a relative improvement of $$10.4\%$$ and $$9.6\%$$ in accuracy over the baseline respectively. After less than a third of training time, it is already able to match the baseline’s fine-grained recognition performance. Both results show that our suggested method is efficient and effective.

Clemens-Alexander Brust, Joachim Denzler
Label-Smooth Learning for Fine-Grained Visual Categorization

Fine-Grained Visual Categorization (FGVC) is challenging due to the superior similarity among categories and the large within-category variance. Existing work tackles this problem by designing self-localization modules in an end-to-end DCNN to learn semantic part features. However the model efficiency of this strategy decreases significantly with the increasing of the number of categories, because more parts are needed to offset the impact of the increasing of categories. In this paper, we propose a label-smooth learning method that improves models applicability to large categories by maximizing its prediction diversity. Based on the similarity among fine-grained categories, a KL divergence between uniform and prediction distributions is established to reduce model’s confidence on the ground-truth category, while raising its confidence on similar categories. By minimizing it, information from similar categories are exploited for model learning, thus diminishing the effects caused by the increasing of categories. Experiments on five benchmark datasets of mid-scale (CUB-200-2011, Stanford Dogs, Stanford Cars, and FGVC-Aircraft) and large-scale (NABirds) categories show a clear advantage of the proposed label-smooth learning and demonstrate its comparable or state-of-the-art performance. Code is available at https://github.com/Cedric-Mo/LS-for-FGVC .

Xianjie Mo, Tingting Wei, Hengmin Zhang, Qiong Huang, Wei Luo
ForestNet – Automatic Design of Sparse Multilayer Perceptron Network Architectures Using Ensembles of Randomized Trees

In this paper, we introduce a mechanism for designing the architecture of a Sparse Multi-Layer Perceptron network, for classification, called ForestNet. Networks built using our approach are capable of handling high-dimensional data and learning representations of both visual and non-visual data. The proposed approach first builds an ensemble of randomized trees in order to gather information on the hierarchy of features and their separability among the classes. Subsequently, such information is used to design the architecture of a sparse network, for a specific data set and application. The number of neurons is automatically adapted to the dataset. The proposed approach was evaluated using two non-visual and two visual datasets. For each dataset, 4 ensembles of randomized trees with different sizes were built. In turn, per ensemble, a sparse network architecture was designed using our approach and a fully connected network with same architecture was also constructed. The sparse networks defined using our approach consistently outperformed their respective tree ensembles, achieving statistically significant improvements in classification accuracy. While we do not beat state-of-art results with our network size and the lack of data augmentation techniques, our method exhibits very promising results, as the sparse networks performed similarly to their fully connected counterparts with a reduction of more than 98% of connections in the visual tasks.

Dalia Rodríguez-Salas, Nishant Ravikumar, Mathias Seuret, Andreas Maier
Clustering-Based Adaptive Dropout for CNN-Based Classification

Dropout has been widely used to improve the generalization ability of a deep network, while current dropout variants rarely adapt the dropout probabilities of the network hidden units or weights dynamically to their contributions on the network optimization. In this work, a clustering-based dropout based on the network characteristics of features, weights or their derivatives is proposed, where the dropout probabilities for these characteristics are updated self-adaptively according to the corresponding clustering group to differentiate their contributions. Experimental results on the databases of Fashion-MNIST and CIFAR10 and expression databases of FER2013 and CK+ show that the proposed clustering-based dropout achieves better accuracy than the original dropout and various dropout variants, and the most competitive performances compared with state-of-the-art algorithms.

Zhiwei Wen, Zhiwei Ke, Weicheng Xie, Linlin Shen

Action and Video and Motion

Frontmatter
Real-Time Detection and Tracking Using Hybrid DNNs and Space-Aware Color Feature: From Algorithm to System

Object detection and tracking are vital for video analysis. As the development of Deep Neural Network (DNN), multiple object tracking is recently performed on the detection results from DNN. However, DNN-based detection is computation-intensive. In order to accelerate multiple object detection and tracking for real-time application, we present a framework to import the tracking knowledge into detection to allow a less accurate but faster DNN for detection and recover the accuracy loss. By combining different DNNs with accuracy-speed trade-offs using space-aware color information, our framework achieves significant speedup (6.8$$\times $$) and maintains high accuracy. Targeting NVIDIA Xavier, we further optimize the implementation from system and platform level.

Liang Feng, Hiroaki Igarashi, Seiya Shibata, Yuki Kobayashi, Takashi Takenaka, Wei Zhang
Continuous Motion Numeral Recognition Using RNN Architecture in Air-Writing Environment

Air-writing, defined as character tracing in a three dimensional free space through hand gestures, is the way forward for peripheral-independent, virtual interaction with devices. While single unistroke character recognition is fairly simple, continuous writing recognition becomes challenging owing to absence of delimiters between characters. Moreover, stray hand motion while writing adds noise to the input, making accurate recognition difficult. The key to accurate recognition of air-written characters lies in noise elimination and character segmentation from continuous writing. We propose a robust and hardware-independent framework for multi-digit unistroke numeral recognition in air-writing environment. We present a sliding window based method which isolates a small segment of the spatio-temporal input from the air-writing activity for noise removal and digit segmentation. Recurrent Neural Networks (RNN) show great promise in dealing with temporal data and is the basis of our architecture. Recognition of digits which have other digits as their sub-shapes is challenging. Capitalizing on how digits are commonly written, we propose a novel priority scheme to determine digit precedence. We only use sequential coordinates as input, which can be obtained from any generic camera, making our system widely accessible. Our experiments were conducted on English numerals using a combination of MNIST and Pendigits datasets along with our own air-written English numerals dataset (ISI-Air Dataset). Additionally, we have created a noise dataset to classify noise. We observe a drop in accuracy with increase in the number of digits written in a single continuous motion because of noise generated between digit transitions. However, under standard conditions, our system produced an accuracy of 98.45% and 82.89% for single and multiple digit English numerals, respectively.

Adil Rahman, Prasun Roy, Umapada Pal
Using Motion Compensation and Matrix Completion Algorithm to Remove Rain Streaks and Snow for Video Sequences

The current outdoor surveillance equipment and cameras are vulnerable to be influenced by rain, snow, and other inclement weather, reducing the performance of the surveillance systems. In this paper, we propose a method to detect and remove rain streaks even snow artifacts from video sequences, using motion compensation and low-rank matrix completion method. First, we adopt the optical flow estimation method between consecutive frames to get a warped frame and obtain an initial binary rain map. We further use morphological component analysis method to dilate the tiny rain streaks. Then we employ the online dictionary learning for sparse representation technique and SVM classifier to refine the rain map by getting rid of parts which are not rain streaks. Finally, we reconstruct the video sequence by using low-rank matrix completion techniques. The experimental results demonstrate the proposed algorithm and perform qualitatively as well as quantitatively better in terms of PSNR/SSIM.

Yutong Zhou, Nobutaka Shimada
Assessing the Impact of Video Compression on Background Subtraction

Video cameras are deployed in large numbers for surveillance. Video is compressed and transmitted over bandwidth limited communication channels. Researches have increasingly focused on automated analysis algorithms to understand video. This raises the question “Does compression have an effect on automated analysis algorithms?”. This is a vital question in understanding the reliability of automated analysis algorithms. In this paper, we evaluated background subtraction algorithm’s performance on videos acquired using a typical surveillance camera under different scenarios. The videos in the dataset are collected from an IP-based surveillance camera and compressed using different bitrates and quantization levels. The experimental results provide insights on the different compression settings and their impact on the expected performance of background subtraction algorithms. Furthermore, we train a classifier that utilizes video quality to predict the performance of an algorithm.

Poonam Beniwal, Pranav Mantini, Shishir K. Shah

Object Detection and Anomaly Detection

Frontmatter
Learning Motion Regularity for Temporal Video Segmentation and Anomaly Detection

Detecting anomalous events in a long video sequence is a challenging task due to the subjective definition of “anomalous” as well as the duration of such events. Anomalous events are usually short-lived and occur rarely. We propose a semi-supervised solution to detect such events. Our method is able to capture the video segment where the anomaly happens via the analyses of the interaction between the spatially co-located interest points. The evolution of their motion characteristics is modeled and abrupt changes are used to temporally segment the videos. Spatiotemporal and motion features are then extracted to model standard events and identify the anomalous segments using a one-class classifier. Quantitative and qualitative experiments on publicly available anomaly detection dataset capturing real-world scenarios show that the proposed method outperforms state-of-the-art approaches.

Fatima Zohra Daha, Shishir K. Shah
A New Forged Handwriting Detection Method Based on Fourier Spectral Density and Variation

Use of handwriting words for person identification in contrast to biometric features is gaining importance in the field of forensic applications. As a result, forging handwriting is a part of crime applications and hence is challenging for the researchers. This paper presents a new work for detecting forged handwriting words because width and amplitude of spectral distributions have the ability to exhibit unique properties for forged handwriting words compared to blurred, noisy and normal handwriting words. The proposed method studies spectral density and variation of input handwriting images through clustering of high and low frequency coefficients. The extracted features, which are invariant to rotation and scaling, are passed to a neural network classifier for the classification for forged handwriting words from other types of handwriting words (like blurred, noisy and normal handwriting words). Experimental results on our own dataset, which consists of four handwriting word classes, and two benchmark datasets, namely, caption and scene text classification and forged IMEI number dataset, show that the proposed method outperforms the existing methods in terms of classification rate.

Sayani Kundu, Palaiahnakote Shivakumara, Anaica Grouver, Umapada Pal, Tong Lu, Michael Blumenstein
Robust Pedestrian Detection: Faster Deployments with Fusion of Models

Pedestrian detection has a wide range of real-world critical applications including security and management of emergency scenarios. In critical applications, detection recall and precision are both essential to ensure the correct detection of all pedestrians. The development and deployment of object detection vision-based models is a time-consuming task, depending on long training and fine-tuning processes to achieve top performance. We propose an alternative approach, based on a fusion of pre-trained off-the-shelf state-of-the-art object detection models, and exploit base model divergences to quickly deploy robust ensembles with improved performance. Our approach promotes model reuse and does not require additional learning algorithms, making it suitable for rapid deployments of critical systems. Experimental results, conducted on PASCAL VOC07 test dataset, reveal mean average precision (mAP) improvements over base detection models, regardless of the set of models selected. Improvements in mAP were observed starting from just two detection models and reached 3.53% for a fusion of four detection models, resulting in an absolute fusion mAP of 83.65%. Moreover, the hyperparameters of our ensemble model may be adjusted to set an appropriate tradeoff between precision and recall to fit different recall and precision application requirements.

Chan Tong Lam, Jose Gaspar, Wei Ke, Xu Yang, Sio Kei Im
Perceptual Image Anomaly Detection

We present a novel method for image anomaly detection, where algorithms that use samples drawn from some distribution of “normal” data, aim to detect out-of-distribution (abnormal) samples. Our approach includes a combination of encoder and generator for mapping an image distribution to a predefined latent distribution and vice versa. It leverages Generative Adversarial Networks to learn these data distributions and uses perceptual loss for the detection of image abnormality. To accomplish this goal, we introduce a new similarity metric, which expresses the perceived similarity between images and is robust to changes in image contrast. Secondly, we introduce a novel approach for the selection of weights of a multi-objective loss function (image reconstruction and distribution mapping) in the absence of a validation dataset for hyperparameter tuning. After training, our model measures the abnormality of the input image as the perceptual dissimilarity between it and the closest generated image of the modeled data distribution. The proposed approach is extensively evaluated on several publicly available image benchmarks and achieves state-of-the-art performance.

Nina Tuluptceva, Bart Bakker, Irina Fedulova, Anton Konushin

Segmentation, Grouping and Shape

Frontmatter
Deep Similarity Fusion Networks for One-Shot Semantic Segmentation

One-shot semantic segmentation is a new challenging task extended from traditional semantic segmentation, which aims to predict unseen object categories for each pixel given only one annotated sample. Previous works employ oversimplified operations to fuse the features from query image and support image, while neglecting to incorporate multi-scale information that is essential for the one-shot segmentation task. In this paper, we propose a novel one-shot based architecture, Deep Similarity Fusion Network (DSFN) to tackle this issue. Specifically, a new similarity feature generator is proposed to generate multi-scale similarity feature maps, which can provide both contextual and spatial information for the following modules. Then, a similarity feature aggregator is employed to fuse different scale feature maps in a coarse-to-fine manner. Finally, a simple yet effective convolutional module is introduced to create the final segmentation mask. Extensive experiments on $$\mathrm {PASCAL-5^{i}}$$ demonstrate that DSFN outperforms the state-of-the-art methods by a large margin with mean IoU of 47.7%.

Shuchang Lyu, Guangliang Cheng, Qimin Ding
Seeing Things in Random-Dot Videos

The human visual system correctly groups features and can even interpret random-dot videos induced by imaging natural dynamic scenes. Remarkably, this happens even if perception completely fails when the same information is presented frame by frame. We study this property of surprising dynamic perception with the first goal of proposing a new detection and spatio-temporal grouping algorithm for such signals when, per frame, the information on objects is both random and sparse. The algorithm is based on temporal integration and statistical tests of unlikeliness, the a contrario framework. The striking similarity in performance of the algorithm to the perception by human observers, as witnessed by a series of psychophysical experiments, leads us to see in it a simple computational Gestalt model of human perception.

Thomas Dagès, Michael Lindenbaum, Alfred M. Bruckstein
Boundary Extraction of Planar Segments from Clouds of Unorganised Points

Planar segment detection in 3D point clouds is of importance for 3D registration, segmentation or analysis. General methods for planarity detection just detect a planar segment and label the 3D points; a boundary of the planar segment is typically not considered; spatial position and scope of coplanar 3D points are neglected. This paper proposes a method for detecting planar segments and extracting boundaries for such segments, all from clouds of unorganised points. Altogether, this aims at describing completely a set of 3D points: If a planar segment is detected, not only the plane’s normal and distance from the origin to the plane are detected, but also the planar segments boundary. By analysing Hough voting (from 3D space into a Hough space), we deduce a relationship between a planar segment’s boundary and voting cells. Cells that correspond to the planar segments boundary are located. Six linear functions are fitted to the voting cells, and four vertices are computed based on the coefficients of fitted functions. The bounding box of the planar segment is determined and used to represent the spatial position and scope of coplanar 3D points. The proposed method is tested on synthetic and real-world 3D point clouds. Experimental results demonstrate that the proposed method directly detects planar segment’s boundaries from clouds of unorganised points. No knowledge about local or global structure of point clouds is required for applying the proposed technique.

Zezhong Xu, Cheng Qian, Xianju Fei, Yanbing Zhuang, Shibo Xu, Reinhard Klette
Real-Time Multi-class Instance Segmentation with One-Time Deep Embedding Clustering

In recent years, instance segmentation research has been considered as an extension of object detection and semantic segmentation, which can provide pixel-level annotations on detected objects. Several approaches for instance segmentation exploit object detection network to generate bounding box and segment each bounding box with segmentation network. However, these approaches need more time consumption due to two independent networks as their framework. On the other hand, some approaches based on clustering transform each pixel into unique representation and produce instance mask by postprocessing. Nevertheless, most clustering approaches have to cluster all instances of each class individually, which contribute to additional time consumption. In this research, we propose a fast clustering method called one-time clustering with single network aiming at reducing time consumption on multi-class instance segmentation. Moreover, we present a class-sensitive loss function that allows the network to generate unique embedding which contains class and instance information. With the informative embeddings, we can cluster them only once instead of clustering for each class in other clustering approaches. Our approach is up to 6x faster than the state-of-the-art UPSNet [1], which appeared in CVPR 2019, and get about 25% lower AP performance on Cityscape dataset. It achieves significantly faster speed and great segmentation quality while having an acceptable AP performance.

Yu-Chi Chen, Chia-Yuan Chang, Pei-Yung Hsiao, Li-Chen Fu

Face and Body and Biometrics

Frontmatter
Multi-person Pose Estimation with Mid-Points for Human Detection under Real-World Surveillance

This paper introduces the design and usage of a multi-person pose estimation system. The system is developed targeting some challenging issues in real-world surveillance such as (i) low image resolution, and (ii) people captured in crowded situation. Under such conditions, we evaluated the system’s performance on human detection by comparing to other state-of-art algorithms. The leading results by using the proposed system are accomplished by several features in the system’s design: (i) training and inference of mid-point, which is the center of two body region points defined in human pose, (ii) core-of-pose which is association of a plurality of body region points, and used as root of each individual person during parsing multiple people under crowded situation. The proposed system is also fast and has the potential for industrial use.

Yadong Pan, Shoji Nishimura
Gaze from Head: Gaze Estimation Without Observing Eye

We propose a gaze estimation method not from eye observation but from head motion. This proposed method is based on physiological studies about the eye-head coordination, and the gaze direction is estimated from observation of head motion by using the eye-head coordination model trained by preliminarily collected data of gaze direction and head pose sequence. We collected gaze-head datasets of from people who walked around under real and VR environments, constructed the eye-head coordination models from those datasets, and evaluated them quantitatively. In addition, we confirmed that there was no significant difference between the models from the real and VR datasets in their estimation accuracy.

Jun’ichi Murakami, Ikuhisa Mitsugami
Interaction Recognition Through Body Parts Relation Reasoning

Person-person mutual action recognition (also referred to as interaction recognition) is an important research branch of human activity analysis. It begins with solutions based on carefully designed local-points and hand-crafted features, and then progresses to deep learning architectures, such as CNNs and LSTMS. These solutions often consist of complicated architectures and mechanisms to embed the relationships between the two persons on the architecture itself, to ensure the interaction patterns can be properly learned. Our contribution with this work is by proposing a more simple yet very powerful architecture, named Interaction Relational Network, which utilizes minimal prior knowledge about the structure of the data. We drive the network to learn to identify how to relate the body parts of the persons interacting, in order to better discriminate among the possible interactions. By breaking down the body parts through the frames as sets of independent joints, and with a few augmentations to our architecture to explicitly extract meaningful extra information from each pair of joints, our solution is able to achieve state-of-the-art performance on the traditional interaction recognition dataset SBU, and also on the mutual actions from the large-scale dataset NTU RGB+D.

Mauricio Perez, Jun Liu, Alex C. Kot
DeepHuMS: Deep Human Motion Signature for 3D Skeletal Sequences

3D Human Motion Indexing and Retrieval is an interesting problem due to the rise of several data-driven applications aimed at analyzing and/or re-utilizing 3D human skeletal data, such as data-driven animation, analysis of sports bio-mechanics, human surveillance etc. Spatio-temporal articulations of humans, noisy/missing data, different speeds of the same motion etc. make it challenging and several of the existing state of the art methods use hand-craft features along with optimization based or histogram based comparison in order to perform retrieval. Further, they demonstrate it only for very small datasets and a few classes. We make a case for using a learned representation that should recognize the motion as well as enforce a discriminative ranking. To that end, we propose, a 3D human motion descriptor learned using a deep network. Our learned embedding is generalizable and applicable to real-world data - addressing the aforementioned challenges and further enables sub-motion searching in its embedding space using another network. Our model exploits the inter-class similarity using trajectory cues, and performs far superior in a self-supervised setting. State of the art results on all these fronts is shown on two large scale 3D human motion datasets - NTU RGB+D and HDM05.

Neeraj Battan, Abbhinav Venkat, Avinash Sharma

Adversarial Learning and Networks

Frontmatter
Towards a Universal Appearance for Domain Generalization via Adversarial Learning

Domain generalization aims to learn a generalized feature space across multiple source domains so as to adapt the representation to an unseen target domain. In this paper, we focus on two main issues of domain generalization for image classification. First, instead of focusing on aligning data distributions among source domains, we aim to leverage the generalization capability by explicitly referring to image content and style from different source domains. We propose a novel appearance generalizer to learn a universal appearance, which captures domain-invariant content but exhibits various styles. Second, to tackle the class-discriminative issue, we resort to the prominent adversary learning and impose an additional constraint on the discriminator to boost the class-discriminative capability. Experimental results on several benchmark datasets show that the proposed method outperforms existing methods significantly.

Yujui Chen, Tse-Wei Lin, Chiou-Ting Hsu
Pre-trained and Shared Encoder in Cycle-Consistent Adversarial Networks to Improve Image Quality

Images generated from Cycle-Consistent Adversarial Network (CycleGAN) become blurry especially in areas with complex edges because of loss of edge information in downsampling of encoders. To solve this problem, we design a new model called ED-CycleGAN based on original CycleGAN. The key idea is using a pre-trained encoder: training an Encoder-Decoder Block (ED-Block) at first in order to get a difference map, which we call an edge map and is produced by the subtraction of input and output of the block. Then, the encoder part of a generator in CycleGAN share the parameters with the trained encoder of ED-Block and they will be frozen during training. Finally, by adding the output from a generator to the edge map, higher quality images can be produced. This structure performs excellently on “Apple2Orange”, “Summer2Winter” and “blond-hair2brown-hair” datasets. We use SSIM and PSNR to evaluate resolution of results and our method achieved the highest evaluation scores among CycleGAN, Unit and DiscoGAN.

Runtong Zhang, Yuchen Wu, Keiji Yanai
MobileGAN: Compact Network Architecture for Generative Adversarial Network

In this paper we introduce a compact neural network architecture for a generator of generative adversarial network (GAN), which reduces the number of weight parameters so that mobile devices can afford to download these parameters via cellular networks.Network architecture of GAN generator usually consists of a fully-connected layer on top and succeeding convolutional layers with upscaling which extend picture characteristics to a larger image. Consequently, the GAN generator network is highly enlarged, and the size of its weight parameters becomes tens or hundreds of MBs, which is not preferable for transmission over cellular network or running on mobile or embedded devices.Our approach named MobileGAN is based on layer decomposition technique like MobileNets. At a generator side, all convolutional layers except the bottom layer to output an RGB image are decomposed into pixelwise and depthwise convolutional layers to reduce the number of weight parameters. Also, the first fully-connected layer is decomposed into a couple of 1-dimensional convolution layers for the similar purpose. On the other hand, our approach does not modify any layers in discriminator network in order to maintain picture quality of output images generated by the decomposed generator network.We evaluated the performance of MobileGAN generator network by Inception Score and Fréchet Inception Distance. As a result, we confirmed that MobileGAN can reduce the size of weight parameter into up to 20.5% of generator network of ResNet GAN with slight score degradation.

Tomoyuki Shimizu, Jianfeng Xu, Kazuyuki Tasaka
Denoising and Inpainting of Sea Surface Temperature Image with Adversarial Physical Model Loss

This paper proposes a new approach for meteorology; estimating sea surface temperatures (SSTs) by using deep learning. SSTs are essential information for ocean-related industries but are hard to measure. Although multi-spectral imaging sensors on meteorological satellites are used for measuring SSTs over a wide area, they cannot measure sea temperature in regions covered by clouds, so most of the temperature data will be partially occluded. In meteorology, data assimilation with physics-based simulation is used for interpolating occluded SSTs, and can generate physically-correct SSTs that match observations by satellites, but it requires huge computational cost. We propose a low-cost learning-based method using pre-computed data-assimilation SSTs. Our restoration model employs adversarial physical model loss that evaluates physical correctness of generated SST images, and restores SST images in real time. Experimental results with satellite images show that the proposed method can reconstruct physically-correct SST images without occlusions.

Nobuyuki Hirahara, Motoharu Sonogashira, Hidekazu Kasahara, Masaaki Iiyama

Computational Photography

Frontmatter
Confidence Map Based 3D Cost Aggregation with Multiple Minimum Spanning Trees for Stereo Matching

Stereo matching is a challenging problem due to the mismatches caused by difficult environment conditions. In this paper, we propose an enhanced version of our previous work, denoted as 3DMST-CM, to handle challenging cases and obtain a high-accuracy disparity map based on the ambiguity of image pixels. We develop a module of distinctiveness analysis to classify pixels into distinctive and ambiguous pixels. Then distinctive pixels are utilized as anchor pixels to help match ambiguous pixels accurately. The experimental results demonstrate the effectiveness of our method and reach state-of-the-art on the Middlebury 3.0 benchmark.

Yuhao Xiao, Dingding Xu, Guijin Wang, Xiaowei Hu, Yongbing Zhang, Xiangyang Ji, Li Zhang
Optical Flow Assisted Monocular Visual Odometry

This paper proposes a novel deep learning based approach for monocular visual odometry (VO) called FlowVO-Net. Our approach utilizes CNN to extract motion information between two consecutive frames and employs Bi-directional convolution LSTM (Bi-ConvLSTM) for temporal modelling. ConvLSTM can encode not only temporal information but also spatial correlation, and the bidirectional architecture enables it to learn the geometric relationship from image sequences pre and post. Besides, our approach jointly predicts optical flow as an auxiliary task in a self-supervised way by measuring photometric consistency. Experiment results indicate competitive performance of the proposed FlowVO-Net to the state-of-art methods.

Yiming Wan, Wei Gao, Yihong Wu
Coarse-to-Fine Deep Orientation Estimator for Local Image Matching

Convolutional neural networks (CNNs) have become a mainstream method for keypoint matching in addition to image recognition, object detection, and semantic segmentation. Learned Invariant Feature Transform (LIFT) is pioneering method based on CNN. It performs keypoint detection, orientation estimation, and feature description in a single network. Among these processes, the orientation estimation is needed to obtain invariance for rotation changes. However, unlike the feature point detector and feature descriptor, the orientation estimator has not been considered important for accurate keypoint matching or been well researched even after LIFT is proposed. In this paper, we propose a novel coarse-to-fine orientation estimator that improves matching accuracy. First, the coarse orientation estimator estimates orientations to make the rotation error as small as possible even if large rotation changes exist between an image pair. Second, the fine orientation estimator further improves matching accuracy with the orientation estimated by the coarse orientation estimator. By using the proposed two-stage CNNs, we can accurately estimate orientations improving matching performance. The experimental results with the HPatches benchmark show that our method can achieve a more accurate precision-recall curve than single CNN-based orientation estimators.

Yasuaki Mori, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi
Geometric Total Variation for Image Vectorization, Zooming and Pixel Art Depixelizing

We propose an original method for vectorizing an image or zooming it at an arbitrary scale. The core of our method relies on the resolution of a geometric variational model and therefore offers theoretic guarantees. More precisely, it associates a total variation energy to every valid triangulation of the image pixels. Its minimization induces a triangulation that reflects image gradients. We then exploit this triangulation to precisely locate discontinuities, which can then simply be vectorized or zoomed. This new approach works on arbitrary images without any learning phase. It is particularly appealing for processing images with low quantization like pixel art and can be used for depixelizing such images. The method can be evaluated with an online demonstrator, where users can reproduce results presented here or upload their own images.

Bertrand Kerautret, Jacques-Olivier Lachaud

Learning Theory and Optimization

Frontmatter
Information Theory-Based Curriculum Learning Factory to Optimize Training

We present a new system to optimize feature extraction from 2D-topological data like images in the context of deep learning using correlation among training samples and curriculum learning optimization (CLO). The system treats every sample as 2D random variable, where a pixel contained in the sample is modelled as an independent and identically distributed random variable (i.i.d) realization. With this modelling we utilize information-theoretic and statistical measures of random variables to rank individual training samples and relationship between samples to construct syllabus. The rank of each sample is then used when the sample is fed to the network during training. Comparative evaluation of multiple state-of-the-art networks, including, ResNet, GoogleNet, and VGG, on benchmark datasets demonstrate a syllabus that ranks samples using measures such as Joint Entropy between adjacent samples, can improve learning and significantly reduce the amount of training steps required to achieve desirable training accuracy. We present results that indicate our approach can produce robust feature maps that in turn contribute to reduction of loss by as much as factors of 9 compared to conventional, no-curriculum, training.

Henok Ghebrechristos, Gita Alaghband
A Factorization Strategy for Tensor Robust PCA

Many kinds of real-world data, e.g., color images, videos, etc., are represented by tensors and may often be corrupted by outliers. Tensor robust principal component analysis (TRPCA) servers as a tensorial modification of the fundamental principal component analysis (PCA) which performs well in the presence of outliers. The recently proposed TRPCA model [12] based on tubal nuclear norm (TNN) has attracted much attention due to its superiority in many applications. However, TNN is computationally expensive, limiting the application of TRPCA for large tensors. To address this issue, we first propose a new TRPCA model by adopting a factorization strategy within the framework of tensor singular value decomposition (t-SVD). An algorithm based on the non-convex augmented Lagrangian method (ALM) is developed with convergence guarantee. Effectiveness and efficiency of the proposed algorithm is demonstrated through extensive experiments on both synthetic and real datasets.

Andong Wang, Zhong Jin, Jingyu Yang
Speeding up of the Nelder-Mead Method by Data-Driven Speculative Execution

The performance of machine learning algorithms considerably depends on the hyperparameter configurations. Previous studies reported that the Nelder-Mead (NM) method known as a local search method requires a small number of evaluations to converge and that these properties enable achieving considerable success in the hyperparameter optimization (HPO) of machine learning algorithms for image recognition. However, most evaluations using the NM method need to be implemented sequentially, which requires a large amount of time. To alleviate the problem that the NM method cannot be computed in parallel, we propose a data-driven speculative execution method based on the statistical features of the NM method. We analyze the behaviors of the NM method on several benchmark functions and experimentally demonstrated that the NM method tends to take certain specific operations. The experimental results show that the proposed method reduced the elapsed time by approximately 50% and the number of evaluations by approximately 60% compared to the naïve speculative execution.

Shuhei Watanabe, Yoshihiko Ozaki, Yoshiaki Bando, Masaki Onishi
Efficient Bayesian Optimization Based on Parallel Sequential Random Embeddings

Bayesian optimization, which offers efficient parameter search, suffers from high computation cost if the parameters have high dimensionality because the search space expands and more trials are needed. One existing solution is an embedding method that enables the search to be restricted to a low-dimensional subspace, but this method works well only when the number of embedding dimensions closely match the that of effective dimensions, which affects the function value. However, in practical situations, the number of effective dimensions is unknown, and embedding into a low dimensional subspace to save computation cost often results in a search in a lower dimensional subspace than the effective dimensions. This study proposes a Bayesian optimization method that uses random embedding to remain efficient even if the embedded dimension is lower than the effective dimensions. By conducting parallel search in an initially low dimensional space and performing multiple cycles in which the search space is incrementally improved, the optimum solution can be efficiently found. An experiment on benchmark problems shows the effectiveness of the proposed method.

Noriko Yokoyama, Masahiro Kohjima, Tatsushi Matsubayashi, Hiroyuki Toda

Applications, Medical and Robotics

Frontmatter
Fundus Image Classification and Retinal Disease Localization with Limited Supervision

Image classification using deep convolutional neural networks (DCNN) has a competitive performance with other state-of-the-art methods. Fundus image classification into disease types is also a promising application domain of DCNN. Typically fundus image classifier is trained using fundus images with labels showing disease types. Such training data is relatively easy to obtain, but a massive number of training data is required to achieve adequate classification performance. If classifier can concentrate the evidential regions attached to training images, it is possible to boost the performance with a limited number of the training dataset. However, such regions are very hard to obtain, especially for fundus image classification because only professional ophthalmologist can give such regions and selecting such regions by GUI is very time-consuming. To boost the classification performance with significantly light ophthalmologist intervention, we propose a new method: first, we show evidential heatmaps by DCNN to ophthalmologists, and then obtained their feedback of selecting images with reasonable evidential regions. This intervention is far very easy for opthalmologist compared to drawing evidential regions. Experiments using fundus images revealed that our method improved accuracy from 90.1% to 94.5% in comparison with the existing method. We also found that the attention regions generated by our process are closer to the GT attention regions provided by ophthalmologists.

Qier Meng, Yohei Hashimoto, Shin’ichi Satoh
OSTER: An Orientation Sensitive Scene Text Recognizer with CenterLine Rectification

Scene texts in China are always arbitrarily arranged in two forms: horizontally and vertically. These two forms of texts exhibit distinctive features, making it difficult to recognize them simultaneously. Besides, recognizing irregular scene texts is still a challenging task due to their various shapes and distorted patterns. In this paper, we propose an orientation sensitive network aiming at distinguishing between Chinese horizontal and vertical texts. The learned orientation is then passed into an attention selective network to adjust the attention maps of the sequence recognition model, leading it working for each type of texts respectively. In addition, a lightweight centerline rectification network is adopted, which enables the irregular texts more readable while no redundant labels are needed. A synthetic dataset named SCTD is released to support our training and evaluate the proposed model. Extensive experiments show that the proposed method is capable of recognizing arbitrarily-aligned scene texts accurately and efficiently, achieving state-of-the-art performance over a number of public datasets.

Zipeng Feng, Chen Du, Yanna Wang, Baihua Xiao
On Fast Point Cloud Matching with Key Points and Parameter Tuning

Nowadays, three dimensional point cloud processing plays a very important role in a wide range of areas: autonomous driving, robotics, cartography, etc. Three dimensional point cloud registration pipelines have high computational complexity, mainly because of the cost of point feature signature calculation. By selecting keypoints and using only them for registration, data points that are interesting in some way, one can significantly reduce the number of points for which feature signatures are needed, hence the running time of registration pipelines. Consequently, keypoint detectors have a prominent role in an efficient processing pipeline. In this paper, we propose to analyze the usefulness of various keypoint detection algorithms and investigate whether and when it is worth to use a keypoint detector for registration. We define the goodness of a keypoint detection algorithm based on the success and quality of registration. Most keypoint detection methods require manual tuning of their parameters for best results. Here we revisit the most popular methods for keypoint detection in 3D point clouds and perform automatic parameter tuning with goodness of registration and run time as primary objectives. We compare keypoint-based registration to registration with randomly selected points and using all data points as a baseline. In contrast to former work, we use point clouds of different sizes, with and without noise, and register objects with different sizes.

Dániel Varga, Sándor Laki, János Szalai-Gindl, László Dobos, Péter Vaderna, Bence Formanek
Detection of Critical Camera Configurations for Structure from Motion Using Random Forest

This paper presents an approach for the detection of critical camera configurations in unorganized image sets with (approximately) known internal camera parameters. Critical configurations are caused by an insufficient distance between cameras compared to the distance of the observed scene and can cause problems in triangulation-based structure from motion application.We give a summary of existing techniques and propose a new approach for the detection of image pairs with a critical camera configuration based on classification using a random forest. To this end, several features characterizing the quality of the reconstructed 3D points as well as the estimated camera poses have been defined and evaluated for various configurations. The proposed approach is integrated into the structure from motion framework COLMAP demonstrating its potential on an independent real-world dataset.

Mario Michelini, Helmut Mayer

Computer Vision and Robot Vision

Frontmatter
Visual Counting of Traffic Flow from a Car via Vehicle Detection and Motion Analysis

Visual traffic counting so far has been carried out by static cameras at streets or aerial pictures from sky. This work initiates a new approach to count traffic flow by using populated vehicle driving recorders. Mainly vehicles are counted by a camera moves along a route on opposite lane. Vehicle detection is first implemented in video frames by using deep learning YOLO3, and then vehicle trajectories are counted in the spatial-temporal space called motion profile. Motion continuity, direction, and detection missing are considered to avoid multiple counting of oncoming vehicles. This method has been tested on naturalistic driving videos lasting for hours. The counted vehicle numbers can be interpolated as a flow of opposite lanes from a patrol vehicle for traffic control. The mobile counting of traffic is more flexible than the traffic monitoring by cameras at street corners.

Kevin Kolcheck, Zheyuan Wang, Haiyan Xu, Jiang Yu Zheng
Directly Optimizing IoU for Bounding Box Localization

Object detection has seen remarkable progress in recent years with the introduction of Convolutional Neural Networks (CNN). Object detection is a multi-task learning problem where both the position of the objects in the images as well as their classes needs to be correctly identified. The idea here is to maximize the overlap between the ground-truth bounding boxes and the predictions i.e. the Intersection over Union (IoU). In the scope of work seen currently in this domain, IoU is approximated by using the Huber loss as a proxy but this indirect method does not leverage the IoU information and treats the bounding box as four independent, unrelated terms of regression. This is not true for a bounding box where the four coordinates are highly correlated and hold a semantic meaning when taken together. The direct optimization of the IoU is not possible due to its non-convex and non-differentiable nature. In this paper, we have formulated a novel loss namely, the Smooth IoU, which directly optimizes the IoUs for the bounding boxes. This loss has been evaluated on the Oxford IIIT Pets, Udacity self-driving car, PASCAL VOC, and VWFS Car Damage datasets and has shown performance gains over the standard Huber loss.

Mofassir Ul Islam Arif, Mohsan Jameel, Lars Schmidt-Thieme
Double Refinement Network for Room Layout Estimation

Room layout estimation is a challenge of segmenting a cluttered room image into floor, walls and ceiling. We apply a Double Refinement Network (DRN) which has been successfully used to the monocular depth map estimation. Our method is the first not using encoder-decoder architecture for the room layout estimation. ResNet50 was utilized as a backbone for the network instead of VGG16 commonly used for the task, allowing the network to be more compact and faster. We introduced a special layout scoring function and layout ranking algorithm for key points and edges output. Our method achieved the lowest pixel and corner errors on the LSUN data set. The input image resolution is 224 * 224.

Ivan Kruzhilov, Mikhail Romanov, Dmitriy Babichev, Anton Konushin
Finger Tracking Based Tabla Syllable Transcription

In this paper, a new visual-based automated tabla syllable transcription algorithm is proposed. The syllable played on the tabla depends on the manner in which various fingers strike the tabla. In the proposed approach, in first step fingers are tracked using SIFT features in all the frames. This path of fingers for various syllables are analyzed and rules are formulated. Then these rules are used to create a visual signature for different syllables. Finally, the visual signatures of all the frames are compared to the visual signature of the base frame and the frame is labeled to that syllable. Based on this the various signatures are classified into different syllables. Using the proposed method, we are able to transcript tabla syllables with 97.14% accuracy.

Raghavendra H. Bhalerao, Varsha Kshirsagar, Mitesh Raval
Dual Templates Siamese Tracking

In recent years, Siamese networks which is based on appearance similarity comparison strategy has attracted great attention in visual object tracking domain due to its balanced accuracy and speed. However, most Siamese networks based tracking algorithms have not considered template updates. The fixed template may cause tracking matching error and can even cause failure of object tracking. In view of this deficiency, we proposed an algorithm based on dual templates Siamese network using difference hash algorithm to determine the template update timing. First, we kept the initial frame target with stable response map score as the base template $$z_{r}$$, using the difference hash algorithm to determine the dynamic template $$z_{t}$$. We analyzed the candidate targets region and the two template matching results, meanwhile the result response maps were fused, which could ensure more accurate tracking results. The experiment results on the OTB-2013 and OTB-2015 and VOT2016 datasets showed that the proposed algorithm has achieved satisfactory result.

Zhiqiang Hou, Lilin Chen, Lei Pu
Structure Function Based Transform Features for Behavior-Oriented Social Media Image Classification

Social media has become an essential part of people to reflect their day to day activities including emotions, feelings, threatening and so on. This paper presents a new method for the automatic classification of behavior-oriented images like Bullying, Threatening, Neuroticism-Depression, Neuroticism-Sarcastic, Psychopath and Extraversion of a person from social media images. The proposed method first finds facial key points for extracting features based on a face detection algorithm. Then the proposed method labels face regions as foreground and other than face region as background to define context between foreground and background information. To extract context, the proposed method explores Structural Function based Transform (SFBT) features, which study variations on pixel values. To increase discriminating power of the context features, the proposed method performs clustering to integrate the strength of the features. The extracted features are then fed to Support Vector Machines (SVM) for classification. Experimental results on a dataset of six classes show that the proposed method outperforms the existing methods in terms of confusion matrix and classification rate.

Divya Krishnani, Palaiahnakote Shivakumara, Tong Lu, Umapada Pal, Raghavendra Ramachandra
Single Image Reflection Removal Based on GAN with Gradient Constraint

When we take a picture through glass windows, the photographs are often degraded by undesired reflections. To separate reflection layer and background layer is an important problem for enhancing image quality. However, single-image reflection removal is a challenging process because of the ill-posed nature of the problem. In this paper, we propose a single-image reflection removal method based on generative adversarial network. Our network is an end-to-end trained network with four types of losses. It includes pixel loss, feature loss, adversarial loss and gradient constraint loss. We propose a novel gradient constraint loss in order to separate the background layer and the reflection layer clearly. Gradient constraint loss is applied in a gradient domain and it minimize the correlation between the background and reflection layer. Owing to the novel loss and our new synthetic dataset, our reflection removal method outperforms state-of-the-art methods in PSNR and SSIM, especially in real world images.

Ryo Abiko, Masaaki Ikehara
SSA-GAN: End-to-End Time-Lapse Video Generation with Spatial Self-Attention

We usually predict how objects will move in the near future in our daily lives. However, how do we predict? In this paper, to address this problem, we propose a GAN-based network to predict the near future for fluid object domains such as cloud and beach scenes. Our model takes one frame and predict future frames. Inspired by the self-attention mechanism [25], we propose introducing the spatial self-attention mechanism into the model. The self-attention mechanism calculates the reaction at a certain position as a weighted sum of the features at all positions, which enables us to learn the model efficiently in one-stage learning. In the experiment, we show that our model is comparable compared with the state-of-the-art method which performs two-stage learning.

Daichi Horita, Keiji Yanai
Semantic Segmentation of Railway Images Considering Temporal Continuity

In this paper, we focus on the semantic segmentation of images taken from a camera mounted on the front end of trains for measuring and managing rail-side facilities. Improving the efficiency and perhaps automating such tasks are crucial as they are currently done manually. We aim to realize this by capturing information about the railway environment through the semantic segmentation of train front-view camera images. Specifically, assuming that the lateral movement of trains are smooth, we propose a method to use information from multiple frames to consider temporal continuity during semantic segmentation. Based on the densely estimated optical flow between sequential frames, the weighted mean of class likelihoods of corresponding pixels of the focused frame are calculated. We also construct a new dataset consisting of train front-view camera images and its annotations for semantic segmentation. The proposed method outperforms a conventional single-frame semantic segmentation model, and the use of class likelihoods for the frame combination also proved effective.

Yuki Furitsu, Daisuke Deguchi, Yasutomo Kawanishi, Ichiro Ide, Hiroshi Murase, Hiroki Mukojima, Nozomi Nagamine
Real Full Binary Neural Network for Image Classification and Object Detection

We propose Real Full Binary Neural Network (RFBNN), a method that can reduce the memory and compute power of Deep Neural Networks. This method has similar performance to other BNNs in image classification and object detection, while reducing computation power and memory size. In RFBNN, the weight filters are approximated as a binary value by applying a sign function; we apply real binary weight to the whole layer. Therefore, RFBNN can be efficiently implemented on CPU, FPGA and GPU. Results of the all experiments show that the proposed method works successfully on various task such as image classification and object detection. All layers in our networks are composed of only {1, −1}, and unlike the other BNN, there is no scaling factor. Compared to recent network binarization methods, BC, BWN and BWRN, we have reduced the memory size and computation costs, but the performance is the same or better.

Youngbin Kim, Wonjun Hwang
Progressive Scale Expansion Network with Octave Convolution for Arbitrary Shape Scene Text Detection

Scene text detection is a challenging problem due to the image cluttering and high variability of text shape. Many methods have been proposed for multi-oriented and arbitrary shape text detection, in which the storage and computation costs of deep neural networks are still concerns. In this paper, we first introduce Octave Convolution into scene text detection for enlarging the receptive fields and reducing the spatial redundancy of networks. Then we combine Octave Convolution with a state-of-the-art arbitrary shape text detector PSENet, which predicts different scale of kernels for each text instance. Experimental results on several benchmarks show that the proposed method can improve both detection performance and speed in detecting multi-oriented and arbitrary shape texts. Furthermore, our method achieves state-of-the-art performances on these benchmarks.

Shi Yan, Wei Feng, Peng Zhao, Cheng-Lin Liu
SFLNet: Direct Sports Field Localization via CNN-Based Regression

In this paper we propose a novel approach to build a single shot regressor, called SFLNet, that directly predicts a parameter set relating a sports field seen in an input frame to its metric model. This problem is challenging due to the huge intra-class variance of sports fields and the large number of free parameters to be predicted. To address these issues, we propose to train our regressor in combination with semantic segmentation in a multi-task learning framework. We also introduce an additional module to exploit the spacial consistency of sports fields, which boosts both regression and segmentation performances. SFLNet can be learned with a training dataset that can be semi-automatically built from human annotated point-to-point correspondences. To our knowledge, this work is the first attempt to solve this sports field localization problem relying only on an end-to-end deep learning framework. Experiments on our new dataset based on basketball games validate our approach over baseline methods.

Shuhei Tarashima
Trained Model Fusion for Object Detection Using Gating Network

The major approaches of transfer learning in computer vision have tried to adapt the source domain to the target domain one-to-one. However, this scenario is difficult to apply to real applications such as video surveillance systems. As those systems have many cameras installed at each location regarded as source domains, it is difficult to identify the proper source domain. In this paper, we introduce a new transfer learning scenario that has various source domains and one target domain, assuming video surveillance system integration. Also, we propose a novel method for automatically producing a high accuracy model by fusing models trained at various source domains. In particular, we show how to apply a gating network to fuse source domains for object detection tasks, which is a new approach. We demonstrate the effectiveness of our method through experiments on traffic surveillance datasets.

Tetsuo Inoshita, Yuichi Nakatani, Katsuhiko Takahashi, Asuka Ishii, Gaku Nakano
Category Independent Object Transfiguration with Domain Aware GAN

Object transfiguration aims to translate the domain of objects in an image. In this paper, we challenge a new task: category independent object transfiguration, which enables objects to be transfigured even for object categories not included in the training data. Conventional methods are based on the premise that the object categories in the test images are contained in the training images. Therefore, they can train transfer regions and magnitude implicitly, and they accurately estimate the transfer regions and magnitude in test images. However, when an image containing object categories not included in the training data is input, this premise breaks down. Consequently, undesired regions are converted with undesired magnitude. To tackle this problem, we propose a domain region and magnitude aware GAN that explicitly predicts transfer region and magnitude, and translates so that the predicted region and magnitude before and after translation are the same. Experimental results show that our method can more realistically and accurately translate the object domains than the state-of-the-art method.

Kaori Kumagai, Yukito Watanabe, Takashi Hosono, Jun Shimamura, Atsushi Sagata
New Moments Based Fuzzy Similarity Measure for Text Detection in Distorted Social Media Images

A trend towards capturing or filming images using cellphone and sharing images on social media is a part and parcel of day to day activities of humans. When an image is forwarded several times in social media it may be distorted a lot due to several different devices. This work deals with text detection from such distorted images. In this work, we consider images pass through three mobile devices on WhatsApp social media, which results in four images (including the original image) Unlike the existing methods that aim at developing new ways, we utilize the results detected by the existing ones to improve performances. The proposed method extracts Hu moments and fuzzy logic from detected texts of images. The similarity between text detection results given by three existing text detection methods is studied for determining the best pair of texts. The same similarity estimation is then used in a novel way to remove extra background or non-texts and restoring missing text information. Experimental results on own dataset and benchmark datasets of natural scene images, namely, MSRA-TD500, ICDAR2017-MLT, Total-Text, CTW1500 dataset and COCO datasets, show that the proposed method outperforms the existing methods.

Soumyadip Roy, Palaiahnakote Shivakumara, Umapada Pal, Tong Lu, Michael Blumenstein
Fish Detection Using Convolutional Neural Networks with Limited Training Data

Due to the effect of global climate changes to marine biology and aquaculture, researchers start to investigate the deep ocean environment and living circumstances of rare fish species. One major issue of the related research is the difficulty of sufficient image data acquisition. This paper presents a method for underwater fish detection using limited training data. Current convolutional neural network based techniques have good performance on object detection and segmentation but require a large collection of image data. The proposed network structure is based on the U-Net model, modified with various encoders, convolutional layers, and residual blocks, to achieve high accuracy detection and segmentation results. It is able to provide better mIoU compared to other improved U-Net variants with a small amount of training data. Experiments carried out on fish tank scenes and the underwater environment have demonstrated the effectiveness of the proposed technique compared to other state-of-the-art detection networks.

Shih-Lun Tseng, Huei-Yung Lin
A New U-Net Based License Plate Enhancement Model in Night and Day Images

A new trend of smart city development opens up many challenges. One such issue is that automatic vehicle driving and detection for toll fee payment in night or limited light environments. This paper presents a new work for enhancing license plates captured in limited or low light conditions such that license plate detection methods can be expanded to detect images at night. Due to the popularity of Convolutional Neural Network (CNN) in solving complex issues, we explore U-Net-CNN for enhancing contrast of license plate pixels. Since the difference between pixels that represent license plates and pixels that represent background is too due to low light effect, the special property of U-Net that extracts context and symmetric of license plate pixels to separate them from background pixels irrespective of content. This process results in image enhancement. To validate the enhancement results, we use text detection methods and based on text detection results we validate the proposed system. Experimental results on our newly constructed dataset which includes images captured in night/low light/limited light conditions and the benchmark dataset, namely, UCSD, which includes very poor quality and high quality images captured in day, show that the proposed method outperforms the existing methods. In addition, the results on text detection by different methods show that the proposed enhancement is effective and robust for license plate detection.

Pinaki Nath Chowdhury, Palaiahnakote Shivakumara, Ramachandra Raghavendra, Umapada Pal, Tong Lu, Michael Blumenstein
Integration of Biologically Inspired Pixel Saliency Estimation and IPDA Filters for Multi-target Tracking

The ability to visually locate and maintain targets is fundamental to modern autonomous systems. By augmenting a novel biologically-inspired target saliency estimator with Integrated Probability Data Association (IPDA) filters and linear prediction techniques, this paper demonstrates the reliable detection and tracking of small and weak-signature targets in cluttered environments. The saliency estimator performs an adaptive, spatio-temporal tone mapping and directional filtering that enhances local contrast and extracts consistent motion, strengthened by the IPDA mechanism which incrementally confirms targets and further removes stochastic false alarms. This joint technique was applied to mid-wave infra-red imagery of maritime scenes where heavy sea clutter distracts significantly from true vessels. Once initialised, the proposed method is shown to successfully maintain tracks of dim targets as small as $$2\times 1$$ pixels with 100% accuracy and zero false positives. On average, this method scored a sensitivity of 93.2%, with 100% precision, which surpasses well-established techniques in the literature. These results are very encouraging, especially for applications that require no misses in highly cluttered environments.

Daniel Griffiths, Laleh Badriasl, Tony Scoleri, Russell S. A. Brinkworth, Sanjeev Arulampalam, Anthony Finn
The Effectiveness of Noise in Data Augmentation for Fine-Grained Image Classification

Recognizing images from subcategories with subtle differences remains a challenging task due to the scarcity of quantity and diversity of training samples. Existing data augmentation methods either rely on models trained with fully annotated data or involve human in the loop, which is labor-intensive. In this paper, we propose a simple approach that leverages large amounts of noisy images from the Web for fine-grained image classification. Beginning with a deep model taken as input image patches for feature representation, the maximum entropy learning criterion is first introduced to improve the score-based patch selection. Then a noise removal procedure is designed to verify the usefulness of noisy images in the augmented data for classification. Extensive experiments on standard, augmented, and combined datasets with and without noise validate the effectiveness of our method. Generally, we achieve comparable results on benchmark datasets, e.g., CUB-Birds, Stanford Dogs, and Stanford Cars, with only 50 augmented noisy samples for every category.

Nanyu Sun, Xianjie Mo, Tingting Wei, Dabin Zhang, Wei Luo
Attention Recurrent Neural Networks for Image-Based Sequence Text Recognition

Image-based sequence text recognition is an important research direction in the field of computer vision. In this paper, we propose a new model called Attention Recurrent Neural Networks (ARNNs) for the image-based sequence text recognition. ARNNs embed the attention mechanism seamlessly into the recurrent neural networks (RNNs) through an attention gate. The attention gate generates a gating signal that is end-to-end trainable, which empowers the ARNNs to adaptively focus on the important information. The proposed attention gate can be applied to any recurrent networks, e.g., standard RNN, Long Short-Term Memory (LSTM), and Gated Recurrent Unit (GRU). Experimental results on several benchmark datasets demonstrate that ARNNs consistently improves previous approaches on the image-based sequence text recognition tasks.

Guoqiang Zhong, Guohua Yue
Nighttime Haze Removal with Glow Decomposition Using GAN

In this paper, we investigate the problem of a single image haze removal in the nighttime. Glow effect is inherently existing in nighttime scenes due to multiple light sources with various colors and prominent glows. As the glow obscures the color and the shape of objects nearby light sources, it is important to handle the glow effect in the nighttime haze image. Even the convolutional neural network has brought impressive improvements for daytime haze removal, it has been hard to train the network in supervised manner in the nighttime because of difficulty in collecting training samples. Towards this end, we propose a nighttime haze removal algorithm with a glow decomposition network as a learning-based layer separation technique. Once the generative adversarial network removes the glow effect from the input image, the atmospheric light and the transmission map are obtained, then eventually the haze-free image. To verify the effectiveness of the proposed method, experiments are conducted on both real and synthesized nighttime haze images. The experiment results show that our proposed method produces haze-removed images that have better quality and less artifacts than ones from previous studies.

Beomhyuk Koo, Gyeonghwan Kim
Finding Logo and Seal in Historical Document Images - An Object Detection Based Approach

Logo and Seal serves the purpose of authenticating and referring to the source of a document. This strategy was also prevalent in the medieval period. Different algorithm exists for detection of logo and seal in document images. A close look into the present state-of-the-art methods reveals that those methods were focused toward detection of logo and seal in contemporary document images. However, such methods are likely to underperform while dealing with historical documents. This is due to the fact that historical documents are attributed with additional challenges like extra noise, bleed-through effect, blurred foreground elements and low contrast. The proposed method frames the problem of the logo and seals detection in an object detection framework. Using a deep-learning technique it counters earlier mentioned problems and evades the need for any pre-processing stage like layout analysis and/or binarization in the system pipeline. The experiments were conducted on historical images from 12th to the 16th century and the results obtained were very encouraging for detecting logo in historical document images. To the best of our knowledge, this is the first attempt on logo detection in historical document images using an object-detection based approach.

Sukalpa Chanda, Prashant Kumar Prasad, Anders Hast, Anders Brun, Lasse Martensson, Umapada Pal
KIL: Knowledge Interactiveness Learning for Joint Depth Estimation and Semantic Segmentation

Depth estimation and semantic segmentation are two important yet challenging tasks in the field of pixel-level scene understanding. Previous works often solve the two tasks as the parallel decoding/modeling process, but cannot well consider strongly correlated relationships between these tasks. In this paper, given an input image, we propose to learn knowledge interactiveness of depth estimation and semantic segmentation for jointly predicting results in an end-to-end way. Especially, the key Knowledge Interactiveness Learning (KIL) module can mine and leverage the connections/complementations of these two tasks effectively. Furthermore, the network parameters can be jointly optimized for boosting the final predictions of depth estimation and semantic segmentation with the coarse-to-fine strategy. Extensive experiments on SUN-RGBD and NYU Depth-V2 datasets demonstrate state-of-the-art performance of the proposed unified framework for both depth estimation and semantic segmentation tasks.

Ling Zhou, Chunyan Xu, Zhen Cui, Jian Yang
Road Scene Risk Perception for Intelligent Vehicles Using End-to-End Affordance Learning and Visual Reasoning

A key goal of intelligent vehicles is to provide a safer and more efficient method of transportation. One important aspect of intelligent vehicles is to understand the road scene using vehicle-mounted camera images. Perceiving the level of driving risk of a given road scene enables intelligent vehicles to drive more efficiently without compromising on safety. Existing road scene understanding methods, however, do not explicitly nor holistically model this notion of driving risk. This paper proposes a new perspective on scene risk perception by modeling end-to-end road scene affordance using a weakly supervised classifier. A subset of images from BDD100k dataset was relabeled to evaluate the proposed model. Experimental results show that the proposed model is able to correctly classify three different levels of risk. Further, saliency maps were used to demonstrate that the proposed model is capable of visually reasoning about the underlying causes of its decision. By understanding risk holistically, the proposed method is intended to be complementary to existing advanced driver assistance systems and autonomous vehicles.

Jayani Withanawasam, Ehsan Javanmardi, Kelvin Wong, Mahdi Javanmardi, Shunsuke Kamijo
Image2Height: Self-height Estimation from a Single-Shot Image

This paper analyzes a self-height estimation method from a single-shot image using a convolutional architecture. To estimate the height where the image was captured, the method utilizes object-related scene structure contained in a single image in contrast to SLAM methods, which use geometric calculation on sequential images. Therefore, a variety of application domains from wearable computing (e.g., estimation of wearer’s height) to the analysis of archived images can be considered. This paper shows that (1) fine tuning from a pretrained object-recognition architecture contributes also to self-height estimation and that (2) not only visual features but their location on an image is fundamental to the self-height estimation task. We verify these two points through the comparison of different learning conditions, such as preprocessing and initialization, and also visualization and sensitivity analysis using a dataset obtained in indoor environments.

Kei Shimonishi, Tyler Fisher, Hiroaki Kawashima, Kotaro Funakoshi
Segmentation of Foreground in Image Sequence with Foveated Vision Concept

The human visual system has no difficulty to detect moving object. To design an automated method for detecting foreground in videos captured in a variety and complicated scenes is a challenge. The topic has attracted much research due to its wide range of video-based applications. We propose a foveated model that mimics the human visual system for the detection of foreground in image sequence. It is a two-step framework simulating the awareness of motion followed by the extraction of detailed information. In the first step, region proposals are extracted based on similarity of intensity and motion features with respect to the pre-generated archetype. Through integration of the similarity measures, each image frame is segregated into background and foreground points. Large foreground regions are preserved as region proposals (RPs). In the second step, analysis is performed on each RP in order to obtain the accurate shape of moving object. Photometric and textural features are extracted and matched with another archetype. We propose a probabilistic refinement scheme. If the RP contains a point initially labeled as background, it can be converted to a foreground point if its features are more similar to neighboring foreground points than neighboring background points. Both archetypes are updated immediately based on the segregation result. We compare our method with some well-known and recently proposed algorithms using various video datasets.

Kwok Leung Chan
Bi-direction Feature Pyramid Temporal Action Detection Network

Temporal action detection in long-untrimmed videos is still a challenging task in video content analysis. Many existing approaches contain two stages, which firstly generate action proposals and then classify them. The main drawback of these approaches is that there are repeated operations in the proposal extraction and the classification stages. In this paper, we propose a novel Bi-direction Feature Pyramid Temporal Action Detection (BFPTAD) Network based on 1D temporal convolutional and deconvolutional layers to detect action instances directly in long-untrimmed videos. We use the top-down pathway to add semantic information to the shallow feature maps, and then use the bottom-up pathway to add location information to the deep feature maps. We evaluate our network on THUMOS14 and ActivityNet benchmarks. Our approach significantly outperforms other state-of-the-art methods by increasing mAP@IoU = 0.5 from 44.2% to 52.2% on THUMOS14.

Jiang He, Yan Song, Haiyu Jiang
Hand Segmentation for Contactless Palmprint Recognition

Extracting a palm region with fixed location from an input hand image is a crucial task for palmprint recognition to realize reliable person authentication under unconstrained conditions. A palm region can be extracted from the fixed position using the gaps between fingers. Hence, an accurate and robust hand segmentation method is indispensable to extract a palm region from an image with complex background taken under various environments. This paper proposes a hand segmentation method for contactless palmprint recognition. The proposed method employs a new CNN architecture consisting of an encoder-decoder model of CNN with a pyramid pooling module. Through a set of experiments using a hand image dataset, we demonstrate that the proposed method exhibits efficient performance on hand segmentation.

Yusei Suzuki, Hiroya Kawai, Koichi Ito, Takafumi Aoki, Masakazu Fujio, Yosuke Kaga, Kenta Takahashi
Selecting Discriminative Features for Fine-Grained Visual Classification

Fine-grained visual classification is a challenging task because of intra-class variation and inter-class similarity. Most fine-grained models predominantly focus on discriminative region localization which can effectively solve the intra-class variation, but ignore global information and the problem of inter-class similarity which easily leads to overfitting on specific samples. To address these issues, we develop an end-to-end model based on selecting discriminative features for fine-grained visual classification without the help of part or bounding box annotations. In order to accurately select discriminative features, we integrate effective information from different receptive fields to enhance the quality of features, then the features of discriminative regions detected by anchors and the whole image’s feature are jointly processed for classification. Besides, we propose a new loss function to optimize the model to find discriminative regions and prevent overfitting in the particular sample, which can simultaneously solve the problems of intra-class variation and inter-class similarity. Comprehensive experiments show that the proposed approach is superior to the state-of-the-art methods on CUB-200-2011, Stanford Cars and FGVC-Aircraft datasets.

Qin Xu, Linyang Li, Qian Chen, Bin Luo
Backmatter
Metadaten
Titel
Pattern Recognition
herausgegeben von
Shivakumara Palaiahnakote
Prof. Gabriella Sanniti di Baja
Liang Wang
Prof. Dr. Wei Qi Yan
Copyright-Jahr
2020
Electronic ISBN
978-3-030-41404-7
Print ISBN
978-3-030-41403-0
DOI
https://doi.org/10.1007/978-3-030-41404-7