Skip to main content

2020 | Buch

Pattern Recognition

5th Asian Conference, ACPR 2019, Auckland, New Zealand, November 26–29, 2019, Revised Selected Papers, Part II

herausgegeben von: Shivakumara Palaiahnakote, Prof. Gabriella Sanniti di Baja, Liang Wang, Prof. Dr. Wei Qi Yan

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This two-volume set constitutes the proceedings of the 5th Asian Conference on ACPR 2019, held in Auckland, New Zealand, in November 2019.
The 9 full papers presented in this volume were carefully reviewed and selected from 14 submissions. They cover topics such as: classification; action and video and motion; object detection and anomaly detection; segmentation, grouping and shape; face and body and biometrics; adversarial learning and networks; computational photography; learning theory and optimization; applications, medical and robotics; computer vision and robot vision; pattern recognition and machine learning; multi-media and signal processing and interaction.

Inhaltsverzeichnis

Frontmatter

Pattern Recognition and Machine Learning

Frontmatter
Margin Constraint for Low-Shot Learning

Low-shot learning aims to recognize novel visual categories with limited examples, which is mimicking the human visual system and remains a challenging research problem. In this paper, we introduce the margin constraint in loss function for the low-shot learning field to enhance the model’s discriminative power. Additionally, we adopt the novel categories’ normalized feature vectors as the corresponding classification weight vectors directly, in order to provide an instant classification performance on the novel categories without retraining. Experiments show that our method provides a better generalization and outperforms the previous methods on the low-shot leaning benchmarks.

Xiaotian Wu, Yizhuo Wang
Enhancing Open-Set Face Recognition by Closing It with Cluster-Inferred Gallery Augmentation

In open-set face recognition—as opposed to closed-set face recognition—it is possible that the identity of a given query is not present in the gallery set. In that case, the identity of the query can only be correctly classified as “unknown” when the similarity with the gallery faces is below a threshold that was determined a priori. However, in many use-cases, the set of queries contains multiple instances of the same identity, whether or not this identity is represented in the gallery. Thus, the set of query faces lends itself to identity clustering that could yield representative instances for unknown identities. By augmenting the gallery with these instances, we can make an open-set face recognition problem more closed. In this paper, we show that this method of Cluster-Inferred Gallery Augmentation (CIGA) does indeed improve the quality of open-set face recognition. We evaluate the addition of CIGA for both a private dataset of images taken in a school context and the public LFW dataset, showing a significant improvement in both cases. Moreover, an implementation of the suggested approach along with our experiments are made publicly available on https://gitlab.com/florisdf/acpr2019 .

Floris De Feyter, Kristof Van Beeck, Toon Goedemé
Action Recognition in Untrimmed Videos with Composite Self-attention Two-Stream Framework

With the rapid development of deep learning algorithms, action recognition in video has achieved many important research results. One issue in action recognition, Zero-Shot Action Recognition (ZSAR), has recently attracted considerable attention, which classify new categories without any positive examples. Another difficulty in action recognition is that untrimmed data may seriously affect model performance. We propose a composite two-stream framework with a pre-trained model. Our proposed framework includes a classifier branch and a composite feature branch. The graph network model is adopted in each of the two branches, which effectively improves the feature extraction and reasoning ability of the framework. In the composite feature branch, 3-channel self-attention modules are constructed to weight each frame of the video and give more attention to the key frames. Each self-attention channel outputs a set of attention weights to focus on the particular stage of the video, and a set of attention weights corresponds to a one-dimensional vector. The 3-channel self-attention modules can inference key frames from multiple aspects. The output sets of attention weight vectors form an attention matrix, which effectively enhances the attention of key frames with strong correlation of action. This model can also implement action recognition under zero-shot conditions, and has good recognition performance for untrimmed video data. Experimental results on relevant datasets confirm the validity of our model.

Dong Cao, Lisha Xu, HaiBo Chen
Representation Learning for Style and Content Disentanglement with Autoencoders

Many approaches have been proposed to disentangle the style and content from image representations. Most existing methods aim to create new images from the combinations of separated style and content. These approaches have shown impressive results in creating new images in such a way as to swap or mix features, but they have not shown the results of disentanglement respectively. In this paper, to show up the results of disentanglement on each, we propose a two-branch autoencoder framework. Our proposed framework, which incorporates content branch and style branch, enables complementary and effective end-to-end learning. To demonstrate our framework in a qualitative assessment, we show the disentanglement results of style and content respectively in handwritten digit datasets: MNIST-M and EMNIST-M. Furthermore, for the quantitative assessment, we examined our method to the classification task. In this experiment, our disentangled content images from digit datasets can achieve competitive performance in classification tasks. Our code is available at https://github.com/najaemin92/disentanglement-pytorch .

Jaemin Na, Wonjun Hwang
Residual Attention Encoding Neural Network for Terrain Texture Classification

Terrain texture classification plays an important role in computer vision applications, such as robot navigation, autonomous driving, etc. Traditional methods based on hand-crafted features often have sub-optimal performances due to the inefficiency in modeling the complex terrain variations. In this paper, we propose a residual attention encoding network (RAENet) for terrain texture classification. Specifically, RAENet incorporates a stack of residual attention blocks (RABs) and an encoding block (EB). By generating attention feature maps jointly with residual learning, RAB is different from the usually used which only combine feature from the current layer with the former one layer. RAB combines all the preceding layers to the current layer and is not only minimize the information loss in the convolution process, but also enhance the weights of the features that are conducive to distinguish between different classes. Then EB further adopts orderless encoder to keep the invariance to spatial layout in order to extract feature details before classification. The effectiveness of RAENet is evaluated on two terrain texture datasets. Experimental results show that RAENet achieves state-of-the-art performance.

Xulin Song, Jingyu Yang, Zhong Jin
The Shape of Patterns Tells More
Using Two Dimensional Hough Transform to Detect Circles

For image processing applications, an initial step is usually extracting features from the target image. Those features can be lines, curves, circles, circular arcs and other shapes. The Hough transform is a reliable and widely used method for straight line and circle detection, especially when the image is noisy. However, techniques of Hough transform for detecting lines and circles are different; when detecting circles it usually requires a three-dimensional parameter space while detecting straight lines only requires two. Higher dimensional parameter transforms suffer from high storage and computational requirements. However, in the two dimensional Hough transform space, straight lines and circles yield patterns with different shapes. By analysing the shape of patterns within the Hough transform space it is possible to reconstruct the circles in image space. This paper proposes a new circle detection method based on analysing the pattern shapes within a two-dimensional line Hough transform space. This method has been evaluated by a simulation of detecting multiple circles and a group of real-world images. From the evaluation our method shows ability for detecting multiple circles in an image with mild noise.

Yuan Chang, Donald Bailey, Steven Le Moan
A Spatial Density and Phase Angle Based Correlation for Multi-type Family Photo Identification

Due to change in mindset and living style of humans, the numbers of diversified marriages are increasing all around the world irrespective of race, color, religion and culture. As a result, it is challenging for research community to identify multi type family photos, namely, normal family (family of the same race, religion or culture), multi-culture family (family of different culture, religion or race) from the family and non-family photos (images with friends, colleagues, etc.). In this work, we present a new method that combines spatial density information with phase angle for multi-type family photo classification. The proposed method uses three facial key points, namely, left-eye, right-eye and nose, for the features which are based on color, roughness and wrinkleless of faces, these are prominent for extracting unique cues for classification. The correlations between features of Left & Right Eyes, Left Eye & Nose and Right Eye & Nose are computed for all the faces in an image. This results in feature vectors for respective spatial density and phase angle information. Furthermore, the proposed method fuses the feature vectors and feeds them to the Convolutional Neural Network (CNN) for the classification of the above-three class problem. Experiments conducted on our database which contains three classes, namely, multi-cultural, normal and non-family images and the benchmark databases (due to Maryam et al. and Wang et al.) which contain two class-family and non-family images, show that the proposed method outperforms the existing methods in terms of classification rate for all the three databases.

Anaica Grouver, Palaiahnakote Shivakumara, Maryam Asadzadeh Kaljahi, Bhaarat Chetty, Umapada Pal, Tong Lu, G. Hemantha Kumar
Does My Gait Look Nice? Human Perception-Based Gait Relative Attribute Estimation Using Dense Trajectory Analysis

Relative attributes play an important role in object recognition and image classification tasks. These attributes provide high-level semantic explanations for describing and relating objects to each other instead of using direct labels for each object. In the current study, we propose a new method utilizing relative attribute estimation for gait recognition. First, we propose a robust gait motion representation system based on extracted dense trajectories (DTs) from video footage of gait, which is more suitable for gait attribute estimation than existing heavily body shape-dependent appearance-based features, such as gait energy images (GEI). Specifically, we used a Fisher vector (FV) encoding framework and histograms of optical flows (HOFs) computed with individual DTs. We then compiled a novel gait dataset containing 1,200 videos of walking subjects and annotation of gait relative attributes based on subjective perception of gait pairs of subjects. To estimate relative attributes, we trained a set of ranking functions for the relative attributes using a Rank-SVM classifier method. These ranking functions estimated a score indicating the strength of the presence of each attribute for each walking subject. The experimental results revealed that the proposed method was able to represent gait attributes well, and that the proposed gait motion descriptor achieved better generalization performance than GEI for gait attribute estimation.

Allam Shehata, Yuta Hayashi, Yasushi Makihara, Daigo Muramatsu, Yasushi Yagi
Scene-Adaptive Driving Area Prediction Based on Automatic Label Acquisition from Driving Information

Technology for autonomous vehicles has attracted much attention for reducing traffic accidents, and the demand for its realization is increasing year-by-year. For safety driving on urban roads by an autonomous vehicle, it is indispensable to predict an appropriate driving path even if various objects exist in the environment. For predicting the appropriate driving path, it is necessary to recognize the surrounding environment. Semantic segmentation is widely studied as one of the surrounding environment recognition methods and has been utilized for drivable area prediction. However, the driver’s operation, that is important for predicting the preferred drivable area (scene-adaptive driving area), is not considered in these methods. In addition, it is important to consider the movement of surrounding dynamic objects for predicting the scene-adaptive driving area. In this paper, we propose an automatic label assignment method from actual driving information, and scene-adaptive driving area prediction method using semantic segmentation and Convolutional LSTM (Long Short-Term Memory). Experiments on actual driving information demonstrate that the proposed methods could both acquire the labels automatically and predict the scene-adaptive driving area successfully.

Takuya Migishima, Haruya Kyutoku, Daisuke Deguchi, Yasutomo Kawanishi, Ichiro Ide, Hiroshi Murase
Enhancing the Ensemble-Based Scene Character Recognition by Using Classification Likelihood

Research on scene character recognition has been popular for its potential in many applications including automatic translator, signboard recognition, and reading assistant for the visually-impaired. The scene character recognition is challenging and difficult owing to various environmental factors at image capturing and complex design of characters. Current OCR systems have not gained practical accuracy for arbitrary scene characters, although some effective methods were proposed in the past. In order to enhance existing recognition systems, we propose a hierarchical recognition method utilizing the classification likelihood and image pre-processing methods. It is shown that the accuracy of our latest ensemble system has been improved from 80.7% to 82.3% by adopting the proposed methods.

Fuma Horie, Hideaki Goto, Takuo Suganuma
Continual Learning of Image Translation Networks Using Task-Dependent Weight Selection Masks

Continual learning is training of a single identical network with multiple tasks sequentially. In general, naive continual learning brings severe catastrophic forgetting. To prevent it, several methods of continual learning for Deep Convolutional Neural Networks (CNN) have been proposed so far, most of which aim at image classification tasks. In this paper, we explore continual learning for the task of image translation. We apply Piggyback [1], which is a method of continual learning using task-dependent masks to select model weights, to an Encoder-Decoder CNN so that it can perform different kinds of image translation tasks with only a single network. By the experiments on continual learning of semantic segmentation, image coloring, and neural style transfer, we show that the performance of the continuously trained network is comparable to the networks trained on each of the tasks individually.

Asato Matsumoto, Keiji Yanai
A Real-Time Eye Tracking Method for Detecting Optokinetic Nystagmus

Optokinetic nystagmus (OKN) is an involuntary repeated “beating” of the eye, comprised of sequences of slow tracking (slow phase) and subsequent quick re-fixation events (quick phase) that occur in response to (typically horizontally) drifting stimuli. OKN has a characteristic saw-tooth pattern that we detect here using a state-machine approach applied to the eye-tracking signal. Our algorithm transitions through the slow/quick phases of nystagmus (and a final state) in order to register the start, peak and end points of individual sawtooth events. The method generates duration, amplitude, velocity estimates for candidate events, as well as repetition estimates from the signal.We test the method on a small group of participants. The results suggest that false positive detections occur as single isolated events in feature space. As a result of this observation we apply a simple criteria based on the repetitious “beating” of the eye. The number of true positives is high ($$94\%$$94%) and false OKN detections are low ($$2\%$$2%). Future work will aim to optimise and rigorously validate the proof-of-concept framework we propose.

Mohammad Norouzifard, Joanna Black, Benjamin Thompson, Reinhard Klette, Jason Turuwhenua
Network Structure for Personalized Face-Pose Estimation Using Incrementally Updated Face-Shape Parameters

This paper proposes a deep learning method for face-pose estimation with an incremental personalization mechanism to update the face-shape parameters. Recent advances in machine learning technology have also led to outstanding performance in applications of computer vision. However, network-based algorithms generally rely on an off-line training process that uses a large dataset, and a trained network (e.g., one for face-pose estimation) usually works in a one-shot manner, i.e., each input image is processed one by one with a static network. On the other hand, we expect a great advantage from having sequential observations, rather than just single-image observations, in many practical applications. In such cases, the dynamic use of multiple observations will contribute to improving system performance. The face-pose estimation method proposed in this paper, therefore, focuses on an incremental personalization mechanism. The method consists of two parts: a pose-estimation network and an incremental estimation of the face-shape parameters (shape-estimation network). Face poses are estimated from input images and face-shape parameters through the pose-estimation network. The shape parameters are estimated as the output of the shape-estimation network and iteratively updated in a sequence of image observations. Experimental results suggest the effectiveness of using face-shape parameters in face-posture estimation. We also describe the incremental refinement of face-shape parameters using a shape-estimation network.

Makoto Sei, Akira Utsumi, Hirotake Yamazoe, Joo-Ho Lee
Optimal Rejection Function Meets Character Recognition Tasks

In this paper, we propose an optimal rejection method for rejecting ambiguous samples by a rejection function. This rejection function is trained together with a classification function under the framework of Learning-with-Rejection (LwR). The highlights of LwR are: (1) the rejection strategy is not heuristic but has a strong background from a machine learning theory, and (2) the rejection function can be trained on an arbitrary feature space which is different from the feature space for classification. The latter suggests we can choose a feature space which is more suitable for rejection. Although the past research on LwR focused only its theoretical aspect, we propose to utilize LwR for practical pattern classification tasks. Moreover, we propose to use features from different CNN layers for classification and rejection. Our extensive experiments of notMNIST classification and character/non-character classification demonstrate that the proposed method achieves better performance than traditional rejection strategies.

Xiaotong Ji, Yuchen Zheng, Daiki Suehiro, Seiichi Uchida
Comparing the Recognition Accuracy of Humans and Deep Learning on a Simple Visual Inspection Task

In this paper, we investigate the number of training samples required for deep learning techniques to achieve better accuracy of inspection than a human on a simple visual inspection task. We also examine whether there are differences in terms of finding anomalies when deep learning techniques outperform human subjects. To this end, we design a simple task that can be performed by non-experts. It required that participants distinguish between normal and anomalous symbols in images. We automatically generated a large number of training samples containing normal and anomalous symbols in the task. The results show that the deep learning techniques required several thousand training samples to detect the locations of the anomalous symbols and tens of thousands to divide these symbols into segments. We also confirmed that deep learning techniques have both advantages and disadvantages in the task of identifying anomalies compared with humans.

Naoto Kato, Michiko Inoue, Masashi Nishiyama, Yoshio Iwai
Improved Gamma Corrected Layered Adaptive Background Model

This paper proposes a method for pixel-based background subtraction with improved gamma correction and a layered adaptive background model (IGLABM). The main problems of background subtraction are background oscillation and shadow. To solve these problems, we have proposed the gamma corrected layered adaptive background model (GLABM), however the performance of GLABM is not sufficient for real scenes. We hence improve the gamma estimation and prepossessing step of GLABM in this study using the covariance matrix of each pixel. We demonstrate the performance of the proposed improved method by comparing it with GLABM and other pixel-based background subtraction methods.

Kousuke Sakamoto, Hiroki Yoshimura, Masashi Nishiyama, Yoshio Iwai
One-Shot Learning-Based Handwritten Word Recognition

One-Shot and Few-shot Learning algorithms have emerged as techniques that can imitate a humans ability to learn from very few examples. This is an advantage over traditional deep networks which require a lot of training samples and lack of robustness due to their excessive domain specific discriminators. In this paper, we explore a one-shot learning approach to recognizing handwritten words using Siamese networks to classify the handwritten images at the word level. The Siamese network’s ability to compute similarities between two images is learned using a supervised metric but the fully trained Siamese network can be used to classify new data that has previously not been used to train the network. The model learns to discriminate inputs from a small labelled support set. By using a convolutional architecture we were able to achieve robust results. We also expect that training the system over a larger distributions of data will result in improved general handwritten word classification. Accuracy as high as 92.4% was obtained while performing 5-way one-shot word recognition on a publicly available dataset which is quite high in comparison to the state-of-the-art methods.

Asish Chakrapani Gv, Sukalpa Chanda, Umapada Pal, David Doermann
First-Person View Hand Parameter Estimation Based on Fully Convolutional Neural Network

In this paper, we propose a real-time framework that can not only estimate location of hands within a RGB image but also their corresponding 3D joint coordinates and their hand side determination of left or right handed simultaneously. Most of the recent methods on hand pose analysis from monocular images only focus on the 3D coordinates of hand joints, which cannot give a full story to users or applications. Moreover, to meet the demands of applications such as virtual reality or augmented reality, a first-person viewpoint hand pose dataset is needed to train our proposed CNN. Thus, we collect a synthetic RGB dataset captured in an egocentric view with the help of Unity, a 3D engine. The synthetic dataset is composed of hands with various posture, skin color and size. We provide 21 joint annotations including 3D coordinates, 2D locations, and corresponding hand side which is left hand or right hand for each hand within an image.

En-Te Chou, Yun-Chih Guo, Ya-Hui Tang, Pei-Yung Hsiao, Li-Chen Fu
Dual-Attention Graph Convolutional Network

Graph convolutional networks (GCNs) have shown the powerful ability in text structure representation and effectively facilitate the task of text classification. However, challenges still exist in adapting GCN on learning discriminative features from texts due to the main issue of graph variants incurred by the textual complexity and diversity. In this paper, we propose a dual-attention GCN to model the structural information of various texts as well as tackle the graph-invariant problem through embedding two types of attention mechanisms, i.e. the connection-attention and hop-attention, into the classic GCN. To encode various connection patterns between neighbour words, connection-attention adaptively imposes different weights specified to neighbourhoods of each word, which captures the short-term dependencies. On the other hand, the hop-attention applies scaled coefficients to different scopes during the graph diffusion process to make the model learn more about the distribution of context, which captures long-term semantics in an adaptive way. Extensive experiments are conducted on five widely used datasets to evaluate our dual-attention GCN, and the achieved state-of-the-art performance verifies the effectiveness of dual-attention mechanisms.

Xueya Zhang, Tong Zhang, Wenting Zhao, Zhen Cui, Jian Yang
Chart-Type Classification Using Convolutional Neural Network for Scholarly Figures

Text-to-speech conversion by smart speakers is expected to assist visually handicapped people who are near total blindness to read documents. This research supposes a situation where such a text-to-speech conversion is applied to scholarly documents. Usually, a page in scholarly documents consists of multiple regions, i.e. ordinary text, mathematical expressions, tables, and figures. In this paper, we propose a method which classifys chart-type of scholarly figures using a convolutional neural network. The method classifies an input figure image into line charts or others. We evaluated the accuracy of the method using scholarly figures dataset collected from actual academic papers. The classification accuracy of the proposed method achieved 97%. We also compared the performance of the proposed method with that of hand-crafted features and support vector machine. The results suggest that the proposed CNN classification outperforms the conventional approach.

Takeo Ishihara, Kento Morita, Nobu C. Shirai, Tetsushi Wakabayashi, Wataru Ohyama
Handwritten Digit String Recognition for Indian Scripts

In many documents digits/numerals may touch each other and hence digit string recognition is necessary as segmentation of individual numeral from the touching string is difficult. In this paper, we propose a digit string recognition system for four Indian popular scripts. Here we consider strings of Kannada, Oriya, Tamil and Telugu scripts for our experiment. This paper has two contributions: (i) we have developed 4 datasets of digit string for each of these four scripts. Each dataset has 20000 numeral string samples for training and 30000 samples for testing. As there is no such dataset available, it will be helpful to the community (ii) we apply a RNN free CNN (Convolutional Neural Network) and CTC (Connectionist Temporal Classifica-tion) based architecture for numeral string recognition. Unlike normal text string, in string of digits has no contextual information among the digits and hence a digit may be followed by an arbitrary digit in a digit string. Because of such behaviors we apply a CNN and CTC based architecture without RNN for numeral string recognition. We tested our scheme on our different test datasets and results are provided.

Hongjian Zhan, Pinaki Nath Chowdhury, Umapada Pal, Yue Lu
Spatial-Temporal Graph Attention Network for Video-Based Gait Recognition

Gait is a kind of attractive feature for human identification at a distance. It can be regarded as a kind of temporal signal. At the same time the human body shape can be regarded as the signal in the spatial domain. In the proposed method, we try to extract discriminative feature from video sequences in the spatial and temporal domains by only one network, Spatial-Temporal Graph Attention Network (STGAN). In spatial domain, we designed one branch to select some distinguished regions and enhance their contribution. It can make the network focus on these distinguished regions. We also constructed another branch, a Spatial-Temporal Graph (STG), to discover the relationship between frames and the variation of a region in temporal domain. The proposed method can extract gait feature in the two domains, and the two branches in the model can be trained end to end. The experimental results on two popular datasets, CASIA-B and OU-ISIR Treadmill-B, show the proposed method can improve gait recognition obviously.

Xinhui Wu, Weizhi An, Shiqi Yu, Weiyu Guo, Edel B. García
Supervised Interactive Co-segmentation Using Histogram Matching and Bipartite Graph Construction

The identification and retrieval of images of same or similar objects finds application in various tasks that are of prime importance in Image Processing and Computer Vision. Accurate and fast extraction of the object of interest from several images is essential for the construction of 3D models, image retrieval applications. The joint partitioning of multiple images having same or similar objects of interest into background and foreground parts is referred to as co-segmentation.This article proposes a novel and efficient interactive co-segmentation method based on the computation of a global energy function and a local smooth energy function. Computation of global energy function from the scribbled regions of the images is based on histogram matching. This is used to estimate the probability of each region belonging either to foreground or background region. The local smooth energy function is used to estimate the probability of regions having similar colour appearance. To further improve the quality of the segmentation, a bipartite graph is constructed using the segments. The algorithm has been implemented on iCoseg and MSRC benchmark data sets. The extensive experimental results show significant improvement in performance compared to many state-of-the-art unsupervised co-segmentation and supervised interactive co-segmentation methods, both in computational time and accuracy.

Harsh Bhandari, Sarbani Palit, Bhabatosh Chanda
Using Deep Convolutional LSTM Networks for Learning Spatiotemporal Features

This paper explores the use of convolutional LSTMs to simultaneously learn spatial- and temporal-information in videos. A deep network of convolutional LSTMs allows the model to access the entire range of temporal information at all spatial scales. We describe our experiments involving convolutional LSTMs for lipreading that demonstrate the model is capable of selectively choosing which spatiotemporal scales are most relevant for a particular dataset. The proposed deep architecture holds promise in other applications where spatiotemporal features play a vital role without having to specifically cater the design of the network for the particular spatiotemporal features existent within the problem. Our model has comparable performance with the current state of the art achieving 83.4% on the Lip Reading in the Wild (LRW) dataset. Additional experiments indicate convolutional LSTMs may be particularly data hungry considering the large performance increases when fine-tuning on LRW after pretraining on larger datasets like LRS2 (85.2%) and LRS3-TED (87.1%). However, a sensitivity analysis providing insight on the relevant spatiotemporal temporal features allows certain convolutional LSTM layers to be replaced with 2D convolutions decreasing computational cost without performance degradation indicating their usefulness in accelerating the architecture design process when approaching new problems.

Logan Courtney, Ramavarapu Sreenivas
Two-Stage Fully Convolutional Networks for Stroke Recovery of Handwritten Chinese Character

In this paper, we propose a method to recover strokes from offline handwritten Chinese characters. The proposed method employs a fully convolutional network (FCN) to estimate the writing order of connected components in offline Chinese character images and a multi-task FCN to estimate the writing order and directions of strokes in each connected component. Online dataset CASIA-OLHWDB1.0 from the CASIA database is hired as the training set. Because the network produces discontinuous strokes, we refine the estimated writing orders using a graph cut (GC), in which the estimated directions are used for calculation of smoothness term. Experimental results with test dataset of CASIA-OLHWDB1.0tst demonstrate the effectiveness of our method.

Yujung Wang, Motoharu Sonogashira, Atsushi Hashimoto, Masaaki Iiyama
Text Like Classification of Skeletal Sequences for Human Action Recognition

Human Action Recognition (HAR) has many applications in surveillance, gaming, animation and Active and Assisted Living (AAL). Several actions performed in daily life are composed of various poses arranged sequentially in time. Recognition of such actions is a difficult and challenging task. The classification approach proposed in this paper considers an analogy between actions and text, where an action is considered as a sentence and a single pose as a word. In the first stage, the poses are grouped based on their similarity and are then assigned labels. These labels are used for constructing label sequences representing motion. We propose Hierarchical Agglomerative Clustering (HAC) for clustering poses. Once the actions are modelled as the spatio-temporal evolution of key poses, we classify the actions using the Hidden Markov Model (HMM) and Hyper-dimensional Computing (HDC) classifiers. The experiments are performed on different datasets using both classifiers and the results are indicative of the effectiveness of the proposed approach in comparison with state-of-the-art methods.

Akansha Tyagi, Ashish Patel, Pratik Shah
Background Subtraction Based on Encoder-Decoder Structured CNN

Background subtraction is commonly adopted for detecting moving objects in image sequence. It is an important and fundamental computer vision task and has a wide range of applications. We propose a background subtraction framework with deep learning model. Pixels are labeled as background or foreground by an Encoder-Decoder Structured Convolutional Neural Network (CNN). The encoder part produces a high-level feature vector. Then, the decoder part uses the feature vector to generate a binary segmentation map, which can be used to identify moving objects. The background model is generated from the image sequence. Each frame of the image sequence and the background model are input to the CNN for pixel classification. Background subtraction result can be erroneous as videos may be captured in various complex scenes. The background model must be updated. Therefore, we propose a feedback scheme to perform the pixelwise background model updating. For the training of the CNN, the input images and the corresponding ground truths are drawn from the benchmark dataset Change Detection 2014. The results show that our proposed architecture outperforms many well-known traditional and deep learning background subtraction algorithms.

Jingming Wang, Kwok Leung Chan
Multi Facet Face Construction

To generate a multi-faceted view, from a single image has always been a challenging problem for decades. Recent developments in technology enable us to tackle this problem effectively. Previously, Several Generative Adversarial Network (GAN) based models have been used to deal with this problem as linear GAN, linear framework, a generator (generally encoder-decoder), followed by the discriminator. Such structures helped to some extent, but are not powerful enough to tackle this problem effectively.In this paper, we propose a GAN based dual-architecture model called DUO-GAN. In the proposed model, we add a second pathway in addition to the linear framework of GAN with the aim of better learning of the embedding space. In this model, we propose two learning paths, which compete with each other in a parameter-sharing manner. Furthermore, the proposed two-pathway framework primarily trains multiple sub-models, which combine to give realistic results. The experimental results of DUO-GAN outperform state of the art models in the field.

Hamed Alqahtani, Manolya Kavakli-Thorne

Multi-media and Signal Processing and Interaction

Frontmatter
Automated 2D Fetal Brain Segmentation of MR Images Using a Deep U-Net

Fetal brain segmentation is a difficult task yet an important step to study brain development in utero. In contrast to adult studies automatic fetal brain extraction remains challenging and has limited research mainly due to arbitrary orientation of the fetus, possible movement and lack of annotated data. This paper presents a deep learning method for 2D fetal brain extraction from Magnetic Resonance Imaging (MRI) data using a convolution neural network inspired from the U-Net architecture [1]. We modified the network to suit our segmentation problem by adding deeper convolutional layers allowing the network to capture finer textural information and using more robust functions to avoid overfitting and to deal with imbalanced foreground (brain) and background (non-brain) samples. Experimental results using 200 normal fetal brains consisting of over 11,000 2D images showed that the proposed method produces Dice and Jaccard coefficients of $$92.8 \pm 6.3\%$$92.8±6.3% and $$86.7 \pm 7.8\%$$86.7±7.8%, respectively providing a significant improvement over the original U-Net and its variants.

Andrik Rampun, Deborah Jarvis, Paul Griffiths, Paul Armitage
EEG Representations of Spatial and Temporal Features in Imagined Speech and Overt Speech

Imagined speech is an emerging paradigm for intuitive control of the brain-computer interface based communication system. Although the decoding performance of the imagined speech is improving with actively proposed architectures, the fundamental question about ‘what component are they decoding?’ is still remaining as a question mark. Considering that the imagined speech refers to an internal mechanism of producing speech, it may naturally resemble the distinct features of the overt speech. In this paper, we investigate the close relation of the spatial and temporal features between imagined speech and overt speech using electroencephalography signals. Based on the common spatial pattern feature, we acquired 16.2% and 59.9% of averaged thirteen-class classification accuracy (chance rate = 7.7%) for imagined speech and overt speech, respectively. Although the overt speech showed significantly higher classification performance compared to the imagined speech, we found potentially similar common spatial pattern of the identical classes of imagined speech and overt speech. Furthermore, in the temporal feature, we examined the analogous grand averaged potentials of the highly distinguished classes in the two speech paradigms. Specifically, the correlation of the amplitude between the imagined speech and the overt speech was 0.71 in the class with the highest true positive rate. The similar spatial and temporal features of the two paradigms may provide a key to the bottom-up decoding of imagined speech, implying the possibility of robust classification of multiclass imagined speech. It could be a milestone to comprehensive decoding of the speech-related paradigms, considering their underlying patterns.

Seo-Hyun Lee, Minji Lee, Seong-Whan Lee
GAN-based Abnormal Detection by Recognizing Ungeneratable Patterns

One of the approaches of image anomaly detection is to build a model to generate normal images and distinguish ungeneratable images as abnormal. Recently multiple studies reported the effectiveness of applying generative adversarial networks (GANs) in the task. A GAN trained on normal images is unable to generate abnormal image which is not included in the manifold of normal images, hence the generated image shows a substantial difference from the original images. Most of the previous studies measure the difference by pixel-wise residual loss, where the essential difference between normal and abnormal is often smeared out by inevitable pixel-level reconstruction error of Generator. In this report, a new GAN-based semi-supervised anomaly detection model, AnnoGAN, is proposed, in which the pixel-wise residual loss is replaced by a classifier network that is trained to recognize the essential difference between normal and abnormal. Proposed model achieves state-of-the-art performance in image anomaly detection using public datasets.

Soto Anno, Yuichi Sasaki
Modality-Specific Learning Rate Control for Multimodal Classification

Multimodal machine learning is an approach to performing tasks with inputs containing multiple expressions for a single subject. There are many recent reports on multimodal machine learning using the framework of deep neural networks. Conventionally, a common learning rate has been used for the network of all modalities. This has led, however, to a decrease in the overall accuracy due to overfitting in some modality models in cases when the convergence rate and generalization performance differ among modalities. In this paper, we propose a method that solves this problem by constructing a model within the framework of multitask learning, which simultaneously learns modality-specific classifiers as well as a multimodal classifier, to detect overfitting in each modality and carry out early stopping separately. We evaluated the accuracy of the proposed method using several datasets and demonstrated that it improves classification accuracy.

Naotsuna Fujimori, Rei Endo, Yoshihiko Kawai, Takahiro Mochizuki
3D Multi-frequency Fully Correlated Causal Random Field Texture Model

We propose a fast novel multispectral texture model with an analytical solution for both parameter estimation as well as unlimited synthesis. This Gaussian random field type of model combines a principal random field containing measured multispectral pixels with an auxiliary random field resulting from a given function whose argument is the principal field data. The model can serve as a stand-alone texture model or a local model for more complex compound random field or bidirectional texture function models. The model can be beneficial not only for texture synthesis, enlargement, editing, or compression but also for high accuracy texture recognition.

Michal Haindl, Vojtěch Havlíček
A Segment Level Approach to Speech Emotion Recognition Using Transfer Learning

Speech emotion recognition (SER) is a non-trivial task considering that the very definition of emotion is ambiguous. In this paper, we propose a speech emotion recognition system that predicts emotions for multiple segments of a single audio clip unlike the conventional emotion recognition models that predict the emotion of an entire audio clip directly. The proposed system consists of a pre-trained deep convolutional neural network (CNN) followed by a single layered neural network which predicts the emotion classes of the audio segments. The predictions for the individual segments are finally combined to predict the emotion of a particular clip. We define several new types of accuracies while evaluating the performance of the proposed model. The proposed model attains an accuracy of 68.7% surpassing the current state-of-the-art models in classifying the data into one of the four emotional classes (angry, happy, sad and neutral) when trained and evaluated on IEMOCAP audio-only dataset.

Sourav Sahoo, Puneet Kumar, Balasubramanian Raman, Partha Pratim Roy
A Decomposition Based Multi-objective Genetic Programming Algorithm for Classification of Highly Imbalanced Tandem Mass Spectrometry

Preprocessing tandem mass spectra to classify the signal and noise peaks plays a crucial role for improving the accuracy of most peptide identification algorithms. As a CID tandem mass spectra dataset is highly imbalanced with high noise ratio and a small number of signal peaks (low signal to noise ratio), a classification strategy which is able to maintain the performance trade-off between the minority (signal) and the majority (noise) class accuracies prior to peptide identification is required. Therefore, this paper proposes a Multi-Objective Genetic Programming (MOGP) approach based on the idea of MOEA/D, named MOGP/D, to evolve a Pareto front of classifiers along the optimal trade-off surface that offers the best compromises between objectives. In comparison with an NSGA-II base MOGP method, called NSGP, with decreasing the signal to noise ratio, MOGP/D produces better solutions in the region of interest (centre of the Pareto front) according to the hypervolume indicator on the training sets. Moreover, the best compromise solution achieved by the proposed method is compared with the best single objective GP and the best of NSGP, and the results show that MOGP/D retains a reasonable number of signal peaks and filters more noise peaks compared to the other two methods. To further evaluate the effectiveness of MOGP/D, the preprocessed MS/MS data is submitted to the mostly used de novo sequencing software, PEAKS, to identify the peptides. The results show that the proposed multi-objective GP method improves the reliability of peptide identification compared to the single objective GP.

Samaneh Azari, Bing Xue, Mengjie Zhang, Lifeng Peng
Automatic Annotation Method for Document Image Binarization in Real Systems

The accuracy of optical character recognition (OCR) has significantly improved recently through the use of deep learning. However, when OCR is used in real applications, the shortage of annotated images often makes training difficult. To solve this problem, there are automatic annotation methods. However, many of these methods are based on active learning, and operators need to confirm generated annotation candidates. I propose a practical automatic annotation method for binarization, which is one of the components of OCR. The purpose with the proposed method is to automatically confirm the quality of annotation candidates. This method consists of three simple processes to achieve this. First, cropping a text from a whole image. Second, applying binarization to the cropped image at all thresholds. Third, recognizing all binarized cropped images and matching the recognition results and correct character database. If the characters match, the cropped binary image is correctly binarized. The method selects that cropped binarized image as an annotation for binarization. Cropping coordinates and the correct character database (DB) can be obtained from a practical OCR system. Because users of such a system usually input corrections for misrecognition of OCR to the system, the system can obtain the correct characters and coordinates. The experimental results indicate that the annotations generated with the proposed method can improve the performance of deep-learning-based binarization. As a result, the normalized edit distance between the recognized text and grand truth text can be reduced by 38.56% on the Find it! receipt image dataset.

Ryosuke Odate
Meaning Guided Video Captioning

Current video captioning approaches often suffer from problems of missing objects in the video to be described, while generating captions semantically similar with ground truth sentences. In this paper, we propose a new approach to video captioning that can describe objects detected by object detection, and generate captions having similar meaning with correct captions. Our model relies on S2VT, a sequence-to-sequence model for video captioning. Given a sequence of video frames, the encoding RNN takes a frame as well as detected objects in the frame in order to incorporate the information of the objects in the scene. The following decoding RNN outputs are then fed into an attention layer and then to a decoder for generating captions. The caption is compared with the ground truth by learning metric so that vector representations of generated captions are semantically similar to those of ground truth. Experimental results with the MSDV dataset demonstrate that the performance of the proposed approach is much better than the model without the proposed meaning-guided framework, showing the effectiveness of the proposed model. Code are publicly available at https://github.com/captanlevi/Meaning-guided-video-captioning- .

Rushi J. Babariya, Toru Tamaki
Infant Attachment Prediction Using Vision and Audio Features in Mother-Infant Interaction

Attachment is a deep and enduring emotional bond that connects one person to another across time and space. Our early attachment styles are established in childhood through the interaction between infants and caregivers. There are two attachment types, secure and insecure. The attachment experience affects personality development, particularly a sense of security, and research shows that it influences the ability to form stable relationships throughout life. It is also an important aspect of assessing the quality of parenting. Therefore, attachment has been widely studied in psychology research. It’s usually acquired by Ainsworth’s Strange Situation Assessment (SSA) through tedious observation. As far as we know, there is no computational method to predict infant attachment type. We try to use the Still-Face Paradigm (SFP) video and audio as input to predict attachment types through machine learning methods. In the present work, we recruited 64 infant-mother participants, collected videos of SFP when babies are 5–8 months of age and identified their attachment types including secure and insecure by SSA when those infants are almost 2 years old. For the visual part, we extract motion features and apply a RNN network with LSTM units model for classification. For the audio part, speech enhancement is conducted as data pre-processing, pitch frequency, short-time energy and Mel Frequency Cepstral Coefficient feature sequences are extracted. Then SVM is deployed to explore the patterns in it. The experiments show that our method is able to discriminate between the 2 classes of subjects with a good accuracy.

Honggai Li, Jinshi Cui, Li Wang, Hongbin Zha
Early Diagnosis of Alzheimer’s Disease Based on Selective Kernel Network with Spatial Attention

Alzheimer’s disease (AD) is a neurodegenerative disorder which leads to memory and behaviour impairment. Early discovery and diagnosis can delay the progress of this disease. In this paper, we propose a new deep learning method called selective kernel network with attention for early diagnosis of AD using magnetic resonance imaging. Generally, deep learning methods for high-accuracy recognition are based on structure of deep neural networks by stacking a myriad of convolutional layers in the model. In this paper, the structure of SKANet is constructed similarly to that of ResNeXt by repeating residual blocks with the same topology and group convolution for saving computational costs. Different from ResNeXt, the primary convolution is replaced by using selective kernel convolution to adaptively adjust the receptive field based on imported information. Then, attention mechanism is added to the bottom of the block to emphasize on important features and suppress unnecessary ones for more accurate representation of the network. The block is termed as selective kernel with attention block that consists of a sequence of operations followed by the order: a convolution with kernel size $$1\times 1$$1×1, a selective kernel convolution, a convolution with kernel size $$1\times 1$$1×1, and spatial attention mechanism. The effectiveness of this proposed model is verified based on the Alzheimer’s Disease Neuroimaging Initiative dataset. Our experimental results show superiority of the proposed model for the early diagnosis of AD. The classification accuracy of AD and mild cognitive impairment reaches up to $$98.82\%$$98.82%.

Huanhuan Ji, Zhenbing Liu, Wei Qi Yan, Reinhard Klette
Detection of One Dimensional Anomalies Using a Vector-Based Convolutional Autoencoder

Anomaly detection is important to significant real life entities such as network intrusion and credit card fraud. Existing anomaly detection methods were partially learned the features, which is not appropriate for accurate detection of anomalies. In this study we proposed vector-based convolutional autoencoder (V-CAE) for one dimensional anomaly detection. The core of our model is a linear autoencoder, which is used to construct a low-dimensional manifold of feature vectors for normal data. At the same time, we used vector-based convolutional neural network (V-CNN) to extract the features from vector data before and after the linear autoencoder that makes the model learned deep features for efficient anomaly detection. This unsupervised learning method used only normal data in the training phase. We used the combined abnormal score calculated from two reconstruction errors: (i) error between the input and output of the whole architecture and (ii) error between the input and output of the linear encoder. Compared with the nine state-of-the-arts methods, our proposed V-CAE shows effective and stable results of AUC with 0.996 in estimating anomalies based on several benchmark datasets.

Qien Yu, Muthusubash Kavitha, Takio Kurita
Detection of Pilot’s Drowsiness Based on Multimodal Convolutional Bidirectional LSTM Network

The drowsiness of pilot causes the various aviation accidents such as an aircraft crash, breaking away airline, and passenger safety. Therefore, detecting the pilot’s drowsiness is one of the critical issues to prevent huge aircraft accidents and to predict pilot’s mental states. Conventional studies have been investigated using physiological signals such as brain signals, electrodermal activity (EDA), electrocardiogram (ECG), respiration (RESP) for detecting pilot’s drowsiness. However, these studies have not sufficient performance to prevent sudden aviation accidents yet because it could detect the mental states after drowsiness occurred and only focus on whether drowsiness or not. To overcome the limitations, in this paper, we propose a multimodal convolutional bidirectional LSTM network (MCBLN) to detect drowsiness or not as well as drowsiness level using the fused physiological signals (electroencephalography (EEG), EDA, ECG, and RESP) for the pilot’s environment. We acquired the physiological signals for the pilot’s simulated aircraft environment across seven participants. The proposed MCBLN extracted the features considering the spatial-temporal correlation of between EEG signals and peripheral physiological measures (PPMs) (EDA, ECG, RESP) to detect the current pilot’s drowsiness level. Our proposed method achieved the grand-averaged 45.16% (±1.01) classification accuracy for 9-level of drowsiness. Also, we obtained 84.41% (±1.34) classification accuracy for whether the drowsiness or not across all participants. Hence, we have demonstrated the possibility of the not only drowsiness detection but also 9-level of drowsiness for the pilot’s aircraft environment.

Baek-Woon Yu, Ji-Hoon Jeong, Dae-Hyeok Lee, Seong-Whan Lee
Deriving Perfect Reconstruction Filter Bank for Focal Stack Refocusing

This paper presents a digital refocusing method that transforms the captured focal stack directly into a new focal stack under different focus settings. Assuming Lambertian scenes with no occlusions, this paper theoretically shows that there exist a set of filters that perfectly reconstructs focal stack under Gaussian aperture from that captured under Cauchy one. The perfect reconstruction filters are derived in linear and space-invariant using a layered scene representation. Numerical simulations using synthetic focal stacks showed that the root mean squared errors are quite small and less than $$10^{-9}$$10-9, indicating the derived filters allow perfect reconstruction.

Asami Ito, Akira Kubota, Kazuya Kodama
Skeleton-Based Labanotation Generation Using Multi-model Aggregation

Labanotation is a well-known notation system for effective dance recording and archiving. Using computer technology to generate Labanotation automatically is a challenging but meaningful task, while existing methods cannot fully utilize spatial characteristics of human motion and distinguish subtle differences between similar human movements. In this paper, we propose a method based on multi-model aggregation for Labanotation generation. Firstly, two types of feature are extracted, the joint feature and the Lie group feature, which reinforce the representation of human motion data. Secondly, a two-branch network architecture based on Long Short-Term Memory (LSTM) network and LieNet is introduced to conduct effective human movement recognition. LSTM is capable to model long-term dependencies in temporal domain, and LieNet is a powerful network for spatial analysis based on Lie group structure. Within the architecture, the joint feature and the Lie group feature are fed into LSTM model and LieNet model respectively for training. Furthermore, we utilize score fusion methods to fuse the output class scores of the two branches, which performs better than any of the single models, due to complementarity between LSTM and LieNet. In addition, skip connection is applied in the structure of LieNet, which simplifies the training procedure and improves the convergence behavior. Evaluations on standard motion capture dataset demonstrate the effectiveness of proposed method and its superiority compared with previous works.

Ningwei Xie, Zhenjiang Miao, Jiaji Wang
Genetic Programming-Based Simultaneous Feature Selection and Imputation for Symbolic Regression with Incomplete Data

Symbolic regression via genetic programming has been used successfully for empirical modeling from given data sets. However, real-world data sets might contain missing values. Although there are different approaches to dealing with incomplete data sets for classification, symbolic regression with missing values has been rarely investigated. Similarly, only a few studies have been conducted on feature selection for symbolic regression, but none of them addresses the incompleteness issue. In this work, a genetic programming-based method for simultaneous imputation and feature selection is developed. This method selects the predictive features for the incomplete features whilst constructing their imputation models. Such models are designed to be suitable for data sets with mixed numerical and categorical features. The performance of the proposed method is compared with state-of-the-art widely used imputation methods from three aspects: the imputation accuracy, the feature selection effectiveness, and the symbolic regression performance.

Baligh Al-Helali, Qi Chen, Bing Xue, Mengjie Zhang
A Generative Adversarial Network Based Ensemble Technique for Automatic Evaluation of Machine Synthesized Speech

In this paper, we propose a method to automatically compute a speech evaluation metric, Virtual Mean Opinion Score (vMOS) for the speech generated by Text-to-Speech (TTS) models to analyse its human-ness. In contrast to the currently used manual speech evaluation techniques, the proposed method uses an end-to-end neural network to calculate vMOS which is qualitatively similar to manually obtained Mean Opinion Score (MOS). The Generative Adversarial Network (GAN) and a binary classifier have been trained on real natural speech with known MOS. Further, the vMOS has been calculated by averaging the scores obtained by the two networks. In this work, the input to GAN’s discriminator is conditioned with the speech generated by off-the-shelf TTS models so as to get closer to the natural speech. It has been shown that the proposed model can be trained with a minimum amount of data as its objective is to generate only the evaluation score and not speech. The proposed method has been tested to evaluate the speech synthesized by state-of-the-art TTS models and it has reported the vMOS of 0.6675, 0.4945 and 0.4890 for Wavenet2, Tacotron and Deepvoice3 respectively while the vMOS for natural speech is 0.6682 on a scale from 0 to 1. These vMOS scores correspond to and are qualitatively explained by their manually calculated MOS scores.

Jaynil Jaiswal, Ashutosh Chaubey, Sasi Kiran Reddy Bhimavarapu, Shashank Kashyap, Puneet Kumar, Balasubramanian Raman, Partha Pratim Roy
Large-Scale Font Identification from Document Images

Identification of font in document images has many applications in modern character recognition systems. Visual font recognition task is a challenging yet popular problem in pattern recognition as many designers try to identify a font for their designs from available images. However, the identification problem is particularly difficult because of the large number of probable fonts that can be used for a particular design. Moreover, there can be multiple fonts in a database which contain similar visual features; therefore, it is difficult to identify the exact font class. In this paper, we explore the font recognition problem for a database of 10,000 fonts using convolutional neural network (CNN) architecture. To the best of our knowledge, no previous approach has explored the font identification problem with these many classes. We performed extensive experiments to quantify our results for synthetic document images as well as natural document images. We have achieved 63.45% top-1 accuracy and 70.76% top-3 accuracy in character level, in addition we also observed 57.18% top-1 accuracy and 62.11% top-3 accuracy in word level even in the presence of rotation and scaling, which demonstrate the effectiveness of the proposed method.

Subhankar Ghosh, Prasun Roy, Saumik Bhattacharya, Umapada Pal
Aggregating Motion and Attention for Video Object Detection

Video object detection plays a vital role in a wide variety of computer vision applications. To deal with challenges such as motion blur, varying view-points/poses, and occlusions, we need to solve the temporal association across frames. One of the most typical solutions to maintain frame association is exploiting optical flow between consecutive frames. However, using optical flow alone may lead to poor alignment across frames due to the gap between optical flow and high-level features. In this paper, we propose an Attention-Based Temporal Context module (ABTC) for more accurate frame alignments. We first extract two kinds of features for each frame using the ABTC module and a Flow-Guided Temporal Coherence module (FGTC). Then, the features are integrated and fed to the detection network for the final result. The ABTC and FGTC are complementary to each other and can work together to obtain a higher detection quality. Experiments on the ImageNet VID dataset show that the proposed framework performs favorable against the state-of-the-art methods.

Ruyi Zhang, Zhenjiang Miao, Cong Ma, Shanshan Hao
Convexity Preserving Contraction of Digital Sets

Convexity is one of the useful geometric properties of digital sets in digital image processing. There are various applications which require deforming digital convex sets while preserving their convexity. In this article, we consider the contraction of such digital sets by removing digital points one by one. For this aim, we use some tools of combinatorics on words to detect a set of removable points and to define such convexity-preserving contraction of a digital set as an operation of re-writing its boundary word. In order to chose one of removable points for each contraction step, we present three geometrical strategies, which are related to vertex angle and area changes. We also show experimental results of applying the methods to repair some non-convex digital sets, which are obtained by rotations of convex digital sets.

Lama Tarsissi, David Coeurjolly, Yukiko Kenmochi, Pascal Romon
Parallax-Tolerant Video Stitching with Moving Foregrounds

The parallax artifacts introduced due to movement of objects across different views in the overlapping area drastically degrade the video stitching quality. To alleviate such visual artifacts, this paper extend our earlier video stitching framework [1] by replacing a deep learning based object detection algorithm for parallax detection, and an optical flow estimation algorithm for parallax correction. Given a set of multi-view overlapping videos, the geometric look-up tables (G-LUT) are generated by stitching a reference frame from the multi-view input videos, which map the input video frames to the panorama domain. We propose to use a deep learning based approach to detect the moving objects in the overlapping area to identify the G-LUT control points which get affected by parallax. To compute the optimal locations of these parallax affected G-LUT control points we propose to use patch-match based optical flow (CPM-flow). The adjustment of G-LUT control points in the overlapping area may cause some unwanted geometric distortions in the non-overlapping area. Therefore, the G-LUT control points in close proximity of moving objects are also updated to ensure the smooth geometric transition between the overlapping and the non-overlapping area. Experimental results on challenging video sequences with very narrow overlapping areas ($$\mathord \sim $$∼3% to $$\mathord \sim $$∼10%) demonstrate that video stitching framework with the proposed parallax minimization scheme can significantly suppress the parallax artifacts occurring due to the moving objects. In comparison to our previous work, the computational time is reduce by $$\mathord \sim $$∼26% with the proposed scheme, while the stitching quality is also marginally improved.

Muhammad Umer Kakli, Yongju Cho, Jeongil Seo
Prototype-Based Interpretation of Pathological Image Analysis by Convolutional Neural Networks

The recent success of convolutional neural networks (CNNs) attracts much attention to applying a computer-aided diagnosis system for digital pathology. However, the basis of CNN’s decision is incomprehensible for humans due to its complexity, and this will reduce the reliability of its decision. We improve the interpretability of the decision made using the CNN by presenting them as co-occurrences of interpretable components which typically appeared in parts of images. To this end, we propose a prototype-based interpretation method and define prototypes as the components. The method comprises the following three approaches: (1) presenting typical parts of images as multiple components, (2) allowing humans to interpret the components visually, and (3) making decisions based on the co-occurrence relation of the multiple components. Concretely, we first encode image patches using the encoder of a variational auto-encoder (VAE) and construct clusters for the encoded image patches to obtain prototypes. We then decode prototypes into images using the VAE’s decoder to make the prototypes visually interpretable. Finally, we calculate the weighted combinations of the prototype occurrences for image-level classification. The weights enable us to ascertain the prototypes that contributed to decision-making. We verified both the interpretability and classification performance of our method through experiments using two types of datasets. The proposed method showed a significant advantage for interpretation by displaying the association between class discriminative components in an image and the prototypes.

Kazuki Uehara, Masahiro Murakawa, Hirokazu Nosato, Hidenori Sakanashi
Design of an Optical Filter to Improve Green Pepper Segmentation Using a Deep Neural Network

Image segmentation is a challenging task in computer vision fields. In this paper, we aim to distinguish green peppers from large amounts of green leaves by using hyperspectral information. Our key aim is to design a novel optical filter to identify the bands where peppers differ substantially from green leaves. We design an optical filter as a learnable weight in front of an RGB filter with a fixed weight, and classify green peppers in an end-to-end manner. Our work consists of two stages. In the first stage, we obtain the optical filter parameters by training an optical filter and a small neural network simultaneously at the pixel level of hyperspectral data. In the second stage, we apply the learned optical filter and an RGB filter in a successive manner to a hyperspectral image to obtain an RGB image. Then we use a SegNet-based network to obtain better segmentation results at the image level. Our experimental results demonstrate that this two-stage method performs well for a small dataset and the optical filter helps to improve segmentation accuracy.

Jun Yu, Xinzhi Liu, Pan Wang, Toru Kurihara
Eye Contact Detection from Third Person Video

Eye contact is fundamental for human communication and social interactions; therefore much effort has been made to develop automated eye-contact detection using image recognition techniques. However, existing methods use first-person-videos (FPV) that need participants to equip wearable cameras. In this work, we develop an novel eye contact detection algorithm taken from normal viewpoint (third person video) assuming the scenes of conversations or social interactions. Our system have high affordability since it does not require special hardware or recording setups, moreover, can use pre-recorded videos such as Youtube and home videos. In designing algorithm, we first develop DNN-based one-sided gaze estimation algorithms which output the states whether the one subject looks at another. Afterwards, eye contact is found at the frame when the pair of one-sided gaze happens. To verify the proposed algorithm, we generate third-person eye contact video dataset using publicly available videos from Youtube. As the result, proposed algorithms performed 0.775 in precision and 0.671 in recall, while the existing method performed 0.484 in precision and 0.061 in recall, respectively.

Yuki Ohshima, Atsushi Nakazawa
Semi-supervised Early Event Detection

Early event detection is among the keys in the field of event detection due to its timeliness in widespread applications. The objective of early detection (ED) is to identify the specified event of the video sequence as early as possible before its ending. This paper introduces semi-supervised learning to ED, which is the first attempt to utilize the domain knowledge in the field of ED. In this setting, some domain knowledge in the form of pairwise constraints is available. Particularly, we treat the segments of complete events as must-link constraints. Furthermore, some segments do not overlap with the event are put together with the complete events as cannot-link constraints. Thus, a new algorithm termed semi-supervised ED (SemiED) is proposed, which could make better early detection for videos. The SemiED algorithm is a convex quadratic programming problem, which could be resolved efficiently. We also discuss the computational complexity of SemiED to evaluate its effectiveness. The superiority of the proposed method is validated on two video-based datasets.

Liping Xie, Chen Gong, Jinxia Zhang, Shuo Shan, Haikun Wei
A Web-Based Augmented Reality Approach to Instantly View and Display 4D Medical Images

In recent years, development in non-invasive and painless medical imaging such as Computed Tomography (CT) or MRI (magnetic resonance imaging), has improved the process of diseases diagnosis and clarification, including tumours, cysts, injuries, and cancers. The full-body scanner with superior spatial resolution provides essential details of complicated anatomical structures for effective diagnostics. However, it is challenging for a physician in glance over a large data-set of over hundreds or even thousands of images (2D “slices” of the body). Consider a case when a doctor wants to view a patient’s CT or MRI scans for analysing, he needs to review and compare among many layers of 2D image stacks (many 2D slices make a 3D stack). If the patient is scanned multiple time (three consecutive months, for instance) to confirm the growth of the tumours, the dataset is turned to be 4D (time-stamp added). The manual analysing process is time-consuming, troublesome and labour-intensive. The innovation of Augmented Reality (AR) in the last few decades allows us to illuminate this problem. In this paper, we propose an AR technique which assists the doctor in instantly accessing and viewing a patient’s set of medical images quickly and easily. The doctor can use an optical head-mounted display such as the Google Glass, or a VR Headset such as the Samsung Gear VR, or a general smartphone such as the Apple iPhone X. He looks at one palm-sized AR Tag with patient’s document embedded with a QR code, and the smart device could detect and download the patient’s data using the decrypted QR code and display layers of CT or MRI images right on top of the AR tag. Looking in and out from the tag allows the doctor to see the above or below of the current viewing layer. Moreover, shifting the looking orientation left or right allows the doctor to see the same layer of images but in different timestamp (e.g. previous or next monthly scans). Our obtained results demonstrated that this technique enhances the diagnosing process, save cost and time for medical practice.

Huy Le, Minh Nguyen, Wei Qi Yan
Group Activity Recognition via Computing Human Pose Motion History and Collective Map from Video

In this paper, we propose a deep learning based approach that exploits multi-person pose estimation from an image sequence to predict individual actions as well as the collective activity for a group scene. We first apply multi-person pose estimation to extract pose information from the image sequence. Then we propose a novel representation called pose motion history (PMH), that aggregates spatio-temporal dynamics of multi-person human joints in the whole scene into a single stack of feature maps. Then, individual pose motion history stacks (Indi-PMH) are cropped from the whole scene stack and sent into a CNN model to obtain individual action predictions. Based on these individual predictions, we construct a collective map that encodes both the positions and actions of all individuals in the group scene into a feature map stack. The final group activity prediction is determined by fusing results of two classification CNNs. One takes the whole scene pose motion history stack as input, and the other takes the collective map stack as input. We evaluate the proposed approach on a challenging Volleyball dataset, and it provides very competitive performance compared to the state-of-the-art methods.

Hsing-Yu Chen, Shang-Hong Lai
DeepRoom: 3D Room Layout and Pose Estimation from a Single Image

Though many deep learning approaches have significantly boosted the accuracy for room layout estimation, the existing methods follow the long-established traditional pipeline. They replace the front-end model with CNN and still rely heavily on post-processing for layout reasoning. In this paper, we propose a geometry-aware framework with pure deep networks to estimate the 2D as well as 3D layout in a row. We decouple the task of layout estimation into two stages, first estimating the 2D layout representation and then the parameters for 3D cuboid layout. Moreover, with such a two-stage formulation, the outputs of deep networks are explainable and also extensible to other training signals jointly and separately. Our experiments demonstrate that the proposed framework can provide not only competitive 2D layout estimation but also 3D room layout estimation in real time without post-processing.

Hung Jin Lin, Shang-Hong Lai
Multi-task Learning for Fine-Grained Eye Disease Prediction

Recently, deep learning techniques have been widely used for medical image analysis. While there exists some work on deep learning for ophthalmology, there is little work on multi-disease predictions from retinal fundus images. Also, most of the work is based on small datasets. In this work, given a fundus image, we focus on three tasks related to eye disease prediction: (1) predicting one of the four broad disease categories – diabetic retinopathy, age-related macular degeneration, glaucoma, and melanoma, (2) predicting one of the 320 fine disease sub-categories, (3) generating a textual diagnosis. We model these three tasks under a multi-task learning setup using ResNet, a popular deep convolutional neural network architecture. Our experiments on a large dataset of 40658 images across 3502 patients provides $$\sim $$∼86% accuracy for task 1, $$\sim $$∼67% top-5 accuracy for task 2, and $$\sim $$∼32 BLEU for the diagnosis captioning task.

Sahil Chelaramani, Manish Gupta, Vipul Agarwal, Prashant Gupta, Ranya Habash
EEG-Based User Identification Using Channel-Wise Features

During the last decades, the biometric signal such as the face, fingerprints, and iris has been widely employed to identify the individual. Recently, electroencephalograms (EEG)-based user identification has received much attention. Up to now, many types of research have focused on deep learning-based approaches, which involves high storage, power, and computing resource. In this paper, a novel EEG-based user identification method is presented that can provide real-time and accurate recognition with a low computing resource. The main novelty is to describe the unique EEG pattern of an individual by fusing the temporal domain of single-channel features and channel-wise information. The channel-wise features are defined by symmetric matrices, the element of which is calculated by the Pearson correlation coefficient between two-pair channels. Channel-wise features are input to the multi-layer perceptron (MLP) for classification. To assess the verity of the proposed identification method, two well-known datasets were chosen, where the proposed method shows the best average accuracies of 98.55% and 99.84% on the EEGMMIDB and DEAP dataset, respectively. The experimental results demonstrate the superiority of the proposed method in modeling the unique pattern of an individual’s brainwave.

Longbin Jin, Jaeyoung Chang, Eunyi Kim
Backmatter
Metadaten
Titel
Pattern Recognition
herausgegeben von
Shivakumara Palaiahnakote
Prof. Gabriella Sanniti di Baja
Liang Wang
Prof. Dr. Wei Qi Yan
Copyright-Jahr
2020
Electronic ISBN
978-3-030-41299-9
Print ISBN
978-3-030-41298-2
DOI
https://doi.org/10.1007/978-3-030-41299-9