main-content

## Über dieses Buch

This three volume set, CCIS 771, 772, 773, constitutes the refereed proceedings of the CCF Chinese Conference on Computer Vision, CCCV 2017, held in Tianjin, China, in October 2017.
The total of 174 revised full papers presented in three volumes were carefully reviewed and selected from 465 submissions. The papers are organized in the following topical sections: biological vision inspired visual method; biomedical image analysis; computer vision applications; deep neural network; face and posture analysis; image and video retrieval; image color and texture; image composition; image quality assessment and analysis; image restoration; image segmentation and classification; image-based modeling; object detection and classification; object identification; photography and video; robot vision; shape representation and matching; statistical methods and learning; video analysis and event recognition; visual salient detection.

## Inhaltsverzeichnis

### An Evolution Perception Mechanism for Organizing and Displaying 3D Shape Collections

How to display the best view of shape collections to facilitate non-profession users is becoming a challenging problem in the field of computer vision. To solve this problem, an evolution perception mechanism for organizing and displaying shape collections (EPM) is proposed. On the one hand, the evolution tree based on Quartet analysis is constructed to effectively organize shapes from a global perspective. On the other hand, the shape snapshot method based on the view features of shapes is presented for each single shape in the evolution tree, so as to show the details of the shape from a local perspective. Moreover, the interactive display methods using the evolution trees are provided to guide users to effectively explore the shape collections. Experimental results show that EPM could provide the effective displaying ways that are close to the user’s viewing.

Lingling Zi, Xin Cong, Yanfei Peng

### C-CNN: Cascaded Convolutional Neural Network for Small Deformable and Low Contrast Object Localization

Traditionally, the normalized cross correlation (NCC) based or shape based template matching methods are utilized in machine vision to locate an object for a robot pick and place or other automatic equipment. For stability, well-designed LED lighting must be mounted to uniform and stabilize lighting condition. Even so, these algorithms are not robust to detect the small, blurred, or large deformed target in industrial environment. In this paper, we propose a convolutional neural network (CNN) based object localization method, called C-CNN: cascaded convolutional neural network, to overcome the disadvantages of the conventional methods. Our C-CNN method first applies a shallow CNN densely scanning the whole image, most of the background regions are rejected by the network. Then two CNNs are adopted to further evaluate the passed windows and the windows around. A relatively deep model net-4 is applied to adjust the passed windows at last and the adjusted windows are regarded as final positions. The experimental results show that our method can achieve real time detection at the rate of 14FPS and be robust with a small size of training data. The detection accuracy is much higher than traditional methods and state-of-the-art methods.

Xiaojun Wu, Xiaohao Chen, Jinghui Zhou

### Skeleton-Based 3D Tracking of Multiple Fish From Two Orthogonal Views

This paper proposes a skeleton-based method for tracking multiple fish in 3D space. First, skeleton analysis is performed to simplify object into feature point representation according to shape characteristics of fish. Next, based on the obtained feature points, object association and matching are achieved to acquire the motion trajectories of fish in 3D space. This process relies on top-view tracking that is supplemented by side-view detection. While fully exploiting the shape and motion information of fish, the proposed method is able to solve the problems of frequent occlusions that occur during the tracking process and has good tracking performance.

Zhiming Qian, Meiling Shi, Meijiao Wang, Tianrui Cun

### A Retinal Adaptation Model for HDR Image Compression

High dynamic range (HDR) images are usually used to capture more information of natural scenes, because the light intensity of real world scenes commonly varies in a very large range. Humans visual system is able to perceive this huge range of intensity benefiting from the visual adaptation mechanisms. In this paper, we propose a new visual adaptation model based on the cone- and rod-adaptation mechanisms in the retina. The input HDR scene is first processed in two separated channels (i.e., cone and rod channels) with different adaptation parameters. Then, a simple receptive field model is followed to enhance the local contrast of the visual scene and improve the visibility of details. Finally, the compressed HDR image is obtained by recovering the fused luminance distribution to the RGB color space. Experimental results suggest that the proposed retinal adaptation model can effectively compress the dynamic range of HDR images and preserve local details well.

Xuan Pu, Kaifu Yang, Yongjie Li

### A Novel Orientation Algorithm for a Bio-inspired Polarized Light Compass

Many animals, such as honey bees and tarantulas, can navigate with polarized light, the key of which is to extract compass information from skylight polarization pattern. Many groups have conducted research on bionic polarization sensors and obtained good orientation results in the open-sky circumstances. However, if the sky is obscured by leaves or buildings, the skylight polarization pattern will be greatly affected. This paper presents an unsupervised method for polarization navigation when the sky is partly blocked. First of all, we introduce the core components of polarized light compass and the measurement method of skylight polarization pattern. Then, an unsupervised method is used to extract the sky region according to the single scattering Rayleigh model. Finally, we calculate solar meridian vector and heading angle using pixels of sky region. Results show that polarization navigation, featuring high anti-interference and no accumulation error, is suitable for outdoor autonomous navigation.

Guoliang Han, Xiaoping Hu, Xiaofeng He, Junxiang Lian, Lilian Zhang, Yujie Wang

### A Brain MR Images Segmentation and Bias Correction Model Based on Students t-Mixture Model

Accurate segmentation for magnetic resonance images is an essential step in quantitative brain image analysis. However, due to the existence of bias field and noise, many segmentation methods are hard to find accurate results. Finite mixture model is one of the wildly used methods for MR image segmentation; however, it is sensitive to noise and cannot deal with images with intensity inhomogeneity. In order to reduce the effect of noise, we introduce a robust Markov Random Field by incorporating new spatial information which is constructed based on posterior probabilities and prior probabilities. The bias field is modeled as a linear combination of a set of orthogonal basis functions and coupled into the model and makes the method can estimate the bias field meanwhile segmenting images. Our statistical results on both synthetic and clinical images show that the proposed method can obtain more accurate results.

Yunjie Chen, Qing Xu, Shenghua Gu

### Define Interior Structure for Better Liver Segmentation Based on CT Images

Liver Segmentation has important application for preoperative planning and intraoperative guiding. In this paper we introduce a new approach by defining the interior structure (hepatic veins) before segmenting the liver from nearby organs. We assume that cells of the liver should lay within a certain distance of the hepatic veins. Therefore, a clear segmentation on hepatic veins will facilitate our segmentation on liver voxel. We build a probabilistic model which adopts four main features of the liver cells based on this idea and implement it on the open source platform 3DMed. We also test the accuracy of this method with four groups of CT data. The results are similar when compared to human experts.

Xiaoyu Zhang, Yixiong Zheng, Bin Zheng

### GPU Accelerated Image Matching with Cascade Hashing

SIFT feature is widely used in image matching. However, matching massive images is time consuming because SIFT feature is a high dimensional vector. In this paper, we proposed a GPU accelerated image matching method with improved Cascade Hashing. Firstly, we propose a disk-memory-GPU data exchange strategy and optimize the load order of data, so that the proposed method can deal with big data. Then, we parallelize the Cascade Hashing method on GPU. An improved parallel reduction and an improved parallel hashing ranking are proposed to fulfill this task. Finally, extensive experiments are carried out to show that our image matching is about 20 times faster than the SiftGPU, nearly one hundred times faster than the CPU Cascade Hashing, and hundreds of times faster than the CPU Kd-Tree based matching.

Tao Xu, Kun Sun, Wenbing Tao

### Improved Single Image Dehazing with Heterogeneous Atmospheric Light Estimation

Images captured in foggy or hazy weather conditions are often degraded by the scattering of atmospheric particles, which seriously reduces the performance of outdoor computer vision processing systems. Single image haze removal algorithm has been considered to be an efficient dehazing method in recent years. The key to this type of approach is the estimation of atmospheric light. In this paper, an improved single image dehazing algorithm with heterogeneous atmospheric light estimation is presented to enhance the quality of hazy images. First, the heterogeneous atmospheric light is calculated with max-pooling. Second, a haze-free image can be recovered with the estimated atmospheric light based on dark channel prior. The experimental results on a variety of hazy images demonstrate that the addressed method outperforms state-of-the-art approaches through the assessment of dehazing effect and algorithm cost.

Yi Lai, Ying Liu

### Spatiogram and Fuzzy Logic Based Multi-modality Fusion Tracking with Online Update

Multi-modality fusion tracking is an interesting but challenging task. Many previous works just consider the fusion of different features from identical spectral image or identical features from different spectral images alone, which makes them be quite distinct from each other and be difficult to be integrated naturally. In this study, we propose an unified tracking framework to naturally integrate multiple different modalities via innovative use of spatiogram and fuzzy logic. Specifically, each modal target and its candidate are first represented by second-order spatiogram and their similarity is measured. Next, a novel objective function is built by integrating all modal similarities, and then a joint target center-shift formula is gained by performing mathematical operation on the objective function. Finally, the optimal target location is gained recursively by applying the mean shift procedure. Besides, a model update scheme via particle filter is developed to capture the appearance variations. Our framework allows the modalities to be original pixels or other extracted features from single image or different spectral images, and provides the flexibility to arbitrarily add or remove modality. Tracking results on the combination of infrared gray-HOG and visible gray-LBP clearly demonstrate the excellence of the proposed tracker.

Canlong Zhang, Zhixin Li, Zhiwen Wang, Ting Han

### A Novel Real-Time Tracking Algorithm Using Spatio-Temporal Context and Color Histogram

Spatio-temporal context (STC) is one of the most important features in describing the motion in videos. STC-based tracking algorithm achieved good performance on real-time tracking. However, it is very difficult to perform precise tracking in complex situations like heavy occlusion, illumination changes, and pose variation. In this paper, we propose a real-time tracking method which is robust to target variation during tracking single-object. Experiments on some challenging sequences highlights a significant improvement of tracking accuracy over the state-of-the-art methods.

Yong Wu, Zemin Cai, Jianhuang Lai, Jingwen Yan

### A New Visibility Model for Surface Reconstruction

In this paper, we propose a new visibility model for scene reconstruction. To yield out surface meshes with enough scene details, we introduce a new visibility model, which keeps the relaxed visibility constraints and takes the distribution of points into consideration, thus it is efficient for preserving details. To strengthen the robustness to noise, a new likelihood energy term is introduced to the binary labeling problem of Delaunay tetrahedra, and its implementation is presented. The experimental results show that our method performs well in terms of detail preservation.

Yang Zhou, Shuhan Shen, Zhanyi Hu

### Global-Local Feature Attention Network with Reranking Strategy for Image Caption Generation

In this paper, a novel framework, named global-local feature attention network with reranking strategy (GLAN-RS), is presented for image captioning task. Rather than only adopt unitary visual information in the classical models, GLAN-RS explore attention mechanism to capture local convolutional salient image maps. Furthermore, we adopt reranking strategy to adjust the priority of the candidate captions and select the best one. The proposed model is verified using the MSCOCO benchmark dataset across seven standard evaluation metrics. Experimental results show that GLAN-RS significantly outperforms the state-of-the-art approaches such as M-RNN, Google NIC etc., which gets an improvement of 20% in terms of BLEU4 score and 13 points in terms of CIDER score.

Jie Wu, Siya Xie, Xinbao Shi, Yaowen Chen

### Dust Image Enhancement Algorithm Based on Color Transfer

The increasing dust weather has seriously affected the quality of captured images. Therefore, research of the dust image quality enhancement has become an important hotspot in the field of computer vision. Compared with pictures which obtained in sunny day, dust images have some obvious problems such as low-quality in definition and light, hue yellowing and so on. Amid such questions, available methods cannot always prevail. Hence, this paper proposed a dust image enhancement algorithm based on color transfer. First of all, applying the scene gist feature to select a target image which has the highest similarity with input image in clear images database; then, the target image color information is passed to input image through the color transfer algorithm; finally, utilizing the contrast limited adaptive histgram equalization algorithm to restore the definition in dust image. The experimental results show that our algorithm can effectively enhance dust images quality in different degree, get more image details and resolve the definition and hue correction problems. We proposed a absolute color difference index to measure this method. The experiments demonstrate that the proposed algorithm outperforms the state-of-the-art models.

Hao Liu, Ce Li, Yuqi Wan, Yachao Zhang

### Chinese Sign Language Recognition with Sequence to Sequence Learning

In this paper, we formulate Chinese sign language recognition (SLR) as a sequence to sequence problem and propose an encoder-decoder based framework to handle it. The proposed framework is based on the convolutional neural network (CNN) and recurrent neural network (RNN) with long short-term memory (LSTM). Specifically, CNN is adopted to extract the spatial features of input frames. Two LSTM layers are cascaded to implement the structure of encoder-decoder. The encoder-decoder can not only learn the temporal information of the input features but also can learn the context model of sign language words. We feed the images sequences captured by Microsoft Kinect2.0 into the network to build an end-to-end model. Moreover, we also set up another model by using skeletal coordinates as the input of the encoder-decoder framework. In the recognition stage, a probability combination method is proposed to fuse these two models to get the final prediction. We validate our method on the self-build dataset and the experimental results demonstrate the effectiveness of the proposed method.

Chensi Mao, Shiliang Huang, Xiaoxu Li, Zhongfu Ye

### ROI Extraction Method of Infrared Thermal Image Based on GLCM Characteristic Imitate Gradient

Automatic inspection of UAV (unmanned aerial vehicle) vision in photovoltaic power station is of great significance in effectively capturing high-definition images and quickly detecting the fault area, which can reduce the risk of false detection and lower the cost of manual operation. However, due to the complexity of photovoltaic power station environment, disturbance often occurrence on images, leading up to the misjudgment of the fault area. We proposed a method to extraction region based on the gray level co-occurrence matrix (GLCM) and textural features. The image extraction from target area can be achieved by extracting feature images, gradient imitate and region filling. This method effectively combines the textural features of images with edge features of the gradient images. A comparison is made between the algorithm promoted in this paper and the grab cut method, on the basis of the labeled image segmentation algorithm. It turns out that the mean precision and mean recall of the proposed imitation gradient image extraction method are higher than that of the grab cut algorithm, and the Recallvalue, F index and J index are better than the grab cut algorithm. A new algorithm is proposed construct filling model by using gradient image and a morsel of texture feature calculated by GLCM method. The advantages of the proposed algorithm are fewer interactive tags, fewer manual labels. Therefore, the image detection of fault area can be better realized.

Hui Shen, Li Zhu, Xianggong Hong, Weike Chang

### Learning Local Feature Descriptors with Quadruplet Ranking Loss

In this work, we propose a novel deep convolutional neural network (CNN) with quadruplet ranking loss to learn local feature descriptors. The proposed model receives quadruplets of two corresponding patches and two non-corresponding patches, and then outputs features measured by $$L_2$$L2 norm with dimension (256d) close to that of SIFT, which thus can be easily applied to practical vision tasks by plug-in replacing SIFT-like features. Moreover, the proposed model mitigates the problem that the margin separating corresponding and non-corresponding pairs varies with samples caused by commonly used triplet loss, and improves the capacity of utilizing the limited training data. Experiments show that our model outperforms some state-of-the-art methods like TN-TG and PN-Net.

Dalong Zhang, Lei Zhao, Duanqing Xu, Dongming Lu

### Three-Dimensional Reconstruction of Ultrasound Images Based on Area Iteration

Ultrasonic C scan technique has been extensively applied in nondestructive testing (NDT) in recent years. Aiming at ultrasonic C scanning original data, this paper presents a 3D reconstruction algorithm for ultrasonic C scanning image based on area iteration. Because the distance between two neighboring slices is small, the area chance between two neighboring images is also few. Therefore, we design an iterative rule on the area of the object to detect and extract the target contour of each image. After preprocessing of ultrasonic C image, we extract the contour of the object scanned by ultrasonic transducer. Then, we reconstruct the 3D ultrasonic C scanning image using marching cube algorithm. In experiments, we implement the proposed method for ultrasonic scanning data of tissue-mimicking phantom including a liver model, and we compare the proposed method with other methods. The results show that the proposed method can more accurately display the edge information of the scanned object, and improve the precision of region detection, especially suitable when the edge has the incomplete information.

Biying Chen, Haijiang Zhu, Jinglin Zhou, Ping Yang, Longbiao He

### A Generation Method of Insulator Region Proposals Based on Edge Boxes

The generation of region proposals is the foundation of object detection. In the object detection task, the steady increase in complexity of classifiers may lead to improvement of detection quality, yet with the cost of increased computation time at the same time. One approach to overcome the tension between high detection quality and low computational complexity is through the use of “region proposals”. High-quality insulator region proposals also play important roles in the detection of transmission line inspection images. This paper applies Edge Boxes to the localization of insulators in inspection images creatively, considering the characteristics of insulators’ edge images, and combines these characteristics with Edge Boxes. As a result, more insulator region proposals are displayed. The experimental results show that, our method can effectively reduce the interference area, meanwhile, has high quality of region proposals with fast speed of calculation.

Zhenbing Zhao, Lei Zhang, Yincheng Qi, Yuying Shi

### Robust Tramway Detection in Challenging Urban Rail Transit Scenes

With the rapid development of light rail transit, tramway detection based on video analysis is becoming the prerequisite and necessary task in driver assistance system. The system should be capable of automatically detecting the trackway using on-board camera in order to determine the train driving limit. However, due to the diversification of ground types, the diversity of weather conditions and the differences in illumination situations, this goal is very challenging. This paper presents a real-time tramway detection method that can effectively deal with various challenging scenarios in the real world of urban rail transit environment. It first uses an adaptive multi-level threshold to segment the ROI of the trolley track, where the local cumulative histogram model is used to estimate the threshold parameters. And then use the regional growth method to reduce the impact of environmental noise and predict the trend of tramway. We have experimentally proved that the method can correctly detect the tramway even in many undesirable situations and use less computational time to meet real-time requirements.

Cheng Wu, Yiming Wang, Changsheng Yan

### Relevance and Coherence Based Image Caption

The attention-based image caption framework has been widely explored in recent years. However, most techniques generate next word conditioned on previous words and current visual contents, while the relationship between the semantic and visual contents is not considered. In this paper, we present a novel framework which can explore the relevance and coherence at the same time. The relevance tries to explore the relationship between the semantic and visual contents in a semantic-visual embedding space, and the coherence is introduced to maximize the probability of generating the next word according to previous words and the current visual contents. The performance of our model is tested with three benchmark datasets: Flickr8k, Flickr30k and MS COCO. The experimental results show that the proposed approach can improve the performance of attention-based image caption method.

Tao Zhang, Wei Wang, Liang Wang, Qinghua Hu

### An Efficient Recoloring Method for Color Vision Deficiency Based on Color Confidence and Difference

Human with color vision deficiency (CVD) cannot distinguish some colors under normal situation. And recoloring is a color adaptation procedure for human with CVD. This paper proposes an efficient recoloring method. First, key color confidence is introduced to judge and recolor unrecognizable colors sequentially instead of simultaneously to improve the execution speed. Second, color differences between one color and other colors with higher confidence are taken as the unrecognizable benchmark for this color without any pre-defined parameters. Third, these differences are also used in iterative recoloring procedure to maximize the contrast between different colors in the recolored image. Experimental results show that our proposed method outperforms state-of-the-art recoloring methods in terms of subjective, objective qualities and executive speed.

Qing Xu, Xiangdong Zhang, Liang Zhang, Guangming Zhu, Juan Song, Peiyi Shen

### A Saliency Based Human Detection Framework for Infrared Thermal Images

In this paper, a novel saliency framework for crowd detection in infrared thermal images is proposed. In order to obtain the optimal classifier from a large amount of data, the process of training consists of the following four steps: (a) a saliency contrast algorithm is employed to detect the regions of interest; (b) standard HOG features of the selected interest areas are extracted to represent the human object; (c) the extracted features, which are prepared for training, are optimized based on a visual attention map; (d) a support vector machine (SVM) algorithm is applied to compute the classifier. Finally, we can detect the human precisely after high-saliency areas of an image are input into the classifier. In order to evaluate our algorithm, we constructed an infrared thermal image database collected by a real-time inspection system. The experimental results demonstrated that our method can outperform the previous state-of-the art methods for human detection in infrared thermal images, and the visual attentional techniques can effectively represent prior knowledge for features optimization in a practicable system.

Xinbo Wang, Dahai Yu, Jianfeng Han, Guoshan Zhang

### Integrating Color and Depth Cues for Static Hand Gesture Recognition

Recognizing static hand gesture in complex backgrounds is a challenging task. This paper presents a static hand gesture recognition system using both color and depth information. Firstly, the hand region is extracted from complex background based on depth segmentation and skin-color model. The Moore-Neighbor tracing algorithm is then used to obtain hand gesture contour. The k-curvature method is used to locate fingertips and determine the number of fingers, then the angle between fingers are generated as features. The appearance-based features are integrated to the decision tree model for hand gesture recognition. Experiments have been conducted on two gesture recognition datasets. Experimental results show that the proposed method achieves a high recognition accuracy and strong robustness.

Jiaming Li, Yulan Guo, Yanxin Ma, Min Lu, Jun Zhang

### Accurate Joint Template Matching Based on Tree Propagating

Given a single template image, it is a big challenge to match all the target images accurately only by pairwise template matching. To handle this case, this paper introduces an accurate template matching method based on tree structure building to jointly match a set of target images. Our method aims to select well matched results for template updating and rescue badly matched images via the tree matching propagating. First, a novel similarity measure is given to evaluate the pairwise matching results. Then the joint matching is under an iterative framework which contains two main steps: (1) tree structure growing; and (2) matching propagating. When all the target images are included in the tree structure, the matching process is finished. Finally, experimental results demonstrate the improvement of the proposed method.

Qian Kou, Yang Yang, Shaoyi Du, Zhuo Chen, Weile Chen, Dexing Zhong

### Robust Visual Tracking Using Oriented Gradient Convolution Networks

Convolutional networks have been successfully applied to visual tracking to extract some useful feature. However, deep networks are time-consuming to offline training and usually extract the feature from raw pixels. In this paper, we propose a two-layer convolutional network based on oriented gradient. The first layer is constructed by the convolution of the filter and an input image of oriented gradient, which is robust to the illumination variation and motion blur. Then, all of the feature maps of the simple layer are stacked to a complex feature map as the target representation. The complex feature map can encode the local structure feature which is robust to occlusion. The proposed approach is tested on nine challenging sequences in comparison with nine state-of-art trackers, and the result show that the proposed tracker achieves mean overlap rate of 0.75, which outperforms the secondary tracker by 26%.

Qi Xu, Huabin Wang, Jian Zhou, Liang Tao

### Novel Hybrid Method for Human Posture Recognition Based on Kinect V2

In recent years, human posture recognition based on Kinect gradually has been paid more attention. However, the current researches and methods have drawbacks, such as low recognition accuracy and less recognizable postures. This paper proposed a novel method. The method utilized image processing technique, BP neural network technique, skeleton data and depth data captured by Kinect v2 to recognize postures. We distinguished four types of postures (sitting cross-legged, kneeling or sitting, standing, and other postures) by using the natural ratios of human body parts, and judged the kneeling and sitting postures by calculating the 3D spatial relation of the feature points. Finally, we applied BP neural network to recognize the lying and bending postures. The experimental results indicated that the robustness and timeliness of our method was strong, the recognition accuracy was 98.98%.

Bo Li, Baoxing Bai, Cheng Han, Han Long, Lin Zhao

### Visibility Estimation Using a Single Image

In this paper, we propose a novel method for visibility estimation using only a single image as input. An assumption is proposed: the extinction coefficient of light is approximately a constant in clear atmosphere. Using this assumption with the theory of atmospheric radiation, the extinction coefficient in clear atmosphere can be estimated. Based on the dark channel prior, ratio of two extinction coefficients in current and clear atmosphere is calculated. By multiplying the clear extinction coefficient and the ratio, we can estimate the extinction coefficient of the input image and then obtain the visibility value. Compared with other methods that require the explicit extraction of the scene, our method needs no constraint and performs well in various types of scenes, which might open a new trend of visibility measurement in meteorological research. Moreover, the actual distance information can also be estimated as a by-product of this method.

Qin Li, Bin Xie

### A Text-Line Segmentation Method for Historical Tibetan Documents Based on Baseline Detection

Text-line segmentation is an important task in the historical Tibetan document recognition. Historical Tibetan document images usually contain touching or overlapping characters between consecutive text-lines, making text-line segmentation a difficult task. In this paper, we present a text-line segmentation method based on baseline detection. The initial positions for the baseline of each line are obtained by template matching, pruning algorithms and closing operation. The baseline is estimated using dynamic tracing within pixel points of each line and the context information between pixel points. The overlapping or touching areas are cut by finding the minimum width stroke. Finally, text-lines are extracted based on the estimated baseline and the cut position of touching area. The proposed algorithm has been evaluated on the dataset of historical Tibetan document images. Experimental result shows the effectiveness of the proposed method.

Yanxing Li, Longlong Ma, Lijuan Duan, Jian Wu

### Low-Light Image Enhancement Based on Constrained Norm Estimation

Low-light images often suffer from poor quality and low visibility. Improving the quality of low-light image is becoming a highly desired subject in both computational photography and computer vision applications. This paper proposes an effective method to constrain the illumination map t by estimating the norm and constructing the constraint coefficients, which called LieCNE. More specifically, we estimate the initial illumination map by finding the maximum value of R, G and B channels and optimize it by norm estimation. We propose a function $${t^\gamma }$$tγ to contain the exponential power $$\gamma$$γ in order to optimize the enhancement effect under different illumination conditions. Finally, a new evaluation criterion is also proposed. We use the similarity with the true value to determine the enhanced effect. Experimental results show that LieCNE exhibits better performance under a variety of lighting conditions in enhancement results and image spillover prevention.

Tan Zhao, Hui Ding, Yuanyuan Shang, Xiuzhuang Zhou

### FPGA Architecture for Real-Time Ultra-High Definition Glasses-Free 3D System

This paper presents an FPGA architecture for real-time ultra-high definition (UHD) glasses-free 3D system by solving high bandwidth requirement and high complexity of the system problems. Video + Depth (V + D) video format and the depth-image-based rendering (DIBR) technique are supported by the system to reduce the requirement of bandwidth. In addition, an asymmetric shift-sensor camera setup is introduced to reduce the hardware cost as well as the complexity of the system. We also simplify the microlens array weight equations so as to reduce the complexity of subpixel rearrangement coefficients calculation for glasses-free 3D image creation. Experiments results show that the proposed architecture can support the resolution of 4K for real-time UHD glasses-free 3D display.

Ran Liu, Mingming Liu, Dehao Li, Yanzhen Zhang, Yangting Zheng

### Quantum Video Steganography Protocol Based on MCQI Quantum Video

A secure quantum video steganography protocol with large payload based on the video strip encoding method called as MCQI (Multi-Channel Quantum Images) is proposed in this paper. The new protocol exploits to embed quantum video as secret information for covert communication. As a result, its capacity are greatly expanded compared with the previous quantum steganography achievements. The new protocol achieves good security and imperceptibility by virtue of the randomization of embedding positions and efficient use of redundant frames. Furthermore, the receiver enables to extract secret information from stego video without retaining the original carrier video, and restore the original quantum video as a follow. The simulation and experiment results prove that the algorithm not only has good imperceptibility, high security, but also has large payload.

Zhiguo Qu, Siyi Chen, Sai Ji

### Video Question Answering Using a Forget Memory Network

Visual question answering combines the fields of computer vision and natural language processing. It has received much attention in recent years. Image question answering (Image QA) targets to automatically answer questions about visual content of an image. Different from Image QA, video question answering (Video QA) needs to explore a sequence of images to answer the question. It is difficult to focus on the local region features which are related to the question from a sequence of images. In this paper, we propose a forget memory network (FMN) for Video QA to solve this problem. When the forget memory network embeds the video frame features, it can select the local region features that are related to the question and forget the irrelevant features to the question. Then we use the embedded video and question features to predict the answer from multiple-choice answers. Our proposed approaches achieve good performance on the MovieQA [21] and TACoS [28] dataset.

Yuanyuan Ge, Youjiang Xu, Yahong Han

### Three Dimensional Stress Wave Imaging Method of Wood Internal Defects Based on TKriging

In order to detect the size, shape and degree of decay inside wood, a three dimensional stress wave imaging method based on TKriging is proposed. The method uses sensors to obtain the stress wave velocity data sets by hanging around the timber randomly, and reconstructs the image of internal defect with those data sets. TKriging optimized structural relationship between interpolation point and reference point in space firstly. The searching radius is used to select the reference points accordingly. Top-k query method is introduced to find the k value with relevant points. The values of the estimated points are calculated and three dimensional image of the internal defect inside wood is reconstructed. The results show the effectiveness of the method and the accuracy rate of sample with one hole is higher than the Kriging method.

Xiaochen Du, Hailin Feng, Mingyue Hu, Yiming Fang, Jiajie Li

### A Novel Color Image Watermarking Algorithm Based on QWT and DCT

A novel color image watermarking algorithm is proposed based on quaternion wavelet transform (QWT) and discrete cosine transform (DCT) for copyright protection. The luminance channel Y of the host color image in YCbCr space is decomposed by QWT to obtain four approximation subimages. A binary watermark is embedded into the mid-frequency DCT coefficients of two randomly selected subimages. The experimental results show that the proposed watermarking scheme has good robustness against common image attacks such as adding noise, filtering, scaling, JPEG compression, cropping, image adjusting, small angle rotation and so on.

Shaocheng Han, Jinfeng Yang, Rui Wang, Guimin Jia

### A Novel Camera Calibration Method for Binocular Vision Based on Improved RBF Neural Network

Considering the problems that camera imaging model is complex and operation is complicated, a binocular camera calibration method of RBF neural network based on k-means and gradient method is proposed in this paper. The data center selection method based on the law of clustering error function can obtain hidden nodes and data centers of RBF network accurately. Dynamic learning of data centers, spread constants and weight values based on gradient method can contribute to improving the precision. Experimental results show that the proposed method has high precision and can be well applied in machine vision.

Weike Liu, Ju Huo, Xing Zhou, Ming Yang

### Collaborative Representation Based Neighborhood Preserving Projection for Dimensionality Reduction

Collaborative graph-based discriminant analysis (CGDA) has been recently proposed for dimensionality reduction and classification. It uses available samples to construct sample collaboration via L2 norm minimization-based representation, thus showing great computational efficiency. However, CGDA only constructs the intra-class graph, so it only takes into account local geometry and ignores the separability for samples in different classes. In this paper, we propose a novel method termed as collaborative representation based neighborhood preserving projection (CRNPP) for dimensionality reduction. By incorporating the intra-class and inter-class discriminant information into the graph construction of collaborative representation coefficients, CRNPP not only maintains the same level of time cost as CGDA, but also preserves both global and local geometry of the data simultaneously. In this way, the collaborative relationship of the data from the same class is strengthened while the collaborative relationship of the data from different classes is inhibited in the projection subspace. Experiments on benchmark face databases validate the effectiveness and efficiency of the proposed method.

Miao Li, Lei Wang, Hongbing Ji, Shuangyue Chen, Danping Li

### Iterative Template Matching with Rotation Invariant Best-Buddies Pairs

In this paper, we propose a new method for template matching method with rotation invariance. Our template matching can not only find the location of the object, but also annotate its rotation angle. The key idea is to firstly rectify the local rotation patches according to their intensity centroids, and then to find the corresponding patch-features between template and target images under an iterative matching framework. We adopt the coarse-to-fine search ways, so the patch size should be updated accordingly, which is time-consuming. To tackle this problem, we use the integral image to update the intensity centroid to accelerate the computing speed. The corresponding feature matching is based on the Best-Buddies Pairs (BBPs), which is robust to the non-rigid transform of local range and outliers. Experimental results demonstrate the effectiveness and robustness of the proposed algorithm.

Zhuo Chen, Yang Yang, Weile Chen, Qian Kou, Dexing Zhong

### Image Forgery Detection Based on Semantic Image Understanding

Image forensics has been focusing on low-level visual features, paying little attention to high-level semantic information of the image. In this work, we propose the framework for image forgery detection based on high-level semantics with three components of image understanding module, the normal rule bank (NR) holding semantic rules that comply with our common sense, and the abnormal rule bank (AR) holding semantic rules that don’t. Ke et al. [1] also proposed a similar framework, but ours has following advantages. Firstly, image understanding module is integrated by a dense image caption model, with no need for human intervention and more hierarchical features. secondly, our proposed framework can generate thousands of semantic rules automatically for NR. Thirdly, besides NR, we also propose to construct AR. In this way, not only can we frame image forgery detection as anomaly detection with NR, but also as recognition problem with AR. The experimental results demonstrate our framework is effective and performs better.

Kui Ye, Jing Dong, Wei Wang, Jindong Xu, Tieniu Tan

### Relative Distance Metric Leaning Based on Clustering Centralization and Projection Vectors Learning for Person Re-identification

Existing projection-based person re-identification methods usually suffer from long time training, high dimension of projection matrix, and low matching rate. In addition, the intra-class instances may be much less than the inter-class instances when a training dataset is built. To solve these problems, a novel relative distance metric leaning based on clustering centralization and projection vectors learning (RDML-CCPVL) is proposed. When constructing training dataset, the images of a same target person are clustering centralized with FCM. The training datasets are built by these clusters in order to alleviate the imbalanced data problem of the training datasets. In addition, during learning projection matrix, the resulted projection vectors can be approximately orthogonal by using an iteration strategy and a conjugate gradient projection vector learning method to update training datasets. The advantage of this method is its quadratic convergence, which can promote the convergence. Experimental results show that the proposed algorithm has higher efficiency. The matching rate can be significantly improved, and the time of training is much shorter than most of existing algorithms of person re-identification.

Tongguang Ni, Zongyuan Ding, Fuhua Chen, Hongyuan Wang

### Multi-context Deep Convolutional Features and Exemplar-SVMs for Scene Parsing

Scene parsing is a challenging task in computer vision field. The work of scene parsing is labeling every pixel in an image with its semantic category to which it belongs. In this paper, we solve this problem by proposing an approach that combines the multi-context deep convolutional features with exemplar-SVMs for scene parsing. A convolutional neural network is employed to learn the multi-context deep features which include image global features and local features. In contrast to hand-crafted feature extraction approaches, the convolutional neural network learns features automatically and the features can better describe images on the task. In order to obtain a high class recognition accuracy, our system consists of the exemplar-SVMs which is training a linear SVM classifier for every exemplar in the training set for classification. Finally, multiple cues are integrated into a Markov Random Field framework to infer the parsing result. We apply our system to two challenging datasets, SIFT Flow dataset and the dataset which is collected by ourselves. The experimental results demonstrate that our method can achieve good performance.

Xiaofei Cui, Hanbing Qu, Songtao Wang, Liang Dong, Ziliang Qi

### How Depth Estimation in Light Fields Can Benefit from Angular Super-Resolution?

With the development of consumer light field cameras, the light field imaging has become an extensively used method for capturing the 3D appearance of a scene. The depth estimation often require a dense sampled light field in the angular domain. However, there is an inherent trade-off between the angular and spatial resolution of the light field. Recently, some studies for novel view synthesis or angular super-resolution from a sparse set of have been introduced. Rather than the conventional approaches that optimize the depth maps, these approaches focus on maximizing the quality of synthetic views. In this paper, we investigate how the depth estimation can benefit from these angular super-resolution methods. Specifically, we compare the qualities of the estimated depth using the original sparse sampled light fields and the reconstructed dense sampled light fields. Experiment results evaluate the enhanced depth maps using different view synthesis approaches.

Mandan Zhao, Gaochang Wu, Yebin Liu, Xiangyang Hao

### Influence Evaluation for Image Tampering Using Saliency Mechanism

Due to the immediacy and the easy way to understand the image content, the applications of digital image have brought great opportunity to the development of social networks. However, there exist some serious problems. Some visual content has been maliciously tampered to achieve illegal purpose, while some modifications are benign, just for fun, for enhancing artistic value, or effectiveness of news dissemination. So beyond the tampering detection, how to evaluate the influence of image tampering is on schedule. In this paper, with the help of forensic tools, we study the problem of automatically assessing the influence of image tampering by examining whether the modification affects the dominant visual content, and utilize saliency mechanism to assess how harmful the tampering is. The experimental results demonstrate the effectiveness of our method.

Kui Ye, Xiaobo Sun, Jindong Xu, Jing Dong, Tieniu Tan

### Correlation Filters Tracker Based on Two-Level Filtering Edge Feature

In recent years, correlation filter frame has attracted more attention in visual object tracking, providing a real-time way. However, with the increase of the computing complexity of feature extractor, the trackers lost the real-time advantage of correlation filters. Moreover, correlation filters model drift can result in tracking failure. In order to solve these problems, a novel and simple correlation filters tracker based on two-level filtering edge feature (ECFT) was proposed. ECFT extracted a low-complexity edge feature based on two-level filtering for object representation. For reducing model drift as much as possible, an object model is updated adaptively by the maximum response value of correlation filters. The comparative experiments of 7 trackers on 20 challenging sequences showed that the ECFT tracker can perform better than other trackers in terms of AUC and Precision.

Dengzhuo Zhang, Donglan Cai, Yun Gao, Hao Zhou, Tianwen Li

### An Accelerated Superpixel Generation Algorithm Based on 4-Labeled-Neighbors

Superpixels are perceptually meaningful atomic regions that could effectively improve efficiency of subsequent image processing tasks. Simple linear iterative clustering (SLIC) has been widely used for superpixel calculation due to outstanding performance. In this paper, we propose an accelerated SLIC superpixel generation algorithm using 4-labeled neighbor pixels called 4L-SLIC. The main idea of 4L-SLIC is that the labels are assigned to a portion of the pixels while the others that associated with certain cluster are restrained by adjacent four labeled pixels. In this way, the average number of distance calculated times of pixels are effectively reduced and the similarity between adjacent pixels ensures a better segmentation effect. The experimental results confirm that 4L-SLIC achieved a speed up of 25%–30% without declining accuracy sharply compared to SLIC. In contrast to the method published on CVIU 2016, 4L-SLIC has an acceptable increase in the cost of time, in the mean time, there is a significant ascension to the accuracy of the segmentation.

Hongwei Feng, Fang Xiao, Qirong Bu, Feihong Liu, Lei Cui, Jun Feng

### A Method of General Acceleration SRDCF Calculation via Reintroduction of Circulant Structure

Discriminatively learned correlation filters (DCF) have been widely used in online visual tracking filed due to its simplicity and efficiency. These methods utilize a periodic assumption of the training samples to construct a circulant data matrix, which is also introduces unwanted boundary effects. Spatially Regularized Correlation Filters (SRDCF) solved this issue by introducing penalization on correlation filter coefficients. However, which breaks the circulant structure used in DCF. We propose Faster SRDCF (FSRDCF) via reintroduction of circulant structure. The circulant structure of training samples in the spatial domain is fully used, more importantly, we exploit the circulant structure of regularization function in the Fourier domain, which allows the problem to be solved more directly and efficiently. Our approach demonstrates superior performance over other non-spatial-regularization trackers on the OTB2013 and OTB2015.

Xiaoxiang Hu, Yujiu Yang

### Using Deep Relational Features to Verify Kinship

Kinship verification from facial images is a very challenging research topic. Differing from most of previous methods focusing on calculating a similarity metric, in this work, we utilize convolutional neural network and autoencoder to learn deep relational features for verifying kinship from facial images. Specifically, we firstly train a convolutional neural network to extract representative facial features, which derive from the last fully-connected layer in network. Then, facial features from two person are set as two ends of an autoencoder respectively, and relational features are extracted from the middle layer of the trained autoencoder. Finally, SVM classifiers are adopted to verify kinship (e.g., Father-Son). Experimental results on two public datasets show the effectiveness of the approach proposed in this work.

Jingyun Liang, Jinlin Guo, Songyang Lao, Jue Li

### Hyperspectral Image Classification Using Spectral-Spatial LSTMs

In this paper, we propose a hyperspectral image (HSI) classification method using spectral-spatial long short term memory (LSTM) networks. Specifically, for each pixel, we feed its spectral values in different channels into Spectral LSTM one by one to learn the spectral feature. Meanwhile, we firstly use principle component analysis (PCA) to extract the first principle component from a HSI, and then select local image patches centered at each pixel from it. After that, we feed the row vectors of each image patch into Spatial LSTM one by one to learn the spatial feature for the center pixel. In the classification stage, the spectral and spatial features of each pixel are fed into softmax classifiers respectively to derive two different results, and a decision fusion strategy is further used to obtain a joint spectral-spatial results. Experiments are conducted on two widely used HSIs, and the results show that our method can achieve higher performance than other state-of-the-art methods.

Feng Zhou, Renlong Hang, Qingshan Liu, Xiaotong Yuan

### A Novel Framework for Image Description Generation

The existing image description generation algorithms always fail to cover rich semantics information in natural images with single sentence or dense object annotations. In this paper, we propose a novel semi-supervised generative visual sentence generation framework by jointly modeling Regions Convolutional Neural Network (RCNN) and improved Wasserstein Generative Adversarial Network (WGAN), for generating diverse and semantically coherent sentence description of images. In our algorithm, the features of candidate regions are extracted with RCNN and the enriched words are polished by their context with an improved WGAN. The improved WGAN consists of a structured sentence generator and a multi-level sentence discriminators. The generator produces sentences recurrently by incorporating region-based visual and language attention mechanisms, while the discriminator assesses the quality of generated sentences. The experimental results on publicly available dataset show the promising performance of our work against other related works.

Qiang Cai, Ziyu Xue, Xiaoyu Zhang, Xiaobin Zhu, Wei Shao, Lei Wang

### 3D Convolutional Neural Network for Action Recognition

Compared with traditional machine learning methods, deep convolutional neural network has more powerful learning ability. The convolutional neural network model of the depth learning algorithm has made remarkable achievements in the field of computer vision. As a branch of neural network, 3D Convolutional neural network (3D CNN) is a relatively new research field in the field of computer vision. To extract features that contain more information, we develop a novel 3D CNN model for action recognition instead of the traditional 2D inputs. The final feature consists spatial and temporal information from multiple channels. We demonstrate the efficacy of our method on KTH dataset.

Junhui Zhang, Li Chen, Jing Tian

### Research on Image Colorization Algorithm Based on Residual Neural Network

In order to colorize the grayscale images efficiently, an image colorization method based on deep residual neural network is proposed. This method combines the classified information and features of the image, uses the whole image as the input of the network and forms a non-linear mapping from grayscale images to the colorful images through the deep network. The network is trained by using the MIT Places Database and ImageNet and colorizes the grayscale images. The experiment result shows that different data sets have different colorization effects on grayscale images, and the complexity of the network determines the colorization effect of grayscale images. This method can colorize the grayscale images efficiently, which has better visual effect.

Pinle Qin, Zirui Cheng, Yuhao Cui, Jinjing Zhang, Qiguang Miao

### Image Retrieval via Balanced and Maximum Variance Deep Hashing

Hashing is a typical approximate nearest neighbor search approach for large-scale data sets because of its low storage space and high computational ability. The higher the variance on each projected dimension is, the more information the binary codes express. However, most existing hashing methods have neglected the variance on the projected dimensions. In this paper, a novel hashing method called balanced and maximum variance deep hashing (BMDH) is proposed to simultaneously learn the feature representation and hash functions. In this work, pairwise labels are used as the supervised information for the training images, which are fed into a convolutional neural network (CNN) architecture to obtain rich semantic features. To acquire effective and discriminative hash codes from the extracted features, an objective function with three restrictions is elaborately designed: (1) similarity-preserving mapping, (2) maximum variance on all projected dimensions, (3) balanced variance on each projected dimension. The competitive performance is acquired using the simple back-propagation algorithm with stochastic gradient descent (SGD) method despite the sophisticated objective function. Extensive experiments on two benchmarks CIFAR-10 and NUS-WIDE validate the superiority of the proposed method over the state-of-the-art methods.

Shengjie Zhang, Yufei Zha, Yunqiang Li, Huanyu Li, Bing Chen

### FFGS: Feature Fusion with Gating Structure for Image Caption Generation

Automatically generating a natural language to describe the content of the given image is a challenging task in the interdisciplinary between computer vision and natural language processing. The task is challenging because computers not only need to recognize objects, their attributions and relationships between them in an image, but also these elements should be represented into a natural language sentence. This paper proposed a feature fusion with gating structure for image caption generation. First, the pre-trained VGG-19 is used as the image feature extractor. We use the FC-7 and CONV5-4 layer’s outputs as the global and local image feature, respectively. Second, the image features and the corresponding sentence are imported into LSTM to learn their relationship. The global image feature is gated at each time-step before imported into LSTM while the local image feature used the attention model. Experimental results show our method outperform the state-of-the-art methods.

Aihong Yuan, Xuelong Li, Xiaoqiang Lu

### Deep Temporal Architecture for Audiovisual Speech Recognition

The Audiovisual Speech Recognition (AVSR) is one of the applications of multimodal machine learning related to speech recognition, lipreading systems and video classification. In recent and related work, increasing efforts are made in Deep Neural Network (DNN) for AVSR, moreover some DNN models including Multimodal Deep Autoencoder, Multimodal Deep Belief Network and Multimodal Deep Boltzmann Machine perform well in experiments owing to the better generalization and nonlinear transformation. However, these DNN models have several disadvantages: (1) They mainly deal with modal fusion while ignoring temporal fusion. (2) Traditional methods fail to consider the connection among frames in the modal fusion. (3) These models aren’t end-to-end structure. We propose a deep temporal architecture, which has not only classical modal fusion, but temporal modal fusion and temporal fusion. Furthermore, the overfitting and learning with small size samples in the AVSR are also studied, so that we propose a set of useful training strategies. The experiments show the superiority of our model and necessity of the training strategies in three datasets: AVLetters, AVLetters2, AVDigits. In the end, we conclude the work.

Chunlin Tian, Yuan Yuan, Xiaoqiang Lu

### Efficient Cross-modal Retrieval via Discriminative Deep Correspondence Model

Cross-modal retrieval has recently drawn much attention due to the widespread existence of multi-modal data, and it generally involves two challenges: how to model the correlations and how to utilize the class label information to eliminate the heterogeneity between different modalities. Most previous works mainly focus on solving the first challenge and often ignore the second one. In this paper, we propose a discriminative deep correspondence model to deal with both problems. By taking the class label information into consideration, our proposed model attempts to seamlessly combine the correspondence autoencoder (Corr-AE) and supervised correspondence neural networks (Super-Corr-NN) for cross-modal matching. The former model can learn the correspondence representations of data from different modalities, while the latter model is designed to discriminatively reduce the semantic gap between the low-level features and high-level descriptions. The extensive experiments tested on three public datasets demonstrate the effectiveness of the proposed approach in comparison with the state-of-the-art competing methods.

Zhikai Hu, Xin Liu, An Li, Bineng Zhong, Wentao Fan, Jixiang Du

### Towards Deeper Insights into Deep Learning from Imbalanced Data

Imbalanced performance usually happens to those classifiers (including deep neural networks) trained on imbalanced training data. These classifiers are more likely to make mistakes on minority class instances than on those majority class ones. Existing explanations attribute the imbalanced performance to the imbalanced training data. In this paper, using deep neural networks, we strive for deeper insights into the imbalanced performance. We find that imbalanced data is a neither sufficient nor necessary condition for imbalanced performance in deep neural networks, and another important factor for imbalanced performance is the distance between the majority class instances and the decision boundary. Based on our observations, we propose a new under-sampling method (named Moderate Negative Mining) which is easy to implement, state-of-the-art in performance and suitable for deep neural networks, to solve the imbalanced classification problem. Various experiments validate our insights and demonstrate the superiority of the proposed under-sampling method.

Jie Song, Yun Shen, Yongcheng Jing, Mingli Song

### Traffic Sign Recognition Based on Deep Convolutional Neural Network

Traffic sign recognition (TSR) is an important component of automated driving system. It is a rather challenging task to design a high-performance classifier for the TSR system. In this paper, we proposed a new method for TSR system based on deep convolutional neural network. In order to enhance the expression of the network, a novel structure (dubbed block-layer below) which combines Network-in-Network and residual connection was designed. Our network has 10 layers with parameters (block-layer be seen as a single layer); the first seven are alternate convolutional layers and block-layers, and the remaining three are fully-connected layers. We trained our TSR network on the German Traffic Sign Recognition Benchmark (GTSRB) dataset. To reduce overfitting, we did data augmentation on the training images and employed a regularization method named dropout. We also employed a mechanism called Batch Normalization which has been proved to be efficient for accelerating the training of deep neural networks. To speed up the training, we used an efficient GPU to accelerate the convolutional operation. On the test dataset of GTSRB, we achieve the accuracy rate of 98.96%, exceeding the human average raters.

Shihao Yin, Jicai Deng, Dawei Zhang, Jingyuan Du

### Deep Context Convolutional Neural Networks for Semantic Segmentation

Recent years have witnessed the great progress for semantic segmentation using deep convolutional neural networks (DCNNs). This paper presents a novel fully convolutional network for semantic segmentation using multi-scale contextual convolutional features. Since objects in natural images tend to be with various scales and aspect ratios, capturing the rich contextual information is very critical for dense pixel prediction. On the other hand, with going deeper of the convolutional layers, the convolutional feature maps of traditional DCNNs gradually become coarser, which may be harmful for semantic segmentation. According to these observations, we attempt to design a deep context convolutional network (DCCNet), which combines the feature maps from different levels of network in a holistic manner for semantic segmentation. The proposed network allows us to fully exploit local and global contextual information, ranging from an entire scene, though sub-regions, to every single pixel, to perform pixel label estimation. The experimental results demonstrate that our DCCNet (without any postprocessing) outperforms state-of-the-art methods on PASCAL VOC 2012, which is the most popular and challenging semantic segmentation dataset.

Wenbin Yang, Quan Zhou, Yawen Fan, Guangwei Gao, Songsong Wu, Weihua Ou, Huimin Lu, Jie Cheng, Longin Jan Latecki

### Automatic Character Motion Style Transfer via Autoencoder Generative Model and Spatio-Temporal Correlation Mining

The style of motion is essential for virtual characters animation, and it is significant to generate motion style efficiently in computer animation. In this paper, we present an efficient approach to automatically transfer motion style by using autoencoder generative model and spatio-temporal correlation mining, which allows users to transform an input motion into a new style while preserving its original content. To this end, we introduce a history vector of previous motion frames into autoencoder generative network, and extract the spatio-temporal feature of input motion. Accordingly, the spatio-temporal correlation within motions can be represented by the correlated hidden units in this network. Subsequently, we established the constraints of Gram matrix in such feature space to produce transferred motion by pre-trained generative model. As a result, various motions of particular semantic can be automatically transferred from one style to another one, and the extensive experiments have shown its outstanding performance.

Dong Hu, Xin Liu, Shujuan Peng, Bineng Zhong, Jixiang Du

### Efficient Human Motion Transition via Hybrid Deep Neural Network and Reliable Motion Graph Mining

Skeletal motion transition is of crucial importance to the simulation in interactive environments. In this paper, we propose a hybrid deep learning framework that allows for flexible and efficient human motion transition from motion capture (mocap) data, which optimally satisfies the diverse user-specified paths. We integrate a convolutional restricted Boltzmann machine with deep belief network to detect appropriate transition points. Subsequently, a quadruples-like data structure is exploited for motion graph building, which significantly benefits for the motion splitting and indexing. As a result, various motion clips can be well retrieved and transited fulfilling the user inputs, while preserving the smooth quality of the original data. The experiments show that the proposed transition approach performs favorably compared to the state-of-the-art competing approaches.

Bing Zhou, Xin Liu, Shujuan Peng, Bineng Zhong, Jixiang Du

### End-to-End View-Aware Vehicle Classification via Progressive CNN Learning

This paper investigates how to perform robust vehicle classification in unconstrained environments, in which appearance of vehicles changes dramatically across different angles and the numbers of viewpoint images are not balanced among different car models. We propose a end-to-end progressive learning framework, which allows the network architecture is reconfigurable, for view-aware vehicle classification. In particular, the proposed network architecture consists of two parts: a general end-to-end progressive CNN architecture for coarse-to-fine or top-down fine-grained recognition task and an end-to-end view-aware vehicle classification framework to combine vehicle classification and viewpoints recognition. We test the technique on a large-scale car dataset, “CompCars”, and experimental results show that our framework can significantly improve performance of vehicle classification.

Jiawei Cao, Wenzhong Wang, Xiao Wang, Chenglong Li, Jin Tang

### A Fast Method for Scene Text Detection

Text detection is important for many applications such as text retrieval, blind guidance, and industrial automation. Meanwhile, text detection is a challenging task due to the complexity of the background and the diversity of the font, size and color of the text. In recent years, deep learning achieves good results in image classification and detection, and provides us a new method for text detection. In this paper, a deep learning based detection method – Single Shot MultiBox Detector (SSD) is adopted. But SSD is a general object detection method, not specific for text detection and is not fast enough. Our method aims to develop a network for text detection, improve the speed and reduce the model. Therefore, we design a feature extraction network with the inception module and an additional deconvolution layer. The experiment on benchmark – ICDAR2013 demonstrates that our method is faster than other SSD-based method comparable results.

Qing Fang, Yanping Yang, Yali Chen, Xiaoyu Yao

### Automatic Image Description Generation with Emotional Classifiers

Automatically generating a natural sentence describing the content of an image is a hot issue in artificial intelligence which links the domain of computer vision and language processing. Most of the existing works leverage large object recognition datasets and external text corpora to integrate knowledge between similar concepts. As current works aim to figure out ‘what is it’, ‘where is it’ and ‘what is it dong’ in images, we focus on a less considered but critical concept: ‘how it feels’ in the content of images. We propose to express feelings contained in images via a more direct and vivid way. To achieve this goal, we extend a pre-trained caption model with an emotion classifier to add abstract knowledge to the original caption. We practice our method on datasets originated from various domains. Especially, we examine our method on the newly constructed SentiCap dataset with multiple evaluation metrics. The results show that the newly generated descriptions can summarize the images vividly.

Yan Sun, Bo Ren

### Backmatter

Weitere Informationen