Visual Tracking by Local Superpixel Matching with Markov Random Field

In this paper, we propose a novel method to track non-rigid and/or articulated objects using superpixel matching and markov random field (MRF). Our algorithm consists of three stages. First, a superpixel dataset is constructed by segmenting training frames into superpixels, and each superpixel is represented by multiple features. The appearance information of target is encoded in the superpixel database. Second, each new frame is segmented into superpixels and then its object-background confidence map is derived by comparing its superpixels with k-nearest neighbors in superpixel dataset. Taking context information into account, we utilize MRF to further improve the accuracy of confidence map. In addition, the local context information is incorporated through a feedback to refine superpixel matching. In the last stage, visual tracking is achieved via finding the best candidate by maximum a posterior estimate based on the confidence map. Experiments show that our method outperforms several state-of-the-art trackers.

Heng Fan, Jinhai Xiang, Zhongmin Chen

Saliency Detection Combining Multi-layer Integration Algorithm with Background Prior and Energy Function

In this paper, we propose an improved mechanism for saliency detection. Firstly, based on a neoteric background prior selecting four corners of an image as background, color and spatial contrast with each super-pixel are being used to obtain a coarse map. Then, we put the Objectness labels as foreground prior based on part of information of the former map to construct another map. Further, an original energy function is applied to optimize both of them respectively and single-layer saliency map is formed by merging the above two maps. Finally, to settle the scale problem, we obtain our multi-layer saliency map by presenting an integration algorithm to take advantage of multiple saliency maps. Quantitative and qualitative experiments on three datasets demonstrate that our method performs favorably against the state-of-the-art algorithm.

Chenxing Xia, Hanling Zhang

Facial Landmark Localization by Part-Aware Deep Convolutional Network

Facial landmark localization is a very challenging research task. The localization accuracy of landmarks on separate facial parts differ greatly due to texture and shape, however most existing methods fail to consider the part location of landmarks. To solve this problem, we propose a novel end-to-end regression framework using deep convolutional neural network (CNN). Our deep architecture first encodes the image into feature maps shared by all the landmarks. Then, these features are sent into two independent sub-network modules to regress contour landmarks and inner landmarks, respectively. Extensive evaluations conducted on 300-W benchmark dataset demonstrate the proposed deep framework achieves state-of-the-art results.

Keke He, Xiangyang Xue

On Combining Compressed Sensing and Sparse Representations for Object Tracking

The tracking algorithm of compressed sensing takes advantage of the objective’s background information, but lacks the feedback mechanism towards the results. The 11 sparse tracking algorithm adapts to the changes in the objectives’ appearances but at the cost of losing their background information. To enhance the effectiveness and robustness of the algorithm in coping with such distractions as occlusion and illumination variation, this paper proposes a tracking framework with the 11 sparse representation being the detector and compressed sensing algorithm the tracker, and establishes a complementary classifier model. A second-order model updating strategy has therefore been proposed to preserve the most representative templates in the 11 sparse representations. It is concluded that this tracking algorithm is better than the prevalent 8 ones with a respective precision plot of 77.15 %, 72.33 % and 81.13 % and a respective success plot of 77.67 %, 74.01 %, 81.51 % in terms of the overall, occlusion and illumination variation.

Hang Sun, Jing Li, Bo Du, Dacheng Tao

Leaf Recognition Based on Binary Gabor Pattern and Extreme Learning Machine

Automatic plant leaf recognition has been a hot research spot in the recent years, where encouraging improvements have been achieved in both recognition accuracy and speed. However, existing algorithms usually only extracted leaf features (such as shape or texture) or merely adopt traditional neural network algorithm to recognize leaf, which still showed limitation in recognition accuracy and speed especially when facing a large leaf database. In this paper, we present a novel method for leaf recognition by combining feature extraction and machine learning. To break the weakness exposed in the traditional algorithms, we applied binary Gabor pattern (BGP) and extreme learning machine (ELM) to recognize leaves. To accelerate the leaf recognition, we also extract BGP features from leaf images with an offline manner. Different from the traditional neural network like BP and SVM, our method based on the ELM only requires setting one parameter, and without additional fine-tuning during the leaf recognition. Our method is evaluated on several different databases with different scales. Comparisons with state-of-the-art methods were also conducted to evaluate the combination of BGP and ELM. Visual and statistical results have demonstrated its effectiveness.

Huisi Wu, Jingjing Liu, Ping Li, Zhenkun Wen

Sparse Representation Based Histogram in Color Texture Retrieval

Sparse representation is proposed to generate the histogram of feature vectors, namely sparse representation based histogram (SRBH), in which a feature vector is represented by a number of basis vectors instead of by one basis vector in classical histogram. This amelioration makes the SRBH to be a more accurate representation of feature vectors, which is confirmed by the analysis in the aspect of reconstruction errors and the application in color texture retrieval. In color texture retrieval, feature vectors are constructed directly from coefficients of Discrete Wavelet Transform (DWT). Dictionaries for sparse representation are generated by K-means. A set of sparse representation based histograms from different feature vectors is used for image retrieval and chi-squared distance is adopted for similarity measure. Experimental results assessed by Precision-Recall and Average Retrieval Rate (ARR) on four widely used natural color texture databases show that this approach is robust to the number of wavelet decomposition levels and outperforms classical histogram and state-of-the-art approaches.

Cong Bai, Jia-nan Chen, Jinglin Zhang, Kidiyo Kpalma, Joseph Ronsin

Improving Image Retrieval by Local Feature Reselection with Query Expansion

A novel approach related to query expansion is proposed to improve image retrieval performance. The proposed approach investigates the problem that not all of the visual features extracted from images are appropriate to be employed for similarity matching. To address this issue, we distinguish image features as effective features from noisy features. The former is benefit for image retrieval while the latter causes deterioration, since the matching of noisy features may rise the similarity score of irrelevant images. In this work, a detailed illustration of effective and noisy features is given and the aforementioned problem is solved by selecting effective features to enhance query feature set while removing noisy features via spatial verification. Experimental results demonstrate that the proposed approach outperforms a number of state-of-the-art query expansion approaches.

Hanli Wang, Tianyao Sun

Sparse Subspace Clustering via Closure Subgraph Based on Directed Graph

Sparse subspace clustering has attracted much attention in the fields of signal processing, image processing, computer vision, and pattern recognition. We propose an algorithm, sparse subspace clustering via closure subgraph (SSC-CG) based on directed graph, to accomplish subspace clustering without the number of subspaces as prior information. In SSC-CG, we use a directed graph to express the relations in data instead of an undirected graph like most previous methods. Through finding all strongly connected components with closure property, we discovery all subspaces in the given dataset. Based on expressive relations, we assign data to subspaces or treat them as noise data. Experiments demonstrate that SSC-CG has an exciting performance in most conditions.

Yuefeng Ma, Xun Liang

Robust Lip Segmentation Based on Complexion Mixture Model

Lip image analysis plays a vital role in Traditional Chinese Medicine (TCM) and other visual and speech recognition applications. However, if the lip images contain weak color difference with background parts or the background is complicated, most of the current methods are difficult to robustly and accurately segment the lip regions. In this paper, we propose a lip segmentation method based on complexion mixture model to resolve this problem. Specifically, we use the pixels’ color of the upper (lip-free) part of the face as training data to build a corresponding complexion Gaussian Mixture Model (GMM) for each face image in Lab color space. Then by iteratively removing the complexion pixels not belonging to the lip region in the lower part of the face based on the GMM, an initial lip can be obtained. We further build GMMs on the initial lip and non-lip regions, respectively. The background probability map can be obtained based on the GMMs. Finally, we extract the optimal lip contour via a smooth operation. Experiments are performed on our dataset with 1000 face images. Experimental results demonstrate the efficacy of the proposed method compared with the state-of-art lip segmentation methods.

Yangyang Hu, Hong Lu, Jinhua Cheng, Wenqiang Zhang, Fufeng Li, Weifei Zhang

Visual BFI: An Exploratory Study for Image-Based Personality Test

This paper positions and explores the topic of image-based personality test. Instead of responding to text-based questions, the subjects will be provided a set of “choose-your-favorite-image” visual questions. With the image options of each question belonging to the same concept, the subjects’ personality traits are estimated by observing their preferences of images under several unique concepts. The solution to design such an image-based personality test consists of concept-question identification and image-option selection. We have presented a preliminary framework to regularize these two steps in this exploratory study. A demo version of the designed image-based personality test is available at http://www.visualbfi.org/. Subjective as well as objective evaluations have demonstrated the feasibility of accurately estimation the personality of subjects in limited round of visual questions.

Jitao Sang, Huaiwen Zhang, Changsheng Xu

Fast Cross-Scenario Clothing Retrieval Based on Indexing Deep Features

In this paper, we propose a new approach for large scale daily clothing retrieval. Fast clothing image search in cross scenarios is a challenging task due to the large amount of clothing images on the internet and visual differences between street photos (pictures of people wearing clothing taken in our daily life with complex background) and online shop photos (pictures of clothing items on people, captured by professionals in more controlled settings). We tackle the problem of cross-scenario clothing retrieval through clothing segmentation based on coarse-fine hierarchical superpixel segmentation and pose estimation to remove the background of clothing image and employ deep features representing the clothing item aimed at describing various clothing effectively. In addition, in order to speed up the retrieval process for large scale online clothing images, we adopt inverted indexing on deep feature by regarding deep features as Bag-of-Word model. In this way, we obtain similar clothing items far faster. Experiments demonstrate that our method significantly outperforms state-of-the-art approaches.

Zongmin Li, Yante Li, Yongbiao Gao, Yujie Liu

3D Point Cloud Encryption Through Chaotic Mapping

Three dimensional (3D) contents such as 3D point clouds, 3D meshes and 3D surface models are increasingly growing and being widely spread into the industry and our daily life. However, less people consider the problem of the privacy preserving of 3D contents. As an attempt towards 3D security, in this papers, we propose methods of encrypting the 3D point clouds through chaotic mapping. 2 schemes of encryption using chaotic mapping have been proposed to encrypt 3D point clouds. (1) 3 random sequences are generated by the logistic chaotic mapping. Each random vector is sorted to randomly shuffler each coordinate of the 3D point clouds. (2) A random 3×3 invertible rotation matrix and a 3×1 translate vector are generated by the logistic mapping. Then each 3D point is projected to another random place using the above random rotation matrix and translate vector in the homogeneous coordinate. We test the above 2 encryption schemes of 3D point cloud encryption using various 3D point clouds. The 3D point clouds can be encrypted and decrypted correctly. In addition, we evaluated the encryption results by VFH (Viewpoint Feature Histogram). The experimental results show that our proposed methods can produce nearly un-recognized encrypted results of 3D point clouds.

Xin Jin, Zhaoxing Wu, Chenggen Song, Chunwei Zhang, Xiaodong Li

Online Multi-Person Tracking Based on Metric Learning

The correct associations of detections and tracklets are the key to online multi-person tracking. Good appearance models can guide data association and play an important role in the association. In this paper, we construct a discriminative appearance model by using metric learning which can obtain accurate appearance affinities with human appearance variations. The novel appearance model can significantly guide data association. Furthermore, the model is learned incrementally according to the association results and its parameters are automatically updated to be suitable for the next online tracking. Based on an online tracking-by-detection framework, our method achieves reliable tracking of multiple persons even in complex scenes. Our experimental evaluation on publicly available data sets shows that the proposed online multi-person tracking method works well.

Changyong Yu, Min Yang, Yanmei Dong, Mingtao Pei, Yunde Jia

A Low-Rank Tensor Decomposition Based Hyperspectral Image Compression Algorithm

Hyperspectral image (HSI), which is widely known that contains much richer information in spectral domain, has attracted increasing attention in various fields. In practice, however, since a hyperspectral image itself contains large amount of redundant information in both spatial domain and spectral domain, the accuracy and efficiency of data analysis is often decreased. Various attempts have been made to solve this problem by image compression method. Many conventional compression methods can effectively remove the spatial redundancy but ignore the great amount of redundancy exist in spectral domain. In this paper, we propose a novel compression algorithm via patch-based low-rank tensor decomposition (PLTD). In this framework, the HSI is divided into local third-order tensor patches. Then, similar tensor patches are grouped together and to construct a fourth-order tensor. And each cluster can be decomposed into smaller coefficient tensor and dictionary matrices by low-rank decomposition. In this way, the redundancy in both the spatial and spectral domains can be effectively removed. Extensive experimental results on various public HSI datasets demonstrate that the proposed method outperforms the traditional image compression approaches.

Mengfei Zhang, Bo Du, Lefei Zhang, Xuelong Li

Moving Object Detection with ViBe and Texture Feature

In the field of computer vision, moving object detection in complicated environments is challenging. This study proposes a moving target detecting algorithm combining ViBe and spatial information to address the poor adaptability of ViBe in complex scenes. The CSLBP texture descriptor was improved to more accurately describe background features. An adaptive threshold was introduced, and thresholding on absolute difference was applied to obtain binary string descriptors using comparisons of pixels from the same region or different images. Afterwards, by adding spatial features to ViBe, a background model based on color and texture feature was obtained. Experimental results show that the proposed method addresses the deficiency of ViBe’s feature representation and improves its adaptability in complex video scenes with shadow, background interference and slow-moving targets. This adaptability allows the precision of detection to improve.

Yumin Tian, Dan Wang, Peipei Jia, Jinhui Liu

Leveraging Composition of Object Regions for Aesthetic Assessment of Photographs

Evaluating the aesthetic quality of photos automatically can be considered as a highly challenging task. In this paper, we propose and investigate a novel method for the aesthetic assessment of photos. We integrate photo composition of salient object regions into the assessment. Specifically, we first evaluate the objectness of regions in photos by considering the spatial location and shape of the image salient object regions. Then, we extract features based on the spatial composition of objects. The proposed features fuse aesthetics rules with composition of semantic regions. The proposed method is evaluated on a large dataset. Experimental results demonstrate the efficacy of the proposed method.

Hong Lu, Zeping Yao, Yunhan Bai, Zhibin Zhu, Bohong Yang, Lukun Chen, Wenqiang Zhang

Video Affective Content Analysis Based on Protagonist via Convolutional Neural Network

Affective recognition is an important and challenging task for video content analysis. Affective information in videos is closely related to the viewer’s feelings and emotions. Thus, video affective content analysis has a great potential value. However, most of the previous methods are focused on how to effectively extract features from videos for affective analysis. There are several issues are worth to be investigated. For example, what information is used to express emotions in videos, and which information is useful to affect audiences’ emotions. Taking into account these issues, in this paper, we proposed a new video affective content analysis method based on protagonist information via Convolutional Neural Network (CNN). The proposed method is evaluated on the largest video emotion dataset and compared with some previous work. The experimental results show that our proposed affective analysis method based on protagonist information achieves best performance in emotion classification and prediction.

Yingying Zhu, Zhengbo Jiang, Jianfeng Peng, Sheng-hua Zhong

Texture Description Using Dual Tree Complex Wavelet Packets

In this work we extend several DWT-based wavelet and wavelet packet feature extraction methods to use the dual-tree complex wavelet transform. This way we aim at alleviating shortcomings of the different algorithms which stem from the use of the underlying DWT. We show that, while some methods benefit significantly from extending them to be based in the dual-tree complex wavelet transform domain (and also provide the best overall results), for other methods there is almost no impact of this extension.

M. Liedlgruber, M. Häfner, J. Hämmerle-Uhl, A. Uhl

Fast and Accurate Image Denoising via a Deep Convolutional-Pairs Network

Most of popular image denoising approaches exploit either the internal priors or the priors learned from external clean images to reconstruct the latent image. However, it is hard for those algorithms to construct the perfect connections between the clean images and the noisy ones. To tackle this problem, we present a deep convolutional-pairs network (DCPN) for image denoising in this paper. With the observation that deeper networks improve denoising performance, we propose to use deeper networks than those employed previously for low-level image processing tasks. In our method, we attempt to build end-to-end mappings directly from a noisy image to its corresponding noise-free image by using deep convolutional layers in pair applied to image patches. Because of those mappings trained from large data, the process of denoising is much faster than other methods. DCPN is composed of three convolutional-pairs layers and one transitional layer. Two convolutional-pairs layers are used for encoding and the other one is used for decoding. Numerical experiments show that the proposed method outperforms many state-of-the-art denoising algorithms in both speed and performance.

Lulu Sun, Yongbing Zhang, Wangpeng An, Jingtao Fan, Jian Zhang, Haoqian Wang, Qionghai Dai

Traffic Sign Recognition Based on Attribute-Refinement Cascaded Convolutional Neural Networks

Traffic sign recognition is a critical module of intelligent transportation system. Observing that a subtle difference may cause misclassification when the actual class and the predictive class share the same attributes such as shape, color, function and so on, we propose a two-stage cascaded convolutional neural networks (CNNs) framework, called attribute-refinement cascaded CNNs, to train the traffic sign classifier by taking full advantage of attribute-supervisory signals. The first stage CNN is trained with class label as supervised signals, while the second stage CNN is trained on super classes separately according to auxiliary attributes of traffic signs for further refinement. Experiments show that the proposed hierarchical cascaded framework can extract the deep information of similar categories, improve discrimination of the model and increase classification accuracy of traffic signs.

Kaixuan Xie, Shiming Ge, Qiting Ye, Zhao Luo

Building Locally Discriminative Classifier Ensemble Through Classifier Fusion Among Nearest Neighbors

Many studies on ensemble learning that combines multiple classifiers have shown that, it is an effective technique to improve accuracy and stability of a single classifier. In this paper, we propose a novel discriminative classifier fusion method, which applies local classification results of classifiers among nearest neighbors to build a local classifier ensemble. From this dynamically selected process, discriminative classifiers are weighted heavily to build a locally discriminative ensemble. Experimental results on several UCI datasets have shown that, our proposed method achieves best classification performance among individual classifiers, majority voting and AdaBoost algorithms.

Xiang-Jun Shen, Wen-Chao Zhang, Wei Cai, Ben-Bright B. Benuw, He-Ping Song, Qian Zhu, Zheng-Jun Zha

Retrieving Images by Multiple Samples via Fusing Deep Features

Most existing image retrieval systems search similar images on a given single input, while querying based on multiple images is not a trivial. In this paper, we describe a novel image retrieval paradigm that users could input two images as query to search the images that include the content of the two input images-synchronously. In our solution, the deep CNN feature is extracted from each single query image and then fused as the query feature. Due to the role of the two query images is different and changeable, we propose the FWC (Feature weighting by Clustering), a novel algorithm to weight the two query features. All the CNN features in the whole dataset are clustered and the weight of each query is obtained by the distance to the mutual nearest cluster. The effectiveness of our algorithm is evaluated in PASCAL VOC2007 and Microsoft COCO datasets.

Kecai Wu, Xueliang Liu, Jie Shao, Richang Hong, Tao Yang

A Part-Based and Feature Fusion Method for Clothing Classification

Clothing recognition and parsing have attracted substantial attention in computer vision community, which contribute to applications like scene recognition, event recognition, e-commerce, etc. In our work, a part-based and feature fusion method is proposed to classify clothing in natural scenes. Firstly, clothing is described with a part-based model, in which a Deformable Part based Model (DPM) and a key point regression method are used to locate the head-shoulder and human torso. Then, a novel Distinctive Efficient Robust Feature (DERF) and four other low-level features are extracted to represent human clothing. Finally, a feature fusion strategy is utilized to promote the classification performance. Experiments are conducted on a new and well labeled image dataset. The experimental results show the efficiency of our proposed method.

Pan Huo, Yunhong Wang, Qingjie Liu

Research on Perception Sensitivity of Elevation Angle in 3D Sound Field

The development of virtual reality and three-dimensional (3D) video inspired the concern about 3D audio, 3D audio aims at reconstructing the spatial information of original signals, the spatial perception sensitivity and minimum audible angle (MAA) would help to improve the accuracy of reconstructing signals. The measurements and analysis of MAA thresholds are limited to the azimuth angle at present, lacking of elevation angles quantitative analyzing, it is unable to build the complete spatial perception model of 3D sound field, which could be used in accurate 3D sound field reconstruction. In order to study the perception sensitivity of elevation angle at different locations in 3D sound field, subjective listening tests were conducted, elevation minimum audible angle (EMAA) thresholds at 144 different locations in 3D sound field were tested. The tests were referred to the quantitative analysis of azimuth minimum audible angle (AMAA) thresholds of human ear, based on psychoacoustic model and manikin. The results show that the EMAA thresholds have obvious dependence on elevation angle, thresholds vary between 3$$^{\circ }$$∘ and 30$$^{\circ }$$∘, reach the minimum value at the ear plane (elevation angle: 0$$^{\circ }$$∘), increase proportional linearly as the elevation angle departs from the ear plane and reach a relative maximum value on both sides (elevation angle: $$-30^{\circ }$$-30∘ and 90$$^{\circ }$$∘). Besides, the EMAA thresholds are dependent upon azimuth angle too, thresholds reach the minimum value around median plane (azimuth angle: 0$$^{\circ }$$∘).

Yafei Wu, Xiaochen Wang, Cheng Yang, Ge Gao, Wei Chen

Tri-level Combination for Image Representation

The context of objects can provide auxiliary discrimination beyond objects. However, this effective information has not been fully explored. In this paper, we propose Tri-level Combination for Image Representation (TriCoIR) to solve the problem at three different levels: object intrinsic, strongly-related context and weakly-related context. Object intrinsic excludes external disturbances and more focuses on the objects themselves. Strongly-related context is cropped from the input image with a more loose bound to contain surrounding context. Weakly-related one is recovered from the image other than object for global context. First, strongly and weakly-related context are constructed from input images. Second, we make cascade transformations for more intrinsical object information, which depends on the consistency between generated global context and input images in the regions other than object. Finally, a joint representation is acquired based on these three level features. The experiments on two benchmark datasets prove the effectiveness of TriCoIR.

Ruiying Li, Chunjie Zhang, Qingming Huang

Accurate Multi-view Stereopsis Fusing DAISY Descriptor and Scaled-Neighbourhood Patches

In this paper, we present an efficient patch-based multi-view stereo reconstruction approach, which is designed to reconstruct accurate, dense 3D models on high-resolution image sets. Wide-baseline matching becomes more challenging due to large perspective distortions, increased occluded areas and high curvature regions that are inevitable in MVS. Correlation window measurements, which are mainly used as photometric discrepancy function, are not appropriate for wide-baseline matching. We introduce DAISY descriptor for photo-consistency optimization of each new patch, which makes our algorithm robust on distortion, occlusion and edge regions against many other photometric constraints. Another key to the performance of Patch-based MVS is the estimation of patch normal. We estimate the initial normal of every seed patch via fitting quadrics with scaled-neighbourhood patches to handle the reconstruction of high local curvature regions. It demonstrates that our approach performs dramatically well on large-scale scene both in terms of accuracy and completeness.

Fei Wang, Ning An

Stereo Matching Based on CF-EM Joint Algorithm

Cost Filtering (CF) and Energy Minimization (EM) are two main cost aggregation methods in stereo matching. Due to global smoothness assumption, EM methods can get higher matching accuracy. However, they tend to fail in occluded areas, while locally adaptive support-weight CF method can solve it well. This paper proposed a CF-EM joint stereo matching framework on the basis of the proof that CF and EM methods can realize interconversion to each other. In this joint framework, we firstly use CF method with fully connected Markov Random Field (F-MRF) model to yield a more robust unary potential. And then, the output unary potential is used as the input to a standard EM method to compute the final disparity in Local connected MRF (L-MRF) model. Experiments results demonstrate that our method can improve the stereo matching accuracy as the achievement of energy transferring from F-MRF to L-MRF.

Baoping Li, Long Ye, Yun Tie, Qin Zhang

Fine-Grained Vehicle Recognition in Traffic Surveillance

Fine-grained vehicle recognition in traffic surveillance plays a crucial part in establishing intelligent transportation system. The major challenge lies in that differences among vehicle models are always subtle. In this paper, we propose a part-based method combining global and local feature for fine-grained vehicle recognition in traffic surveillance. We develop a novel voting mechanism to unify the preliminary recognition results, which are obtained by using Histograms of Oriented Gradients (HOG) and pre-trained convolutional neural networks (CNN), leading to fully exploiting the discriminative ability of different parts. Besides, we collect a comprehensive public database for 50 common vehicle models with manual annotation of parts, which is used to evaluate the proposed method and serves as supportive dataset for related work. The experiments show that the average recognition accuracy of our method can approach 92.3 %, which is 3.4 %–7.1 % higher than the state-of-art approaches.

Qi Wang, Zhongyuan Wang, Jing Xiao, Jun Xiao, Wenbin Li

Transductive Classification by Robust Linear Neighborhood Propagation

We propose an enhanced label prediction method termed Transductive Classification Robust Linear Neighborhood Propagation (R-LNP). To encode the neighborhood reconstruction error more accurately, we apply the L2,1-norm that is proved to be very robust to noise for characterizing the manifold smoothing term. Since L2,1-norm can also enforce the neighborhood reconstruction error to be sparse in rows, i.e., entries of some rows are zeros. In addition, to enhance robustness in the process of modeling the difference between the initial labels and predicted ones, we also regularize the weighted L2,1-norm on the label fitting term, so the resulted measures would be more accurate. Compared with several transductive label propagation models, our proposed algorithm obtains state-of-the-art performance over extensive representation and classification experiments.

Lei Jia, Zhao Zhang, Weiming Jiang

Discriminative Sparse Coding by Nuclear Norm-Driven Semi-Supervised Dictionary Learning

In this paper, we propose a Nuclear norm-driven Semi-Supervised Dictionary Learning (N-SSDL) approach for classification. N-SSDL incorporates the idea of the recent label consistent KSVD with the label propagation process that propagates label information from labeled data to unlabeled data via balancing the neighborhood reconstruction error and the label fitness error. To provide a more reliable distance metric for measuring the neighborhood reconstruction error, we apply the nuclear-norm that is proved to be suitable for modeling the reconstruction error, where the reconstruction coefficients are computed based on the sparsely reconstructed training data rather than original ones. Besides, we also use the robust l_2,1-norm regularization on the label fitness error so that the measurement is robust to noise and outliers. Extensive simulations on several datasets show that N-SSDL can deliver enhanced performance over other state-of-the-arts for classification.

Weiming Jiang, Zhao Zhang, Yan Zhang, Fanzhang Li

Semantically Smoothed Refinement for Everyday Concept Indexing

Instead of occurring independently, semantic concepts pairs tend to co-occur within a single image and it is intuitive that concept detection accuracy for visual concepts can be enhanced if concept correlation can be leveraged in some way. In everyday concept detection for visual lifelogging using wearable cameras to automatically record everyday activities, the captured images usually have a diversity of concepts which challenges the performance of concept detection. In this paper a semantically smoothed refinement algorithm is proposed using concept correlations which exploit topic-related concept relationships, modeled externally in a user experiment rather than extracted from training data. Results for initial concept detection are factorized based on semantic smoothness and adjusted in compliance with the extracted concept correlations. Refinement performance is demonstrated in experiments to show the effectiveness of our algorithm and the extracted correlations.

Peng Wang, Lifeng Sun, Shiqiang Yang, Alan F. Smeaton

A Deep Two-Stream Network for Bidirectional Cross-Media Information Retrieval

The recent development in deep learning techniques has showed its wide applications in traditional vision tasks like image classification and object detection. However, as a fundamental problem in artificial intelligence that connects computer vision and natural language processing, bidirectional retrieval of images and sentences is not as popular as the traditional problems, and the results are far from satisfying. In this paper, we consider learning a cross-media representation model with a deep two-stream network. Previous models generally use image label information to train the dataset or strictly correspond the local features in images and texts. Unlike those models, we learn globalized local features, which can reflect the salient objects as well as the details in the images and sentences. After mapping the cross-media data into a common feature space, we use max-margin as the criterion function to update the network. The experiment on the dataset of Flickr8k shows that our approach achieves superior performance compared with the state-of-the-art methods.

Tianyuan Yu, Liang Bai, Jinlin Guo, Zheng Yang, Yuxiang Xie

Prototyping Methodology with Motion Estimation Algorithm

With CPU, GPU and other hardware accelerators, heterogeneous systems can increase the computing performance in many domains of general purpose computing. Open Computing Language (OpenCL) is the first open and free standard for heterogeneous computing on multi hardware platforms. In this paper, a parallelized Full Search Motion Estimation (FSME) approach exploits the parallelism available in OpenCL-supported devices and algorithm. Different from existing GPU-based ME approach, the proposed approach is implemented on the heterogeneous computing system which contains CPU and GPU. In the meantime, we propose the prototyping framework directly generates the executable code for target hardware from the high level description of applications, and balances the workload distribution in the heterogeneous system. It greatly reduces the development period of parallel programming and easily access the parallel computing without concentrating on the complex kernel code.

Jinglin Zhang, Jian Shang, Cong Bai

Automatic Image Annotation Using Adaptive Weighted Distance in Improved K Nearest Neighbors Framework

Automatic image annotation is a challenging problem due to the label-image-matching, label-imbalance and label-missing problems. Some research tried to address part of these problems but didn’t integrate them. In this paper, an adaptive weighted distance method which incorporates the CNN (convolutional neural network) feature and multiple handcrafted features is proposed to handle the label-image-matching and label-imbalance issues, while the K nearest neighbors framework is improved by using the neighborhood with all labels which can reduce the effects of the label-missing problem. Finally, experiments on three benchmark datasets (Corel-5k, ESP-Game and IAPRTC-12) for image annotation are performed, and the results show that our approach is competitive to the state-of-the-art methods.

Jiancheng Li, Chun Yuan

One-Shot-Learning Gesture Segmentation and Recognition Using Frame-Based PDV Features

This paper proposes one novel on-line gesture segmentation and recognition method for one-shot-learning on depth video. In each depth image, we take several random points from the motion region and select a group of relevant points for each random point. Depth difference between each random point and its relevant points is calculated in Motion History Images. The results are used to generate the random point’s feature. Then we use Random Decision Forest to assign gesture label to each random point and work out the probability distribution vector (PDV) for each frame in the video. Finally, we gain a probability distribution matrix (PDM) using PDVs of sequential frames and do on-line segmentation and recognition for one-shot-learning. Experimental results show our method is competitive to the state-of-the- art methods.

Tao Rong, Ruoyu Yang

Multi-scale Point Set Saliency Detection Based on Site Entropy Rate

Visual saliency in images has been studied extensively in many literatures, but there is no much work on point sets. In this paper, we propose an approach based on pointwise site entropy rate to detect the saliency distribution in unorganized point sets and range data, which are lack of topological information. In our model, a point set is first transformed to a sparsely-connected graph. Then the model runs random walks on the graphs to simulate the signal/information transmission. We evaluate point saliency using site entropy rate (SER), which reflects average information transmitted from a point to its neighbors. By simulating the diffusion process on each point, multi-scale saliency maps are obtained. We combine the multi-scale saliency maps to generate the final result. The effectiveness of the proposed approach is demonstrated by comparisons to other approaches on a range of test models. The experiment shows our model achieves good performance, without using any connectivity information.

Yu Guo, Fei Wang, Pengyu Liu, Jingmin Xin, Nanning Zheng

Facial Expression Recognition with Multi-scale Convolution Neural Network

We present a deep convolutional neural network (CNN) architecture for facial expression recognition. Inspired by the fact that regions located around certain facial parts (e.g. mouth, nose, eyes, and brows) contain the most representative information of expressions, an architecture extracts features at different scale from intermediate layers is designed to combine both local and global information. In addition, noticing that in specific to facial expression recognition, traditional face alignment would distort the images and lose expression information. To avoid this side effect, we apply batch normalization to the architecture instead of face alignment and feed the network with original images. Moreover, considering the tiny differences between classes caused by the same facial movements, a triplet-loss learning method is used to train the architecture, which improves the discrimination of deep features. Experiments show that the proposed architecture achieves superior performance to other state-of-the-art methods on the FER2013 dataset.

Jieru Wang, Chun Yuan

Deep Similarity Feature Learning for Person Re-identification

Person re-identification aims to match the same pedestrians across different camera views and has been applied to many important applications such as intelligent video surveillance. Due to the spatiotemporal uncertainty and visual ambiguity of pedestrian image pairs, person re-identification remains a difficult and challenging problem. The huge success of deep learning has focused attention on the use of deep features for person re-identification. However, for person re-identification, most deep learning methods minimize cross-entropy or triplet-based losses, thereby neglecting the fact that the similarities and differences between image pairs can be considered simultaneously to increase discrimination. In this paper, we propose a novel deep learning method called deep similarity feature learning (DSFL) to extract more effective deep features for image pairs. Extensive experiments on two representative person re-identification datasets (CUHK-03 and GRID) demonstrate the effectiveness and robustness of DSFL.

Yanan Guo, Dapeng Tao, Jun Yu, Yaotang Li

Object Detection Based on Scene Understanding and Enhanced Proposals

This paper studies the role of scene understanding in object detection by proposing a two-stream pipeline model called Scene-Object Network for Detection (SOND). Specifically, SOND is based on scene-like info and object detection method, which are separately implemented by two deep ConvNets and merged at the end of the pipeline. Moreover, we raise a novel approach to proportionally combine proposals generated by Selective Search and Edge Box to reduce the high localization error when only Selective Search is used. The enhanced combined proposals are used in the training and testing process of our method. Proving that scene info and enhanced proposals are indeed of great help to improve object detection, we achieve competitive results on PASCAL VOC 2007 (mAP 74.2 %) and 2012 (mAP 71.8 %), which surpass the baseline of Fast-RCNN by 4.2 % on VOC 2007 and 3.4 % on VOC 2012.

Zhicheng Wang, Chun Yuan

Video Inpainting Based on Joint Gradient and Noise Minimization

Video inpainting is a process of filling in missing pixels or replacing undesirable pixels in a video. State-of-the-art approaches show limited efficiency to the videos in which the regions to be filled are large and irregular or the occlusions move slowly. Inpainted results on these videos are coarse and have small noise remained or have mosaic appearance. In this paper, we propose a new inpainting method to fix the problem. We rarefy the gradient fields of recovered frames along the horizontal and vertical direction and minimize noise remained after the initial inpainting while still considering the rank minimization of to-be-inpainted matrix. An algorithm which uses partial updates for the dual variables and approximates gradient fields based on Alternating Direction Minimizing strategy is designed to efficiently and effectively optimize the formulation. The experimental results on both simulated and real video data demonstrate the superior performance of the proposed method over state-of-the-arts in terms of preserving details as well as guaranteeing piecewise smoothness of recovered frames.

Yiqi Jiang, Xin Jin, Zhiyong Wu

Head Related Transfer Function Interpolation Based on Aligning Operation

Head related transfer function (HRTF) is the main technique of binaural synthesis, which is used to reconstruct spatial sound image, and the HRTF data only can be obtained by measurement. A high resolution HRTF database contains too many HRTFs, the workload of measurement is too huge to be finished. As a solution, in order to calculate new HRTF by measured HRTFs, many researchers concentrate on the interpolation of HRTF. But, before interpolating, HRTFs should be aligned because there is time delay between different HRTFs. Some researchers try to implement aligning operation based on phase, but the method is not appropriate since the periodicity of phase. Another idea to align HRTFs is by detecting method, however, the time difference is too tiny to detect exactly. None of the methods can provide a good and stable performance. In this paper, we propose a new method to align HRTFs based on correlation. And the experiments show that the proposed aligning method improves the accuracy index SDR 18.5 dB for the most, furthermore, the proposed method could improve the accuracy for all positions.

Tingzhao Wu, Ruimin Hu, Xiaochen Wang, Li Gao, Shanfa Ke

Adaptive Multi-window Matching Method for Depth Sensing SoC and Its VLSI Implementation

This paper presents the full VLSI implementation of adaptive multi-window matching method for depth sensing system on a chip (SoC) based on active infrared structured light, which estimates the 3D scene depth by matching randomized speckle patterns, akin to the Microsoft Kinect. We present a simple and efficient hardware structure for the adaptive multi-window block-matching-disparity estimation algorithm, which facilitates rapid generation of disparity maps in real-time. Then the disparity map is calculated to the depth value according to the triangulation principle. We have implemented these ideas in an end-to-end SoC using FPGA and demonstrate that our depth sensing SoC can ensure the matching accuracy and improve the details of depth maps, such as, small objects and the edge of objects.

Huimin Yao, Chenyang Ge, Liuqing Yang, Yichuan Fu, Jianru Xue

A Cross-Domain Lifelong Learning Model for Visual Understanding

In the study of media machine perception on image and video, people expect the machine to have the ability of lifelong learning like human. This paper, starting from anthropomorphic media perception, researches the multi-media perception which is based on lifelong machine learning. An ideal lifelong machine learning system for visual understanding is expected to learn relevant tasks from one or more domains continuously. However, most existing lifelong learning algorithms do not focus on the domain shift among tasks. In this work, we propose a novel cross-domain lifelong learning model (CD-LLM) to address the domain shift problem on visual understanding. The main idea is to generate a low-dimensional common subspace which captures domain invariable properties by embedding Grassmann manifold into tasks subspaces. With the low-dimensional common subspace, tasks can be projected and then model learning is performed. Extensive experiments are conducted on competitive cross-domain dataset. The results show the effectiveness and efficiency of the proposed algorithm on competitive cross-domain visual tasks.

Chunmei Qing, Zhuobin Huang, Xiangmin Xu

On the Quantitative Analysis of Sparse RBMs

With the development of deep neural networks, the model of restricted Boltzmann machine(RBM) has gradually become one of the essential aspects in deep learning researches. Because of the presence of the partition function, it is intractable to get the model selection, control the complexity, and learn an exact maximum likelihood in RBM model. A kind of effective measure is approximate inference that adopts annealing importance sampling(AIS) scheme only to evaluate a RBM’s performance. At present, there is little quantitative analysis on discrepancies generated by different RBM models. So we focus on the innovation research on some quantitative evaluation of the generalization performance of all kinds of sparse RBM models, including the classical sparse RBM(SpRBM) and the log sum sparse RBM(LogSumRBM). We discuss the influence and efficiency of the AIS strategy for these sparse RBMs’ estimations. Particularly, we confirm that the LogSumRBM is the optimal model in RBM and sparse RBMs for its smaller deviations in the assessment results regardless of the training MNIST data and the test, which provides a guarantee on some theories and experience in the choice of the deep learning models in the future.

Yanxia Zhang, Lu Yang, Binghao Meng, Hong Cheng, Yong Zhang, Qian Wang, Jiadan Zhu

An Efficient Solution for Extrinsic Calibration of a Vision System with Simple Laser

Strong demands for accurate reconstruction of non-cooperative target have been arising recent years. The existing methods which combine cameras and laser range finders (LRF) are inconvenient and cost much. We replace the widely used laser range finder with a simple laser, and find that the combination of a camera and a simple calibrated laser is also enough to reconstruct the highly accurate 3-D position of the laser spot. In this paper, we propose a method to calibrate the extrinsic parameters between a camera and a simple laser, and show how to use it to reconstruct a laser spot’s 3-D position. The experiments show that our proposed method can obtain a result which is comparable to the state-of-the-art LRF-based methods.

Ya-Nan Chen, Fei Wang, Hang Dong, Xuetao Zhang, Haiwei Yang

A Stepped-RAM Reading and Multiplierless VLSI Architecture for Intra Prediction in HEVC

An efficient hardware architecture for intra prediction in High Efficiency Video Coding is proposed in this paper. The architecture supports all prediction modes and units. First, stepped-RAM reading method is proposed to realize reading RAM pixels in one pipeline period. Second, a new reference pixel-mapping method is also presented to solve hardware-consumed reference mapping process in angle prediction. A universal address arbitration and multiplierless prediction calculation unit which integrates angle, planar and DC prediction are also included in the design, which efficiently reduce hardware cost. Experimental results show that, with TSMC 65 nm CMOS technology, the proposed architecture reach a high operating clock frequency of 695 MHz with and meet the real time requirement for 2560 × 1440 video at 35 fps. The hardware cost of our proposed architecture is only 42 K gates. Compared with previous architectures, the proposed architecture can greatly increase working throughput and reduce hardware cost.

Wei Zhou, Yue Niu, Xiaocong Lian, Xin Zhou, Jiamin Yang

A Sea-Land Segmentation Algorithm Based on Sea Surface Analysis

Ship detection from optical remote sensing imagery is an important and challenging task. Sea-land segmentation is a key step for ship detection. Due to the complex and various sea surfaces caused by waves, illumination and shadows, traditional sea-land segmentation algorithms often misjudge between land and sea. Thus, a new segmentation scheme based on sea surface analysis is proposed in this paper. Then the adaptive threshold can be determined according to statistical analysis to different types of patches from the optical remote sensing images. Experimental results show that our algorithm has better performance compared to the traditional algorithms.

Guichi Liu, Enqing Chen, Lin Qi, Yun Tie, Deyin Liu

Criminal Investigation Oriented Saliency Detection for Surveillance Videos

Detecting the salient regions, namely locating the key regions that contain rich clues, is of great significance for better mining and analyzing the crucial information in surveillance videos. Yet, to date, the existed saliency detection methods are mainly designed to fit human perception. Nevertheless, what we value most during in surveillance videos, i.e. criminal investigation attentive objects (CIAOs) such as pedestrians, human faces, vehicles and license plates, is often different from those sensitive to human vision in general situations. In this paper, we proposed criminal investigation oriented saliency detection method for surveillance videos. A criminal investigation attentive model (CIAM) is constructed to score the occurrence probabilities of CIAOs in spatial domain and novelly utilize score to represent saliency, thus making CIAO regions more salient than non-CIAO regions. In addition, we refine the spatial domain saliency map with the motion information in temporal domain to obtain the spatio-temporal saliency map that has high distinctiveness for regions of moving CIAOs, static CIAOs, moving non-CIAOs and static non-CIAOs. Experimental results on surveillance video datasets demonstrate that the proposed method outperforms the state-of-art saliency detection methods.

Yu Chen, Ruimin Hu, Jing Xiao, Liang Liao, Jun Xiao, Gen Zhan

Deep Metric Learning with Improved Triplet Loss for Face Clustering in Videos

Face clustering in videos is to partition a large amount of faces into a given number of clusters, such that some measure of distance is minimized within clusters and maximized between clusters. In real-world videos, head pose, facial expression, scale, illumination, occlusion and some uncontrolled factors may dramatically change the appearance variations of faces. In this paper, we tackle this problem by learning non-linear metric function with a deep convolutional neural network from the input image to a low-dimensional feature embedding with the visual constraints among face tracks. Our network directly optimizes the embedding space so that the Euclidean distances correspond to a measure of semantic face similarity. This is technically realized by minimizing an improved triplet loss function, which pushes the negative face away from the positive pairs, and requires the distance of the positive pair to be less than a margin. We extensively evaluate the proposed algorithm on a set of challenging videos and demonstrate significant performance improvement over existing techniques.

Shun Zhang, Yihong Gong, Jinjun Wang

Characterizing TCP Performance for Chunk Delivery in DASH

Dynamic Adaptive Streaming over HTTP (DASH) has emerged as an increasingly popular paradigm for video streaming [12], in which a video is segmented into many chunks delivered to users by HTTP request/response over Transmission Control Protocol (TCP) connections. Therefore, it is intriguing to study the performance of strategies implemented in conventional TCPs, which are not dedicated for video streaming, e.g., whether chunks are efficiently delivered when users perform interactions with the video players. In this paper, we conduct measurement studies on users chunk requesting traces in DASH from a representative video streaming provider, to investigate users behaviors in DASH, and TCP-connection-level traces from CDN servers, to investigate the performance of TCP for DASH. By studying how video chunks are delivered in both the slow start and congestion avoidance phases, our observations have revealed the performance characteristics of TCP for DASH as follows: (1) Request patterns in DASH have a great impact on the performance of TCP variations including cubic; (2) Strategies in conventional TCPs may cause user perceived quality degradation in DASH streaming; (3) Potential improvement to TCP strategies for better delivery in DASH can be further explored.

Wen Hu, Zhi Wang, Lifeng Sun

Where and What to Eat: Simultaneous Restaurant and Dish Recognition from Food Image

This paper considers the problem of simultaneous restaurant and dish recognition from food images. Since the restaurants are known because of their some special dishes (e.g., the dish “hamburger” in the restaurant “KFC” ), the dish semantics from the food image provides partial evidence for the restaurant identity. Therefore, instead of exploiting the binary correlation between food images and dish labels by existing work, we model food images, their dish names and restaurant information jointly, which is expected to enable novel applications, such as food image based restaurant visualization and recommendation. For solution, we propose a model, namely Partially Asymmetric Multi-Task Convolutional Neural Network (PAMT-CNN), which includes the dish pathway and the restaurant pathway to learn the dish semantics and the restaurant identity, respectively. Considering the dependence of the restaurant identity on the dish semantics, PAMT-CNN is capable of learning the restaurant’s identity under the guidance of the dish pathway using partially asymmetric shared network architecture. To evaluate our model, we construct one food image dataset with 24,690 food images, 100 classes of restaurants and 100 classes of dishes. The evaluation results on this dataset have validated the effectiveness of the proposed approach.

Huayang Wang, Weiqing Min, Xiangyang Li, Shuqiang Jiang

A Real-Time Gesture-Based Unmanned Aerial Vehicle Control System

Unmanned aerial vehicles (UAVs) are playing important roles in many fields for their stability and flexibility. However, controlling a UAV by its remote-controller is very difficult especially for the beginners. In this paper, we propose a real-time UAV control system that only exploits shape and movements of the user’s hands. A set of gestures that map the hand actions to all motions of the UAV is designed based on subjective experience assessment. 94.898 % of motion accuracy can be achieved with only 0.19 s of latency on average. Compared with other systems, our system reduces 40.625 % and 36.667 % in the latency. To the best of our knowledge, our control system is the first one to realize all motions of the UAV in the actual experiments by only utilizing hand motions.

Leye Wei, Xin Jin, Zhiyong Wu, Lei Zhang

A Biologically Inspired Deep CNN Model

Recently, the Deep Convolutional Neural Networks (DCNN) have achieved state-of-the-art performances with many tasks in image and video analysis. However, it is a very challenging problem to devise a good DCNN model as there are so many choices to be made by a network designer, including the depth, the number of feature maps, interconnection patterns, window sizes for convolution and pooling layers, etc. These choices constitute a huge search space that makes it impractical to discover an optimal network structure with any systematic approaches. In this paper, we strive to develop a good DCNN model by borrowing biological guidance from the human visual cortex. By making an analogy between the proposed DCNN model and the human visual cortex, many critical design choices of the proposed model can be determined with some simple calculations. Comprehensive experimental evaluations demonstrate that the proposed DCNN model achieves state-of-the-art performances on four widely used benchmark datasets: CIFAR-10, CIFAR-100, SVHN and MNIST.

Shizhou Zhang, Yihong Gong, Jinjun Wang, Nanning Zheng

Saliency-Based Objective Quality Assessment of Tone-Mapped Images

High Dynamic Range (HDR) image is an imaging or photographic technique that can keep more intensity information than Low Dynamic Range (LDR) image. Based on the features of HDR images, we describe a technique for the saliency detection of HDR images, and further combine it for objectively assessing the quality of Tone-mapping operators. We propose a Tone-mapping quality index which is more similar to subjective quality scores ranked by human being. We use saliency map of the HDR images to adjust the differences of structural fidelity. Experimental results demonstrate that our proposed method can get relatively high score by comparing with the subjective scores.

Yinchu Chen, Ke Li, Bo Yan

Sparse Matrix Based Hashing for Approximate Nearest Neighbor Search

Binary hashing has been widely studied for approximate nearest neighbor (ANN) search with its compact representation and efficient comparison. Many existing hashing methods aim at improving the accuracy of ANN search, but ignore the complexity of generating binary codes. In this paper, we propose a new unsupervised hashing method based on a sparse matrix, named as Sparse Matrix based Hashing (SMH). There are only three kinds of elements in our sparse matrix, i.e., +1, $$-1$$-1 and 0. We learn the sparse matrix by optimizing a new pair-wise distance-preserving objective, in which the linear projection on the Euclidean distance and the corresponding Hamming distance is preserved. With the special form of the sparse matrix, the optimization can be solved by a greedy algorithm. The experiments on two large-scale datasets demonstrate that SMH expedites the process of generating binary codes, and achieves competitive performance with the state-of-the-art unsupervised hashing methods.

Min Wang, Wengang Zhou, Qi Tian, Houqiang Li

Piecewise Affine Sparse Representation via Edge Preserving Image Smoothing

We show a new image editing method, which can obtain the sparse representation of images. The previous methods obtain the sparse image representation by using first-order smooth prior with $$l_0$$l0-norm. A type of incorrect structure will be preserved due to the so called staircasing effects, which usually occur in the region where the image changes gradually. In this paper, we propose the model formed with the data fidelity and the new regularization preserving the gradient at the salient edges and penalizing the magnitude of second-order derivative at all of the other pixels. To obtain the sparse representation, we iteratively minimize the model. In each iteration, the salient edges are re-extracted and the weight of regularization becomes larger than previous. Our iterating smoothing scheme yields the sparse representation, and avoids the incorrect structure caused by staircasing. The experiments illustrate our method outperforms the state of the arts.

Xuan Wang, Fei Wang, Yu Guo

Springer Professional

About this book

Table of Contents

Frontmatter