S2DLDP with its Application to Palmprint Recognition

In this paper, we introduce the sparse two-dimensional local discriminant projections (S2DLDP) algorithm into palmprint recognition, and give an exactly recognition performance evaluation of the S2DLDP algorithm on public PolyU palmprint database. S2DLDP algorithm applies the idea of sparse for 2DLDP, possessing advantages of high computational efficiency and recognition performance. We perform the algorithm using various non-zero elements and image sizes, and then compare it with LDA, LPP and DLPP algorithm. The optimal recognition rate obtained by S2DLDP is 99.5%, which is significantly higher than the other three methods. Experiment results illuminate the excellent effectiveness of the S2DLDP algorithm for palmprint recognition.

Ligang Liu, Jianxin Zhang, Yinghua Jiang

A Hierarchical Voting Scheme for Robust Geometric Model Fitting

In this paper, we propose an efficient and robust model fitting method, called Hierarchical Voting scheme based Fitting (HVF), to deal with multiple-structure data. HVF starts from a hierarchical voting scheme, which simultaneously analyses the consensus information of data points and the preference information of model hypotheses. Based on the proposed hierarchical voting scheme, HVF effectively removes “bad” model hypotheses and outliers to improve the efficiency and accuracy of fitting results. Then, HVF introduces a continuous relaxation based clustering algorithm to fit and segment multiple-structure data. The proposed HVF can effectively estimate model instances from the model hypotheses generated by random sampling, which usually includes a large proportion of “bad” model hypotheses. Experimental results show that the proposed HVF method has significant superiority over several state-of-the-art fitting methods on both synthetic data and real images.

Fan Xiao, Guobao Xiao, Xing Wang, Jin Zheng, Yan Yan, Hanzi Wang

Multi-kernel Hashing with Semantic Correlation Maximization for Cross-Modal Retrieval

Cross-modal hashing aims to facilitate approximate nearest neighbor search by embedding multimedia data represented in high-dimensional space into a common low-dimensional Hamming space, which serves as a key part in multimedia retrieval. In recent years, kernel-based hashing methods have achieved impressive success in cross-modal hashing. Enlightened by this, we present a novel multiple kernel hashing method, where hash functions are learned in the kernelized space using a sequential optimization strategy. Experimental results on two benchmark datasets verify that the proposed method significantly outperforms some state-of-the-art methods.

Guangfei Yang, Huanghui Miao, Jun Tang, Dong Liang, Nian Wang

Preprocessing and Segmentation Algorithm for Multiple Overlapped Fiber Image

In the fiber image recognition system, pinpoint segmentation is critical for fiber feature extraction and further identification. In the case of fiber image taken by the optical microscope, the overlapped types are complicate which deteriorates the accuracy of segmentation. In order to get the concise and clear fiber contour curve for later process, a pretreatment method combined the curve fitting and complex domain filtering is designed in this paper. A novel technology for the concave points matching is also applied to the segmentation. The essence of the segmentation is to construct a virtual fiber boundary on the fiber contour which is obtained though the fiber image preprocessing approach presented in this article, the virtual boundary will connect the concave points to be matched so that the overlapped fibers are separated into independent individuals. Whether or not the virtual boundary is successfully restored is judged by the triangular area consisted of a given concave point, its precursor point (or subsequent point) and the target point, as well as the characteristic relationship between concave points in the bounded closed corners set of the fiber contour curves. The experimental result shows that the overlapped fibers can be segmented without changing any morphological parameter, and the fiber segmentation approach is suitable for complex scenes such as fiber adhesion or cross with each other and has a high segmentation accuracy.

Xiaochun Chen, Yanbin Zhang, Shun Fu, Hu Peng, Xiang Chen

Action Graph Decomposition Based on Sparse Coding

A video can be thought of as a visual document which may be represented from different dimensions such as frames, objects and other different levels of features. Action recognition is usually one of the most important and popular tasks, and requires the understanding of temporal and spatial cues in videos. What structures do the temporal relationships share in common inter- and intra-classes of actions? What is the best representation for those temporal relationships? We propose a new temporal relationship representation, called action graphs based on Laplacian matrices and Allen’s temporal relationships. Recognition framework based on sparse coding, which also mimics human vision system to represent and infer knowledge. To our best knowledge, “action graphs” is put forward to represent the temporal relationships. we are the first using sparse graph coding for event analysis.

Wengang Feng, Huawei Tian, Yanhui Xiao, Jianwei Ding, Yunqi Tang

An Unsupervised Change Detection Approach for Remote Sensing Image Using Visual Attention Mechanism

In this paper, we propose a novel approach for unsupervised change detection by integrating visual attention mechanism which has the ability to find the real changes between two images. The approach starts by generating a difference image using the differential method. Subsequently, an entropy-based saliency map is generated in order to highlight the changed regions which are regarded as salient regions. Thirdly, a fusion image is generated using difference image and entropy-based saliency map. Finally, the K-means clustering algorithm is used to segment the fusion image into changed and unchanged classes. To demonstrate the effect of our approach, we compare it with the other four state-of-the-art change detection methods over two datasets, meanwhile extensive quantitative and qualitative analysis of the change detection results confirms the effectiveness of the proposed approach, showing its capability to consistently produce promising results on all the datasets without any priori assumptions.

Lin Wu, Guanghua Feng, Jiangtao Long

An Efficient and Robust Visual Tracking via Color-Based Context Prior Model

Real-time object tracking has been widely applied to time-critical multimedia fields such as surveillance and human-computer interaction. It is a challenge to balance accuracy and speed in tracking. Spatio-Temporal Context tracker (STC) formulates the spatio-temporal relationship between the object and its surrounding background, and achieves good performance in accuracy and speed. However, the context prior model only utilizes the grayscale feature which is not efficient. When the target is not obvious in the context, or the context exists a similar interference compared to the target, STC tracker drifts from the target. To solve the problem, we exploit the standard color histograms of the context to build a discriminative context prior model. More specifically, we utilize an effective lookup-table to compute the prior context model at a low computational cost. Finally, extensive experiments on challenging sequences show the effectiveness and robustness of our proposed method.

Chunlei Yu, Baojun Zhao, Zengshuo Zhang, Maowen Li

Non-local Gradient Minimization Filter and Its Applications for Depth Image Upsampling

In this work, we propose a non-local $$L_{0}$$L0 gradient minimization filter. The nonlocal idea is to restore an unknown pixel using other similar pixels, and the nonlocal gradient model has been verified for feature and structure-preserving. We introduce the nonlocal idea into a $$L_{0}$$L0 gradient minimization approach, which is effective for preserving major edges while eliminating the low-amplitude structures. An optimization framework is designed for achieving this effort. Many optimized based filters do not have the property of joint filtering, so they can not be used in many problems, such as joint denoising, joint upsampling, while the proposed filter not only inherits the advantages of the $$L_{0}$$L0 gradient minimization filter, but also has the property of the joint filtering. So our filter can be applied to joint super resolution. With the guidance of the high-resolution image, we propose upsampling the low-resolution depth image with the proposed filter. Experimental results demonstrate the effectiveness of our method both qualitatively and quantitatively compared with the state-of-the-art methods.

Hang Yang, Xueqi Sun, Ming Zhu, Kun Wu

Salient Object Detection via Google Image Retrieval

Among trillions of images available online, there likely exists images whose content is visually similar to the query image (source image). Based on this observation, we propose a novel approach for salient object detection with the help of retrieved images returned by Google image search. We take the regional saliency of the source image as the frequency of occurrences in the retrieved images. The procedure of our saliency estimation approach is as follows. Firstly, given a query (source image) we extract N similar images from the retrieved result returned by Google. Then, we conduct matching between the source image and the extracted retrieval images in order to detect the repetitive region among them. Simultaneously, we segment the source image into several distinctive regions using superpixel segmentation and fusion. Finally, we derive the saliency map from the matching on the segmented regions. Experimental results demonstrate that compared with other methods, the proposed approach consistently achieves higher saliency detection performance in terms of subjective observations and objective evaluations.

Weimin Tan, Bo Yan

Abnormal Gait Detection in Surveillance Videos with FFT-Based Analysis on Walking Rhythm

For abnormal gait detection in surveillance videos, the existing methods suffer from that they are unable to recognize novel types of anomalies if the corresponding prototypes have not been included in the training data for supervised machine learning but it is impractical to foresee all types of anomalies. This research aims to solve the problem in an unsupervised manner, which does not rely on any prior knowledge regarding abnormal prototypes and avoids time-consuming machine learning over large-scale high-dimensional features. The intuition is that normal gait is nearly periodic signal and anomalies may disturb such periodicity. Hence, the time-varying ratio of width to height of a walking person is transformed to frequency domain using Fast Fourier Transform (FFT), and the standard deviation over spectrum is used as an indicator of anomalies, subject to any sudden change to break the normally periodical walking rhythm. The experimental results demonstrate its precision.

Anqin Zhang, Su Yang, Xinfeng Zhang, Jiulong Zhang, Weishan Zhang

Uncooperative Gait Recognition Using Joint Bayesian

Human gait, as a soft biometric, helps to recognize people by walking without subject cooperation. In this paper, we propose a more challenging uncooperative setting under which views of the gallery and probe are both unknown and mixed up (uncooperative setting). Joint Bayesian is adopted to model the view variance. We conduct experiments to evaluate the effectiveness of Joint Bayesian under the proposed uncooperative setting on OU-ISIR Large Population Dataset (OULP) and CASIA-B Dataset (CASIA-B). As a result, we confirm that Joint Bayesian significantly outperform the state-of-the-art methods for both identification and verification tasks even when the training subjects are different from the test subjects. For further comparison, the uncooperative protocol, experimental results, learning models, and test codes are available.

Chao Li, Kan Qiao, Xin Min, Xiaoyan Pang, Shouqian Sun

Pedestrian Detection via Structure-Sensitive Deep Representation Learning

Pedestrian detection is a fundamental task in a wide range of computer vision applications. Detecting the head-shoulder appearance is an attractive way for pedestrian detection, especially in scenes with crowd, heavy occlusion or large camera tilt angles. However, the head-shoulder part contains less information than the full human body, which requires better feature extraction to ensure the effectiveness of the detection. This paper proposes a head-shoulder detection method based on the convolutional neural network (CNN). According to the characteristics of the head and shoulders, our method integrates a structure-sensitive ROI pooling layer into the CNN architecture. The proposed CNN is trained in a multi-task scheme with classification and localization outputs. Furthermore, the convolutional layers of the network are pre-trained using a triplet loss to capture better features of the head-shoulder appearance. Extensive experimental results demonstrate that the average accuracy of the proposed method is 89.6% when the IoU threshold is 0.5. Our method obtains close results to the state-of-the-art method Faster R-CNN while outperforming it in speed. Even when the number of extracted candidate regions increases, the increased detection time is negligible. In addition, when the IoU threshold is greater than 0.6, the average accuracy of our method is higher than that of Faster R-CNN, which indicates that our results have higher IoU with ground truth.

Deliang Huang, Shijia Huang, Hefeng Wu, Ning Liu

Two-Stage Saliency Fusion for Object Segmentation

This paper proposes an effective two-stage saliency fusion method to generate the fusion map, which is used as a prior for object segmentation. Given multiple saliency maps generated by different saliency models, the first stage is to produce two fusion maps based on average and min-max statistics, respectively. The second stage is to perform the Fourier transform (FT) on the two fusion maps, and to combine the amplitude spectrum of average fusion map and the phase spectrum of min-max fusion map, so as to reform the spectrum, and the final fusion map is obtained by using the inverse FT on the reformed spectrum. Last, object segmentation is performed under graph cut by using the final fusion map as a prior. Extensive experiments on three public datasets demonstrate that the proposed method facilitates to achieve the better object segmentation performance compared to using individual saliency map and other fusion methods.

Guangling Sun, Jingru Ren, Zhi Liu, Wei Shan

Recognition of Offline Handwritten Mathematical Symbols Using Convolutional Neural Networks

This paper presents a method of Convolutional Neural Networks (CNN) to recognize offline handwritten mathematical symbols. In this paper, we propose a CNN model called HMS-VGGNet, in which the Batch Normalization and Global Average Pooling methods and only very small, specifically, 1 × 1 and 3 × 3 convolutional filters are applied. HMS-VGGNet uses only offline features of the symbols and has achieved the state-of-the-art accuracies in Competition on Recognition of Online Handwritten Mathematical Expressions (CROHME) 2014 test set and HASYv2 dataset. In CROHME 2016 test set, our result is only 0.39% less than the winner of CROHME 2016 who has used both online and offline features. The models proposed in this paper are accurate yet slim, which will be shown in our experiments.

Lanfang Dong, Hanchao Liu

4D ISIP: 4D Implicit Surface Interest Point Detection

In this paper, we proposed a new method to detect 4D spatiotemporal interest point called 4D-ISIP (4 dimension implicit surface interest point). We implicitly represent the 3D scene by 3D volume which has a truncated signed distance function (TSDF) in every voxel. The TSDF represents the distance between the spatial point and object surface which is a kind of implicit surface representation. The basic idea of 4D-ISIP detection is to detect the points whose local neighborhood has significant variations along both spatial and temporal dimensions. In order to test our 4D-ISIP detection, we built a system to acquire 3D human motion dataset using only one Kinect. Experimental results show that our method can detect 4D-ISIP for different human actions.

Shirui Li, Alper Yilmaz, Changlin Xiao, Hua Li

Integrative Embedded Car Detection System with DPM

In this paper a embedded system based on CPU and FPGA integration is presented for car detection with the improved Deformable Part Model (Deformable Part Model, DPM). Original images are computed and layered into multi-resolution HOG feature pyramid on CPU, and then transmitted to FPGA for fast convolution operations, and finally return to CPU for statistical matching and display. Due to the architecture of the DPM algorithm, combined with the hardware characteristics of the embedded system, the overall algorithm frameworks are simplified and optimized. According to the mathematical derivation and statistical rules, the feature dimensions and the pyramid levels of the model descend without sacrificing the accuracy, which effectively reduce the amount of calculation and data transmission. The advantages of parallel processing and pipeline design of FPGA are made full used to achieve the acceleration of convolution computation, which significantly reduce the running time of the program. Several experiments have been done for visible images in the unmanned aerial vehicle’s view and the driver assistance scene, and infrared images captured in an overlooking perspective are also tested and analyzed. The result shows that the system has good real-time and accuracy performance in different situations.

Wei Zhang, Lian-fa Bai, Yi Zhang, Jing Han

Online Fast Deep Learning Tracker Based on Deep Sparse Neural Networks

Deep learning can explore robust and powerful feature representations from data and has gained significant attention in visual tracking tasks. However, due to its high computational complexity and time-consuming training process, the most existing deep learning based trackers require an offline pre-training process on a large scale dataset, and have low tracking speeds. Therefore, aiming at these difficulties of the deep learning based trackers, we propose an online deep learning tracker based on Sparse Auto-Encoders (SAE) and Rectifier Linear Unit (ReLU). Combined ReLU with SAE, the deep neural networks (DNNs) obtain the sparsity similar to the DNNs with offline pre-training. The inherent sparsity make the deep model get rid of the complex pre-training process and can be used for online-only tracking well. Meanwhile, the technique of data augmentation is employed in the single positive sample to balance the quantities of positive and negative samples, which improve the stability of the model to some extent. Finally, in order to overcome the problem of randomness and drift of particle filter, we adopt a local dense sampling searching method to generate a local confidence map to locate the target’s position. Moreover, several corresponding update strategies are proposed to improve the robustness of the proposed tracker. Extensive experimental results show the effectiveness and robustness of the proposed tracker in challenging environment against state-of-the-art methods. Not only the proposed tracker leaves out the complicated and time-consuming pre-training process efficiently, but achieves an online fast and robust tracking.

Xin Wang, Zhiqiang Hou, Wangsheng Yu, Zefenfen Jin

Affine-Gradient Based Local Binary Pattern Descriptor for Texture Classification

We present a novel Affine-Gradient based Local Binary Pattern (AGLBP) descriptor for texture classification. It is very hard to describe complicated texture using single type information, such as Local Binary Pattern (LBP), which just utilizes the sign information of the difference between pixel and its local neighbors. Our descriptor has three characteristics: (1) In order to make full use of the information contained in the texture, the Affine-Gradient, which is different from Euclidean-Gradient and invariant to affine transformation, is incorporated into AGLBP. (2) An improved method is proposed for rotation invariance, which depends on the reference direction calculating respect to local neighbors. (3) Feature selection method, considering both the statistical frequency and the intraclass variance of the training dataset, is also applied to reduce the dimensionality of descriptors. Experiments on three standard texture datasets, Outex12, Outex10 and KTH-TIPS2, are conducted to evaluate the performance of AGLBP. The results show that our proposed descriptor gets better performance comparing to some state-of-the-art rotation texture descriptors in texture classification.

You Hao, Shirui Li, Hanlin Mo, Hua Li

Deep Convolutional Neural Network for Facial Expression Recognition

In this paper, a deep convolutional neural network model and the method of transfer learning are used to solve the problems of facial expression recognition (FER). Firstly, the method of transfer learning was adopted and face recognition net was transferred into facial expression recognition net. And then, in order to enhance the classification ability of our proposed model, a modified Softmax loss function (Softmax-MSE) and a double activation layer (DAL) are proposed. We performed our experiment on enhanced SFEW2.0 dataset and FER2013 dataset. The experiments have achieved overall classification accuracy of 48.5% and 59.1% respectively, which achieved the state-of-art performance.

Yikui Zhai, Jian Liu, Junying Zeng, Vincenzo Piuri, Fabio Scotti, Zilu Ying, Ying Xu, Junying Gan

A New Framework for Removing Impulse Noise in an Image

Nonlocal means filter (NLMF) or sparse representation based denoising technology has the remarkable performance in image denoising. In order to combine the advantages of the two methods together, a new image denoising framework is proposed. In this framework, the image containing impulse noise is processed firstly by NLMF to obtain a good temporary denoised image. Based on it, a number of patches are extracted for training a redundant dictionary which is adapted to the target signal. Finally, each noisy image patch in which the impulse noise is replaced by the values from the temporary denoised image is coded sparsely over the dictionary. Then, a clean image patch is reconstructed by multiplying the code efficient and the redundant dictionary. Verified by the extensive experiments, this denoising framework can not only obtain the better performance than that after use individually NLMF or sparse representation technology, but also get an obvious promotion in denoising texture images.

Zhou Yingyue, Xu Su, Zang Hongbin, He Hongsen

Joint Classification Loss and Histogram Loss for Sketch-Based Image Retrieval

We study the problem of content-based image retrieval using hand drawn sketches. The problem is very challenging since the low-level visual features of sketch and image have a large variance. Recent studies show that learning deep features that utilize high-level supervision is a feasible solution of this problem. We propose a new network structure with a joint loss by combining a simple classification loss with a robust histogram loss to learn better deep features for both sketch and image. The joint loss method has nearly no parameters to tune; it can not only learn the difference between image/sketch samples from different semantic class but also capture the fine-grained similarity between image/sketch samples in the same semantic class. In the experiments, we show the proposed method obtains excellent performance in real-time on the standard sketch-based image retrieval benchmark.

Yongluan Yan, Xinggang Wang, Xin Yang, Xiang Bai, Wenyu Liu

Correlation Based Identity Filter: An Efficient Framework for Person Search

Person search, which addresses the problem of re-identifying a specific query person in whole candidate images without bounding boxes in real-world scenarios, is a new topic in computer vision with many meaningful applications and has attracted much attention. However, it is inherently challenging because the annotations of pedestrian bounding boxes are unavailable and we have to identify find the target person from the whole gallery images. The existence of many visually similar people and dramatic appearance changes of the same person arising from the great cross-camera variation such as illumination, viewpoint, occlusions and background clutter also leads to the failure of searching a query person. In this work, we designed a Correlation based Identity Filter (CIF) framework for re-identifying the query person directly from the whole image with high efficiency. A regression model is learnt for obtaining a correlation filter/template for a given query person, which can help to alleviate the accumulated error caused by doing detection and re-identification separately. The filter is light and can be obtained and applied to search the query person with high speed with the utilization of Block-Circulant Decomposition (BCD) and Discrete Fourier Transform (DFT) techniques. Extensive experiments illustrate that our method has the important practical benefit of searching a specific person with a light weight and high efficiency and achieves better accuracy than doing detection and re-identification separately.

Wei-Hong Li, Yafang Mao, Ancong Wu, Wei-Shi Zheng

An Online Approach for Gesture Recognition Toward Real-World Applications

Action recognition is an important research area in computer vision. Recently, the application of deep learning greatly promotes the development of action recognition. Many networks have achieved excellent performances on popular datasets. But there is still a gap between researches and real-world applications. In this paper, we propose an integrated approach for real-time online gesture recognition, trying to bring deep learning based action recognition methods into real-world applications. Our integrated approach mainly consists of three parts. (1) A gesture recognition network simplified from two-stream CNNs is trained on optical flow images to recognize gestures. (2) To adapt to complicated and changeable real-world environments, target detection and tracking are applied to get a stable target bounding box to eliminate environment disturbances. (3) Improved optical flow is introduced to remove global camera motion and get a better description of human motions, which improves gesture recognition performance significantly. The integrated approach is tested on real-world datasets and achieves satisfying recognition performance, while guaranteeing a real-time processing speed.

Zhaoxuan Fan, Tianwei Lin, Xu Zhao, Wanli Jiang, Tao Xu, Ming Yang

A Novel Pavement Crack Detection Approach Using Pre-selection Based on Transfer Learning

Most of the existing pavement image crack detection methods cannot effectively solve the noise problem caused by the complicated pavement textures and intensity inhomogeneity. In this paper, we propose a novel fully automatic crack detection approach by incorporating a pre-selection process. It starts by dividing images into small blocks and training a deep convolutional neural network to screen out the non-crack regions in a pavement image which usually cause lots of noise and errors when performing crack detection; then an efficient thresholding method based on linear regression is applied to the crack-block regions to find the possible crack pixels; at last, tensor voting-based curve detection is employed to fill the gaps between crack fragments and produce the continuous crack curves. We validate the approach on a dataset of 600 (2000 × 4000-pixel) pavement images. The experimental results demonstrate that, with pre-selection, the proposed detection approach achieves very good performance (recall = 0.947, and precision = 0.846).

Kaige Zhang, Hengda Cheng

Learning Local Instance Constraint for Multi-label Classification

Compared to single-label image classification, multi-label image classification outputs unknown-number objects of different categories for an input image. For image-label relevance in multi-label classification, how to incorporate local information of objects with global information of label representation is still a challenging problem. In this paper, we propose an end-to-end Convolutional Neural Network (CNN) based method to address this problem. First, we leverage CNN to extract hierarchical features of input images and the dilated convolution operator is adopted to expand receptive fields without additional parameters compared to common convolution operator. Then, one loss function is used to model local information of instance activations in convolutional feature maps and the other to model global information of label representation. Finally, the CNN is trained end-to-end with a multi-task loss. Experimental results show that the proposed proposal-free single-CNN framework with a multi-task loss can achieve the state-of-the-art performance compared with existing methods.

Shang Luo, Xiaofeng Wu, Bin Wang, Liming Zhang

Disparity Refinement Using Merged Super-Pixels for Stereo Matching

The traditional disparity refinement methods cannot get highly accurate disparity estimations, especially pixels around depth boundaries and within low textured regions. To tackle this problem, two novel stereo refinement strategies are proposed: (1) merging super-pixels into stable region to maintain continuity and accuracy of the same disparity; (2) optimizing the co-operative relations between adjacent regions. Then we can obtain high-quality and high-density disparity maps. The quantitative evaluation on Middlebury benchmark shows that our algorithm can significantly refine the results obtained by local and non-local methods.

Jianyu Heng, Zhenyu Xu, Yunan Zheng, Yiguang Liu

Deep Scale Feature for Visual Tracking

Recently, deep learning methods have been introduced to the field of visual tracking and gain promising results due to the property of complicated feature. However existing deep learning trackers use pre-trained convolution layers which is discriminative to specific object. Such layers would easily make trackers over-fitted and insensitive to object deformation, which makes tracker a good locator but not a good scale estimator. In this paper, we propose deep scale feature and an algorithm for robust visual tracking. In our method, object scale estimator is made from lower layers from deep neural network and we add a specially trained mask after convolution layers, which filters out potential noise in this tracking sequence. Then, the scale estimator is integrated into a tracking framework combined with locator made from powerful deep learning classifier. Furthermore, inspired by correlation filter trackers, we propose an online update algorithm to make our tracker consistent with changing object in tracking video. Experimental results on various public challenging tracking sequences show that our proposed framework is effective and produce state-of-art tracking performance.

Wenyi Tang, Bin Liu, Nenghai Yu

TCCF: Tracking Based on Convolutional Neural Network and Correlation Filters

With the rapid development of deep learning in recent years, lots of trackers based on deep learning were proposed, and achieved great improvements compared with traditional methods. However, due to the scarcity of training samples, fine-tuning pre-trained deep models can be easily over-fitted and its cost is expensive. In this paper, we propose a novel algorithm for online visual object tracking which is divided into two separate parts, one of them is target location estimation and the other is target scale estimation. Both of them are implemented with correlation filters independently while using different feature representations. Instead of fine-tuning pre-trained deep models, we update correlation filters. And we design the desired output of correlation filters for every training sample which makes our tracker perform better. Extensive experiments are conducted on the OTB-15 benchmark, and the results demonstrate that our algorithm outperforms the state-of-the-art by great margin in terms of accuracy and robustness.

Qiankun Liu, Bin Liu, Nenghai Yu

PPEDNet: Pyramid Pooling Encoder-Decoder Network for Real-Time Semantic Segmentation

Image semantic segmentation is a fundamental problem and plays an important role in computer vision and artificial intelligence. Recent deep neural networks have improved the accuracy of semantic segmentation significantly. Meanwhile, the number of network parameters and floating point operations have also increased notably. The real-world applications not only have high requirements on the segmentation accuracy, but also demand real-time processing. In this paper, we propose a pyramid pooling encoder-decoder network named PPEDNet for both better accuracy and faster processing speed. Our encoder network is based on VGG16 and discards the fully connected layers due to their huge amounts of parameters. To extract context feature efficiently, we design a pyramid pooling architecture. The decoder is a trainable convolutional network for upsampling the output of the encoder, and fine-tuning the segmentation details. Our method is evaluated on CamVid dataset, achieving 7.214% mIOU accuracy improvement while reducing 17.9% of the parameters compared with the state-of-the-art algorithm.

Zhentao Tan, Bin Liu, Nenghai Yu

A New Kinect Approach to Judge Unhealthy Sitting Posture Based on Neck Angle and Torso Angle

Sitting posture has a close relationship with our health, keeping right sitting posture is important for people to avoid chronic diseases. However, automatic unhealthy sitting posture detection system is rare, especially for those based on computer vision technology. This paper proposes a new method of judging unhealthy sitting posture based on neck angle and torso angle detection using Kinect sensor. The method tracks neck angle and torso angle as two representative features from the depth image in a given period of time to judge whether the sitting posture is healthy or not. Experimental results show that the proposed method can judge sitting posture effectively for different unhealthy sitting types. Compared with the existing methods of action recognition, our method only needs a Kinect sensor without any other wearable sensors and is time efficient and robust because of only calculating two angles.

Leiyue Yao, Weidong Min, Hao Cui

Using Stacked Auto-encoder to Get Feature with Continuity and Distinguishability in Multi-object Tracking

Good feature expression of targets plays an important role in multi-object tracking (MOT). Inspired by the self-learning concept of deep learning methods, an online feature extraction scheme is proposed in this paper, based on a conditional random field (CRF). The CRF model is transformed into a certain number of multi-scale stacked auto-encoders with a new loss function. Features obtained with our method contain both continuous and distinguishable characteristics of targets. The inheritance relationship of stacked auto-encoders between adjacent frames is implemented by an online process. Features extracted from our online scheme are applied to improve the network flow tracking model. Experiment results show that the features by our method achieve better performance compared with other handcrafted-features. The overall tracking performance are improved when our features are used in the MOT tasks.

Haoyang Feng, Xiaofeng Li, Peixin Liu, Ning Zhou

Vehicle Detection Based on Superpixel and Improved HOG in Aerial Images

A method is introduced in this paper, which could be applied for vehicle detection in aerial images. In considering of the stable scale and structure feature of vehicles in aerial images, the method is divided into three steps: region extraction based on SLIC and contour shrinking, feature extraction with improved HOG and classification using SVM with RBF kernel function. Compared to the original HOG algorithm, SLIC is employed to reduce the time cost of the sliding window in the proposed method. Both vehicle and other object could be contained in one patch acquired from SLIC. To ensure that pixels in each patch for feature extraction belong to the same category, contours of the patches are shrunk towards its own center after the segmentation using SLIC. The 31-dimensional feature based on HOG is applied as the feature describer. The average time cost of the proposed method is 20.7 ms. The effectiveness of the proposed method is indicated by the detection results shown in experiments.

Enlai Guo, Lianfa Bai, Yi Zhang, Jing Han

Scalable Object Detection Using Deep but Lightweight CNN with Features Fusion

Recently, deep Convolutional Neural Network (CNN) is becoming more and more popular in pattern recognition, and have achieved impressive performance in multi-category datasets. Most object detection system include three main parts, CNN features extraction, region proposal and ROI classification, just like Fast R-CNN and Faster R-CNN. In this paper, a deep but lightweight CNN with features fusion is presented, and our work is focused on the improvement of the features extraction part in Faster R-CNN framework. Inspired by recent technical innovation structures, such as Inception, HyperNet and multi-scale construction, the proposed network is able to result in lower computation consumption with considerable deep layers. Besides, the network is trained with the help of data augmentation, fine-tune and batch normalization. In order to apply scalable with features fusion, there are different sampling methods for different layers, and various size kernel to extract both global and local features. Then fuse these features together, which can deal with diverse size object. The experimental results shows that our method have achieved better performance than Faster R-CNN with VGG16 on VOC2007, VOC2012 and KITTI datasets while maintaining the original speed.

Qiaosong Chen, Shangsheng Feng, Pei Xu, Lexin Li, Ling Zheng, Jin Wang, Xin Deng

Object Tracking with Blocked Color Histogram

Object deformation and blur are challenging problems in visual object tracking. Most existing methods increase the generalization of the features to decrease the sensitivity of spatial structure or combine statistical feature and spatial structure feature. This paper presents a novel approach to add structure characteristics to color histograms with blocked color histogram (BCH) to increase the robustness of trackers based on color histogram especially in deformation or blur problems. The proposed approach works by computing color histograms of every blocks extracted from given boxes. We strengthen structure characteristics by separating the whole box to several parts and use the color histogram of the individual parts to track, then weighting the results, and the result shows that this improves the performance compared to the methods using the whole color histogram. We also use double layer structure to speed up the method with the necessary accuracy. The proposed method gets good score in VOT2015 and VOT2016.

Xiaoyu Chen, Lianfa Bai, Yi Zhang, Jing Han

Deep Networks for Single Image Super-Resolution with Multi-context Fusion

Deep convolutional neural networks have been successfully applied to image super resolution. In this paper, we propose a multi-context fusion learning based super resolution model to exploit context information on both smaller image regions and larger image regions for SR. To speed up execution time, our method directly takes the low-resolution image (not interpolation version) as input on both training and testing processes and combines the residual network at the same time. The proposed model is extensively evaluated and compared with the state-of-the-art SR methods and experimental results demonstrate its performance in speed and accuracy.

Zheng Hui, Xiumei Wang, Xinbo Gao

Chinese Handwriting Generation by Neural Network Based Style Transformation

This paper proposes a novel learning-based approach to generate personal style handwritten characters. Given some training characters written by an individual, we first calculate the deformation of corresponding points between the handwritten characters and standard templates, and then learn the transformation of stroke trajectory using a neural network. The transformation can be used to generate handwritten characters of personal style from standard templates of all categories. In training, we use shape context features as predictors, and regularize the distortion of adjacent points for shape smoothness. Experimental results on online Chinese handwritten characters show that the proposed method can generate personal-style samples which appear to be naturally written.

Bi-Ren Tan, Fei Yin, Yi-Chao Wu, Cheng-Lin Liu

Text Detection Based on Affine Transformation

Text detection and recognition plays an important roles in many computer vision based systems, since text can provide explicit content information. In natural scene, variations of scale, rotation and position are the main challenges for text detection and recognition algorithms. Thus, rectified text region is required for most text recognition algorithm. In this paper, we proposed a text detection method which can provide accurate text region. With the external quadrilateral of text region, the affine parameters can be estimated. Consequently, the distorted text region can be rectified according to the affine parameters. The proposed method can provide more accurate detection result for text region. In addition, it can enhance the performance of text recognition. The experiments show the effectiveness of the proposed method.

Xiaoyue Jiang, Jing Liu, Yu Liu, Lin Zhang, Xiaoyi Feng

Visual Servoing of Mobile Robots with Input Saturation at Kinematic Level

This paper addresses the eye-in-hand visual servoing problem of mobile robots with velocity input saturation. A class of continuous and bounded functions is applied for the saturated visual servoing control design. The asymptotical convergence to zero of pose errors is proven using Lyapunov techniques and LaSalle’ s invariance principle. Simulation results are provided to show that the proposed controller can stabilize the mobile robot to the desired pose under the velocity saturation constraints.

Runhua Wang, Xuebo Zhang, Yongchun Fang, Baoquan Li

Fabric Defect Detection Algorithm Based on Multi-channel Feature Extraction and Joint Low-Rank Decomposition

Fabric defect detection plays an important role in the quality control of fabric products. In order to effectively detect defects for fabric images with numerous kinds of defects and complex texture, a novel fabric defect detection algorithm based on multi-channel feature extraction and joint low-rank decomposition is proposed. First, at the feature extraction stage, a multi-channel robust feature (Multi-channel Distinctive Efficient Robust Feature, McDerf) is extracted by simulating the biological visual perception mechanism for multiple gradient orientation maps. Second, joint low-rank decomposition algorithm is adopted to decompose the feature matrix into a low rank matrix and a sparse matrix. Finally, for the purpose of localizing the defect region, the threshold segmentation algorithm is utilized to segment the saliency map generated by sparse matrix. Comparing with the existing fabric defect detection algorithms, the experimental results show that the proposed algorithm has better adaptability and detection efficiency for the plain and patterned fabric images.

Chaodie Liu, Guangshuai Gao, Zhoufeng Liu, Chunlei Li, Yan Dong

Robust 3D Indoor Map Building via RGB-D SLAM with Adaptive IMU Fusion on Robot

Building a 3D map of indoor environment is a prerequisite for various applications, ranging from service robot to augmented reality, where RGB-D SLAM is a commonly used technique. To efficiently and robustly build a 3D map via RGB-D SLAM on robot, or the RGB-D sensor mounted on a moving robot, the following two key issues must be addressed: How to reliably estimate the robot’s pose to align partial models on the fly, and how to design the robot’s movement patterns in large environment to effectively reduce error accumulation and to increase building efficiency. To address these two issues in this work, we propose an algorithm to adaptively fuse the IMU information with the visual tracking for the first issue, and design two robot movement patterns for the second issue. The preliminary experiments on a TurtleBot2 robot platform show that our RGB-D SLAM system works well even for difficult situations such as weak-textured space, or presence of pedestrians.

Xinrui Meng, Wei Gao, Zhanyi Hu

The Research of Multi-angle Face Detection Based on Multi-feature Fusion

The method of multi-angle face detection method is proposed based on fusion of Haar-like, HOG and MB-LBP features. Firstly, three Adaboost cascade classifiers for original face region detection are constructed respectively according to Haar-like, MB-LBP and HOG features, using the processed training samples of face and non-face to train the classifiers. Secondly, the preprocessing of testing sample is implemented based on skin color model, which results are the classifiers input, and then the suspected face regions and their weights are obtained. Finally, the refined face regions are selected according to the results of voting and weighted threshold. The method proposed in this paper is implemented on VS2012 platform invoking opencv function library, and the simulation experiment is carried on FDDB. To verify the effect, our method is compared with other methods based on a single feature. Results of experiments show that the proposed method has higher accuracy and better real-time performance.

Mengnan Hu, Yongkang Liu, Rong Wang

Attention-Sharing Correlation Learning for Cross-Media Retrieval

Cross-media retrieval is a challenging research topic with wide prospect of application, aiming to retrieve among different media types by using a single-media query. The main challenge of cross-media retrieval is to learn the correlation between different media types for addressing the issue of “media gap”. The close semantic correlation usually lies in specific parts of cross-media data such as image and text, which plays the key role for precious correlation mining. However, existing works usually focus on correlation learning in the level of whole media instance, or adopt patch segmentation but treat the patches indiscriminately. They ignore the fine-grained discrimination learning, which limits the retrieval accuracy. Inspired by attention mechanism, this paper proposes the attention-sharing correlation learning network, which is an end-to-end network to generate cross-media common representation for retrieval. By sharing the common attention weights, the attention of different media types can be learned coordinately. It can not only emphasize the single-media discriminative parts, but also enhance the cross-media fine-grained consistent pattern, and so learn more precious cross-media correlation to improve retrieval accuracy. Experimental results on 2 widely-used datasets with state-of-the-art methods verify the effectiveness of the proposed approach.

Xin Huang, Zhaoda Ye, Yuxin Peng

A Method for Extracting Eye Movement and Response Characteristics to Distinguish Depressed People

Eye movement is an important characteristic in the field of image processing and psychology, which reflects people’s attention bias. How to design a paradigm with the function of psychological discrimination to extract significant eye movement characteristic is a challenging task. In this paper, we present a novel psychology evaluation system with eye tracking. Negative and positive background images from IAPS and Google are chosen based on the Minnesota Multiphasic Personality Inventory (MMPI). Meanwhile, negative and positive face images are used as emotional foreground. The location of the face images is shown on the left or right randomly. In this paradigm, people with different psychological status have different characteristics of eye movement length, fixation points and response time. The experimental results show that these characteristics have significant discriminability and can be used to distinguish depressed and normal people effectively.

Chao Le, Huimin Ma, Yidong Wang

Long-Distance/Environment Face Image Enhancement Method for Recognition

With the increase of distance and the influence of environmental factors, such as illumination and haze, the face recognition accuracy is significantly lower than that of indoor close-up images. In order to solve this problem, an effective face image enhancement method is proposed in this paper. This algorithm is a nonlinear transformation which combines gamma and logarithm transformation. Therefore, it is called: G-log. The G-Log algorithm can perform the following functions: (1) eliminate the influence of illumination; (2) increase image contrast and equalize histogram; (3) restore the high-frequency components and detailed information; (4) improve visual effect; (5) enhance recognition accuracy. Given a probe image, the procedure of face alignment, enhancement and matching is executed against all gallery images. For comparing the effects of different enhancement algorithms, all probe images are processed by different enhancement methods and identical face alignment, recognition modules. Experiment results show that G-Log method achieves the best effect both in matching accuracy and visual effect. Long-distance uncontrolled environment face recognition accuracy has been greatly improved, up to 98%, 98%, 95% for 60-, 100-, 150-m images after processed by G-Log from original 95%, 89%, 70%.

Zhengning Wang, Shanshan Ma, Mingyan Han, Guang Hu, Shuaicheng Liu

A Method on Recognizing Transmission Line Structure Based on Multi-level Perception

The structure of transmission line can be recognized by processing the images captured from unmanned aerial vehicle (UAV) power inspection. That can be applied to vision based UAV intelligent inspection and the further analysis of the fault diagnosis of transmission lines. For that, a multi-level perception-based method of transmission line structure recognition is proposed. Firstly, the extracted line segments are split based on key points and then merged based on Gestalt perception theory for getting a relatively complete and independent local contour. Next the area of parallel line segments and symmetrical and crossing line segments are perceived, and then a position restraint mechanism of transmission line structure is built for the preliminary recognition. Finally, the local contour feature is used for the further recognition. In the experiment, the false rate and the missed rate of the method are verified to be lower.

Yue Liu, Jianxiang Li, Wei Xu, Mingyang Liu

Saliency Detection Based on Background and Foreground Modeling

In this paper, a novel saliency detection algorithm is proposed to fuse both the background and foreground information while detecting salient objects in complex scenes. Firstly, we extract background seeds as well as their spatial information from image borders to construct a background-based saliency map. Then, an optimal contour closure is selected as the foreground region according to the first-stage saliency map. The optimal contour closure can provide a preferable description for salient object. We compute a foreground-based saliency map using the selected foreground region and integrate it with the background-based one. Finally, the unified saliency map is further refined to obtain a more accurate result. Experimental results show that the proposed algorithm can achieve favorable performance compared to the state-of-the-art ones.

Zhengbing Wang, Guili Xu, Yuehua Cheng, Zhengsheng Wang

Scale Estimation and Refinement in Monocular Visual-Inertial SLAM System

The fusion of monocular visual and inertial cues has become popular in robotics, unmanned vehicle and augmented reality fields. Recent results have shown that optimization-based fusion strategies outperform filtering ones. The visual-inertial ORB-SLAM is optimization-based and has achieved great success. However, it takes all measurements into IMU initialization, which contains outliers, and it lacks of termination criterion. In this paper, we aim to resolve these issues. First, we present an approach to estimate scale, gravity and accelerometer bias together, and regard the estimated gravity as an indication for estimation convergence. Second, we propose a methodology that is able to use weight $$ w $$w derived from the robust norm for outliers handling, so that the estimated scale can be refined. We test our approaches with the public EuRoC datasets. Experimental results show that the proposed methods can achieve good scale estimation and refinement.

Xufu Mu, Jing Chen, Zhen Leng, Songnan Lin, Ningsheng Huang

Crop Disease Image Recognition Based on Transfer Learning

Machine learning has been widely applied to the crop disease image recognition. Traditional machine learning needs to satisfy two basic assumptions: (1) The training and test data should be under the same distribution; (2) A large scale of labeled training samples is required to learn a reliable classification model. However, in many cases, these two assumptions cannot be satisfied. In the field of agriculture, there are not enough labeled crop disease images. In order to solve this problem, the paper proposed a method which introduced transfer learning to the crop disease image recognition. Firstly, the double Otsu method was applied to obtain the spot images of five kinds of cucumber and rice diseases. Then, color feature, texture feature and shape feature of spot images were extracted. Next, the TrAdaBoost-based method and other baseline methods were used to identify diseases. And experimental results indicate that the TrAdaBoost-based method can implement samples transfer between the auxiliary and target domain and achieve the better results than the other baseline methods. Meanwhile, the results show that transfer learning is helpful in the crop disease image recognition while the training sample is not enough.

Sisi Fang, Yuan Yuan, Lei Chen, Jian Zhang, Miao Li, Shide Song

Multi-orientation Scene Text Detection Leveraging Background Suppression

Most state-of-the-art text detection methods are devoted to horizontal texts and these methods cannot work well when encountering blurred, multi-oriented, low-resolution and small-sized texts. In this paper, we propose to localize texts from the perspective of suppressing more non-text backgrounds, in which a coarse-to-fine strategy is presented to remove non-text pixels from images. Firstly, the fully convolutional network (FCN) framework is utilized to make the coarse prediction of text labeling. Secondly, an efficient saliency measure based on background priors is employed to further suppress non-text pixels and generate fine character candidate regions. The remaining candidates of character regions composite text lines, so that the proposed method can handle multi-orientation texts in natural scene images. Two public datasets, MSRA-TD500 and ICDAR2013 are utilized to evaluate the performance of our proposed method. Experimental results show that our method achieves high recall rate and demonstrates the competitive performance.

Xihan Wang, Xiaoyi Feng, Zhaoqiang Xia, Jinye Peng, Eric Granger

Sparse Time-Varying Graphs for Slide Transition Detection in Lecture Videos

In this paper, we present an approach for detecting slide transitions in lectures videos by introducing sparse time-varying graphs. Given a lecture video which records the digital slides, the speaker, and the audience by multiple cameras, our goal is to find the keyframes where slide content changes. Specifically, we first partition the lecture video into short segments through feature detection and matching. By constructing a sparse graph at each moment with short video segments as nodes, we formulate the detection problem as a graph inference issue. A set of adjacency matrix between edges, which are sparse and time-varying, are then solved through a global optimization algorithm. Consequently, the changes between adjacency matrix reflect the slide transition. Experimental results show that the proposed system achieves the better accuracy than other video summarization and slide progression detection approaches.

Zhijin Liu, Kai Li, Liquan Shen, Ping An

Heterogeneous Multi-group Adaptation for Event Recognition in Consumer Videos

Event recognition in consumer videos has attracted much attention from researchers. However, it is a very challenging task since annotating numerous training samples is time consuming and labor expensive. In this paper, we take a large number of loosely labeled Web images and videos represented by different types of features from Google and YouTube as heterogeneous source domains, to conduct event recognition in consumer videos. We propose a heterogeneous multi-group adaptation method to partition loosely labeled Web images and videos into several semantic groups and find the optimal weight for each group. To learn an effective target classifier, a manifold regularization is introduced into the objective function of Support Vector Regression (SVR) with an $$\epsilon $$ϵ-insensitive loss. The objective function is alternatively solved by using standard quadratic programming and SVR solvers. Comprehensive experiments on two real-world datasets demonstrate the effectiveness of our method.

Mingyu Yao, Xinxiao Wu, Mei Chen, Yunde Jia

Margin-Aware Binarized Weight Networks for Image Classification

Deep neural networks (DNNs) have achieved remarkable successes in many vision tasks. However, due to the dependence on large memory and high-performance GPUs, it is extremely hard to deploy DNNs on low-power devices. For compressing and accelerating deep neural networks, many techniques have been proposed recently. Particularly, binarized weight networks, which store one weight using only one bit and replace complex floating operations with simple calculations, are attractive from the perspective of hardware implementation. In this paper, we propose a simple strategy to learn better binarized weight networks. Motivated by the phenomenon that the stochastic binarization approach usually converges with real-valued weights close to two boundaries $$\{-1, +1\}$${-1,+1} and gives better performance than deterministic binarization, we construct a margin-aware binarization strategy by adding a weight constraint into the objective function of deterministic scheme to minimize the margins between real-valued weights and boundaries. This constraint can be easily realized by a Binary-L2 regularization without suffering from the complex random number generation. Experimental results on MNIST and CIFAR-10 datasets show that the proposed method yields better performance than recent network binarization schemes and the full precision network counterpart.

Ting-Bing Xu, Peipei Yang, Xu-Yao Zhang, Cheng-Lin Liu

Multi-template Matching Algorithm Based on Adaptive Fusion

A target recognition method based on adaptive fusion of multiple matching results was proposed, in order to take advantage of the gray information and feature information in forward-looking infrared (FLIR) target recognition. On the basis of gray-value matching and feature matching, the primary decision based on the proposed adaptive analytic hierarchy process (AHP) and the secondary decision based on the overlap area and the local searching were utilized, thereby the final matching result was generated. Experimental results show that the proposed method can overcome the false matching caused by feature template matching effectively, and improve the accuracy of target matching, and especially in case of complicated background, scale difference or viewpoint difference, the proposed algorithm has better robustness.

Bing Li, Juan Su, Dan Chen, Wei Wu

Combining Object-Based Attention and Attributes for Image Captioning

Image captioning has been a hot topic in computer vision and natural language processing. Recently, researchers have proposed many models for image captioning which can be classified into two classes: visual attention based models and semantic attributes based models. In this paper, we propose a novel image captioning system which models the relationship between semantic attributes and visual attention. Besides, different from the traditional attention models which don’t use object detectors and instead learn latent alignment between regions and words, we propose an object attention system which is capable to incorporate information output by object detectors and can better attend to objects when generating corresponding words. We evaluate our method on MS COCO dataset and our model outperforms many strong baselines.

Cong Li, Jiansheng Chen, Weitao Wan, Tianpeng Li

Intrinsic Image Decomposition: A Comprehensive Review

Image understanding and analysis is one of the important tasks in the image processing. Multiple factors influence the appearance of an object in an image. However, extracting the intrinsic images from the observer image can eliminate the environmental impact effectively and make the image understanding more accurately. The intrinsic images represent the inherent shape, color and texture information of the object. Intrinsic image decomposition is recovering shading image and reflectance image from a single input image and remains a challenging problem because of its severely ill-posed problem. In order to deal with these problems, researches have proposed various algorithms for decomposing the intrinsic image. In this paper we survey the recent advances in intrinsic image decomposition. First, we introduce the existing datasets for intrinsic image decomposition. Second, we introduce and analyze the existing intrinsic image decomposition algorithms. Finally, we use the existing algorithms to experiment on the intrinsic image datasets, and analyze and summarize the experimental results.

Yupeng Ma, Xiaoyi Feng, Xiaoyue Jiang, Zhaoqiang Xia, Jinye Peng

Multi-instance Multi-label Learning for Image Categorization Based on Integrated Contextual Information

In image categorization, one image is usually reshaped as a bag of instances affiliated with multiple labels, which naturally induces a paradigm of multi-instance and multi-label learning (MIMLL). Previous researches proved that the significant improvements on image categorization accuracy resulted from applying connections between labels and regions or correlations among labels, but most proposed approaches could not make full use of contextual cues. Thus we propose a brand-new MIMLL method with integrating three distinct types of contexts into a conditional random fields (CRFs) framework, which can simultaneously capture latent probability distribution of instances, spatial context among adjacent instances and correlations between instances and labels. We perform image categorization on MSRC and Corel 1000 data sets to verify our proposal. Compared with traditional and state-of-the-art MIMLL algorithms, our approach obtains the superior performance.

Xingyue Li, Shouhong Wan, Chang Zou, Bangjie Yin

Book Page Identification Using Convolutional Neural Networks Trained by Task-Unrelated Dataset

This paper presents a pipeline to make convolutional neural networks (CNNs) trained for another unrelated task available for book page identification. The pipeline has five building blocks: (1) An image segmentation module to separate book page from the background; (2) An image correction module to correct geometry and color distortions; (3) A feature extraction module to extract discriminative image features by a pre-trained CNN; (4) A feature compression module to reduce feature dimensions for speeding up; and (5) A feature matching module to calculate the similarity between a query image and a reference image, and then to find out the most similar reference image. The experimental results on a challenging testing dataset show that the proposed book page identification method achieves a top-5 hit rate of 98.93%.

Leyuan Liu, Yi Zhao, Huabing Zhou, Jingying Chen

No Reference Assessment of Image Visibility for Dehazing

Haze affects the quality and visibility of the image. Many dehazing algorithms have been developed in recent years. However, the evaluation for the performance of the dehazing method is still not solved. The assessment is not easy to achieve since the reference image is not available. In this paper, a no reference image quality evaluation indicator is proposed to assess the visibility of a dehazed image. A multi-scale contrast feature is designed to measure the image sharpness. Considering some dehazing methods often cause under-dehazing results, a dark channel feature is employed to describe the haze residual degree of the restored image. Fusing the two features together, the final indicator that can measure the image visibility is obtained. Experimental results show that the assessment results are highly correlated with human visual perceptions and objective quality scores, which demonstrate the effectiveness and robustness of the proposed approach.

Manjun Qin, Fengying Xie, Zhiguo Jiang

Multi-focus Image Fusion Using Dictionary Learning and Low-Rank Representation

Among the representation learning, the low-rank representation (LRR) is one of the hot research topics in many fields, especially in image processing and pattern recognition. Although LRR can capture the global structure, the ability of local structure preservation is limited because LRR lacks dictionary learning. In this paper, we propose a novel multi-focus image fusion method based on dictionary learning and LRR to get a better performance in both global and local structure. Firstly, the source images are divided into several patches by sliding window technique. Then, the patches are classified according to the Histogram of Oriented Gradient (HOG) features. And the sub-dictionaries of each class are learned by K-singular value decomposition (K-SVD) algorithm. Secondly, a global dictionary is constructed by combining these sub-dictionaries. Then, we use the global dictionary in LRR to obtain the LRR coefficients vector for each patch. Finally, the $$l_1-norm$$l1-norm and choose-max fuse strategy for each coefficients vector is adopted to reconstruct fused image from the fused LRR coefficients and the global dictionary. Experimental results demonstrate that the proposed method can obtain state-of-the-art performance in both qualitative and quantitative evaluations compared with serval classical methods and novel methods.

Hui Li, Xiao-Jun Wu

Key-Region Representation Learning for Anomaly Detection

Anomaly detection and localization is of great importance for public safety monitoring. In this paper we focus on individual behavior anomaly detection, which remains a challenging problem due to complicated dynamics of video data. We try to solve this problem in a way based on feature extraction, we believe that patterns are easier to classify in feature space. However, different from many works in video analysis, we only extract features from small key-region patches, which allows our feature extraction module to have a simple architecture and be more targeted at anomaly detection. Our anomaly detection framework consists of three parts, the main part is an auto-encoder based representation learning module, and the other two parts, key-region extracting module and Mahalanobis distance based classifier, are specifically designed for anomaly detection in video. Our work has the following advantages: (1) our anomaly detection framework focus only on suspicious regions, and can detect anomalies with high accuracy and speed. (2) Our anomaly detection classifier has a stronger power to capture data distribution for anomaly detection.

Wenfei Yang, Bin Liu, Nenghai Yu

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter