Exemplar based Deep Discriminative and Shareable Feature Learning for scene image classification
Introduction
Extracting informative, robust, and compact data representation (features) has been considered as one of the key factors for good performance in computer vision. Thus, much effort has been paid on developing efficient and effective features, and the existing methods can roughly be classified into two categories: feature engineering and feature learning. In the last decade, numerous feature engineering methods developed hand-crafted features, such as SIFT [1], and HOG [2], that have ruled the image representation area. However, such methods are labor-intensive and limited by the designer׳s ingenuity and prior knowledge. In contrast, to expand the capability and ease of image representation, feature learning methods [3], [4], [5], [6], [7], [8], [9] aim to automatically learn data adaptive image representations from raw pixel image data. However, these methods are generally poor on extracting and organizing the discriminative information from the data. Meanwhile, most of the learning frameworks operate in unsupervised ways without considering the class label information, which is crucial for image classification. To compensate these weaknesses while maintain the advantages of feature learning, we propose to encode shareable information that exists among groups of classes, and discriminative patterns owned by specific classes in a feature learning procedure.
In this paper, we develop a multiple layer feature learning framework called Deep Discriminative and Shareable Feature Learning (DDSFL), which aims to hierarchically learn transformation filter bank to transform pixel values of local image patches to features. As shown in Fig. 1, in each feature learning layer, we aim to learn an over-complete filter bank, which is able to cover the variances of patches from different classes, meanwhile keeping the shareable correlation among similar classes and discriminative power of each category. Intuitively, this goal can be reached by randomly selecting training patches, and learning filter banks for each class independently, then concatenating them together afterwards. However, there are several problems: (1) some of the patterns are shared among some classes, repeatedly learning filters corresponding to the similar patterns are neither memory compact nor computationally efficient, meanwhile the feature dimension will increase linearly with the number of classes, which limit the learning methods to be only applicable to small datasets; (2) discriminative power can hardly be fully exploited, since the class specific characteristics are generally subtle and not obvious without comparing with other classes; (3) in most cases, images are dominated by noisy or meaningless patches, learning filters from randomly sampled image patches will increase the learning cost and depress the performance.
To learn compact and effective filter banks, each category is forced to only activate a subset of global filters during the learning procedure. Beyond reducing feature dimension, sharing filters can also lead to more robust features. Images belonging to different classes do share some information (e.g. in scene images, both ‘computer room’ and ‘office’ contain ‘computer’ and ‘desk’). The amount of information sharing depends on the similarity between different categories. Hence, we allow filters to be shareable, meaning that the same filters can be activated by a number of categories. We introduce a binary selection vector to adaptively select which filters to share, and among which categories.
To improve the discrimination power, we force the features from the same category to be close and the features from different categories to be far away (e.g. patches corresponding to bookshelf in ‘office’ can hardly be found in ‘computer room’). However, the local patches from the same categories are very diverse. Therefore, we propose to measure the similarity by forcing a patch to be similar to a subgroup of training samples from the same category instead. Furthermore, not all the local patches from different classes need to be separable. Thus, we relax the discriminative term to allow sharing similar patches across different classes and focus on separating the less similar patches.
To improve the quality of the filters and efficiency of the learning procedures, we propose two exemplar selection schemes to select effective training data. The proposed methods aim to remove the noisy training patches that commonly exist in many different classes, and select the patches that contain both shareable and discriminative patterns as the training data to learn the filter banks.
Furthermore, supported by lots of previous deep feature learning works, hierarchically extracting increasing visual level features can help to get more abstractive and useful information. Inspired by this idea, we extend the single layer feature learning module to a hierarchical structure. In this paper, we build a three layer learning framework, as shown in Fig. 2. Specifically, we firstly learn the first layer features from small (16×16) raw pixel value image patches. Then for the higher layers, we convolve the previous layer features within a larger region (32×32 and 64×64 for the second and third layer respectively) as the input to PCA, and use the reduced dimensional data (we set the dimension to 300 for all the layers) as the input to train the current layer filter bank (we set to learn 400 filters for each layer). Finally, we combine the features learned by all the three layers as our DDSFL feature.
The rest of this paper is organized as follows. Section 2 introduces the related works, including feature engineering, feature learning, and discriminative training. Section 3 describes the details of our DDSFL method by introducing global unsupervised term, shareable term, and discriminative term. Section 4 proposes two exemplar selection methods including Nearest Neighbor based and SVM based selection. In Section 5, we test our method on three widely used scene image classification datasets: Scene 15, UIUC Sports, MIT Indoor, and we also test on PASCAL VOC 2012. The experimental results show that our features can outperform most of the existing methods, and it also has significant complementary effect with the state-of-the-art Caffe [12] features (ConvNets [3] pre-trained on ImageNet [13]). Finally, Section 6 concludes this paper.
Section snippets
Feature engineering
In the feature engineering area, hand-crafted features including SIFT [1], HOG [2], LBP [14] and GIST [15] (global feature) were popular used. Comparing to current existing feature learning methods, they can generally get better local descriptors, and extra information (e.g. discriminative information) can be better expressed by manually inserting prior knowledge. Even though they are very powerful, designing such features is labor intensive, and they can hardly capture any information other
Deep discriminative and shareable feature learning
In this section, the hierarchical structure of our Deep Discriminative and Shareable Feature Learning (DDSFL) framework will be briefly introduced. For each single layer DDSFL, its three learning components will be introduced in detail, and an alternating optimization strategy will be provided afterwards (Fig. 3).
Exemplar selection
The quality of the learned features not only depends on the learning structure and parameters, but also relies on the quality of input training data. In order to select training patch set that carries both potential shareable and discriminative patterns, while also excludes common noisy patches, a training data selection procedure is required before processing feature learning.
For this purpose, one intuitive solution is applying k-means clustering, and using the cluster centroids as the
Datasets and experiment settings
We tested our DDSFL method on three widely used scene image classification datasets: Scene 15 [11], UIUC Sports [41], and MIT Indoor [42]. We also tested on the challenging PASCAL VOC 2012 object classification dataset. To make a fair comparison with other types of features, we only utilized gray scale information for all of the images in these datasets.
- •
Scene 15: This dataset includes 4485 images from 15 outdoor and indoor scene categories, each category contains 200–400 gray scale images.
Conclusion
In this paper, we propose a hierarchical weakly supervised local feature learning method, called DDSFL, to learn discriminative and shareable filter banks to transform local image patches into different visual level features. In our DDSFL method, we learn a flexible number of shared filters to represent shareable patterns that exist among similar categories. To enhance the discriminative power, we force the features from the same class to be similar, while features from different classes to be
Conflict of interest
None declared.
Acknowledgments
The research is supported by Singapore Ministry of Education (MOE) Tier 1 RG84/12, Singapore Ministry of Education(MOE) Tier 2 ARC28/14, and Singapore A*STAR Science and Engineering Research Council PSF1321202099.
Zhen Zuo received her B.S. degree from Huazhong University of Science and Technology (HUST), Wuhan, China, in 2011. She is currently a Ph.D. student in the School of Electrical and Electronic Engineering, Nanyang Technological University (NTU), Singapore. Her research interests include Computer Vision and Machine Learning.
References (54)
- et al.
Building the gist of a scenethe role of global image features in recognition
Progr. Brain Res.
(2006) Distinctive image features from scale-invariant keypoints
Int. J. Comput. Vis.
(2004)- N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: CVPR,...
- A. Krizhevsky, I. Sutskever, G. Hinton, Imagenet classification with deep convolutional neural networks, in: NIPS,...
- Q.V. Le, A. Karpenko, J. Ngiam, A.Y. Ng, Ica with reconstruction cost for efficient overcomplete feature learning, in:...
- W.Y. Zou, S.Y. Zhu, A.Y. Ng, K. Yu, Deep learning of invariant features via simulated fixations in video, in: NIPS,...
- et al.
A fast learning algorithm for deep beliefnets
Neural Comput.
(2006) - A. Coates, H. Lee, A.Y. Ng, An analysis of single-layer networks in unsupervised feature learning, in: AI Statistics,...
- K. Sohn, D.Y. Jung, H. Lee, A.O. Hero, Efficient learning of sparse, distributed, convolutional feature representations...
- et al.
Learning discriminative hierarchical features for objectrecognition
Signal Process. Lett.
(2014)
Multiresolution gray-scale and rotation invariant texture classification with local binary patterns
IEEE Trans. Pattern Anal. Mach. Intell.
Cited by (0)
Zhen Zuo received her B.S. degree from Huazhong University of Science and Technology (HUST), Wuhan, China, in 2011. She is currently a Ph.D. student in the School of Electrical and Electronic Engineering, Nanyang Technological University (NTU), Singapore. Her research interests include Computer Vision and Machine Learning.
Gang Wang is an Assistant Professor in Electrical Electronic Engineering at the Nanyang Technological University. He is also a Research Scientist of the Advanced Digital Science Center. He received the B.S. degree from Harbin Institute of Technology, China, in 2005 and the Ph.D. degree from the University of Illinois at Urbana-Champaign, Urbana. His research interests include computer vision and machine learning.
Bing Shuai received the B.E. degree in computer science from Chongqing University, China, in 2010. From 2010 to 2013, he is a master student had studied in the area of computer vision in Xiamen University, China. He is currently a Ph.D. candidate in machine learning and computer vision at School of EEE, Nanyang Technological University, Singapore.
Lifan Zhao received the B.S. degree in electronic engineering from Xidian University, Xi׳an, China, in 2010. He is currently working towards the Ph.D. degree the School of Electrical and Electronic Engineering from Nanyang Technological University, Singapore. His research interests include sparse signal recovery techniques and their applications in source localization, radar imagery and wireless communications.
Qingxiong Yang is an Assistant Professor in the Department of Computer Science at City University of Hong Kong. He obtained his BEng degree in Electronic Engineering & Information Science from University of Science & Technology of China (USTC) in 2004 and PhD degree in Electrical & Computer Engineering from University of Illinois at Urbana-Champaign in 2010. His research interests reside in Computer Vision and Computer Graphics. He won the best student paper award at MMSP 2010 and best demo at CVPR 2007.