Elsevier

Pattern Recognition

Volume 48, Issue 10, October 2015, Pages 3004-3015
Pattern Recognition

Exemplar based Deep Discriminative and Shareable Feature Learning for scene image classification

https://doi.org/10.1016/j.patcog.2015.02.003Get rights and content

Highlights

  • We propose to encode shareable and discriminative information in feature learning.

  • Two exemplar selection methods are proposed to select effective training data.

  • We build a hierarchical learning scheme to capture multiple visual level information.

  • Our DDSFL outperforms most of the existing features.

  • DDSFL features show great complementary effect to Caffe features.

Abstract

In order to encode the class correlation and class specific information in image representation, we propose a new local feature learning approach named Deep Discriminative and Shareable Feature Learning (DDSFL). DDSFL aims to hierarchically learn feature transformation filter banks to transform raw pixel image patches to features. The learned filter banks are expected to (1) encode common visual patterns of a flexible number of categories; (2) encode discriminative information; and (3) hierarchically extract patterns at different visual levels. Particularly, in each single layer of DDSFL, shareable filters are jointly learned for classes which share the similar patterns. Discriminative power of the filters is achieved by enforcing the features from the same category to be close, while features from different categories to be far away from each other. Furthermore, we also propose two exemplar selection methods to iteratively select training data for more efficient and effective learning. Based on the experimental results, DDSFL can achieve very promising performance, and it also shows great complementary effect to the state-of-the-art Caffe features.

Introduction

Extracting informative, robust, and compact data representation (features) has been considered as one of the key factors for good performance in computer vision. Thus, much effort has been paid on developing efficient and effective features, and the existing methods can roughly be classified into two categories: feature engineering and feature learning. In the last decade, numerous feature engineering methods developed hand-crafted features, such as SIFT [1], and HOG [2], that have ruled the image representation area. However, such methods are labor-intensive and limited by the designer׳s ingenuity and prior knowledge. In contrast, to expand the capability and ease of image representation, feature learning methods [3], [4], [5], [6], [7], [8], [9] aim to automatically learn data adaptive image representations from raw pixel image data. However, these methods are generally poor on extracting and organizing the discriminative information from the data. Meanwhile, most of the learning frameworks operate in unsupervised ways without considering the class label information, which is crucial for image classification. To compensate these weaknesses while maintain the advantages of feature learning, we propose to encode shareable information that exists among groups of classes, and discriminative patterns owned by specific classes in a feature learning procedure.

In this paper, we develop a multiple layer feature learning framework called Deep Discriminative and Shareable Feature Learning (DDSFL), which aims to hierarchically learn transformation filter bank to transform pixel values of local image patches to features. As shown in Fig. 1, in each feature learning layer, we aim to learn an over-complete filter bank, which is able to cover the variances of patches from different classes, meanwhile keeping the shareable correlation among similar classes and discriminative power of each category. Intuitively, this goal can be reached by randomly selecting training patches, and learning filter banks for each class independently, then concatenating them together afterwards. However, there are several problems: (1) some of the patterns are shared among some classes, repeatedly learning filters corresponding to the similar patterns are neither memory compact nor computationally efficient, meanwhile the feature dimension will increase linearly with the number of classes, which limit the learning methods to be only applicable to small datasets; (2) discriminative power can hardly be fully exploited, since the class specific characteristics are generally subtle and not obvious without comparing with other classes; (3) in most cases, images are dominated by noisy or meaningless patches, learning filters from randomly sampled image patches will increase the learning cost and depress the performance.

To learn compact and effective filter banks, each category is forced to only activate a subset of global filters during the learning procedure. Beyond reducing feature dimension, sharing filters can also lead to more robust features. Images belonging to different classes do share some information (e.g. in scene images, both ‘computer room’ and ‘office’ contain ‘computer’ and ‘desk’). The amount of information sharing depends on the similarity between different categories. Hence, we allow filters to be shareable, meaning that the same filters can be activated by a number of categories. We introduce a binary selection vector to adaptively select which filters to share, and among which categories.

To improve the discrimination power, we force the features from the same category to be close and the features from different categories to be far away (e.g. patches corresponding to bookshelf in ‘office’ can hardly be found in ‘computer room’). However, the local patches from the same categories are very diverse. Therefore, we propose to measure the similarity by forcing a patch to be similar to a subgroup of training samples from the same category instead. Furthermore, not all the local patches from different classes need to be separable. Thus, we relax the discriminative term to allow sharing similar patches across different classes and focus on separating the less similar patches.

To improve the quality of the filters and efficiency of the learning procedures, we propose two exemplar selection schemes to select effective training data. The proposed methods aim to remove the noisy training patches that commonly exist in many different classes, and select the patches that contain both shareable and discriminative patterns as the training data to learn the filter banks.

Furthermore, supported by lots of previous deep feature learning works, hierarchically extracting increasing visual level features can help to get more abstractive and useful information. Inspired by this idea, we extend the single layer feature learning module to a hierarchical structure. In this paper, we build a three layer learning framework, as shown in Fig. 2. Specifically, we firstly learn the first layer features from small (16×16) raw pixel value image patches. Then for the higher layers, we convolve the previous layer features within a larger region (32×32 and 64×64 for the second and third layer respectively) as the input to PCA, and use the reduced dimensional data (we set the dimension to 300 for all the layers) as the input to train the current layer filter bank (we set to learn 400 filters for each layer). Finally, we combine the features learned by all the three layers as our DDSFL feature.

The rest of this paper is organized as follows. Section 2 introduces the related works, including feature engineering, feature learning, and discriminative training. Section 3 describes the details of our DDSFL method by introducing global unsupervised term, shareable term, and discriminative term. Section 4 proposes two exemplar selection methods including Nearest Neighbor based and SVM based selection. In Section 5, we test our method on three widely used scene image classification datasets: Scene 15, UIUC Sports, MIT Indoor, and we also test on PASCAL VOC 2012. The experimental results show that our features can outperform most of the existing methods, and it also has significant complementary effect with the state-of-the-art Caffe [12] features (ConvNets [3] pre-trained on ImageNet [13]). Finally, Section 6 concludes this paper.

Section snippets

Feature engineering

In the feature engineering area, hand-crafted features including SIFT [1], HOG [2], LBP [14] and GIST [15] (global feature) were popular used. Comparing to current existing feature learning methods, they can generally get better local descriptors, and extra information (e.g. discriminative information) can be better expressed by manually inserting prior knowledge. Even though they are very powerful, designing such features is labor intensive, and they can hardly capture any information other

Deep discriminative and shareable feature learning

In this section, the hierarchical structure of our Deep Discriminative and Shareable Feature Learning (DDSFL) framework will be briefly introduced. For each single layer DDSFL, its three learning components will be introduced in detail, and an alternating optimization strategy will be provided afterwards (Fig. 3).

Exemplar selection

The quality of the learned features not only depends on the learning structure and parameters, but also relies on the quality of input training data. In order to select training patch set that carries both potential shareable and discriminative patterns, while also excludes common noisy patches, a training data selection procedure is required before processing feature learning.

For this purpose, one intuitive solution is applying k-means clustering, and using the cluster centroids as the

Datasets and experiment settings

We tested our DDSFL method on three widely used scene image classification datasets: Scene 15 [11], UIUC Sports [41], and MIT Indoor [42]. We also tested on the challenging PASCAL VOC 2012 object classification dataset. To make a fair comparison with other types of features, we only utilized gray scale information for all of the images in these datasets.

  • Scene 15: This dataset includes 4485 images from 15 outdoor and indoor scene categories, each category contains 200–400 gray scale images.

Conclusion

In this paper, we propose a hierarchical weakly supervised local feature learning method, called DDSFL, to learn discriminative and shareable filter banks to transform local image patches into different visual level features. In our DDSFL method, we learn a flexible number of shared filters to represent shareable patterns that exist among similar categories. To enhance the discriminative power, we force the features from the same class to be similar, while features from different classes to be

Conflict of interest

None declared.

Acknowledgments

The research is supported by Singapore Ministry of Education (MOE) Tier 1 RG84/12, Singapore Ministry of Education(MOE) Tier 2 ARC28/14, and Singapore A*STAR Science and Engineering Research Council PSF1321202099.

Zhen Zuo received her B.S. degree from Huazhong University of Science and Technology (HUST), Wuhan, China, in 2011. She is currently a Ph.D. student in the School of Electrical and Electronic Engineering, Nanyang Technological University (NTU), Singapore. Her research interests include Computer Vision and Machine Learning.

References (54)

  • A. Oliva et al.

    Building the gist of a scenethe role of global image features in recognition

    Progr. Brain Res.

    (2006)
  • D.G. Lowe

    Distinctive image features from scale-invariant keypoints

    Int. J. Comput. Vis.

    (2004)
  • N. Dalal, B. Triggs, Histograms of oriented gradients for human detection, in: CVPR,...
  • A. Krizhevsky, I. Sutskever, G. Hinton, Imagenet classification with deep convolutional neural networks, in: NIPS,...
  • Q.V. Le, A. Karpenko, J. Ngiam, A.Y. Ng, Ica with reconstruction cost for efficient overcomplete feature learning, in:...
  • W.Y. Zou, S.Y. Zhu, A.Y. Ng, K. Yu, Deep learning of invariant features via simulated fixations in video, in: NIPS,...
  • G. Hinton et al.

    A fast learning algorithm for deep beliefnets

    Neural Comput.

    (2006)
  • A. Coates, H. Lee, A.Y. Ng, An analysis of single-layer networks in unsupervised feature learning, in: AI Statistics,...
  • K. Sohn, D.Y. Jung, H. Lee, A.O. Hero, Efficient learning of sparse, distributed, convolutional feature representations...
  • Z. Zuo et al.

    Learning discriminative hierarchical features for objectrecognition

    Signal Process. Lett.

    (2014)
  • J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, Y. Gong, Locality-constrained linear coding for image classification, in:...
  • S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: spatial pyramid matching for recognizing natural scene...
  • J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, T. Darrell, Decaf: a deep convolutional activation...
  • J. Deng, A.C. Berg, K. Li, L. Fei-Fei, What does classifying more than 10,000 image categories tell us? in: ECCV,...
  • T. Ojala et al.

    Multiresolution gray-scale and rotation invariant texture classification with local binary patterns

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2002)
  • Q.V. Le, M.A. Ranzato, R. Monga, M. Devin, K. Chen, G.S. Corrado, J. Dean, A.Y. Ng, Building high-level features using...
  • Q.V. Le, W.Y. Zou, S.Y. Yeung, A.Y. Ng, Learning hierarchical invariant spatio-temporal features for action recognition...
  • A. Wang, J. Lu, G. Wang, J. Cai, T.-J. Cham, Multi-modal unsupervised feature learning for rgb-d scene labeling, in:...
  • P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, Y. LeCun, Overfeat: integrated recognition, localization and...
  • R. Girshick, J. Donahue, T. Darrell, J. Malik, Rich feature hierarchies for accurate object detection and semantic...
  • Z. Jiang, Z. Lin, L.S. Davis, Learning a discriminative dictionary for sparse coding via label consistent k-svd, in:...
  • J. Mairal, F. Bach, J. Ponce, G. Sapiro, A. Zisserman, Supervised dictionary learning, in: NIPS,...
  • M. Yang, L. Zhang, X. Feng, D. Zhang, Fisher discrimination dictionary learning for sparse representation, in: ICCV,...
  • S. Kong, D. Wang, A dictionary learning approach for classification: separating the particularity and the commonality,...
  • Q. Li, J. Wu, Z. Tu, Harvesting mid-level visual concepts from large-scale internet images, in: CVPR,...
  • C. Doersch, A. Gupta, A.A. Efros, Mid-level visual element discovery as discriminative mode seeking, in: NIPS,...
  • J. Sun, J. Ponce, et al., Learning discriminative part detectors for image classification and cosegmentation, in: ICCV,...
  • Cited by (0)

    Zhen Zuo received her B.S. degree from Huazhong University of Science and Technology (HUST), Wuhan, China, in 2011. She is currently a Ph.D. student in the School of Electrical and Electronic Engineering, Nanyang Technological University (NTU), Singapore. Her research interests include Computer Vision and Machine Learning.

    Gang Wang is an Assistant Professor in Electrical Electronic Engineering at the Nanyang Technological University. He is also a Research Scientist of the Advanced Digital Science Center. He received the B.S. degree from Harbin Institute of Technology, China, in 2005 and the Ph.D. degree from the University of Illinois at Urbana-Champaign, Urbana. His research interests include computer vision and machine learning.

    Bing Shuai received the B.E. degree in computer science from Chongqing University, China, in 2010. From 2010 to 2013, he is a master student had studied in the area of computer vision in Xiamen University, China. He is currently a Ph.D. candidate in machine learning and computer vision at School of EEE, Nanyang Technological University, Singapore.

    Lifan Zhao received the B.S. degree in electronic engineering from Xidian University, Xi׳an, China, in 2010. He is currently working towards the Ph.D. degree the School of Electrical and Electronic Engineering from Nanyang Technological University, Singapore. His research interests include sparse signal recovery techniques and their applications in source localization, radar imagery and wireless communications.

    Qingxiong Yang is an Assistant Professor in the Department of Computer Science at City University of Hong Kong. He obtained his BEng degree in Electronic Engineering & Information Science from University of Science & Technology of China (USTC) in 2004 and PhD degree in Electrical & Computer Engineering from University of Illinois at Urbana-Champaign in 2010. His research interests reside in Computer Vision and Computer Graphics. He won the best student paper award at MMSP 2010 and best demo at CVPR 2007.

    View full text