Regularized Semi-Supervised Latent Dirichlet Allocation for visual concept learning
Introduction
Visual concept detection is a key problem in image retrieval. It aims at automatically mapping images into predefined semantic concepts (such as indoor, sunset, airplane, and face), so as to bridge the so-called semantic gap between low-level visual features and high-level semantic content of images. Although there have been many studies over the last decades [1], [2], [3], it is still a challenging problem within multimedia and computer vision communities. Recently, topic models have been introduced to solve this problem, and achieve impressive results [4], [5], [6], [7], [8], [9]. In these applications, each image is treated as a document, and represented by a histogram of visual words. A visual word is equivalent to a text word, and often generated by clustering various local descriptors such as SIFT. Topic models cluster co-occurring visual words into topics, which are used to image classification.
Among current topic models, Latent Dirichlet Allocation (LDA) [10] is one of the most popular ones. Classic LDA is an unsupervised model without using any prior label information. The lack of useful supervised information usually leads to slow convergence and unsatisfactory performance. Moreover, only the visual words in the training images are modeled in classic LDA. During classification, class labels are simply treated as features extracted from the topic distribution [5]. Since class label is not part of the model, classic LDA is not well suited for classification problems, thus resulting in not so robust performance in visual concept detection.
To make LDA more effective for classification and prediction problem, Blei et al. introduced a supervised Latent Dirichlet Allocation (sLDA) model [11], [7]. In the sLDA model, label parameter is a domain structure and topics are trained to best fit the corresponding variables or labels. Both visual words and class labels are modeled at the same time. Similarly, Wang et al. [6] proposed a Semi-Latent Dirichlet Allocation for human action recognition. Different from sLDA, Semi-LDA introduces supervised information into its model by associating image class labels with visual words. That is, Semi-LDA assumes that the topic of a visual word is observable and equal to the image class label. Fig. 1 shows the difference between classic LDA, sLDA and Semi-LDA. By modeling the class label, both sLDA and Semi-LDA outperforms classic LDA significantly for classification problems. Beside sLDA and Semi-LDA, Pang et al. [12] also proposed a supervised topic model called Travelogue Model, which can extract both local and global topics with each local topic corresponding to some semantics that characterize a few specific locations.
However, all these models (sLDA, Semi-LDA and Travelogue Model) improve the model performance in a fully supervised fashion, and therefore require all training images to be labeled. For a large dataset, any label information is labor intensive and expensive, making fully supervised topic models greatly restricted to only a few concepts. On the other hand, huge amounts of unlabeled images are available in the Internet and easy to obtain. These unlabeled images contain enough information to train visual concept classifiers, and can help avoid overfitting. Therefore, learning visual concepts classifiers with a fully supervised topic model in a semi-supervised manner, which aims to utilize a large amount of unlabeled images, is a promising direction to explore.
Although much work on semi-supervised learning (SSL) algorithms has been developed, few considered combining semi-supervised properties with topic models to solve the visual concept learning problem. In [8], Zhuang et al. proposed a method called Semi-supervised pLSA (Ss-pLSA) for image classification. By introducing category label information into the EM algorithm during training, they can train classifiers with pLSA in a semi-supervised fashion. Although supervised information effectively speeds up the convergence to achieve desire results, Ss-pLSA does not encode class labels into its model, and seems to be a loosely coupled way of simple label propagation in conjunction with a unsupervised pLSA model. Different from [8], [13], [14], [15] carried out semi-supervised topic models in a more consistent fashion by incorporating the manifold assumption into the topic model. They assumed that the probabilities of latent topics of images resided on or close to a manifold, and incorporated the manifold structure into the standard EM algorithm as a regularization term. Since the underlying manifold was unknown, they simply used a nearest neighbor graph to approximate it. However, a nearest neighbor graph is mainly based on pairwise Euclidean distances, and thus is very sensitive to data noise. Since only taking local pairwise relationship into account, a nearest neighbor graph cannot well capture the global geometric structure of the manifold, thus having poor performance. Moreover, all these methods use only class label information to help model learning, while not modeling the class label in their models. As the above analysis, this will decrease the performance of visual concept classifiers.
In this paper, we propose a novel semi-supervised topic model called regularized Semi-Supervised Latent Dirichlet Allocation (r-SSLDA) for visual concept learning. Inspired by Wang et al. [16], instead of attempting to introduce a new Bayesian statistical model, we try to find a simple and an efficient semi-supervised way to learn visual concept classifier with topic models. Unlike the loosely coupled solution in [8], we consider both semi-supervised properties and topic models simultaneously in a regularization framework. By minimizing the cost function of the regularization framework, we provide a direct solution to the semi-supervised topic model problem. Different from current semi-supervised topic models [8], [13], [14], [15], our r-SSLDA encodes class labels into its framework by adopting a supervised LDA model to learn the visual concept classifiers. Meanwhile, instead of using a nearest neighbor graph, r-SSLDA uses the low rank graph (LR-graph) [17] to approximate the manifold. Compared with existing popular graphs (k NN-graph [18], [19], [20], LLE-graph [21], [22]), LR-graph uses both the global property and local property of the graph, and thus is better at capturing the global structure of all data. Experimental results showed that r-SSLDA significantly outperformed classic unsupervised LDA and achieved competitive performance compared with fully supervised LDA with fewer labeled images.
The rest of this paper is organized as follows: In Section 2, we give the detail of low rank graph construction. Then, we introduce the regularized Semi-supervised LDA framework in Section 3. Experiments and result analysis follow in Session 4. Section 5 is our conclusions.
Section snippets
Low rank graph construction
Let be a set of data points drawn from a manifold. Each column of X is a data point in . Since the manifold is unknown, we construct a graph from these data points to approximate it. Let be a graph, where is the set of graph vertices (node vi corresponds to data point xi), and E is the set of graph edges and associated with a weight matrix . For any two neighboring nodes vi and vj, if they are connected with an edge , otherwise .
Framework of regularized semi-supervised LDA
Given an image set and a label set , the first l images are labeled and the others are unlabeled. Let be the label vector of all images. For labeled image , yi is set to one of the elements in C. For unlabeled images , yi can be any limited value beyond C. To simplify our discussion, this paper only considers binary classification with . In this case, yi is set to 1 for positive labeled images,
Data preparation
The datasets used in this paper were Caltech 101 and Caltech 256, two popular image datasets in the literature of image classification. Compared with Caltech 101, Caltech 256 is more challenging because of containing more complex clutters. In our experiments, only 10 categories were selected, and 200 images were randomly selected from each category, 100 images for training and 100 images for test. Specifically, we chose five categories (leopard, motorbike, watch, airplane and face) from Caltech
Conclusion
In this work, we developed a novel regularized Semi-Supervised Latent Dirichlet Allocation (r-SSLDA) for visual concept learning. r-SSLDA considered both semi-supervised properties and topic models simultaneously in the regularization framework. Also, we introduced the low rank graph into the framework to improve the performance. Experiments on Caltech 101 and Caltech 256 showed that our r-SSLDA could effectively utilize both labeled images and unlabeled images and achieved competitive
Acknowledgments
We would like to thank Dr. Yi Ma (Microsoft Research Asia) for his helpful conversations about sparse representation and low rank representation. We also thank anonymous reviewers for their constructive comments. This work is partially supported by the National Science Foundation of China (No. 60933013, No. 61103134), the National Science and Technology Major Project (No. 2010ZX03004-003), the Fundamental Research Funds for the Central Universities (WK210023002, WK2101020003), and the Science
Liansheng Zhuang received the B.Sc. degree and Ph.D. degree from University of Science and Technology of China (USTC), in 2001 and 2006, respectively. He is now a Lecturer in the School of Information Science and Techonlogy, USTC. His current research interests include computer vision, image & video retrieval, and machine learning. He is a member of the IEEE and ACM.
References (33)
- et al.
Summarizing tourist destinations by mining user-generated travelogues and photos
Comput. Vis. Image Understanding
(2011) - J. Tang, S. Yan, R. Hong, G. Qi, T. Chua, Inferring semantic concepts from community-contributed images and noisy tags,...
- et al.
Image annotation by graph-based inference with integrated multiple/single instance representations
IEEE Trans. Multimedia
(2010) - et al.
Correlative linear neighborhood propagation for video annotation
IEEE Trans. Syst. Man Cybern.Part B
(2009) - et al.
Image categorization by learning and reasoning with regions
J. Mach. Learn. Res.
(2004) - R. Fergus, F.-F. Li, P. Perona, A. Zisserman, Learning object categories from google's image search, in: IEEE...
- et al.
Human action recognition by semi-latent topic models
IEEE Trans. Pattern Anal. Mach. Intell. (Special Issue on Probabilistic Graphical Models in Computer Vision)
(2009) - C. Wang, D. Blei, F.-F. Li, Simultaneous image classification and annotation, in: Proceedings of IEEE Computer Society...
- L. Zhuang, L. She, Y. Jiang, K. Tang, N. Yu, Image classification via semi-supervised plsa, in: Proceedings of the...
- et al.
Travelogue enriching and scenic spot overview based on textual and visual topic model
Int. J. Pattern Recognition Artif. Intell.
(2011)
Latent Dirichlet allocation
J. Mach. Learn. Res.
Supervised topic models
Adv. Neural Inf. Process. Syst.
Cited by (11)
Multi-view learning via multiple graph regularized generative model
2017, Knowledge-Based SystemsCitation Excerpt :In text analysis, these methods model each document as a mixture over a fixed set of underlying topics, where each topic is characterized as a distribution over words. These approaches have shown impressive success in discovering low-rank hidden structures for textual and visual data [12–14]. Recently, Zhuang et al. [15] proposed MVPLSA method, which is a multi-view topic modeling algorithm via Probabilistic Latent Semantic Analysis.
Image color harmony modeling through neighbored co-occurrence colors
2016, NeurocomputingCitation Excerpt :Thus, to introduce the discriminability into LDA, researchers proposed supervised topic models by introducing the label information of each document/image during the training phase, which provide more reliable classification performances. In [36], a semi-supervised topic model was described, where both semi-supervised properties and supervised topic model were integrated into a regularization framework simultaneously. By considering both observed label frequencies and label dependency, Li et al. designed a supervised topic model to solve the multi-label classification problems [37].
A Guided Topic-Noise Model for Short Texts
2022, WWW 2022 - Proceedings of the ACM Web Conference 2022Keypoints detection and feature extraction: A dynamic genetic programming approach for evolving rotation-invariant texture image descriptors
2017, IEEE Transactions on Evolutionary ComputationSemi-supervised max-margin topic model with manifold posterior regularization
2017, IJCAI International Joint Conference on Artificial IntelligenceScene classification based on spatial semantic topic
2017, Journal of Computational and Theoretical Nanoscience
Liansheng Zhuang received the B.Sc. degree and Ph.D. degree from University of Science and Technology of China (USTC), in 2001 and 2006, respectively. He is now a Lecturer in the School of Information Science and Techonlogy, USTC. His current research interests include computer vision, image & video retrieval, and machine learning. He is a member of the IEEE and ACM.
Haoyuan Gao received the B.Sc. degree from University of Science and Technology of China (USTC) in 2009. He is currently working toward the Master degree from USTC. His research interests include computer vision, image & video retrieval, and machine learning.
Jiebo Luo received the Ph.D. degree from the University of Rochester in 1995. He is a Professor in CS Department, University of Rochester since Fall 2011. Before that he was a Senior Principal Scientist leading research and advanced development at Kodak Research Laboratories, Rochester, New York. His research spans image processing, computer vision, machine learning, data mining, medical imaging, and ubiquitous computing. He has authored more than 150 technical papers and holds 50 US patents. He has been involved in numerous technical conferences, including serving as the program co-chair of ACM Multimedia 2010 and IEEE CVPR 2012. He is the Editor-in-Chief of the Journal of Multimedia, and has served on the editorial boards of the IEEE Transactions on Pattern Analysis and Machine Intelligence, IEEE Transactions on Multimedia, IEEE Transactions on Circuits and Systems for Video Technology, Pattern Recognition, Machine Vision and Applications, and Journal of Electronic Imaging. He is a Fellow of the SPIE, IEEE, and IAPR.
Zhouchen Lin received the Ph.D. degree in Applied Mathematics from Peking University in 2000. He is currently a Full Professor in Peking University. He is also now a Guest Professor to Beijing Jiaotong University, Southeast University and Shanghai Jiaotong University. He is also a Guest Researcher to Institute of Computing Technology, Chinese Academy of Sciences. His research interests include computer vision, computer graphics, image processing, pattern recognition, and machine learning.