Energy-based model of least squares twin Support Vector Machines for human action recognition
Introduction
Human action recognition is one of the important research areas in computer vision and pattern recognition. It has a wide range of applications such as surveillance systems, human computer interaction, video retrieval, and gesture recognition. In the past decade, with growing in video quality and personal video recording, the need to automatic video analysis and the recognition of events has bean increased. The difficulty of human action recognition problems may have been originated from several challenges such as illumination changes, partial occlusions, and intra-class differences [1].
Recently, Bag of Words (BoWs) representation and support vector machine (SVM) for human action recognition have attracted much interest [2], [3], [4]. Accordingly, the feature descriptors are extracted from all the training sequences to build a codebook by clustering similar features. The cluster centroids, called as video words, are the members of this codebook. Each feature descriptor is assigned to a certain video word (cluster centroid). An action video is represented as a histogram of the number of occurrences of particular video words. Then, classification methods are exploited to build models for each action class.
The support vector machine was originally proposed by Cortes and Vapnik [5] for the purpose of binary classification. SVM has been successfully applied in a wide spectrum of research areas like face recognition, object categorization, and biomedicine [6], [7], [8], [9]. The computational complexity of SVM is , where l denotes the total size of training data. However, this drawback restricts the application of SVM to large-scale problem domains. Since the optimal hyperplane obtained by SVM depends on only a small part of samples (support vectors), it is very sensitive to the outliers and noisy samples. Moreover, multi-category classification of human actions is usually done by solving many one-versus-rest binary SVM classification tasks. Each binary SVM is trained with all of the patterns, so it easily leads to the class imbalance problem.
To deal with these issues, we propose a fast classifier to understand activity recognition based on Twin Support Vector Machines (TSVM). TSVM were proposed by Jayadeva et al. in [10] for binary classification. This method generates two nonparallel hyperplanes by solving two smaller-sized Quadratic Programming Problems (QPPs) such that each hyperplane is closer to one class and as far as possible from the other. The idea of solving two smaller-sized QPPs rather than a single larger-sized QPP in SVM makes the learning of TSVM four times faster than the conventional SVM [10]. Least Squares Twin Support Vector Machine (LS-TSVM) [11] is an extension of TSVM as a way to replace the convex QPPs in TSVM with a convex linear system by using a squared loss function instead of the hinge one. This formulation leads to the extremely simple and fast algorithm. The constraints of the LS-TSVM are converted to an energy model which could reduce the adverse effects of noisy data and outliers. In addition, in one-versus-rest protocol of ELS-TSVM for multi-class classification, imbalance datasets will not affect the model learning.
The paper is organized as follows: we first review the related work in Section 2. In Section 3, we describe the proposed human action recognition framework and introduce the ELS-TSVM. In Section 4, the experimental results on common datasets are given. Finally, Section 5 contains concluding remarks.
Section snippets
Related work and background
A comprehensive review of the human action recognition approaches can be found in some interesting survey papers such as [1], [12], [13], [14]. In general, feature representations of video sequences can be divided into two categories: top-down (global) [15], [16], and bottom-up (local) [17], [18], [19] strategy representations. The global strategy first localizes region of the person in the video by background subtraction, and then represents the interest region as a whole. In this way, global
Human action recognition framework
In this section, each step of the proposed human action recognition framework is described in detail. The action representation is described in Section 3.1. The proposed ELS-TSVM classification algorithm is presented in Section 3.2. Finally, discussion on ELS-TSVM is done in 3.3. The framework of the proposed action recognition method has been illustrated in Fig. 1.
Experimental results
In this section ELS-TSVM has been employed to understand human actions. For this purpose, we have compared our ELS-TSVM method with other related methods on the Weizmann, KTH, and Hollywood action datasets. Figs. 2 and 4 provide some sample frames of action datasets. As shown in [29], the authors reported different accuracy rates up to 10.67% in results when different validation approaches have been applied to the same data. In our experiments, the leave-one-person-out cross-validation approach
Conclusion
In this paper, we have extended the LS-TSVM classifier to an energy based model called ELS-TSVM for human action recognition. The energy for each hyperplane in ELS-TSVM has been introduced to be flexible in the face of outliers of each actions. ELS-TSVM classifier performs classification by the use of two non-parallel hyperplanes unlike SVM which uses a single hyperplane. The proposed framework have addressed some pitfalls in previous action framework by SVM classifier such as
Acknowledgment
This research is partially supported by ITRC (Iran Telecommunication Research Center) under contract no. 6979/500.
References (41)
- et al.
Human action recognition based on boosted feature selection and naive Bayes nearest-neighbor classification
Signal Process.
(2013) - et al.
Human action recognition with salient trajectories
Signal Process.
(2013) - et al.
Exploring trace transform for robust human action recognition
Pattern Recognit.
(2013) - et al.
Fast multi-view segment graph kernel for object classification
Signal Process.
(2013) - et al.
Development of entropy based algorithm for cardiac beat detection in 12-lead electrocardiogram
Signal Process.
(2007) - et al.
Sparse coding and classifier ensemble based multi-instance learning for image categorization
Signal Process.
(2013) - et al.
Improving verification accuracy by synthesis of locally enhanced biometric images and deformable model
Signal Process.
(2007) - et al.
A survey of vision-based methods for action representation, segmentation and recognition
Comput. Vis. Image Underst.
(2011) A survey on vision-based human action recognition
Image Vis. Comput.
(2010)- et al.
Transform based spatio-temporal descriptors for human action recognition
Neurocomputing
(2011)