Multiclass support vector classification via coding and regression
Introduction
Over the last decade, there have been a great interest and successful development of support vector machines (SVMs) for classification [7], [5], [11]. SVMs are originally designed for binary classification. There are two commonly seen multiclass extensions for SVMs. One is the composition type methods built upon a series of binary classifiers, e.g., the one-against-one, one-against-rest and error correcting output codes [12], [1], [8], [10], [18], and the other is the single machine type methods, often huge and solved in one optimization formulation [37], [4], [9], [24], [28]. Comparison studies and discussions on compositions of binary classifiers and single machine approaches can be found in [22], [33]. Based on their findings, there is no universally dominant classification rule for the multiclass problems, and different methods have their own merits and advantages. Thus, it allows the room for exploration of alternative approaches.
In this article, we propose an alternative approach based on regression concept. The time-honored Fisher linear discriminant analysis (FLDA) separates data classes by projecting input attributes to the eigen-space of the between-class covariance (scaled by the within-class covariance). In the binary classification case, FLDA can be solved via a multiresponse regression [14], [30], [2] by encoding the binary class labels into numeric responses , where and are class sizes and . Hastie et al. [20] have further extended the FLDA to the nonlinear and multiclass classification via a penalized regression setting, and they named it the “flexible discriminant analysis (FDA)”. The FDA is based on encoding the class labels into response scores and then a nonparametric regression technique, such as MARS (multivariate additive regression splines), neural networks, or else, is used to fit the response scores. A particular encoding scheme “optimal scoring” is proposed in FDA. Later [31], [34] have adopted the same FDA approach, but have replaced the conventional nonparametric regression technique with a kernel trick, and their approach is named kernel discriminant analysis (KDA). The KDA [34] is solved by an EM algorithm.
Inspired by the above-mentioned regression approaches and the successful development of SVMs, we take both ideas and adopt the multiresponse support vector regression (mSVR) for the multiclass classification problem. Some preceding works on support vector classification can also be interpreted as ridge regression applied to classification problems, e.g., the proximal SVM [16], [17] and the regularized least squares SVM [36]. Our mSVR for classification consists of three major steps. The first is to encode the class labels into multiresponse scores. Next, the regression fit of scores on kernel-transformed feature inputs is carried out to extract a low-dimensional discriminant feature subspace. The final step is a classification rule to transform (i.e., to decode) the mapped values in this low-dimensional discriminant subspace back to class labels. The standard kernel Fisher discriminant analysis and also SVM solve the classification problems on a high-dimensional feature space. Through the mSVR, we can extract the information of the attributes into a feature subspace (J is the number of classes), which can accelerate the speed of classification training and decrease the influence due to noise. We will give a unified view of different coding schemes. The low-dimensional discriminant feature subspace generated by different coding schemes with long enough code length will be identical, which introduces the notion of equivalence of codes. We will prove this equivalence theoretically and also confirm it by our numerical experiments. The regression step can be viewed as a feature extraction to make the final classification (decoding) step computationally light and easy. The nonlinear structure between different classes of data patterns will be embedded in the extracted features because of the kernel transformation. Thus, we can apply any feasible linear learning algorithm to the data images in this low-dimensional discriminant feature subspace. Numerical tests and comparisons show that our framework, regression setup combined with a linear discriminant algorithm, is an easy and yet competent alternative to the multiclass classification problem.
The rest of the article is organized as follows. In Section 2, we introduce the framework of mSVR for classification. We describe the major steps of our regression approach including kernelization, encoding the class labels, a regularized least-square-based mSVR algorithm and the principle for decoding. In Section 3, we develop some notions and properties of equivalent codes and scores in the discriminant analysis context. In Section 4, implementation issues including model selection and the choice of base classifiers are discussed. Experimental results are provided to demonstrate the efficiency of our proposal and to illustrate numerical properties of different coding schemes. Concluding remarks are in Section 5. All proofs are in the Appendix.
Section snippets
Classification by multiresponse regression
Consider the problem of multiclass classification with J classes based on d measurements of input attributes . Denote the membership set by and each individual membership by . Suppose we have training data . Our goal is to construct a classification rule which, given a new input, can correctly predict the associated class label of this new input. Aside from various support vector approaches mentioned in Section 1 originating from the machine learning
Encoding and equivalence class of codes
In this section we introduce several existing coding schemes to encode the class labels. We also unify them under the notions of equivalence of codes and scores in the context of discriminant analysis. We refer the reader to [21] for general theory of discrete codes and to [8], [10] for continuous codes.
Experimental study
Data sets and experimental setting. The following data sets are used for experimental study: ionosphere, Iris, wine, glass, segment, image, dna, satimage and pendigits, where ionosphere, Iris, wine, glass, image and pendigits are from the UCI Repository machine learning databases [3], and segment, dna and satimage are from the UCI statlog collection. Data structures are characterized in Table 1. For data sets without a given training/testing split, we divide them into 10 folds for
Concluding remarks
In this article the mSVR is proposed for the multiclass classification problem. The class labels are encoded into multiresponse scores and then the regression of scores on kernel inputs is used to extract a low-dimensional discriminant feature subspace, which is spanned by the regression coefficient variates. The discriminant feature subspace generated by different coding schemes with long enough code length will be identical, which introduces the notion of equivalence of codes. Data are then
Acknowledgments
The authors thank Chuhsing Kate Hsiao and Chii-Ruey Hwang for helpful comments.
Pei-Chun Chen received her B.S. degree from National Dong-Hwa University in 2001, and the M.S. and Ph.D. degrees from Graduate Institute of Epidemiology, National Taiwan University in 2003 and 2008, respectively. Currently, she is a Post-doctoral Fellow in Bioinformatics and Biostatistics Core, Research Center for Medical Excellence, National Taiwan University. Her research interests are in biostatistics, Bayesian statistics, bioinformatics and machine learning.
References (37)
- et al.
Model selection for support vector machines via uniform design
Comput. Stat. Data Anal.
(2007) - et al.
A novel and quick SVM-based multi-class classifier
Pattern Recognition
(2006) - et al.
Reducing multiclass to binary: a unifying approach for margin classifiers
J. Mach. Learn. Res.
(2001) An Introduction to Multivariate Statistical Analysis
(2003)- A. Asuncion, D.J. Newman, UCI Machine Learning Repository, University of California, Irvine, School of Information and...
- et al.
Multicategory classification by support vector machines
Comput. Optim. Appl.
(1999) A tutorial on support vector machines for pattern recognition
Data Min. Knowl. Disc.
(1998)- C.C. Chang, C.J. Lin, LIBSVM: a library for support vector machines, 2001,...
- et al.
Support vector networks
Mach. Learn.
(1995) - K. Crammer, Y. Singer, Improved output coding for classification using continuous relaxation, in: Proceeding of the...
On the algorithmic implementation of multiclass kernel-based vector machines
J. Mach. Learn. Res.
On the learnability and design of output codes for multiclass problems
Mach. Learn.
An Introduction to Support Vector Machines
Solving multiclass learning problems via error-correcting output codes
J. Artif. Intell. Res.
Number-Theoretic Methods in Statistics
The use of multiple measurements in taxonomic problems
Ann. Eugen.
Multivariate adaptive regression splines (with discussion)
Ann. Stat.
Cited by (14)
In situ identification of shearing parameters for loose lunar soil using least squares support vector machine
2016, Aerospace Science and TechnologyCitation Excerpt :It uses a set of linear equations instead of a quadratic programming problem and reduces the complexity of the algorithm, thus extending the application of the SVM [38]. The LS_SVM can be used to solve regression problems with multiple outputs [39–41]. For the sake of completeness, a brief introduction is presented.
Determination of API gravity, kinematic viscosity and water content in petroleum by ATR-FTIR spectroscopy and multivariate calibration
2014, FuelCitation Excerpt :The support vector is a machine learning method developed by Cortes and Vapnik [22], originally for solving binary classification problems. However, the technique was extended to handle multiclass problems [23,24] and regression [21,25–29]. Support vector regression (SVR) is machine learning based on statistical learning theory and seeks to maximize the ability to generalize using the structural risk minimization principle.
Towards enhancing centroid classifier for text classification-A border-instance approach
2013, NeurocomputingCitation Excerpt :The basic idea of CC is to use all the training instances belonging to the same categories to construct centroid vectors, and assign a new document to the category with the most similar centroid. While being praised for its simplicity and efficiency, CC has long been criticized for its relatively low classification accuracies compared with some state-of-the-art algorithms such as SVMs [33,38,3], especially when the data have classes in non-spherical, disconnected, or other irregular shapes. People recognize that CC's inferior performances partially attribute to the inappropriately constructed centroid vectors [8].
Nonparametric Functional Graphical Modeling Through Functional Additive Regression Operator
2022, Journal of the American Statistical Association
Pei-Chun Chen received her B.S. degree from National Dong-Hwa University in 2001, and the M.S. and Ph.D. degrees from Graduate Institute of Epidemiology, National Taiwan University in 2003 and 2008, respectively. Currently, she is a Post-doctoral Fellow in Bioinformatics and Biostatistics Core, Research Center for Medical Excellence, National Taiwan University. Her research interests are in biostatistics, Bayesian statistics, bioinformatics and machine learning.
Kuang-Yao Lee received the B.S. and M.S. degrees in the Mathematics Department of National Taiwan University, in 2002 and 2005. Currently a Ph.D. student in the Department of Statistics in Pennsylvania State University. Research interests include linear and nonlinear dimensionality reduction methods and machine learning.
Tsung-Ju Lee received a B.S. degree in Mathematics from TungHai University, Taiwan in 2000 and the M.S. degree in Applied Mathematics from National Chiao Tung University, Taiwan in 2002. Currently, he is working towards the Ph.D. degree in the Department of Computer Science, National Chiao Tung University, Taiwan. His current research interests include machine learning, data mining and various applications, especially in network security, e-learning and computational biology.
Yuh-Jye Lee received his Master degree in Applied Mathematics from the National Tsing Hua University, Taiwan in 1992 and Ph.D. degree in computer sciences from the University of Wisconsin, Madison in 2001. He is an associate professor in the Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology. His research interests are in machine learning, data mining, optimization, information security and operations research. Dr. Lee developed new algorithms for large data mining problems such as classification and regression problem, abnormal detection and dimension reduction. Using the methodologies such as support vector machines, reduced kernel method, chunking and smoothing techniques allow us to get a very robust solution (prediction) for a large dataset. These methods have been applied to solve many real world problems such as intrusion detection system (IDS), face detection, microarray gene expression analysis and breast cancer diagnosis and prognosis.
Su-Yun Huang received the B.S. and M.S. degrees from Department of Mathematics, National Taiwan University, in 1983 and 1985, respectively, and the Ph.D. degree from Department of Statistics, Purdue University in 1990. She is currently a Research Fellow in the Institute of Statistical Science, Academia Sinica, Taiwan. Her research interests are mainly on mathematical statistics.