Abstract
In Support Vector Machines (SVMs), the solution of the classification problem is characterized by a (convex) quadratic programming (QP) problem. In a modified version of SVMs, called Least Squares SVM classifiers (LS-SVMs), a least squares cost function is proposed so as to obtain a linear set of equations in the dual space. While the SVM classifier has a large margin interpretation, the LS-SVM formulation is related in this paper to a ridge regression approach for classification with binary targets and to Fisher's linear discriminant analysis in the feature space. Multiclass categorization problems are represented by a set of binary classifiers using different output coding schemes. While regularization is used to control the effective number of parameters of the LS-SVM classifier, the sparseness property of SVMs is lost due to the choice of the 2-norm. Sparseness can be imposed in a second stage by gradually pruning the support value spectrum and optimizing the hyperparameters during the sparse approximation procedure. In this paper, twenty public domain benchmark datasets are used to evaluate the test set performance of LS-SVM classifiers with linear, polynomial and radial basis function (RBF) kernels. Both the SVM and LS-SVM classifier with RBF kernel in combination with standard cross-validation procedures for hyperparameter selection achieve comparable test set performances. These SVM and LS-SVM performances are consistently very good when compared to a variety of methods described in the literature including decision tree based algorithms, statistical algorithms and instance based learning methods. We show on ten UCI datasets that the LS-SVM sparse approximation procedure can be successfully applied.
Article PDF
Similar content being viewed by others
Avoid common mistakes on your manuscript.
References
Aha, D., & Kibler, D. (1991). Instance-based learning algorithms. Machine Learning, 6, 37–66.
Allwein, E. L., Schapire, R. E., & Singer, Y. (2000). Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of Machine Learning Research, 1, 113–141.
Baudat, G., & Anouar, F. (2000). Generalized discriminant analysis using a Kernel approach. Neural Computation, 12, 2385–2404.
Bay, S. D. (1999). Nearest neighbor classification from multiple feature subsets. Intelligent Data Analysis, 3, 191–209.
Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford University Press.
Blake, C. L., & Merz, C. J. (1998). UCI Repository of machine learning databases [http://www.ics.uci.edu/~mlearn/MLRepository.html]. Irvine, CA: University of California, Dept. of Information and Computer Science.
Boser, B., Guyon, I., & Vapnik, V. (1992). A training algorithm for optimal margin classifiers. In Proc. of the Fifth Annual Workshop on Computational Learning Theory (pp. 144–152). Pittsburgh: ACM.
Bradley, P. S., & Mangasarian, O. L. (1998). Feature selection via concave minimization and support vector machines. In J. Shavlik (ed.), Machine Learning Proc. of the Fifteenth Int. Conf. (ICML'98) (pp. 82–90). Morgan Kaufmann, San Francisco, California.
Breiman, L., Friedman, J., Olshen, R., & Stone, C. (1984). Classification and Regression Trees. New York: Chapman and Hall.
Cawley, G. C. (2000). MATLAB Support Vector Machine Toolbox (v0.54β). [http://theoval.sys.uea.ac.uk/~gcc/svm/toolbox]. University of East Anglia, School of Information Systems, Norwich, Norfolk, U.K.
Cristianini, N., & Shawe-Taylor, J. (2000) An Introduction to Support Vector Machines. Cambridge University Press.
De Groot, M. H. (1986). Probability and Statistics, 2nd ed. Reading, MA: Addison-Wesley.
Domingos, P. (1996). Unifying instance-based and rule-based induction. Machine Learning, 24, 141–168.
Dietterich, T. G. (1998). Approximate statistical tests for comparing supervised classification learning algorithms. Neural Computation, 10, 1895–1924.
Dietterich, T. G., & Bakiri, G. (1995). Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research, 2, 263–286.
Duda, R. O., & Hart, P. E. (1973), Pattern Classification and Scene Analysis. New York: John Wiley.
Evgeniou, T., Pontil, M., & Poggio, T. (2001). Regularization networks and support vector machines. Advances in Computational Mathematics, 13, 1–50.
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7:2, 179–188.
Friedman, J. (1989). Regularized discriminant analysis. Journal of the American Statistical Association, 84, 165–175.
Girosi, F. (1998). An equivalence between sparse approximation and support vector machines. Neural Computation, 10, 1455–1480.
Golub, G. H., & Van Loan, C. F. (1989). Matrix Computations. Baltimore MD: Johns Hopkins University Press.
Hassibi, B., & Stork, D. G. (1993). Second order derivatives for network pruning: Optimal brain surgeon. In Hanson, Cowan, & Giles (Eds.), Advances in Neural Information Processing Systems (Vol. 5). San Mateo, CA: Morgan Kaufmann.
Hastie, T., & Tibshirani, R. (1998). Classification by pairwise coupling. The Annals of Statistics, 26, 451–471.
Holte, R. C. (1993). Very simple classification rules perform well on most commonly used datasets. Machine Learning, 11, 63–90.
John, G. H., & Langley, P. (1995). Estimating continuous distributions in Bayesian classifiers. In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence (pp. 338–345). Montreal, Quebec, Morgan Kaufmann.
Kwok, J. T. (2000). The evidence framework applied to support vector machines. IEEE Trans. on Neural Networks, 10:5, 1018–1031.
Le Cun, Y., Denker, J. S., & Solla, S. A. (1990). Optimal brain damage. In Touretzky (Ed.), Advances in Neural Information Processing Systems (Vol. 2). San Mateo, CA: Morgan Kaufmann.
Lim, T.-S., Loh, W.-Y., & Shih, Y.-S. (2000). A comparison of prediction accuracy, complexity, and training time of thirty-three old and new classification algorithms. Machine Learning, 40:3, 203–228.
MacKay, D. J. C. (1995). Probable networks and plausible predictions—A review of practical Bayesian methods for supervised neural networks. Network: Computation in Neural Systems, 6, 469–505.
Mika, S., Rätsch, G., Weston, J., Schölkopf, B., & Müller, K.-R. (1999). Fisher discriminant analysis with Kernels. In Proc. IEEE Neural Networks for Signal Processing Workshop 1999, NNSP 99.
Navia-Vázquez, A., Pérez-Cruz, F., Artés-Rodríguez, A., Figueiras-Vidál, A. R. (2001). Weighted least squares training of support vector classifiers leading to compact and adaptive schemes. IEEE Transactions on Neural Networks, 12, 1047–1059.
Platt, J. (1998). Fast training of support vector machines using sequential minimal optimization. In B. Schölkopf, C. J. C. Burges, & A. J. Smola (Eds.), Advances in Kernel Methods—Support Vector Learning. Cambridge, MA.
Quinlan, J. (1993). C4.5 Programs for Machine Learning. Morgan Kaufmann.
Rao, P. (1983) Nonparametric Functional Estimation. Orlando: Academic Press.
Ripley, B. D. (1996). Pattern Classification and Neural Networks. Cambridge.
Saunders, C., Gammerman, A., & Vovk, V. (1998). Ridge regression learning algorithm in dual variables. In Proc. of the 15th Int. Conf. on Machine Learning ICML-98 (pp. 515–521). Madison-Wisconsin: Morgan Kaufmann.
Schölkopf, B., Sung, K.-K., Burges, C., Girosi, F., Niyogi, P., Poggio, T., & Vapnik, V. (1997). Comparing support vector machines with Gaussian Kernels to radial basis function classifiers. IEEE Transactions on Signal Processing, 45, 2758–2765.
Schölkopf, B., Burges, C., & Smola, A. (Eds.), (1998). Advances in Kernel Methods—Support Vector Learning. MIT Press.
Sejnowski, T. J., & Rosenberg, C. R. (1987). Parallel networks that learn to pronounce English text. Journal of Complex Systems, 1:1, 145–168.
Smola, A., Schölkopf, B., & Müller, K.-R. (1998). The connection between regularization operators and support vector kernels. Neural Networks, 11, 637–649.
Smola, A. (1999). Learning with Kernels. PhD Thesis, published by: GMD, Birlinghoven.
Suykens, J. A. K., & Vandewalle, J. (Eds.) (1998). Nonlinear Modeling: Advanced Black-Box Techniques. Boston: Kluwer Academic Publishers.
Suykens, J. A. K., & Vandewalle, J. (1999a). Training multilayer perceptron classifiers based on a modified support vector method. IEEE Transactions on Neural Networks, 10, 907–912.
Suykens, J. A. K., & Vandewalle, J. (1999b). Least squares support vector machine classifiers. Neural Processing Letters, 9, 293–300.
Suykens, J. A. K., Lukas, L., Van Dooren, P., De Moor, B., & Vandewalle, J. (1999). Least squares support vector machine classifiers: A large scale algorithm. In Proc. of the European Conf. on Circuit Theory and Design (ECCTD'99) (pp. 839–842).
Suykens, J. A. K., & Vandewalle, J. (1999c). Multiclass least squares support vector machines. In Proc. of the Int. Joint Conf. on Neural Networks (IJCNN'99), Washington, DC.
Suykens, J. A. K., De Brabanter, J., Lukas, L., & Vandewalle, J. (2002). Weighted least squares support vector machines: Robustness and sparse approximation. Neurocomputing, 48:1–4, 85–105.
Suykens, J. A. K., & Vandewalle, J. (2000). Recurrent least squares support vector machines. IEEE Transactions on Circuits and Systems-I, 47, 1109–1114.
Suykens, J. A. K., Vandewalle, J., & De Moor, B. (2001). Optimal control by least squares support vector machines. Neural Networks, 14, 23–35.
Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1, 211–244.
Utschick, W. (1998). A regularization method for non-trivial codes in polychotomous classification. International Journal of Pattern Recognition and Artificial Intelligence, 12, 453–474.
Van Gestel, T., Suykens, J. A. K., Baestaens, D.-E., Lambrechts, A., Lanckriet, G., Vandaele, B., De Moor, B., & Vandewalle, J. (2001). Predicting financial time series using least squares support vector machines within the evidence framework. IEEE Transactions on Neural Networks, (Special Issue on Financial Engineering), 12, 809–821.
Van Gestel, T., Suykens, J. A. K., Lanckriet, G., Lambrechts, A., De Moor, B., & Vandewalle, J. (2002). A Bayesian framework for least squares support vector machine classifiers. Neural Computation, 14, 1115–1148.
Vapnik, V. (1995). The Nature of Statistical Learning Theory. New-York: Springer-Verlag.
Vapnik, V. (1998a). Statistical Learning Theory. New-York: John Wiley.
Vapnik, V. (1998b). The support vector method of function estimation. In J. A. K. Suykens, & J. Vandewalle, (Eds.), Nonlinear Modeling: Advanced Black-box Techniques. Boston: Kluwer Academic Publishers.
Viaene, S., Baesens, B., Van Gestel, T., Suykens, J. A. K., Van den Poel, D., Vanthienen, J., De Moor, B., & Dedene, G. (2001). Knowledge discovery in a direct marketing case using least squares support vector machine classifiers. International Journal of Intelligent Systems, 9, 1023–1036.
Williams, C. K. I. (1998). Prediction with Gaussian processes: From linear regression to linear prediction and beyond. In M. I. Jordan (Ed.), Learning and Inference in Graphical Models. Kluwer Academic Press.
Witten, I. H., & Frank, E. (2000). Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. San Francisco: Morgan Kaufmann.
Author information
Authors and Affiliations
About this article
Cite this article
van Gestel, T., Suykens, J.A., Baesens, B. et al. Benchmarking Least Squares Support Vector Machine Classifiers. Machine Learning 54, 5–32 (2004). https://doi.org/10.1023/B:MACH.0000008082.80494.e0
Issue Date:
DOI: https://doi.org/10.1023/B:MACH.0000008082.80494.e0