Developing parallel sequential minimal optimization for fast training support vector machine

doi:10.1016/j.neucom.2006.05.007

Neurocomputing

Volume 70, Issues 1–3, December 2006, Pages 93-104

https://doi.org/10.1016/j.neucom.2006.05.007 Get rights and content

Abstract

A parallel version of sequential minimal optimization (SMO) is developed in this paper for fast training support vector machine (SVM). Up to now, SMO is one popular algorithm for training SVM, but it still requires a large amount of computation time for solving large size problems. The parallel SMO is developed based on message passing interface (MPI). Unlike the sequential SMO which handle all the training data points using one CPU processor, the parallel SMO first partitions the entire training data set into smaller subsets and then simultaneously runs multiple CPU processors to deal with each of the partitioned data sets. Experiments show that there is great speedup on the adult data set, the MNIST data set and IDEVAL data set when many processors are used. There are also satisfactory results on the Web data set. This work is very useful for the research where multiple CPU processors machine is available.

Introduction

Recently, a lot of research work has been done on support vector machines (SVMs), mainly due to their impressive generalization performance in solving various machine learning problems [1], [2], [15]. Given a set of data points {(X_i,y_i)}_i¹ (X_i∈R^d is the input vector of ith training data pattern; $y_{i} \in {- 1, 1}$ is its class label; l is the total number of training data patterns), training an SVM in classification is equivalent to solving the following linearly constrained convex quadratic programming (QP) problem. $maximize : R (α_{i}) = \sum_{i = 1}^{l} α_{i} - \frac{1}{2} \sum_{i = 1}^{l} \sum_{j = 1}^{l} α_{i} α_{j} y_{i} y_{j} K (X_{i}, X_{j})$ $subject to : \sum_{i = 1}^{l} α_{i} y_{i} = 0,$ $0 ⩽ α_{i} ⩽ C, i = 1, \dots, l,$ where K(X_i,X_j) is the kernel function. The mostly widely used kernel function is the Gaussian function $e - | | X_{i} - X_{j} | |^{2} / σ^{2}$ , where σ² is the width of the Gaussian kernel. α_i is the Lagrange multiplier to be optimized. For each of training data patterns, one α_i is associated. C is the regularization constant pre-determined by users. After solving the QP problem (1), the following decision function is used to determine the class label for a new data pattern: $f (X) = \sum_{i = 1}^{l} α_{i} y_{i} K (X_{i}, X) - b,$ where b is obtained from the solution of (1). A detailed description of Eq. (3) can be referred to our previous work [7].

So the main problem in SVM is reduced to solving the QP problem (1), where the number of variables α_i to be optimized is equal to the number of training data patterns l. For small size problems, standard QP techniques such as the projected conjugate gradient can be directly applied. But for large size problems, standard QP techniques are not useful as they require a large amount of computer memory to store the kernel matrix K as the number of elements of K is equal to the square of the number of training data patterns.

For making SVM more practical, special algorithms are developed, such as Vapnik's chunking [14], Osuna's decomposition [10] and Joachims's SVM^light [8]. They make the training of SVM possible by breaking the large QP problem (1) into a series of smaller QP problems and optimizing only a subset of training data patterns at each step. The subset of training data patterns optimized at each step is called the working set. Thus, these approaches are categorized as the working set methods.

Based on the idea of the working set methods, Platt [12] proposed the sequential minimal optimization (SMO) algorithm which selects the size of the working set as two and uses a simple analytical approach to solve the reduced smaller QP problems. There are some heuristics used for choosing two α_i to optimize at each step. As pointed out by Platt, SMO scales only quadratically in the number of training data patterns, while other algorithms scales cubically or more in the number of training data patterns. Later, Keerthi et al. [9] ascertained inefficiency associated with Platt's SMO and suggested two modified versions of SMO that are much more efficient than Platt's original SMO. The second modification is particular good and used in popular SVM packages such as LIBSVM [3]. We will refer to this modification as the modified SMO algorithm.

Recently, there are few works on developing parallel implementation of training SVMs [4], [6], [16]. In [4], a mixture of SVMs are trained in parallel using the subsets of a training data set. The results of each SVM are then combined by training another multi-layer perceptron. The experiment shows that the proposed parallel algorithm can provide much efficiency than using a single SVM. In the algorithm proposed by Dong et al. [6], multiple SVMs are also developed using subsets of a training data set. The support vectors in each SVM are then collected to train another new SVM. The experiment demonstrates much efficiency of the algorithm. Zanghirati and Zanni [16] also proposed a parallel implementation of SVM^light where the whole quadratic programming problem is split into smaller subproblems. The subproblems are then solved by a variable projection method. The results show that the approach is comparable on scalar machines with a widely used technique and can achieve good efficiency and scalability on a multiprocessor system.

This paper proposes a parallel implementation of the modified SMO based on the multiprocessor system for speeding up the training of SVM, especially with the aim of solving large size problems. In this paper, the parallel SMO is developed using message passing interface (MPI) [11]. Unlike the sequential SMO which handles the entire training data set using a single CPU processor, the parallel SMO first partitions the entire training data set into smaller subsets and then simultaneously runs multiple CPU processors to deal with each of the partitioned data sets. On the adult data set the parallell SMO using 32 CPU processors is more than 21 times faster than the sequential SMO. On the web data set,the parallel SMO using 30 CPU processors is more than 10 times faster than the sequential SMO. On the MNIST data set the parallel SMO using 30 CPU processors on the averaged time of “one-against-all” SVM classifiers is more than 21 times faster than the sequential SMO.

This paper is organized as follows. Section 2 gives an overview of the modified SMO. Section 3 describes the parallel SMO developed using MPI. Section 4 presents the experiment indicating the efficiency of the parallel SMO. A short conclusion then follows.

Section snippets

A brief overview of Keerthi's modified SMO

We begin the description of the modified SMO by giving the notation used. Let $I_{0} = {i : y_{i} = 1, 0 < α_{i} < C} \cup {i : y_{i} = - 1, 0 < α_{i} < C}$ , $I_{1} = {i : y_{i} = 1, α_{i} = 0}$ , $I_{2} = {i : y_{i} = - 1, α_{i} = C}$ , $I_{3} = {i : y_{i} = 1, α_{i} = C}$ and $I_{4} = {i : y_{i} = - 1, α_{i} = 0}$ . $I = \cup I_{i}, i = 0, \dots, 4$ denotes the index of training data patterns. $F_{i} = \sum_{j = 1}^{l} α_{j} y_{j} K (X_{j}, X_{i}) - y_{i} . b_{up} = \min {F_{i} : i \in I_{0} \cup I_{1} \cup I_{2}}$ , $I_{up} = \arg \min_{i} F_{i}$ . $b_{low} = \max {F_{i} : i \in I_{0} \cup I_{3} \cup I_{4}}$ , $I_{low} = \arg \max_{i} F_{i}$ . τ=10⁻⁶.

The idea of the modified SMO is to optimize the two α_i associated with b_up and b_low according to (4) and (5) at each step. Their

The parallel SMO

MPI is not a new programming language, but a library of functions that can be used in C, C++ and FORTRAN [11]. MPI allows one to easily implement an algorithm in parallel by running multiple CPU processors for improving efficiency. The “single program multiple data (SPMD)” mode where different processors execute the same program but different data is generally used in MPI for developing parallel programs.

In the sequential SMO algorithm, most of computation time is dominated by updating F_i array

Experiment

The parallel SMO is tested against the sequential SMO using three benchmarks: the adult data set, the web data set and the MNIST data set. Both algorithms are written in C. Both algorithms are run on IBM p690 Regata SuperComputer which has a total of seven nodes, with each node having 32 power PC_POWER4 1.3 GHz processors. For ensuring the same accuracy in the sequential SMO and the parallel SMO, the stop criteria used in both algorithms such as the value of τ are all the same.

Conclusions

This paper proposes the parallel implementation of SMO using MPI. The parallel SMO uses multiple CPU processors to deal with the computation of SMO. By partitioning the entire training data set into smaller subsets and distributing each of the partitioned subsets into one CPU processor, the parallel SMO updates F_i array and calculates b_up, b_low, and DualityGap at each step in parallel using multiple CPU processors. This parallel mode is called the “SPMD” model in MPI. Experiment on three large

Dr L. J. Cao received her Ph.D. in Mechanical Engineering from the National University of Singapore in 2002. She is currently working as an associate professor in the institute of financial studies in Fudan University in Shanghai, P.R.China. Her research area focuses on support vector machines on financial applications and financial derivative pricing.

References (16)

G. Zanghirati et al.
A parallel solver for large quadratic programs in training support vector machines
Parallel Comput.
(2003)
C.J.C. Burges
A tutorial on support vector machines for pattern recognition
Knowledge Discovery Data Mining
(1998)
L.J. Cao et al.
Support vector machines with adaptive parameters in financial time series forecasting
IEEE Trans. Neural Networks
(2003)
C.C. Chang, C.J. Lin, LIBSVM: a library for support vector machines, available at...
R. Collobert et al.
A parallel mixture of SVMs for very large scale problems
Neural Comput.
(2002)
J.X. Dong, A. Krzyzak, C.Y. Suen, A fast SVM training algorithm, Pattern Recog. Artif. Intell., 2002, in...
J.X. Dong, A. Krzyzak, C.Y. Suen, A fast parallel optimization for training support vector machine, In: P. Perner, A....
C.W. Hsu, C.C. Chang, C.J. Lin, A Practical Guide to Support Vector Classification. at...

There are more references available in the full text version of this article.

Cited by (0)

Dr. Keerthi is a Scientist at Yahoo! Research. His research has focused on the development of practical algorithms for a variety of areas, such as machine learning, robotics, computer graphics and optimal control. His current research focuses on kernel methods, in particular on the development of fast kernel methods for large scale data mining problems.

Dr. Chong-Jin Ong received Ph.D. degrees in mechanical and applied mechanics from University of Michigan, Ann Arbor 1993. He joined the National University of Singapore in 1993 and is now an Associate Professor with the Department of Mechanical Engineering. His current research interests are in the areas of machine learning and robust control theory and applications.

Uvaraj Periyathamby obtained his M.S. degree at the Department. of Mechanical Engineering in National University of Singapore in 1994. He is currently a Senior Research Engineer cum Industry Development Manager at the Institute of High Performance Computing under the Agency for Science Technology and Research in Singapore. His areas of interests include parallel computing, aerospace engineering, computational fluid dynamics and grid computing.

Dr. Xiuju Fu obtained her Ph.D. degree at the department of Mechanical Engineering from Nanyang Technological University in Singapore in 2003. Currently, she is working as a research engineer in Institute of High Performance Computing. Her research areas include: neural networks, support vector machine, genetic algorithms, data mining, classification, data dimensionality reduction, and linguistic rule extraction.

Dr H. P. Lee is currently the Deputy Executive Director (Research) of the Institute of High Performance Computing (IHPC) as well as an Associate Professor at the Department of Mechanical Engineering, National University of Singapore. He has held position as the Sub-dean for External Relations in the Faculty of Engineering, National University of Singapore prior to taking up the appoint at IHPC.

Dr. Lee obtained his Ph.D. and M.S. from the Department of Mechanical Engineering, Stanford University, B.A. from the University of Cambridge. His research interests are in the areas of vibration and acoustics, simulation techniques, metal forming, MEMS and data mining.

^☆: The research work is funded by National Natural Science Research Fund No. 70501008 and sponsored by Shanghai Pujiang program.

View full text

Developing parallel sequential minimal optimization for fast training support vector machine☆

Abstract

Introduction

Section snippets

A brief overview of Keerthi's modified SMO

The parallel SMO

Experiment

Conclusions

Parallel Comput.

A tutorial on support vector machines for pattern recognition

Knowledge Discovery Data Mining

Support vector machines with adaptive parameters in financial time series forecasting

IEEE Trans. Neural Networks

A parallel mixture of SVMs for very large scale problems

Neural Comput.