SVM from a Geometric Perspective
SVM Main Properties
-
SVM is a sparse technique. Like nonparametric methods, SVM requires that all the training data be available, that is, stored in memory during the training phase, when the parameters of the SVM model are learned. However, once the model parameters are identified, SVM depends only on a subset of these training instances, called support vectors, for future prediction. Support vectors define the margins of the hyperplanes. Support vectors are found after an optimization step involving an objective function regularized by an error term and a constraint, using Lagrangian relaxation.1 The complexity of the classification task with SVM depends on the number of support vectors rather than the dimensionality of the input space. The number of support vectors that are ultimately retained from the original dataset is data dependent and varies, based on the data complexity, which is captured by the data dimensionality and class separability. The upper bound for the number of support vectors is half the size of the training dataset, but in practice this is rarely the case. The SVM model described mathematically in this chapter is written as a weighted sum of the support vectors, which gives the SVM framework the same advantages as parametric techniques in terms of reduced computational time for testing and storage requirements.
-
SVM is a kernel technique. SVM uses the kernel trick to map the data into a higher-dimensional space before solving the machine learning task as a convex optimization problem in which optima are found analytically rather than heuristically, as with other machine learning techniques. Often, real-life data are not linearly separable in the original input space. In other words, instances that have different labels share the input space in a manner that prevents a linear hyperplane from correctly separating the different classes involved in this classification task. Trying to learn a nonlinear separating boundary in the input space increases the computational requirements during the optimization phase, because the separating surface will be of at least the second order. Instead, SVM maps the data, using predefined kernel functions, into a new but higher-dimensional space, where a linear separator would be able to discriminate between the different classes. The SVM optimization phase will thus entail learning only a linear discriminant surface in the mapped space. Of course, the selection and settings of the kernel function are crucial for SVM optimality.
-
SVM is a maximum margin separator. Beyond minimizing the error or a cost function, based on the training datasets (similar to other discriminant machine learning techniques), SVM imposes an additional constraint on the optimization problem: the hyperplane needs to be situated such that it is at a maximum distance from the different classes. Such a term forces the optimization step to find the hyperplane that would eventually generalize better because it is situated at an equal and maximum distance from the classes. This is essential, because training is done on a sample of the population, whereas prediction is to be performed on yet-to-be-seen instances that may have a distribution that is slightly different from that of the subset trained on.
Hard-Margin SVM
Soft-Margin SVM
Kernel SVM
-
Linear kernel:
-
Polynomial function:
-
Hyperbolic tangent (sigmoid):
-
Gaussian radial basis function (RBF):
-
Laplacian radial basis function:
-
Randomized blocks analysis of variance (ANOVA RB) kernel:
-
Linear spline kernel in 1D:
Multiclass SVM
-
One-versus-the-rest (also called one-against-all [OAA]) is probably the earliest SVM multiclass implementation and is one of the most commonly used multiclass SVMs. It constructs c binary SVM classifiers, where c is the number of classes. Each classifier distinguishes one class from all the others, which reduces the case to a two-class problem. There are c decision functions: . The initial formulation of the OAA method assigns a data point to a certain class if and only if that class has accepted it, while all other classes have not, which leaves undecided regions in the feature space when more than one class accepts it or when all classes reject it. Vapnik (1998) suggested assigning data points to the class with the highest value, regardless of sign. The final label output is given to the class that has demonstrated the highest output value:
-
Proposed by Knerr, Personnaz, and Dreyfus (1990), and first adopted in SVM by Friedman (1996) and Kressel (1999), pair-wise classification (also called one-against- one [OAO]) builds c(c – 1)/2 binary SVMs, each of which is used to discriminate two of the c classes only and requires evaluation of (c – 1) SVM classifiers. For training data from the k th and j th classes, the constraints for ( ) are, forfor
-
The multiclassification objective function probably has the most compact form, as it optimizes the problem in a single step. The decision function is the same as that of the OAA technique. The multiclassification objective function constructs c two-class rules, and c decision functions solve the following constraints:.
SVM with Imbalanced Datasets
Predicted/Actual Class | Positive Class | Negative Class |
---|---|---|
Positive Class | TP | FP |
Negative Class | FN | TN |
Improving SVM Computational Requirements
Case Study of SVM for Handwriting Recognition
Preprocessing
Feature Extraction
-
Writing direction: Defined by
-
where Δx, Δy, and Δs are defined as
-
Curvature: Defined by the sine and cosine of the angle defined by the points (x(t - 2), y(t - 2)); (x(t), y(t)); and (x(t + 2), y(t + 2)). Curvature can be calculated from the writing direction, using the following equations:
-
Aspect of the trajectory: Computed according to the equation
-
Curliness: Describes the deviation of the points from a straight line formed by the previous and following points in the sequence by the equation
-
where L(t) represents the length of the trajectory from point (x(t - 1), y(t - 1)) to point (x(t + 1), y(t + 1)).
-
Linearity: Measured by the average distance from each point of the sequence to the straight line joining the first and last points in the sequence:
-
Slope of the sequence: Measured by the cosine and sine of the angle formed by the straight line joining the first and last points in the sequence and a horizontal line.
-
Ascenders and descenders: Describes the number of points of the sequence below (descenders) or above (ascenders) the baseline (the straight horizontal line on which the letter is written), each weighted by its distance to the baseline.
-
Variance of coordinates (for both dimensions): Measures the expansion of the points around the center of mass.
-
Ratio of variances: Represents the proportion of the width to the height of the letter.
-
Cumulative distance: The sum of the length of the segments of line joining consecutive points of the sequence.
-
Average distance to the center, The mean of the distances from each point of the sequence to the center of mass of the letter.
Hierarchical, Three-Stage SVM
-
Using a binary SVM classifier, the first stage classifies the instance as one of two classes: uppercase or lowercase letter.
-
Using OAA SVM, the second stage classifies the instance as one of the manually determined clusters shown in Table 3-1.Table 3-1.Lower- and Uppercase ClustersLowercase ClustersUppercase ClustersCluster 1: a c e oCluster 2: b d l tCluster 3: f h kCluster 4: g z jCluster 5: p qCluster 6: i r sCluster 7: u v w xCluster 8: m nCluster 9: A B P RCluster 10: C D G O QCluster 11: E F I LCluster 12: J K TCluster 13: M N HCluster 14: S Y Z XCluster 15: U V W
-
Using OAA SVM, with a simple majority vote, the third stage identifies the letter as one of the 52 classes (or subclusters). Figure 3-9 displays the hierarchy of the three-stage system.×
Experimental Results
Architecture | Recognition Rate (%) |
---|---|
Flat SVM OAA | 65 |
Flat SVM OAO | 82 |
3NN (Prat et al. 2009) | 89.15 |
Three-Stage SVM | 91.8 |
Complexity Analysis
Step | Total Operations |
---|---|
Representing letter in a sequence of vector | 8N
|
Computing slant |
M + 1 |
Rotating letter |
N
|
Normalizing dimensions | 2N
|
Shifting to center of mass | 4N + 2 |
Feature | Total Operations |
---|---|
Writing direction | 7N
|
Curvature | 6N
|
Aspect | 2N
|
Curliness | 14N
|
Linearity | 6N + 1 |
Slope | 7 |
Ascenders and descenders | 6N
|
Variance | 8N + 4 |
Ratio of variances | 1 |
Cumulative distance | 5N - 5 |
Average distance | 4N
|
Classifier | Decision Function | Total Operations |
---|---|---|
Three-Stage SVM | 12 operations per classifier; in total, 168 operations (the class requiring the most classifiers) | |
3NN (Prat et al. 2009) | 150 operations per distance measure; in total, 3 50 * K = 150 * K
|
References
www.mathworks.com/matlabcentral/fileexchange/37311-smoteboost
.www.ics.uci.edu/`mlearn/MLRepository.html
.https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es1999-461.pdf
.