1 Introduction
1.1 Broader context
1.1.1 Model selection for SVMs
1.1.2 Learning from weakly-labeled, noisy, and poor-quality data
1.2 Motivation and goals
-
We present an extensive review of the state-of-the-art methods for selecting SVM training sets. Not only do we report these methods, but we also discuss their potential weaknesses, strengths, and ideas behind them. This will allow for better understanding (i) how to cope with massive real-life sets, and (ii) how to select an appropriate method for a problem at hand.
-
We believe that this review will notably help in developing new approaches for selecting SVM training sets. The presented taxonomy should be useful in identifying the potential pitfalls of emerging training set selection algorithms, and in determining which techniques could be successfully combined into hybrid algorithms to further boost all available knowledge concerning the \(\varvec{T}\) vectors (e.g., those methods which utilize complementary sources of information in search of valuable training vectors). The literature in this field is very diverse—we hope that this review will clearly highlight the areas which should (or should not) be further explored.
1.3 Structure of the review
2 Theoretical background
2.1 Linear SVMs
2.2 Non-linear SVMs
3 Selecting SVM training sets
3.1 Data geometry analysis methods
3.1.1 Clustering-based methods
3.1.2 Non-clustering methods
3.2 Neighborhood analysis methods
3.3 Evolutionary methods
3.4 Active learning methods
3.5 Random sampling methods
3.6 Summary of the SVM training set reduction methods
-
Type—indicates whether the method is one-pass or iterative. In the iterative approaches, the initial refined training set is gradually improved in order to include better vectors from \(\varvec{T}\). Such methods encompass algorithms which (i) keep enhancing the refined sets of a given (constant) size, and those which (ii) decrease or (iii) increase refined sets to boost their quality.
-
Source—being the underlying source of information concerning a training set. The knowledge extracted from this source is then used for generating refined sets during the optimization process. We distinguish five possible sources of information which can be utilized for this purpose—they are summarized in Table 1. Note that there exist methods which exploit several sources of information.
-
Randomized—shows whether the algorithm is randomized or deterministic.
-
Dependent on\(t\)—shows whether the algorithm is dependent on the cardinality of \(\varvec{T}\). If so, it may require analyzing the entire training set which is often not possible in massively large real-life datasets. Hence, the techniques which are independent from \(t\) should be preferred in practical applications.
-
Data—indicates which types of datasets were used to validate the corresponding algorithm (A—artificially generated, B—benchmark, R—real-life datasets).
-
Maximum\(t\)—indicates (roughly) the maximum size of the dataset for which the method was tested in the referenced paper. As mentioned in Sect. 1, the term large dataset is quite ambiguous in the literature (the cardinality of large sets may vary from hundreds to millions of \(\varvec{T}\) vectors).
Source | Description |
---|---|
Local | The local neighborhood of a training vector is analyzed, and based on the outcome of this analysis, this vector may be classified as important (i.e., likely to be selected as a SV), or useless. The latter vectors are usually not appended to the refined sets |
Global | The global layout and characteristics of a training set are investigated in search of important vectors which should be included in the refined set |
Wrapper | In this approach, the SVM classifier is trained using a retrieved refined set (or sets, in the case of population-based algorithms), and the quality of this refined set is quantified using the classification performance of the learned SVM—see Sect. 3.7 for details (commonly, all vectors from \(\varvec{T}\), or a subset of \(\varvec{T}\) vectors are classified to verify the SVM performance—in the latter case, wrapper techniques may be independent from the cardinality of the entire training set). The classifier is often treated as a “black box” in these wrapper techniques |
SVM | As in wrapper techniques, the SVM classifier is learned using the refined training set at first. However, the \(\varvec{T}\) vectors are assessed based on the specific features of the trained SVM (e.g., the vectors selected as SVs in the training process are considered important) |
Theory | The theory behind the SVM classifier or the desired training data characteristics are used to estimate the importance of a given training vector. Contrary to the “SVM” information source, the classifier is not trained before the training set vectors can be assessed in this case |
3.7 Assessing SVM training set selection algorithms
3.7.1 Quantitative measures
-
Classification performance of an SVM trained using a refined set (\(\uparrow \)) The performance of classifiers (including SVMs) is assessed based on ratios derived from the number of (a) correctly classified positive-class vectors—true positives (TP), (b) correctly classified negative-class vectors—true negatives (TN), (c) incorrectly classified negative-class vectors—false positives (FP), and (d) incorrectly classified positive-class vectors—false negatives (FN) obtained for a test set which was not used during the training (see Table 2). Utilizing the unseen dataset allows for verifying the generalization capabilities of a classifier.The derived ratios include, among others, the true positive rate:and the false positive rate:$$\begin{aligned} \mathrm{TPR}=\frac{\mathrm{TP}}{\mathrm{TP+FN}} \end{aligned}$$(45)TPR and FPR are often presented in a form of receiver operating characteristic (ROC) curves (Fawcett 2006). Each point in this curve is the performance of an SVM for a given decision threshold (Yu et al. 2015). Calculating the area under this curve (AUC) reduces a ROC curve into a single scalar value representing the classifier performance (the higher the AUC values, the better, and \(0\le \mathrm{AUC} \le 1\)). The area under the ROC curve and the accuracy (ACC):$$\begin{aligned} \mathrm{FPR}=\frac{\mathrm{FP}}{\mathrm{FP+TN}.} \end{aligned}$$(46)are the most widely used measures exploited to quantify the performance of training set selection algorithms (the classification performance of an SVM trained using a refined set should be maximized). Other common measures include precision, recall and the F-measure (Khosravani et al. 2013).$$\begin{aligned} \mathrm{ACC}=\frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{TN}+\mathrm{FP}+\mathrm{FN}} \end{aligned}$$(47)
-
Size of the refined training set (\(\downarrow \)) The main objective of training set selection algorithms is to minimize the cardinality of the training set (ideally without decaying the SVM classification performance). Hence, the number of vectors in the refined sets elaborated using such approaches is almost always investigated. To make this measure easier to interpret for datasets of different sizes, it is very often presented as the reduction rate (\(\mathcal {R}\)):where \(t'\) is the cardinality of the refined training set, and \(t\) is the size of the original dataset. This reduction rate should be maximized.$$\begin{aligned} \mathcal {R}=\frac{t}{t'}, \end{aligned}$$(48)
-
The number of support vectors (\(\downarrow \)) As already mentioned, the number of SVs influences (linearly) the SVM classification time. Therefore, it should be minimized to speed up the operation of a trained classifier.
-
The percentage of vectors in a refined training set selected as support vectors (\(\uparrow \) Determining the desired cardinality of refined sets is often a critical step in training set selection algorithms. Such refined sets should be small and should include important vectors which are likely to be selected as SVs during the SVM training. In several works, the percentage of vectors in refined training sets selected as SVs has been investigated (Nalepa and Kawulok 2016a; Verbiest et al. 2016). This percentage should be maximized to keep the number of “useless” vectors in a refined set as small as possible. However, this measure can easily become misleading—selecting all training set vectors as SVs can be a sign of overfitting and lack of generalization capabilities.
-
Training set selection, SVM training and classification times (\(\downarrow \)) In all state-of-the-art approaches, the execution time of a training set selection algorithm should be minimized. Also, the SVM training and classification times are to be minimized (these times are correlated with the size of a refined training set and the number of determined SVs).
-
Combined quality measure (\(\uparrow \)) Although the above-mentioned measures are usually investigated separately, this approach becomes infeasible in several practical scenarios, e.g., in real-time systems in which a trained SVM should work extremely fast even if it delivers slightly worse results (i.e., minimizing the number of SVs may be more important than maximizing the classification accuracy). In such cases, the problem of selecting SVM training sets can be considered as a two- (or multi-) objective optimization problem: the first objective is to maximize the classification accuracy of an SVM trained with a refined set, and the second is to minimize the number of SVs. Nalepa (2016) transformed these two objectives into a single quality functionwhere \(\mathrm{AUC}^\mathrm{B}\) denotes the best (the largest) AUC obtained for the test set (note that AUC can be replaced by any other classification performance measure in this formula), \(s^\mathrm{B}\) is the best (the lowest) number of determined SVs across the investigated training set selection algorithms, and q denotes the importance of the first objective (\(0 < q \le 1\)). The largest Q value is retrieved for the best training set selection algorithm.$$\begin{aligned} Q(\mathrm{AUC},s)=q \cdot \frac{\mathrm{AUC}}{\mathrm{AUC}^\mathrm{B}} + (1 - q) \cdot \frac{s^\mathrm{B}}{s}, \end{aligned}$$(49)
Predicted | |||
---|---|---|---|
Positive | Negative | ||
Real | Positive |
True positive
|
False negative
|
Negative |
False positive
|
True negative
|
Algorithm | Year | References | Type | Source | Rand.? | Dep. \(t\)? | Data | Max. \(t\) |
---|---|---|---|---|---|---|---|---|
Data geometry analysis methods
| ||||||||
Clustering-based methods | ||||||||
Clustering and 1-nearest neighbor cluster analysis | 1999 |
Lyhyaoui et al. (1999) | One-pass | Global | No | Yes | AR | 500 |
Analysis of k-means clusters in the training set | 2000 |
Barros de Almeida et al. (2000) | One-pass | Global | Yes | Yes | A |
\(10^3\)
|
Hierarchical micro-clustering | 2003 |
Yu et al. (2003) | Iterative | Global, SVM | No | Yes | AR |
\(5\times 10^6\)
|
Clustering and processing of the crisp clusters | 2004 |
Koggalage and Halgamuge (2004) | Iterative | Global | Yes | Yes | B |
\(5\times 10^3\)
|
Analysis of similarities between training vectors | 2004 |
Wang and Xu (2004) | One-pass | Global | No | Yes | A |
\(10^5\)
|
Analysis of Hausdorff distances between vectors and convex hulls | 2007 |
Wang et al. (2007) | One-pass | Global | No | Yes | B | 615 |
Minimum enclosing ball clustering | 2008 |
Cervantes et al. (2008) | One-pass | Global, SVM | Yes | Yes | ABR |
\(5\times 10^5\)
|
Hierarchical clustering and mutual cluster analysis | 2008 |
Wang and Shi (2008) | One-pass | Global | No | Yes | AB |
\(3.7\times 10^3\)
|
Smallest enclosing ball support vector machines | 2008 |
Zeng et al. (2008) | One-pass | Global | No | Yes | B |
\(5\times 10^5\)
|
Edge detection and analysis of clusters | 2009 |
Li et al. (2009) | One-pass | Local, global | Yes | Yes | ABR |
\(5\times 10^3\)
|
Analysis of convex-concave hulls | 2012 | One-pass | Global | No | Yes | AB |
\(2.5\times 10^5\)
| |
Analysis of convex hulls | 2013 |
Wang et al. (2013a) | Iterative | Global | Yes | Yes | ABR |
\(1.6 \times 10^5\)
|
Modified analysis of convex hulls | 2013 |
Khosravani et al. (2013) | Iterative | Global | No | Yes | A |
\(4 \times 10^3\)
|
Analysis of cluster boundaries and cluster independencies | 2016 |
Shen et al. (2016) | Iterative | Global, theory | Yes | Yes | ABR |
\(5.7\times 10^5\)
|
Non-clustering methods | ||||||||
Extraction of boundary data using the Mahalanobis distance | 2001 |
Abe and Inoue (2001) | One-pass | Global | No | Yes | B | 800 |
\(\beta \)-skeleton analysis of the training set | 2002 |
Zhang and King (2002) | One-pass | Global | No | Yes | B |
\(4.4\times 10^3\)
|
Nearest neighbor condensation | 2010 |
Angiulli and Astorino (2010) | Iterative | Global | No | Yes | ABR |
\(10^6\)
|
Neighborhood analysis methods
| ||||||||
Pattern selection using k-nearest neighbors | 2002 |
Shin and Cho (2002) | One-pass | Local | No | Yes | A | 600 |
Analysis of the vector’s neighborhood heterogeneity | 2003 |
Shin and Cho (2003) | One-pass | Local | Yes | Yes | AB |
\(1.3 \times 10^4\)
|
Sphere-based neighborhood analysis of training vectors | 2005 |
Wang et al. (2005) | One-pass | Local | No | Yes | B | 547 |
Analysis of the vector’s neighborhood properties | 2007 |
Shin and Cho (2007) | Iterative | Local | Yes | Yes | ABR |
\(1.2 \times 10^4\)
|
Analysis of ensemble margins of training vectors | 2010 |
Guo et al. (2010) | One-pass | Theory | No | Yes | AB |
\(2 \times 10^3\)
|
Neighborhood-based rough set model | 2011 |
He et al. (2011) | One-pass | Local | No | Yes | AB |
\(5\times 10^3\)
|
Extreme vectors selection using the data boundaries | 2011 |
Li (2011) | One-pass | Local | No | Yes | AB |
\(1.4\times 10^5\)
|
Border-edge pattern selection | 2011 |
Li and Maguire (2011) | One-pass | Local | No | Yes | AB |
\(5 \times 10^4\)
|
Improved analysis of ensemble margins of each vector | 2015 |
Guo and Boukir (2015) | One-pass | Theory | No | Yes | AB |
\(10^4\)
|
Decision trees | 2015 |
Cervantes et al. (2015) | One-pass | Local, SVM | Yes | No | AB |
\(10^6\)
|
Evolutionary methods
| ||||||||
Initial work on genetic algorithms to select \(\varvec{T'}\)’s | 2007 |
Kawulok (2007) | Iterative | Wrapper | Yes | No | R |
\(2.9\times 10^5\)
|
Random sampling enhanced with crossover | 2008 |
Nishida and Kurita (2008) | Iterative | Wrapper | Yes | No | AB |
\(6 \times 10^4\)
|
Genetic algorithm (GASVM) | 2012 |
Kawulok and Nalepa (2012) | Iterative | Wrapper | Yes | No | AR |
\(7\times 10^6\)
|
Adaptive genetic algorithm (AGA) | 2014 |
Nalepa and Kawulok (2014a) | Iterative | Wrapper | Yes | No | ABR |
\(3\times 10^5\)
|
Dynamically adaptive genetic algorithm (DAGA) | 2014 |
Kawulok and Nalepa (2014a) | Iterative | Wrapper, SVM | Yes | No | ABR |
\(9\cdot 10^4\)
|
Memetic algorithm (MASVM) | 2014 |
Nalepa and Kawulok (2014b) | Iterative | Wrapper, SVM | Yes | No | ABR |
\(4\times 10^6\)
|
Multi-objective evolutionary sampling for imbalanced data | 2015 |
Fernandes et al. (2015) | Iterative | Wrapper | Yes | No | B | 1484 |
Multi-objective evolutionary algorithm | 2015 |
Pighetti et al. (2015) | Iterative | Wrapper | Yes | No | R |
\(3 \times 10^4\)
|
Memetic algorithm for evolving training sets and labels | 2015 |
Kawulok and Nalepa (2015) | Iterative | Wrapper, SVM | Yes | No | ABR |
\(4\times 10^6\)
|
Parameter-less SVMs | 2015 |
Nalepa et al. (2015b) | Iterative | Wrapper, SVM | Yes | No | AB |
\(4\times 10^6\)
|
Evolutionary wrapper approaches for training set selection | 2016 |
Verbiest et al. (2016) | Iterative | Wrapper | Yes | No | B | 1728 |
Adaptive memetic algorithm enhanced with geometry analysis (PCA\(^2\)MA) | 2016 |
Nalepa and Kawulok (2016a) | Iterative | Wrapper, SVM | Yes | Yes | ABR |
\(4\times 10^6\)
|
An alternating genetic algorithm for selecting SVM model and training set | 2017 |
Kawulok et al. (2017) | Iterative | Wrapper | Yes | No | AB |
\(2.7\times 10^4\)
|
Active learning methods
| ||||||||
Active learning-based heuristic algorithm | 2000 |
Schohn and Cohn (2000) | Iterative | SVM | Yes | Yes | BR |
\(6\times 10^3\)
|
Pool-based active learning | 2001 |
Tong and Chang (2001) | Iterative | Wrapper, theory | Yes | No | R |
\(2 \times 10^3\)
|
Guided active learning | 2002 |
Tong and Koller (2002) | Iterative | Wrapper, theory | Yes | No | BR |
\(3.3 \times 10^3\)
|
Ensemble learning with active example selection | 2011 |
Oh et al. (2011) | Iterative | Wrapper | Yes | No | R | 768 |
Random sampling methods
| ||||||||
Random sampling | 2001 |
Balcázar et al. (2001) | Iterative | Wrapper | Yes | No | – | – |
3.7.2 Standard experimental setup
-
Sensitivity analysis The impact of the most important components of a new algorithm on its overall performance is verified in the sensitivity analysis. Usually, one (or more) components are enabled (the other components are disabled), and the experiments are repeated for each configuration.
-
Comparison with other training set selection algorithms and SVMs trained using the entire set The comparison with the state of the art is always crucial for new training set selection algorithms. Also, they are commonly compared qualitatively and quantitatively (using the measures discussed in the previous section) with SVMs trained using the entire set—without any training set selection applied (however, it may be impossible due to the cardinality of this set) and other techniques from the literature (very often from different categories).
3.7.3 Datasets and practical applications
-
Artificially generated datasets Vectors in artificial datasets are usually generated to follow a known distribution (e.g., the Gaussian distribution). Therefore, the underlying data characteristics are known (which is not always achievable in the case of benchmark and real-life sets). Additionally, artificially generated sets are often straight-forward to visualize. Such datasets are used to understand the behavior of new training set selection algorithms (e.g., whether the vectors in the refined sets are positioned near the decision hyperplane or whether there are any vectors that could be removed from the refined sets as they are not selected as SVs). Several artificially generated datasets are available at http://sun.aei.polsl.pl/~jnalepa/SVM/ (see example datasets in Fig. 10—white and black pixels visualize vectors from the positive \(\varvec{T}_{+}\) and negative \(\varvec{T}_{-}\) classes; training set vectors are grouped into clusters in the \(\alpha \) versions of these 2D sets).
-
Benchmark datasets Such datasets (of different characteristics) are exploited to compare the performance of training set selection algorithms (benchmark sets were used in more than \(70\%\) of papers presented in this review). These datasets can be downloaded from the following repositories:In Table 4, we gather the characteristics of ten most frequently used (in the analyzed papers) benchmark sets alongside the repository name (the same dataset can be often downloaded from more than one repository). For multi-class sets (e.g., Yeast), pairwise coupling is performed—the multi-class classification problem is decomposed into two-class problems and the majority voting principle is used (a number of binary SVMs vote for the final class label for an incoming vector). Although the sizes of these benchmark datasets are not very large, they are widely used in the literature to compare training set selection algorithms (also thanks to a well-defined experimental protocol which is often presented at a repository website—it makes the comparisons much easier).
-
Knowledge Extraction based on Evolutionary Learning (KEEL) repository: http://www.keel.es/.
-
LibSVM repository: https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.
-
Practical applications and real-life datasets Although the amount of generated data is steadily growing nowadays and the size of training sets became a real obstacle in exploiting SVMs in practice, only less than \(45\%\) of all investigated training set selection algorithms were tested using real-life datasets.6 The most interesting practical scenarios in which training set selection algorithms have been tested and utilized include:
-
Hand-written digits classification In several works (Shin and Cho 2003, 2007; Nishida and Kurita 2008), the authors tackled the multi-class hand-written digits classification problem and exploited the MNIST dataset (http://yann.lecun.com/exdb/mnist/). They applied SVM training set selection algorithms to retrieve useful data from \(6\times 10^5\) training images (hand-written digits belonging to 10 classes, see Fig. 11). Important applications of the automated analysis of the digitalized hand-written text include bank check processing, postal address identification, analysis of historical documents or biometric authentication.
-
Skin detection and segmentation Detecting pixels representing human skin in color images (which is a preliminary step of the skin region segmentation process whose aim is to determine the boundaries of skin regions) is a difficult and important pattern recognition task. Its applications include content filtering, hand and face detection and tracking, human-computer interaction and many more (Kawulok et al. 2014). Kawulok and Nalepa (2012) generated the Skin dataset of skin and non-skin pixels (in the YC\(_{b}\)C\(_{r}\) color space; \(4\times 10^6\) pixels in total)—they exploited images from the ECU face and skin detection database elaborated by Phung et al. (2005) (see example images in Fig. 12—note that skin pixels expose different color and intensity characteristics), and used this set to test their several SVM training set selection algorithms (Nalepa and Kawulok 2014a, b, 2016a; Kawulok and Nalepa 2014a; Nalepa 2016).
-
Hand pose estimation Kawulok and Nalepa (2014b) applied SVMs to recognize hand poses based on the shape context descriptors (Belongie et al. 2002). In their approach, vectors of differences between two hand shapes are classified to determine whether they represent the same pose (hence, the class decision is indirect). The authors showed that training sets can become very large even for a relatively small number of gestures (i.e., for n gestures, \(\frac{n!}{2\cdot (n-2)!}\) feature vectors are obtained). To make SVMs applicable in this scenario, a genetic technique was utilized for selecting refined SVM training sets (Kawulok and Nalepa 2012).
-
Face detection Kawulok (2007) and Wang et al. (2013a) verified their SVM training set selection algorithms in the face detection problem—Wang et al. (2013a) exploited a dataset with almost 3500 images, whereas Kawulok (2007) used 1000 images from the famous Feret database presented by Phillips et al. (1998). Face detection is a pattern recognition task aimed at determining whether or not an input image contains a human face. Face detection algorithms are being exploited in surveillance systems, human–computer interaction and entertainment applications, human gait characterization, gender classification and many more (Paul and Haque 2013).
-
Detection of deceptive facial expressions Facial image analysis is an active topic—new research directions focus on facial dynamics recognition and understanding for deception detection, behavioral analysis and diagnosis of psychological disorders. Kawulok et al. (2016) used fast smile intensity detectors to elaborate textural facial features that are fed into the SVM classification pipeline to distinguish between posed and spontaneous expressions in video sequences from the UvA-NEMO database containing 1240 sequences, including 643 posed and 597 spontaneous smiles (Dibeklioğlu et al. 2012)—see examples in Fig. 13. Since these features are extracted for each frame (also those which are neutral, without any features exposing the smile characteristics), SVM training sets may become very large and often contain “useless” vectors. To deal with these issues, the authors utilized their memetic training set selection algorithm (Nalepa and Kawulok 2016a).
-
Image retrieval Tong and Chang (2001) showed that their SVM active learning training set selection algorithm can be successfully applied for image retrieval. It selects the most informative images to effectively query a user and quickly learn the decision hyperplane which should separate unlabeled \(\varvec{T}\) images to satisfy the user’s query. With the use of real-life datasets (encompassing up to 2000 images collected from the Internet), the authors proved their technique to be outperforming other state-of-the-art image retrieval approaches. Such image retrieval techniques are commonly applied in the textiles industry, nudity-detection filtering engines, picture and art archives, and even medical diagnosis (Trojacanec et al. 2009).
-
Biomedical applications Selecting appropriate training sets is an important problem in biomedical applications since the data quality and volume are big issues in this field. The following points summarize the most interesting biomedical applications in which SVM training set selection algorithms have been tested and utilized.
-
RNA classification SVMs have been successfully applied to detect non-coding RNAs (ncRNAs) in sequenced genomes (Uzilov et al. 2006). However, RNA datasets are very large which affects the SVM training. Cervantes et al. (2008) exploited their clustering-based training set selection algorithm for two RNA datasets (the first one included almost \(5\times 10^5\) vectors with 8 features, and the second—\(2\times 10^3\) vectors with 84 features) and showed that is it quite competitive with the state of the art for such large-scale data. Wang et al. (2013a) tested their training set selection algorithm on an interesting problem of deciding whether the incoming vector represents RNA of cod fish (the entire training set encompassed more than \(3\times 10^5\) vectors with 8 features).
-
Diseases classification (e.g., leukemia, diabetes, Parkinson’s disease, hepatitis) There are a bunch of approaches that have been tested on various disease classification tasks. In a standard medical image analysis scenario, the cardinality of a training set is not very large, but such datasets are highly imbalanced (usually, there are much more healthy examples compared with the pathological ones). Therefore, applying an appropriate approach for selecting desired training sets is inevitable. Oh et al. (2011) investigated their SVM training set selection using such imbalanced sets for various diseases (leukemia, diabetes, Parkinson’s disease, hepatitis, breast cancer and cardiac diseases). These datasets included up to 800 vectors (Diabetes dataset), and the number of features was up to almost 7200 in the Leukemia dataset.
-
-
Network intrusion detection Yu et al. (2003) focused on an important network intrusion detection problem. Its aim is to build a classifier which is able to distinguish between “bad” connections (intrusions and/or attacks) and normal connections. To test their approaches, the authors exploited a dataset containing a variety of intrusions simulated in a military network environment (42 features, \(4\times 10^6\) vectors). This dataset is available at http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html, and it was used as a benchmark at the Third International Knowledge Discovery and Data Mining Tools Competition, which was held in conjunction with The Fifth International Conference on Knowledge Discovery and Data Mining (KDD-99). Interestingly, the test data does not follow the same probability distribution as the training data (it includes 14 specific attack types that are not present in \(\varvec{T}\) in which 24 training attack types are given). Yu et al. (2003) showed that their clustering-based training set selection technique can easily outperform random sampling in this scenario.
-
Text classification Text classification, being the problem of determining to which topic a given text document belongs (it may be in one, multiple or no category because of the overlaps across these categories), is an important research topic which has been accelerated by the rapid growth of online information. Its applications include spam filtering, language identification, e-mail routing, readability assessment and more. Schohn and Cohn (2000) and Tong and Koller (2002) tackled this problem to verify the capabilities of their active-learning SVM approaches. They exploited the Reuters-21578 dataset (http://www.daviddlewis.com/resources/testcollections/reuters21578/) in the ModApte data split configuration (there are several pre-defined training-test splits provided by the authors of this dataset) with almost \(1.3\times 10^4\) articles (about \(10^4\) features each) and considered 10 most frequently occurring categories. Another commonly used text-classification dataset is Newsweeder (Lang 1995), also investigated in these papers.
-
Credit screening Lyhyaoui et al. (1999) tested their SVM training set selection using a dataset of 690 examples (15 features, 2 classes) reflecting customer creditworthiness. Although this dataset is known to be noisy (Quinlan 1999), the authors were able to surpass \(90\%\) of the classification accuracy with the use of their clustering-based technique.
-
Dataset |
\(t\)
| #features | Types of features | #classes | Repository |
---|---|---|---|---|---|
Adult | 15, 082 | 14 | Categorical, integer | 2 | UCI |
Breast | 277 | 9 | Categorical, integer, real | 2 | KEEL |
Bupa | 345 | 6 | Categorical, integer, real | 2 | KEEL |
German | 1000 | 20 | Categorical, integer | 2 | UCI |
Ionosphere | 351 | 34 | Integer, real | 2 | UCI |
Iris | 150 | 4 | Real | 2 | UCI |
Mushroom | 8124 | 22 | Categorical | 2 | UCI |
Sonar | 208 | 60 | Categorical, integer, real | 2 | KEEL |
Wisconsin | 683 | 9 | Integer | 2 | KEEL |
Yeast | 1484 | 8 | Real | 10 | UCI |