1 Introduction
-
virtual concept drift means that changes do not impact the posterior probabilities, but affect the conditional probability density functions (Widmer and Kubat 1993).
-
slow changes, i.e., gradual or incremental drift.
-
abrupt changes, i.e., sudden drift.
2 Related works
2.1 Data stream classification in the presence of concept drift
-
Training new classifier every time when new data are available. Such an approach is impractical and very expensive, especially if drift occurs rapidly.
-
Detecting concept drift in new data, and if these changes are significant enough then train the classifier on the basis of new data gathered after a drift occurred.
-
Adopting an incremental learning algorithm for the classification model to smoothly adapt to changing nature of incoming objects.
-
Each object must be processed only once in the course of training.
-
The memory and computing time are limited.
-
The classifier training could be interrupted several times and its quality should not be lower than the classifier trained using batch mode.
2.2 One-class classification
2.3 Weighted one-class support vector machine
3 Adaptive WOCSVM for incremental data stream
3.1 Incremental learning
-
assigning weights to objects from \({\mathcal {DS}}_{k}\) according to Eq. (11). This is motivated by the fact that in the incoming data chunk not all objects should have the same impact on the shape of the new decision boundary,
-
assigning the highest weights to objects coming from the new data chunks:This is motivated by the fact that in the presence of the concept drift objects from a new chunk, therefore, should have a top priority in forming the new decision boundary.$$\begin{aligned} \forall _{x_i \in {\mathcal {DS}}_k} \;\; w_i = 1. \end{aligned}$$(12)
3.2 Incremental forgetting
-
gradual decrease of weights with respect to their initial importance—here, we introduce a denomination factor \(\tau \) that is a user-specified value used to decrease the weights in each iteration:This is motivated by the fact that if an object had initially assigned a higher weight, it had a bigger importance for the classifier. As such, these objects can be valuable for a longer period of time than objects with initial low weights. In this approach, their weights will sooner approach the 0 value and they will be removed in a fewer iterations than objects with high initial weights.$$\begin{aligned} w_i^{k+1} = w_i^k - \tau . \end{aligned}$$(13)
-
aligned decrease of weights without considering their initial importance—here, we introduce a time factor \(\kappa \) that is a user-specified value standing for a number of iterations after which the object should be removed:where \(w_i^a\) stands for the initial value of the weight assigned to \(i\)th object. As we can see, the weights of objects are reduced with each iteration till they are equal to 0 (and removed from \({\mathcal {DS}}\))—the main difference is that this method does not consider the initial importance of data. This means that all the objects from the \(k\)-th data chunk will be removed in the same moment, after \(\kappa \) iterations. This is motivated by the fact that changes in the dynamic environment can be unpredictable and quickly move from the original distribution—therefore, data from previous steps may quickly loose its importance.$$\begin{aligned} w_i^{k+1} = w_i^k - (w_i^a / \kappa ), \end{aligned}$$(14)
-
decrease of weights according to the following sigmoidal function:where \(\beta \) is responsible for the forgetting rapidity. Its value should be determined experimentally. This method allows for a smooth forgetting of previous data with the rapidity of changes controlled by user.$$\begin{aligned} w_i^{k+1} = w_i^a 2\exp (-\beta (k+1)) (1+\exp (-\beta (k+1))\nonumber \\ \end{aligned}$$(15)
4 Experimental investigations
-
to establish, if applying principles of incremental learning and forgetting in one-class classification will allow to handle data streams with the presence of concept drift efficiently;
-
to examine the effectiveness of proposed incremental learning and forgetting schemes that were introduced in this paper.
4.1 Data sets
Data set | Objects | Features | Classes | Drift type |
---|---|---|---|---|
RBF | 1,000,000 | 20 | 2 | Gradual |
LED | 1,000,000 | 24 | 10 | Gradual |
COV | 581,012 | 53 | 7 | Unknown |
ELEC | 45,312 | 7 | 2 | Unknown |
AIR | 539,383 | 7 | 2 | Unknown |
-
RBF: the radial basis function generator outputs a user-defined number of drifting centroids. Each of such centroids is described by a class label, position, weight, and standard deviation. The created data set is defined as a two-class problem with 1,000,000 objects in each class. Four gradual recurring concept drifts were defined. To change this data set into a one-class problem, we use positive class as the target class and negative class as outlier class.
-
LED: this is an artificial data stream generator, described by 24 features that describe an output of seven-segment LED display. We generate a data set consisting of 1,000,000 objects and with the presence of gradual concept drift. We use digit 0 as the target class and remaining digits as outliers.
-
COV: forest cover type is a real-life data set that deals with different cover types in four wilderness areas. Each example is described by 53 cartographic features that allow to categorize cover type into one out of seven classes. There are 581,012 examples in this data set. We use Lodgepole Pine (cover type 2) class as target class (as it is the most numerous one, with 283,301 examples) and remaining six classes as outliers.
-
ELEC: electricity is a real-life data set describing fluctuations in energy prices from the electricity market. This benchmark consists of 45,312 objects, each described by seven features. We use the positive class as the target concept and negative class as outliers.
-
AIR: airlines in a real-life data set dealing with the problem of predicting flight delays. This data stream consists of 539,383 objects, each described by seven features. We use the positive class (flight on time) as the target concept and negative class (flight delay) as outliers.
4.2 Set-up
Model | Description |
---|---|
\(L_1 F_0\)
| Standard WOCSVM without any adaptation mechanism |
Trained on a single incoming chunk, without any previous data | |
\(L_1 F_1\)
| Incremental learning: assigning weights to new objects according to Eq. (11) |
Forgetting: gradual decrease of weights according to Eq. (13) with \(\tau = 0.15\)
| |
\(L_2 F_1\)
| Incremental learning: assigning highest weights to new objects |
Forgetting: gradual decrease of weights according to Eq. (13) with \(\tau \) = 0.15 | |
\(L_1 F_2\)
| Incremental learning: assigning weights to new objects according to Eq. (11) |
Forgetting: aligned decrease of weights without considering their initial importance according to Eq. (14) with \(\kappa = 0.1\)
| |
\(L_2 F_2\)
| Incremental learning: assigning highest weights to new objects |
Forgetting: aligned decrease of weights without considering their initial importance according to Eq. (14) with \(\kappa = 0.1\)
| |
\(L_1 F_3\)
| Incremental learning: assigning weights to new objects according to Eq. (11) |
Forgetting: decrease of weights according to proposed function Eq. (15) with \(\beta \) = 9 | |
\(L_2 F_3\)
| Incremental learning: assigning highest weights to new objects |
Forgetting: decrease of weights according to proposed function Eq. (15) with \(\beta \) = 9 |
4.3 Results and discussion
Data set | OcVFDT | IOCSVM |
\(L_1 F_0\)
|
\(L_1 F_1\)
|
\(L_2 F_1\)
|
\(L_1 F_2\)
|
\(L_2 F_2\)
|
\(L_1 F_3\)
|
\(L_2 F_3\)
|
---|---|---|---|---|---|---|---|---|---|
RBF | 62.82 | 71.06 | 59.67 | 71.34 | 75.69 | 73.47 | 74.12 | 73.82 |
77.12
|
LED | 56.94 | 64.48 | 51.98 | 64.24 | 68.49 | 62.21 | 65.06 | 64.98 |
70.36
|
COV | 55.90 | 71.08 | 57.32 | 70.35 |
75.73
| 68.23 | 72.38 | 73.11 |
75.76
|
ELEC | 68.39 | 70.42 | 62.08 | 70.15 | 73.49 | 70.03 | 72.94 | 72.67 |
74.04
|
AIR | 62.72 | 64.51 | 61.12 | 63.84 |
66.13
| 61.98 | 64.22 | 63.87 |
66.14
|
Data set | OcVFDT | IOCSVM |
\(L_1 F_0\)
|
\(L_1 F_1\)
|
\(L_2 F_1\)
|
\(L_1 F_2\)
|
\(L_2 F_2\)
|
\(L_1 F_3\)
|
\(L_2 F_3\)
|
---|---|---|---|---|---|---|---|---|---|
RBF | 1.87 | 4.63 | 6.12 | 6.43 | 5.14 | 6.32 | 5.24 | 6.33 | 5.13 |
LED | 1.92 | 4.62 | 5.98 | 6.19 | 5.02 | 6.23 | 6.10 | 6.28 | 5.04 |
COV | 2.38 | 4.06 | 6.01 | 6.12 | 5.00 | 6.14 | 5.02 | 6.14 | 4.99 |
ELEC | 1.32 | 3.74 | 4.56 | 4.73 | 4.26 | 4.68 | 4.21 | 4.65 | 4.22 |
AIR | 0.97 | 2.10 | 4.43 | 4.61 | 3.19 | 4.51 | 3.16 | 4.38 | 3.17 |
Data set | OcVFDT | IOCSVM |
\(L_1 F_0\)
|
\(L_1 F_1\)
|
\(L_2 F_1\)
|
\(L_1 F_2\)
|
\(L_2 F_2\)
|
\(L_1 F_3\)
|
\(L_2 F_3\)
|
---|---|---|---|---|---|---|---|---|---|
RBF | 1.12 | 4.62 | 3.54 | 4.04 | 2.78 | 3.97 | 2.65 | 3.82 | 2.34 |
LED | 0.23 | 3.17 | 2.54 | 2.33 | 1.25 | 2.04 | 1.19 | 2.12 | 1.08 |
COV | 0.11 | 2.03 | 1.37 | 2.02 | 1.09 | 1.97 | 1.00 | 1.92 | 0.93 |
ELEC | 0.04 | 1.08 | 0.53 | 0.92 | 0.51 | 0.91 | 0.50 | 0.89 | 0.46 |
AIR | 0.15 | 1.76 | 2.63 | 1.21 | 1.54 | 2.04 | 1.59 | 1.97 | 1.45 |
5 Conclusions
-
implementing and testing new forgetting mechanisms,
-
testing our approach on different types of concept drift, especially on sudden shift which requires drift detector and shorter restoration time of the adaptation strategies,
-
implementing active learning mechanisms which do not require labels of each examples in a chunk,
-
implementing classifier ensemble based on the proposed incremental WOCSVM.