1 Introduction
-
BPF is robust to the presence of outliers and clusters of different sizes and densities.
-
BPF can be used with LOF for detecting boundary points and outliers as BPF shares the k-nearest neighbors and LOF computation.
-
BPF has one tune-able parameter \(k\) (number of nearest neighbors) where the value of k tuned for boundary points detection with BPF can also be used for outlier detection with LOF.
-
Quantitative evaluation of StaticBPF on 2- and high-dimensional synthetic and real data. In our previous work, we demonstrated the accuracy by showing the detected boundary points on 2-d synthetic data and real data. In this paper, in addition to the previous results, the accuracy results are shown quantitatively w.r.t. precision, recall, F1 score, area under precision-recall curve (AUC PR) and area under ROC curve (AUC ROC) on all datasets.
-
Proposal of a boundary points detection method over data stream named \({\textit{StreamBPF}}\). Our previous contribution is suitable for static datasets, and it is computationally expensive to use it for streaming data. Therefore, in this paper, we propose StreamBPF that:1.uses a grid structure to improve the k-nearest neighbors computation, and
-
Runtime performance evaluation of StreamBPF on synthetic and real data, and comparison with StaticBPF and other methods.
2 Related work
Symbols | Description |
---|---|
\(k\) | # Nearest neighbors |
\(n\) | # Points in dataset or window |
\(m\) | # Boundary points |
\(d\) | Dimensionality of data |
\(D\) | Dataset of d-dimensional data points |
\(W_t\) | Window at time step \(t\) |
\(w\) | Slide size of a count-based window |
\(N_k(p)\) | Set of k-nearest neighbors of point \(p\) |
\({\textit{RN}}_k(p)\) | Set of reverse k-neighbors of point \(p\) |
\({\textit{kdist}}(p)\) | Distance of \(p\) with its kth nearest neighbor |
\({\textit{dist}}(p,q)\) | Distance between point \(p\) and \(q\) |
\(G(p)\) | Gravity value of point \(p\) |
\({\textit{LOF}}(p)\) | LOF score of point \(p\) |
\(l\) | Grid cell length |
\({\mathbb {G}}_t\) | Grid at time step \(t\) |
\(C_i\) | \(C_i=(K_i,{\mathbb {S}}_i)\) represents a cell in \({\mathbb {G}}_t\) |
\(K_i\) | Cell key of cell \(C_i\) |
\({\mathbb {S}}_i\) | Subset of points in \(W_t\) belonging to cell \(C_i\) |
3 Preliminaries
4 Boundary point factor (BPF)
5 StaticBPF
5.1 Algorithm
5.2 Runtime complexity
5.3 Evaluation of StaticBPF
5.3.1 Experimental setup
Method | Range |
---|---|
StaticBPF | \( k=[50,100,150,200,250]\) |
BorderShift | \(k=[5,10,50,100]\) |
BRIM | \({\textit{eps}}\) = [\({\textit{min}}\_ {\textit{distance}}\), \({\textit{avg}}\_ {\textit{distance}}\)] |
BORDER | \(k=[50,100,150,200,250]\) |
5.3.2 2-Dimensional synthetic data
Dataset name | Dataset size \((n)\) | #outliers | #boundary points \((m)\) |
---|---|---|---|
Diamonds | 3300 | 220 | 528 |
Rings | 4200 | 172 | 567 |
Mix1 | 3800 | 262 | 838 |
Mix2 | 1710 | 47 | 479 |
Mix3 | 1800 | 127 | 484 |
Mix4 | 2400 | 200 | 298 |
10d | 3300 | 372 | 939 |
20d | 3300 | 401 | 997 |
50d | 3300 | 400 | 1140 |
Method | Parameter | Prec. | Rec. | F1 | ROC | PR |
---|---|---|---|---|---|---|
StaticBPF | \(k=100\) | 0.75 | 0.75 | 0.75 | 0.96 | 0.75 |
BPDAD | #boundaries = 362 | 0.33 | 0.23 | 0.27 | – | – |
BRIM | \(eps=0.96\) | 0.71 | 0.71 | 0.71 | 0.95 | 0.75 |
BORDER | \(k=100\) | 0.55 | 0.55 | 0.55 | 0.86 | 0.41 |
BorderShift | \(k=50\) | 0.83 | 0.83 | 0.83 | 0.97 | 0.85 |
\(\lambda _1=2552,\lambda _2=3080\) |
Method | Parameter | Prec. | Rec. | F1 | ROC | PR |
---|---|---|---|---|---|---|
StaticBPF | \(k=200\) | 0.79 | 0.79 | 0.79 | 0.97 | 0.71 |
BPDAD | #boundaries = 915 | 0.33 | 0.53 | 0.4 | – | – |
BRIM | \(eps=0.094\) | 0.53 | 0.53 | 0.53 | 0.78 | 0.59 |
BORDER | \(k=100\) | 0.69 | 0.69 | 0.69 | 0.95 | 0.55 |
BorderShift | \(k=100\) | 0.66 | 0.66 | 0.66 | 0.93 | 0.69 |
\(\lambda _1=3461,\lambda _2=4028\) |
Method | Parameter | Prec. | Rec. | F1 | ROC | PR |
---|---|---|---|---|---|---|
StaticBPF | \(k=100\) | 0.72 | 0.72 | 0.72 | 0.91 | 0.73 |
BPDAD | \(\#\textrm{boundaries}=716\) | 0.49 | 0.42 | 0.45 | – | – |
BRIM | \({\textit{eps}}=1.13\) | 0.39 | 0.39 | 0.39 | 0.69 | 0.45 |
BORDER | \(k=50\) | 0.65 | 0.65 | 0.65 | 0.86 | 0.51 |
BorderShift | \(k=100\) | 0.69 | 0.69 | 0.69 | 0.88 | 0.69 |
\(\lambda _1=2700,\lambda _2=3538\) |
Method | Parameter | Prec. | Rec. | F1 | ROC | PR |
---|---|---|---|---|---|---|
StaticBPF | \(k=100\) | 0.85 | 0.85 | 0.85 | 0.96 | 0.81 |
BPDAD | \(\#\textrm{boundaries}=689\) | 0.32 | 0.46 | 0.38 | – | – |
BRIM | \({\textit{eps}}=1.19\) | 0.63 | 0.63 | 0.63 | 0.75 | 0.69 |
BORDER | \(k=100\) | 0.9 | 0.9 | 0.9 | 0.96 | 0.79 |
BorderShift | \(k=50\) | 0.71 | 0.71 | 0.71 | 0.89 | 0.76 |
\(\lambda _1=1184,\lambda _2=1663\) |
Method | Parameter | Prec. | Rec. | F1 | ROC | PR |
---|---|---|---|---|---|---|
StaticBPF | \(k=100\) | 0.84 | 0.84 | 0.84 | 0.96 | 0.82 |
BPDAD | #boundaries = 453 | 0.35 | 0.33 | 0.34 | – | – |
BRIM | \(eps=1.2\) | 0.65 | 0.65 | 0.65 | 0.74 | 0.65 |
BORDER | \(k=100\) | 0.71 | 0.71 | 0.71 | 0.89 | 0.58 |
BorderShift | \(k=50\) | 0.72 | 0.72 | 0.72 | 0.9 | 0.79 |
\(\lambda _1=1189,\lambda _2=1673\) |
Methods | Parameter | Prec. | Rec. | F1 | ROC | PR |
---|---|---|---|---|---|---|
StaticBPF | \(k=50\) | 0.74 | 0.74 | 0.74 | 0.97 | 0.78 |
\(k=100\) | 0.72 | 0.72 | 0.72 | 0.97 | 0.76 | |
\(k=150\) | 0.67 | 0.67 | 0.67 | 0.97 | 0.72 | |
BorderShift | \(k=50\), \(\lambda _{1}=2102\), \(\lambda _{2}=2400\) | 0.33 | 0.33 | 0.33 | 0.86 | 0.31 |
\(k=50\), \(\lambda _{1}=1902\), \(\lambda _{2}=2200\) | 0.6 | 0.6 | 0.6 | 0.91 | 0.67 | |
\(k=50\), \(\lambda _{1}=1702\), \( \lambda _{2}=2000\) | 0.42 | 0.42 | 0.42 | 0.55 | 0.25 | |
BRIM | \(eps=0.58\) | 0.52 | 0.52 | 0.52 | 0.82 | 0.57 |
\(eps=1.15\) | 0.56 | 0.56 | 0.56 | 0.76 | 0.53 | |
\(eps=1.73\) | 0.39 | 0.39 | 0.39 | 0.76 | 0.5 | |
BORDER | \(k=100\) | 0.32 | 0.32 | 0.32 | 0.82 | 0.29 |
\(k=150\) | 0.35 | 0.35 | 0.35 | 0.86 | 0.32 | |
\(k=200\) | 0.37 | 0.37 | 0.37 | 0.87 | 0.34 | |
BPDAD | #boundary pts = 740 | 0.15 | 0.37 | 0.21 | – | – |
Methods | Parameter | Prec. | Rec. | F1 | ROC | PR |
---|---|---|---|---|---|---|
StaticBPF | \(k=50\) | 0.75 | 0.75 | 0.75 | 0.97 | 0.8 |
\(k=100\) | 0.75 | 0.75 | 0.75 | 0.97 | 0.8 | |
\(k=150\) | 0.7 | 0.7 | 0.7 | 0.97 | 0.76 | |
BorderShift | \(k=50\), \(\lambda _{1}=2010\), \(\lambda _{2}=2300\) | 0.49 | 0.49 | 0.49 | 0.92 | 0.44 |
\(k=50\), \(\lambda _{1}=1810\), \(\lambda _{2}=2100\) | 0.48 | 0.48 | 0.48 | 0.63 | 0.31 | |
\(k=50\), \(\lambda _{1}=1610\), \( \lambda _{2}=1900\) | 0.25 | 0.25 | 0.25 | 0.3 | 0.12 | |
BRIM | \(eps=0.74\) | 0.46 | 0.46 | 0.46 | 0.76 | 0.53 |
\(eps=1.4\) | 0.44 | 0.44 | 0.44 | 0.74 | 0.38 | |
\(eps=2.16\) | 0.25 | 0.25 | 0.25 | 0.68 | 0.26 | |
BORDER | \(k=50\) | 0.33 | 0.33 | 0.33 | 0.74 | 0.24 |
\(k=100\) | 0.31 | 0.31 | 0.31 | 0.81 | 0.29 | |
\(k=150\) | 0.33 | 0.33 | 0.33 | 0.85 | 0.31 | |
BPDAD | #boundary pts = 740 | 0.16 | 0.42 | 0.23 | – | – |
5.3.3 High-dimensional synthetic data
Methods | Parameter | Prec. | Rec. | F1 | ROC | PR |
---|---|---|---|---|---|---|
StaticBPF | \(k=150\) | 0.78 | 0.78 | 0.78 | 0.94 | 0.78 |
\(k=200\) | 0.79 | 0.79 | 0.79 | 0.94 | 0.79 | |
\(k=250\) | 0.8 | 0.8 | 0.8 | 0.94 | 0.8 | |
BorderShift | \(k=100\), \(\lambda _{1}=2189\), \(\lambda _{2}=3128\) | 0.54 | 0.54 | 0.54 | 0.74 | 0.61 |
\(k=100\), \(\lambda _{1}=1989\), \(\lambda _{2}=2928\) | 0.53 | 0.53 | 0.53 | 0.62 | 0.43 | |
\(k=100\), \(\lambda _{1}=1789\), \( \lambda _{2}=2728\) | 0.36 | 0.36 | 0.36 | 0.48 | 0.21 | |
BRIM | \(eps=0.38\) | 0.5 | 0.5 | 0.5 | 0.69 | 0.54 |
\({\textit{eps}}=0.58\) | 0.62 | 0.62 | 0.62 | 0.79 | 0.62 | |
\({\textit{eps}}=0.77\) | 0.43 | 0.43 | 0.43 | 0.59 | 0.4 | |
BORDER | \(k=100\) | 0.58 | 0.58 | 0.58 | 0.83 | 0.49 |
\(k=150\) | 0.59 | 0.59 | 0.59 | 0.83 | 0.49 | |
\(k=200\) | 0.6 | 0.6 | 0.6 | 0.83 | 0.49 | |
BPDAD | #boundary pts = 872 | 0.23 | 0.21 | 0.22 | – | – |
Methods | Parameter | Prec. | Rec. | F1 | ROC | PR |
---|---|---|---|---|---|---|
StaticBPF | \(k=100\) | 0.7 | 0.7 | 0.7 | 0.82 | 0.69 |
\(k=200\) | 0.72 | 0.72 | 0.72 | 0.83 | 0.71 | |
\(k=250\) | 0.74 | 0.74 | 0.74 | 0.83 | 0.83 | |
BorderShift | \(k=100\), \(\lambda _{1}=1902\), \(\lambda _{2}=2899\) | 0.47 | 0.47 | 0.47 | 0.64 | 0.37 |
\(k=100\), \(\lambda _{1}=1702\), \(\lambda _{2}=2699\) | 0.37 | 0.37 | 0.37 | 0.69 | 0.45 | |
\(k=100\), \(\lambda _{1}=1502\), \( \lambda _{2}=2499\) | 0.36 | 0.36 | 0.36 | 0.67 | 0.51 | |
BRIM | \({\textit{eps}}=0.55\) | 0.59 | 0.59 | 0.59 | 0.76 | 0.63 |
\({\textit{eps}}=0.82\) | 0.62 | 0.62 | 0.62 | 0.78 | 0.66 | |
\({\textit{eps}}=1.1\) | 0.41 | 0.41 | 0.41 | 0.63 | 0.47 | |
BORDER | \(k=100\) | 0.57 | 0.57 | 0.57 | 0.72 | 0.44 |
\(k=150\) | 0.59 | 0.59 | 0.59 | 0.72 | 0.45 | |
\(k=200\) | 0.61 | 0.61 | 0.61 | 0.75 | 0.42 | |
BPDAD | #boundary pts = 1033 | 0.22 | 0.23 | 0.22 | – | – |
Methods | Parameter | Prec. | Rec. | F1 | ROC | PR |
---|---|---|---|---|---|---|
StaticBPF | \(k=150\) | 0.81 | 0.81 | 0.81 | 0.93 | 0.78 |
\(k=200\) | 0.83 | 0.83 | 0.83 | 0.94 | 0.78 | |
\(k=250\) | 0.84 | 0.84 | 0.84 | 0.94 | 0.78 | |
BorderShift | \(k=100\), \(\lambda _{1}=1760\), \(\lambda _{2}=2900\) | 0.69 | 0.69 | 0.69 | 0.71 | 0.49 |
\(k=100\), \(\lambda _{1}=1560\), \(\lambda _{2}=2700\) | 0.74 | 0.74 | 0.74 | 0.78 | 0.65 | |
\(k=100\), \(\lambda _{1}=1360\), \( \lambda _{2}=2500\) | 0.68 | 0.68 | 0.68 | 0.78 | 0.75 | |
BRIM | \({\textit{eps}}=1.2\) | 0.55 | 0.55 | 0.55 | 0.79 | 0.72 |
\({\textit{eps}}=1.72\) | 0.48 | 0.48 | 0.48 | 0.71 | 0.53 | |
\({\textit{eps}}=2.15\) | 0.36 | 0.36 | 0.36 | 0.53 | 0.37 | |
BORDER | \(k=100\) | 0.65 | 0.65 | 0.65 | 0.81 | 0.53 |
\(k=150\) | 0.65 | 0.65 | 0.65 | 0.81 | 0.52 | |
\(k=200\) | 0.65 | 0.65 | 0.65 | 0.81 | 0.53 | |
BPDAD | #boundary pts = 1100 | 0.23 | 0.22 | 0.22 | – | – |
5.3.4 Real data
Dataset name | Dataset size \((n)\) | Dimensions (d) | #Outliers | #Boundaries \((m)\) |
---|---|---|---|---|
Biomed [10] | 194 | 4 | 14 | 42 |
Cancer [11] | 449 | 9 | 88 | 129 |
Methods | Parameter | Prec. | Rec. | F1 | ROC | PR |
---|---|---|---|---|---|---|
StaticBPF | \(k=150\) | 0.64 | 0.64 | 0.64 | 0.92 | 0.73 |
\(k=160\) | 0.69 | 0.69 | 0.69 | 0.95 | 0.79 | |
\(k=170\) | 0.86 | 0.86 | 0.86 | 0.97 | 0.95 | |
BorderShift | \(k=100\), \(\lambda _{1}=138\), \(\lambda _{2}=180\) | 0.83 | 0.83 | 0.83 | 0.96 | 0.88 |
\(k=100\), \(\lambda _{1}=118\), \(\lambda _{2}=160\) | 0.52 | 0.52 | 0.52 | 0.59 | 0.42 | |
\(k=100\), \(\lambda _{1}=98\), \( \lambda _{2}=140\) | 0.14 | 0.14 | 0.14 | 0.18 | 0.034 | |
BRIM | \({\textit{eps}}=0.54\) | 0.64 | 0.64 | 0.64 | 0.91 | 0.69 |
\({\textit{eps}}=0.6\) | 0.69 | 0.69 | 0.60 | 0.91 | 0.76 | |
\({\textit{eps}}=0.64\) | 0.67 | 0.67 | 0.67 | 0.91 | 0.75 | |
BORDER | \(k=150\) | 0.67 | 0.67 | 0.67 | 0.91 | 0.56 |
\(k=160\) | 0.69 | 0.69 | 0.69 | 0.91 | 0.55 | |
\(k=170\) | 0.71 | 0.71 | 0.71 | 0.82 | 0.51 | |
BPDAD | #boundary pts = 49 | 0.53 | 0.62 | 0.57 | – | – |
Methods | Parameter | Prec. | Rec. | F1 | ROC | PR |
---|---|---|---|---|---|---|
StaticBPF | \(k=190\) | 0.53 | 0.53 | 0.53 | 0.75 | 0.69 |
\(k=200\) | 0.59 | 0.59 | 0.59 | 0.79 | 0.72 | |
\(k=210\) | 0.59 | 0.59 | 0.59 | 0.83 | 0.73 | |
BorderShift | \(k=50\), \(\lambda _{1}=232\), \(\lambda _{2}=361\) | 0.26 | 0.26 | 0.26 | 0.76 | 0.59 |
\(k=50\), \(\lambda _{1}=212\), \(\lambda _{2}=341\) | 0.3 | 0.3 | 0.3 | 0.7 | 0.56 | |
\(k=50\), \(\lambda _{1}=192\), \( \lambda _{2}=321\) | 0.21 | 0.21 | 0.21 | 0.59 | 0.41 | |
BRIM | \({\textit{eps}}=0.75\) | 0.49 | 0.49 | 0.49 | 0.66 | 0.5 |
\({\textit{eps}}=1\) | 0.51 | 0.51 | 0.51 | 0.75 | 0.52 | |
\({\textit{eps}}=1.25\) | 0.46 | 0.46 | 0.46 | 0.64 | 0.49 | |
BORDER | \(k=90\) | 0.54 | 0.54 | 0.54 | 0.81 | 0.56 |
\(k=100\) | 0.53 | 0.53 | 0.53 | 0.82 | 0.56 | |
\(k=110\) | 0.52 | 0.52 | 0.52 | 0.82 | 0.56 | |
BPDAD | #boundary pts = 244 | 0.29 | 0.54 | 0.37 | – | – |
6 StreamBPF
6.1 Problem definition
6.2 Grid structure
6.2.1 Definitions
6.2.2 Grid-based k-nearest neighbor computation algorithm
6.3 Incremental computation
-
Given that \(p\) arrives and \(r\) expires from \(W_t\), the k-neighborhood of the points affected by the arrival of p and expiration of r have to be updated. The update of k-neighborhood will result in the update of their \({\textit{kdist}}\). Consequently, the update of \({\textit{kdist}}\) will require the update of \({\textit{reach-dist}}_k\), \({\textit{lrd}}_k\) and \({\textit{LOF}}\) according to Definitions 1, 2 and 3, respectively. Let \({\mathbb {U}}\) represent all points whose k-neighborhood have changed, then LOF scores of all points in \({\mathbb {U}}\) should be updated.
-
Given \({\mathbb {U}}\) the set of points whose k-neighborhood have changed. Let \(q \in {\mathbb {U}}\) and \(o\in N_k(q)\), according to Definition 1, if \(q\in N_k(o)\), then it will affect the reachability distance of o w.r.t. q (\({\textit{reach-dist}}(o,q)\)). Subsequently, the \({\textit{reach-dist}}(o,q)\) will affect the \({\textit{lrd}}(o)\) and \({\textit{LOF}}(o)\). Therefore, o is also considered as an affected point.
-
According to Definition 3, LOF score of a point q should be updated if \({\textit{lrd}}(q)\) or any of its k-nearest neighbors \(o\in N_k(q)\) changes. If \({\textit{lrd}}(o)\) have changed then, all points containing o in their k-nearest neighborhood should be considered affected and their LOF scores should be updated.
6.4 Algorithm
6.5 Evaluation of StreamBPF
Method name | Description |
---|---|
StaticBPF | Apply StaticBPF (Algorithm 1) on each window |
IncBPF | Incrementally computes BPF scores of the affected points in each window |
GridBPF | Uses grid for k-neighborhood computation and computes BPF scores of all points in each window |
StreamBPF | Apply StreamBPF (Algorithm 3) on each window |
6.5.1 Synthetic data streams
Parameter | Values |
---|---|
#Nearest neighbors (\(k\)) | 50, 100, 150, 200, 250 |
Slide size (\(w\)) | 1, 10, 20, 30, 40, 50 |
Window size (\(n\)) | 3000, 6000, 9000, 12,000, 15,000 |
Dimensions (\(d\)) | 2, 10, 20, 50, 100 |