nach oben

Vietnam Journal of Computer Science

Erschienen in:

Open Access 12.05.2016 | Regular Paper

Fast support vector clustering

verfasst von: Tung Pham, Hang Dang, Trung Le, Thai Hoang Le

Erschienen in: Vietnam Journal of Computer Science | Ausgabe 1/2017

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

Support-based clustering has recently absorbed plenty of attention because of its applications in solving the difficult and diverse clustering or outlier detection problem. Support-based clustering method perambulates two phases: finding the domain of novelty and performing the clustering assignment. To find the domain of novelty, the training time given by the current solvers is typically over-quadratic in the training size. This fact impedes the application of support-based clustering method to the large-scale datasets. In this paper, we propose applying stochastic gradient descent framework to the first phase of support-based clustering for finding the domain of novelty in the form of a half-space and a new strategy to perform the clustering assignment. We validate our proposed method on several well-known datasets for clustering task to show that the proposed method renders a comparable clustering quality to the baselines while being faster than them.

1 Introduction

Cluster analysis is a fundamental problem in pattern recognition where objects are categorized into groups or clusters based on pairwise similarities between those objects such that two criteria, homogeneity and separation, are achieved [21]. Two challenges in the task of cluster analysis are (1) dealing with complicated data with nested or hierarchy structures inside; and (2) automatically detecting the number of clusters. Recently, support-based clustering, e.g., support vector clustering (SVC) [1], has drawn a significant research concern because of its applications in solving the difficult and diverse clustering or outlier detection problem [1, 2, 8, 10, 11, 15, 23]. These clustering methods have two main advantages comparing with other clustering methods: (1) ability to generate the clustering boundaries with arbitrary shapes and automatically discover the number of clusters; and (2) capability to handle well the outliers.

Support-based clustering methods always undergo two phases. In the first phase, the domain of novelty, e.g., optimal hypersphere [1, 9, 22] or hyperplane [18], is discovered in the feature space. The domain of novelty when mapped back into the input space will become a set of contours tightly enclosing data which can be interpreted as cluster boundaries. However, this set of contours does not specify how to assign a data sample to its cluster. In addition, the computational complexity of the current solvers [3, 7] to find out the domain of novelty is often over-quadratic [4]. Such a computational complexity impedes the usage of support-based clustering methods for the real-world datasets. In the second phase, namely clustering assignment, based on the geometry information carried in the resultant set of contours harvested from the first phase, data samples are appointed to their clusters. Several works have been proposed for improving cluster assignment procedure [2, 8, 11, 15, 23].

Recently, stochastic gradient descent (SGD) frameworks [6, 19, 20] have emerged as building blocks to develop the learning methods for efficiently handling the large-scale dataset. SGD-based algorithm has the following advantages: (1) very fast; (2) ability to run in online mode; and (3) not requiring to load the entire dataset to the main memory in training. In this paper, we conjoin the advantages of SGD with support-based clustering. In particular, we propose to use the optimal hyperplane as the domain of novelty. The margin, i.e., the distance from the origin to the optimal hyperplane, is maximized to make the contours enclosing the data as tightly as possible. We subsequently apply the stochastic gradient descent framework proposed in [19] to the first phase of support-based clustering for achieving the domain of novelty. Finally, we propose a new strategy for clustering assignment where each data sample in the extended decision boundary has its own trajectory to converge to an equilibrium point and clustering assignment is then reduced to the same task for those equilibrium points. Our clustering assignment strategy distinguishes from the existing works of [8, 11‐13] in the way to find the trajectory with a start and the initial set of data samples that need to do a trajectory for finding the corresponding equilibrium point. The experiments established on the real-world datasets show that our proposed method produces the comparable clustering quality with other support-based clustering methods while simultaneously achieving the computational speedup.

To summarize, the contribution of the paper consists of the following points:

Different from the works of [1, 2, 11, 15, 23] which employ a hypersphere to characterize the domain of novelty, we propose using a hyperplane to characterize the domain of novelty. This allows us to introduce SGD-based solution for finding the domain of novelty.
We propose SGD-based solution for finding the domain of novelty. We perform a rigorous convergence analysis for the proposed solution. We note that the works of [1, 2, 11, 15, 23] utilized the Sequential-Minimal-Optimization-based approach [17] to find the domain of novelty wherein the computational complexity is over-quadratic and it requires loading the entire Gram matrix to the main memory.
We propose new clustering assignment strategy which can reduce the clustering assignment for N samples in the entire training set to the same task for M equilibrium points where M is usually very small comparing with N.
Comparing with the conference version [16], this paper presents a more rigorous convergence analysis with the full proofs and explanations. In addition, it further introduces new strategy for clustering assignment. Regarding the experiment, it compares with more baselines and produces more experimental results.

2 Stochastic gradient descent large margin one-class support vector machine

2.1 Large margin one-class support vector machine

Given the dataset $\mathcal {D}=\left\{ x_{1},x_{2},\ldots ,x_{N}\right\} $, to define the domain of novelty, we construct an optimal hyperplane that can separate the data samples and the origin such that the margin, i.e., the distance from the origin to the hyperplane, is maximized. The optimization problem is formulated as

$$\begin{aligned} \underset{\mathbf {w},\rho }{\max }\left( \frac{\left| \rho \right| }{\Vert \mathbf {w}\Vert ^{2}}\right) \end{aligned}$$

subjects to

$$\begin{aligned}&\mathbf {w}^{\mathsf {T}}\phi (x_{i})-\rho \ge 0, \quad i=1,\ldots ,N \\&\mathbf {w}^{\mathsf {T}}\mathbf {0}-\rho =-\rho <0 \end{aligned}$$

where $\phi $ is a transformation from the input space to the feature space and $\mathbf {w}^{\mathsf {T}}\phi \left( x\right) -\rho =0$ is equation of the hyperplane. It occurs that the margin is invariant if we scale $(\mathbf {w},\rho )$ by a factor k. Hence without loss of generality, we can assume that $\rho =1$ and we achieve the following optimization problem

$$\begin{aligned} \underset{\mathbf {w}}{\text {min}}\left( \frac{1}{2}\Vert \mathbf {w}\Vert ^{2}\right) \end{aligned}$$

subjects to

$$\begin{aligned} \mathbf {w}^{\mathsf {T}}\phi (x_{i})-1\ge 0, \quad i=1,\ldots ,N \end{aligned}$$

Using the slack variables, we can extend the above optimization problem to form the soft model of large margin one-class Support vector machine (LMOCSVM)

$$\begin{aligned} \underset{\mathbf {w}}{\text {min}}\left( \frac{1}{2}\Vert \mathbf {w}\Vert ^{2}+\frac{C}{N}\sum _{i=1}^{N}\xi _{i}\right) \end{aligned}$$

subjects to

$$\begin{aligned}&\mathbf {w}^{\mathsf {T}}\phi (x_{i})-1\ge -\xi _{i}, \quad i=1,\ldots ,N \\&\xi _{i}\ge 0, \quad i=1,\ldots ,N \end{aligned}$$

where $C>0$ is the trade-off parameter and $\varvec{\xi }=\left[ \xi _{1},\ldots ,\xi _{N}\right] $ is the vector of slack variables.

We can rewrite the above optimization problem in the primal form as follows

$$\begin{aligned} \underset{\mathbf {w}}{\text {min}}\left( J(\mathbf {w})\triangleq \frac{1}{2}\left\| \mathbf {w}\right\| ^{2}+\frac{C}{N}\sum _{i=1}^{N}\max \left\{ 0,1-\mathbf {w}^{\mathsf {T}}\phi \left( x_{i}\right) \right\} \right) \end{aligned}$$

(1)

2.2 SGD-based Solution in the primal form

To efficiently solve the optimization in Eq. (1), we use stochastic gradient descent method. We name the outcome method by stochastic-based large margin one-class support vector machine (SGD-LMSVC).

At $t\mathrm{th}$ round, we sample the data point $x_{n_{t}}$ from the dataset $\mathcal {D}$. Let us define the instantaneous function $g_{t}\left( \mathbf {w}\right) \triangleq \frac{1}{2}\left\| \mathbf {w}\right\| ^{2}+C\text {max}\left\{ 0,1-\mathbf {w}^{\mathsf {T}}\phi \left( x_{n_{t}}\right) \right\} $. It is obvious that $g_{t}(\mathbf {w})$ is $1-\text {strongly convex}$ w.r.t the norm $\left\| .\right\| _{2}$ over the feature space.

The learning rate is $\eta _{t}=\frac{1}{t}$ and the sub-gradient is $\lambda _{t}=\mathbf {w}_{t}-C\mathbb {\mathbf {I}}_{[\mathbf {w}_{t}^{\mathsf {T}}\phi \left( x_{n_{t}}\right) <1]}\phi \left( x_{n_{t}}\right) \in \partial g_{t}\left( \mathbf {w}_{t}\right) $, where $\mathbb {\mathbf {I}}_{A}\left( .\right) $ is the indicator function. Therefore, the update rule is

$$\begin{aligned} \mathbf {w}_{t+1}=\mathbf {w}_{t}-\eta _{t}\lambda _{t}=\left( 1-\frac{1}{t}\right) \mathbf {w}_{t}+\frac{C}{t}\mathbb {\mathbf {I}}_{[\mathbf {w}_{t}^{\mathsf {T}}\phi \left( x_{n_{t}}\right) <1]}\phi \left( x_{n_{t}}\right) \end{aligned}$$

(2)

Algorithm 1 is proposed to find the optimal hyperplane which defines the domain of novelty. At each round, one data sample is uniformly sampled from the training set and the update rule in Eq. (2) is applied to determine the next hyperplane, i.e., $\mathbf {w}_{t+1}$. Finally, the last hyperplane, i.e., $\mathbf {w}_{T+1}$ is outputted as the optimal hyperplane. According to the theory displayed in the next section, we can randomly output any intermediate hyperplane and the approximately accurate solution is still warranted in a long-term training. Nonetheless, in Algorithm 1, we make use of the last hyperplane as output to exploit as much as possible the information accumulated through the iterations. It is worthwhile to note that in Algorithm 1, we store $\mathbf {w}_{t}$ as $\mathbf {w}_{t}=\sum _{i}\alpha _{i}\phi \left( x_{i}\right) $.

2.3 Convergence analysis

In this section, we show the convergence analysis of Algorithm 1. We assume that data are bounded in the feature space, that is, $\left\| \phi \left( x\right) \right\| \le R,\;\forall x\,\in \mathcal {X}$. We denote the optimal solution by $\mathbf {w}^{*}$, that is,

https://static-content.springer.com/image/art%3A10.1007%2Fs40595-016-0068-y/MediaObjects/40595_2016_68_IEq32_HTML.gif

. We derive as follows.

Lemma 1 establishes a bound on $\left\| \mathbf {w}_{T}\right\| $, followed by Lemma 2 which establishes a bound on $\left\| \lambda _{T}\right\| $.

Lemma 1

The following statement holds

$$\begin{aligned} \left\| \mathbf {w}_{T}\right\| \le CR, \quad \forall \; T \end{aligned}$$

Proof

We have

$$\begin{aligned}&t\mathbf {w}_{t+1} =\left( t-1\right) \mathbf {w}_{t}+C\mathbb {\mathbf {I}}_{[\mathbf {w}_{t}^{\mathsf {T}}\phi \left( x_{n_{t}}\right) <1]}\phi \left( x_{n_{t}}\right) \\&t\left\| \mathbf {w}_{t+1}\right\| \le \left( t-1\right) \left\| \mathbf {w}_{t}\right\| +CR \end{aligned}$$

Taking sum the above when $t=1,2,\ldots ,T-1$, we gain

$$\begin{aligned}&\left( T-1\right) \left\| \mathbf {w}_{T}\right\| \le \left( T-1\right) CR\\&\left\| \mathbf {w}_{T}\right\| \le CR \end{aligned}$$

Lemma 2

The following statement holds

$$\begin{aligned} \left\| \lambda _{T}\right\| =\left\| \mathbf {w}_{t}-C\mathbb {\mathbf {I}}_{[\mathbf {w}_{t}^{\mathsf {T}}\phi \left( x_{n_{t}}\right) <1]}\phi \left( x_{n_{t}}\right) \right\| \le 2CR, \quad \forall \; T \end{aligned}$$

Proof

We have

$$\begin{aligned} \left\| \lambda _{T}\right\|&\le \left\| \mathbf {w}_{T}\right\| +CR\le 2CR \end{aligned}$$

Theorem 1 establishes a bound on regret and shows that Algorithm 1 has the convergence rate $\text {O}\left( \frac{\log \,T}{T}\right) $.

Theorem 1

Let us consider the running of Algorithm 1.The following statement holds

$$\begin{aligned} \mathcal {J}\left( \overline{\mathbf {w}}_{T}\right) -\mathcal {J}\left( \mathbf {w}^{*}\right) \le \frac{2C^{2}R^{2}\left( \log \,T+1\right) }{T} \end{aligned}$$

where $\overline{\mathbf {w}}_{T}=\frac{1}{T}\sum _{t=1}^{T}\mathbf {w}_{t}$.

Proof

It is apparent that

$$\begin{aligned} \mathbb {E}\left[ \lambda _{t}|\mathbf {w}_{t}\right] =\mathbf {w}_{t}-\frac{C}{N}\sum _{n_{t}=1}^{N}C\mathbb {\mathbf {I}}_{[\mathbf {w}_{t}^{\mathsf {T}}\phi \left( x_{n_{t}}\right) <1]}\phi \left( x_{n_{t}}\right) =\mathcal {J}^{'}\left( \mathbf {w}_{t}\right) \end{aligned}$$

We have the following

$$\begin{aligned} \left\| \mathbf {w}_{t+1}-\mathbf {w}^{*}\right\| ^{2}&=\left\| \mathbf {w}_{t}-\eta _{t}\lambda _{t}-\mathbf {w}^{*}\right\| ^{2}\\&\le \left\| \mathbf {w}_{t}-\mathbf {w}^{*}\right\| ^{2}-2\eta _{t}\lambda _{t}^{\mathsf {T}}\left( \mathbf {w}_{t}-\mathbf {w}^{*}\right) \\&\quad +\eta _{t}^{2}\left\| \lambda _{t}\right\| ^{2} \end{aligned}$$

Taking conditional expectation w.r.t $\mathbf {w}_{t}$ the above, we gain

$$\begin{aligned}&\mathcal {J}^{'}\left( \mathbf {w}_{t}\right) \left( \mathbf {w}_{t}-\mathbf {w}^{*}\right) \\&\quad \le \frac{\mathbb {E}\left[ \left\| \mathbf {w}_{t} -\mathbf {w}^{*}\right\| ^{2}\right] -\mathbb {E}\left[ \left\| \mathbf {w}_{t+1} -\mathbf {w}^{*}\right\| ^{2}\right] }{2\eta _{t}}+\frac{\eta _{t}}{2}\mathbb {E}\left[ \left\| \lambda _{t}\right\| ^{2}\right] \\&\mathcal {J}\left( \mathbf {w}_{t}\right) -\mathcal {J}\left( \mathbf {w}^{*}\right) +\frac{1}{2}\left\| \mathbf {w}_{t}-\mathbf {w}^{*}\right\| ^{2} \\&\quad \le \frac{\mathbb {E}\left[ \left\| \mathbf {w}_{t} -\mathbf {w}^{*}\right\| ^{2}\right] -\mathbb {E}\left[ \left\| \mathbf {w}_{t+1} -\mathbf {w}^{*}\right\| ^{2}\right] }{2\eta _{t}}+\frac{\eta _{t}}{2}\mathbb {E} \left[ \left\| \lambda _{t}\right\| ^{2}\right] \\&\mathcal {J}\left( \mathbf {w}_{t}\right) -\mathcal {J}\left( \mathbf {w}^{*}\right) \le \frac{t-1}{2}\mathbb {E}\left[ \left\| \mathbf {w}_{t}-\mathbf {w}^{*}\right\| ^{2}\right] \\&\quad -\,\frac{t}{2}\mathbb {E}\left[ \left\| \mathbf {w}_{t+1}-\mathbf {w}^{*}\right\| ^{2}\right] +\frac{2C^{2}R^{2}}{t} \end{aligned}$$

Taking expectation again, we achieve

$$\begin{aligned} \mathbb {E}\left[ \mathcal {J}\left( \mathbf {w}_{t}\right) \right] -\mathcal {J}\left( \mathbf {w}^{*}\right)&\le \frac{t-1}{2}\mathbb {E}\left[ \left\| \mathbf {w}_{t}-\mathbf {w}^{*}\right\| ^{2}\right] \\&\quad -\frac{t}{2}\mathbb {E}\left[ \left\| \mathbf {w}_{t+1}-\mathbf {w}^{*}\right\| ^{2}\right] +\frac{2C^{2}R^{2}}{t} \end{aligned}$$

Taking sum the above inequality when $t=1,\ldots ,T$, we gain

$$\begin{aligned}&\frac{1}{T}\sum _{t=1}^{T}\mathbb {E}\left[ \mathcal {J}\left( \mathbf {w}_{t}\right) \right] -\mathcal {J}\left( \mathbf {w}^{*}\right) \nonumber \\&\quad \le \frac{2C^{2}R^{2}}{T}\sum _{t=1}^{T}\frac{1}{t}\le \frac{2C^{2}R^{2}\left( \log \,T+1\right) }{T} \\&\mathcal {J}\left( \overline{\mathbf {w}}_{T}\right) -\mathcal {J}\left( \mathbf {w}^{*}\right) \le \frac{2C^{2}R^{2}\left( \log \,T+1\right) }{T}\nonumber \end{aligned}$$

(3)

$\square $

Theorem 1 shows the inequality for the average solution in the expectation form. In the following theorem, we prove that if we output a single-point solution, with a high probability we have a real inequality.

Theorem 2

Let us consider the running of Algorithm 1. Let r be an integer randomly picked from $\left\{ 1,2,\ldots ,T\right\} $. Given $\delta \in \left( 0;1\right) $, with the probability greater than $1-\delta $ the following inequality holds

$$\begin{aligned} \mathcal {J}\left( \mathbf {w}_{r}\right) <\mathcal {J}\left( \mathbf {w}^{*}\right) +\frac{2R^{2}C^{2}\left( 1+\text {log}\,T\right) }{\delta T} \end{aligned}$$

Proof

Let us denote $X=\mathcal {J}\left( \mathbf {w}_{r}\right) -\mathcal {J}\left( \mathbf {w}^{*}\right) \ge 0$. By definition of r, we have

$$\begin{aligned}&\mathbb {E}_{r}\left[ X\right] =\frac{1}{T}\sum _{t=1}^{T}\mathbb {E}\left[ \mathcal {J}\left( \mathbf {w}_{t}\right) \right] -\mathcal {J}\left( \mathbf {w}^{*}\right) \\&\mathbb {E}\left[ X\right] =\mathbb {E}_{\left( x_{t},y_{t}\right) _{t=1}^{T}}\left[ \mathbb {E}_{r}\left[ X\right] \right] \le \frac{2C^{2}R^{2}\left( \log \,T+1\right) }{T} \end{aligned}$$

Using Markov inequality, we gain

$$\begin{aligned}&\mathbb {P}\left( X\ge \varepsilon \right) \le \frac{\mathbb {E}\left[ X\right] }{\varepsilon }\le \frac{2C^{2}R^{2}\left( \log \,T+1\right) }{\varepsilon T}\\&\mathbb {P}\left( X<\varepsilon \right) >1-\frac{2C^{2}R^{2}\left( \log \,T+1\right) }{\varepsilon T} \end{aligned}$$

Choosing $\delta =\frac{2C^{2}R^{2}\left( \log \,T+1\right) }{\varepsilon T}$, we gain the conclusion. $\square $

We now investigate the number of iterations required if we want to gain an $\varepsilon $-precision solution with a probability at least $1-\delta $. According to Theorem 2, the number of iterations T must be greater than $T_{0}$ where $T_{0}$ is the smallest number such that

$$\begin{aligned}&\frac{2R^{2}C^{2}\left( 1+\text {log}\,T_{0}\right) }{\delta T_{0}} \le \varepsilon \\&\frac{1+\text {log}\,T_{0}}{T_{0}} \le \frac{\varepsilon \delta }{2R^{2}C^{2}} \end{aligned}$$

3 Clustering assignment

After solving the optimization problem, we yield the decision function

$$\begin{aligned} f\left( x\right) =\sum _{i=1}^{N}\alpha _{i}K\left( x_{i},x\right) -1 \end{aligned}$$

To find the equilibrium points, we need to solve the equation $\nabla f\left( x\right) =0$. To this end, we use the fixed point technique and assume that Gaussian kernel is used, i.e., $K\left( x,x'\right) =e^{-\gamma \Vert x-x'\Vert ^{2}}$. We then have

$$\begin{aligned} \frac{1}{2}\nabla f\left( x\right)= & {} \sum _{i=1}^{N}\alpha _{i}\left( x_{i}-x\right) e^{-\gamma \Vert x-x_{i}\Vert ^{2}} \\= & {} 0\rightarrow x=\frac{\sum _{i=1}^{N}\alpha _{i}e^{-\gamma \Vert x-x_{i}\Vert ^{2}}x_{i}}{\sum _{i=1}^{N}\alpha _{i}e^{-\gamma \Vert x-x_{i}\Vert ^{2}}}=P\left( x\right) \end{aligned}$$

To find an equilibrium point, we start with the initial point $x^{(0)}\in \mathbb {R}^{d}$ and iterate $x^{(j+1)}=P\left( x^{(j)}\right) $. By fixed point theorem, the sequence $x^{(j)}$, which can be considered as a trajectory with start $x^{(0)}$, converges to the point $x_{*}^{(0)}$ satisfying P$(x_{*}^{(0)})=x_{*}^{(0)}$ or $\nabla f(x_{*}^{(0)})=0$, i.e., $x_{*}^{(0)}$ is an equilibrium point.

Let us denote $B_{\epsilon }=\left\{ x_{i}:1\le i\le N\,\wedge \,\vert f\left( x_{i}\right) \vert \le \epsilon \right\} $, namely the extended boundary for a tolerance $\epsilon >0$. It follows that the set $B_{\epsilon }$ forms a strip enclosing the decision boundary $f\left( x\right) =0$. Algorithm 2 is proposed to do clustering assignment. In Algorithm 2, the task of clustering assignment is reduced to itself for M equilibrium point. To fulfill cluster assignment for M equilibrium points, we run $m=20$ sample-point test as proposed in [1].

Our proposed clustering assignment procedure is different with the existing procedure proposed in [1]. The procedure proposed in [1] requires to run $m=20$ sample-point test for every edge connected $x_{i},\,x_{j}$ ($i\ne j$) in the training set. Consequently, the computational cost incurred is $\text {O}\left( N\left( N-1\right) ms\right) $ where s is the sparsity level of the decision function (i.e., the number of vectors in the model). Our proposed procedure needs to perform $m=20$ sample-point test for a reduced set of M data samples (i.e., the set of the equilibrium points $\left\{ e_{1},e_{2},\ldots ,e_{M}\right\} $) where M is possibly very small comparing with N. The reason is that many data points in the training set could converge to a common equilibrium point which significantly reduces the size from N to M. The computational cost incurred is therefore $\text {O}\left( M\left( M-1\right) ms\right) $.

4 Experiments

4.1 Visual experiment

To visually show the high clustering quality produced by our proposed SGD-LMSVC, we establish experiment on three synthesized datasets and visually make comparison SGD-LMSVC with C-Means and Fuzzy C-Means. In the first experiment, data samples form the nested structure with two outside rings and one Gaussian distribution at center. As shown in Fig. 1, SGD-LMSVC can perfectly detect three clusters without any prior information while both C-Means and Fuzzy C-Means with the number of clusters being set to 3 beforehand fail to discover the nested clusters. The second experiment is carried out with a two-moon dataset. As observed from Fig. 2, SGD-LMSVC without any prior knowledge can flawlessly discover two clusters in moons, however, C-Means and Fuzzy C-Means cannot detect the clusters correctly. In the last visual experiment, we generate data from the mixture of 4 Gaussian distributions. As shown in Fig. 3, SGD-LMSVC can perfectly detect 4 clusters corresponding to the individual Gaussian distributions. These visual experiments manifest that SGD-LMSVC is able to generate the cluster boundaries in arbitrary shapes as well as automatically detect the appropriate number of clusters well presented in the data.

4.2 Experiment on real datasets

To explicitly prove the performance of the proposed algorithm, we establish experiments on the real datasets. Clustering problem is basically an unsupervised learning task and, therefore, there is not a perfect measure to compare given two clustering algorithms. We examine five typical clustering validity indices (CVI) including compactness, purity, rand index, Davies–Bouldin index (DB index), and normalized mutual information (NMI). A good clustering algorithm should produce a solution which has a high purity, rand index, DB index, and NMI and a low compactness.

4.2.1 Clustering validity index

Compactness measures the average pairwise distances of points in the same cluster [5] and is given as follows

$$\begin{aligned} \mathrm{Compactness}\triangleq \frac{1}{N}\sum _{k=1}^{m}N_{k}\frac{\sum _{x,x'\in C_{k}}d\left( x,x'\right) }{N_{k}\left( N_{k}-1\right) /2} \end{aligned}$$

where the cluster solution consists of m clusters $C_{1},\,C_{2},\ldots ,C_{m}$ whose cardinalities are $N_{1},\,N_{2},\ldots ,N_{m}$, respectively.

The clustering with a small compactness is preferred. A small compactness gained means the average intra-distance of clusters is small and homogeneity is thereby good, i.e., two objects in the same cluster have high similarity to each other.

The second CVI in use is purity which measures the purity of clustering solution with respect to the nature classes of data [14]. It is certainly true that the metric purity is only appropriate for data with labels in nature. Let $N_{ij}$ be the number of objects in cluster i that belong to the class j. Again, let $N_{i}\triangleq \sum _{j=1}^{m}N_{ij}$ be total number of objects in cluster i. Let us define $p_{ij}\triangleq \frac{N_{ij}}{N_{j}}$, i.e., the empirical distribution over class labels for cluster i. We define a purity of a cluster as $p_{i}\triangleq \max _{j}\,p_{ij}$ and overall purity of a clustering solution as

$$\begin{aligned} \mathrm{Purity}\triangleq \sum _{i}\frac{N_{i}}{N}\times p_{i} \end{aligned}$$

The purity ranges between 0 (bad) and 1 (good). This CVI embodies the classification ability of clustering algorithm. A clustering algorithm which achieves a high purity can be appropriately used for classification purpose.

The third CVI used as a measure is rand index [14]. To calculate this CVI for a clustering solution, we need to construct a $2\times 2$ contingency table containing the following numbers: (1) TP (true positive) is the number of pairs that are in the same cluster and belong to the same class; (2) TN (true negative) is the number of pairs that are in two different clusters and belong to different classes; (3) FP (false positive) is the number of pairs that are in the same cluster but belong to different classes; and (4) FN (false negative) is the number of pairs that are in two different clusters but belong to the same class. Rand index is defined as follows

$$\begin{aligned} \mathrm{Rand}\triangleq \frac{\mathrm{TP}+\mathrm{TN}}{\mathrm{TP}+\mathrm{FP}+\mathrm{TN}+\mathrm{FN}} \end{aligned}$$

This can be interpreted as the fraction of clustering decisions that are correct. Obviously, rand index ranges between 0 and 1.

Davies–Bouldin validity index is a function of the ratio of the sum of intra-distances to inter-distances [5] and is formulated as follows

$$\begin{aligned} \mathrm{DBI}\triangleq \frac{1}{m}\sum _{i=1}^{m}\underset{j\ne i}{\text {max}}\left\{ \frac{\Delta \left( C_{i}\right) +\Delta \left( C_{j}\right) }{d\left( C_{i},C_{j}\right) }\right\} \end{aligned}$$

A good clustering algorithm should produce the solution which has as smallest DBI as possible.

Table 1

The statistics of the experimental datasets

Datasets	Size	Dimension	#Classes
Aggregation	788	2	7
Breast cancer	699	9	2
Compound	399	2	6
D31	3100	2	31
Flame	240	2	2
Glass	214	9	7
Iris	150	4	3
Jain	373	2	2
Pathbased	300	2	3
R15	600	2	15
Spiral	312	2	3
Abalone	4177	8	28
Car	1728	6	4
Musk	6598	198	2
Shuttle	43,500	9	5

Table 2

The purity, rand index, and NMI of the clustering methods on the experimental datasets

Datasets	Purity			Rand index			NMI
Datasets	SVC	SGD	FSVC	VC	SGD	FSVC	SVC	SGD	FSVC
Aggregation	1.00	1.00	0.22	1.00	1.00	0.22	0.69	0.75	0.60
Breast cancer	0.98	0.99	0.99	0.82	0.85	0.81	0.22	0.55	0.45
Compound	0.66	0.62	0.13	0.92	0.88	0.25	0.51	0.81	0.45
Flame	0.86	0.87	0.03	0.75	0.76	0.03	0.55	0.51	0.05
Glass	0.5	0.71	0.65	0.77	0.91	0.54	0.60	0.44	0.53
Iris	1.00	1.00	0.68	0.97	0.96	0.69	0.63	0.75	0.71
Jain	0.37	0.46	0.69	0.7	0.71	0.77	0.53	0.31	1.00
Pathbased	0.6	0.5	1.00	0.81	0.94	1.00	0.48	0.43	0.12
R15	0.88	0.9	0.37	0.74	0.71	0.37	0.67	0.77	0.77
Spiral	0.09	0.33	0.53	0.15	0.94	0.75	0.52	0.34	0.16
D31	0.94	0.99	0.42	0.88	0.81	0.54	0.45	0.50	0.38
Abalone	0.22	0.44	0.03	0.43	0.86	0.12	0.22	0.34	0.07
Car	0.94	0.95	0.70	0.46	0.46	0.54	0.32	0.32	0.24
Musk	0.87	0.68	0.88	0.26	0.28	0.26	0.21	0.16	0.23
Shuttle	0.06	0.05	0.06	0.84	0.83	0.75	0.26	0.41	0.50

The last considered CVI is normalized mutual information (NMI) [14]. This measure allows us to trade off the quality of the clustering against the number of clusters.

$$\begin{aligned} \mathrm{NMI}\triangleq \frac{I\left( \Omega ,C\right) }{\left[ H\left( C\right) +H\left( \Omega \right) \right] /2} \end{aligned}$$

where $C=\left\{ c_{1},\ldots ,c_{J}\right\} $ is the set of classes and $\Omega =\left\{ \omega _{1},\ldots ,\omega _{K}\right\} $ is the set of clusters. $I\left( \Omega ,C\right) $ is the mutual information and is defined as

$$\begin{aligned} I\left( \Omega ,C\right) \triangleq \sum _{k}\sum _{j}P\left( c_{j}\bigcap \omega _{k}\right) \log \frac{P\left( c_{j}\bigcap \omega _{k}\right) }{P\left( c_{j}\right) P\left( \omega _{k}\right) } \end{aligned}$$

and $H\left( .\right) $ is the entropy and is defined as

$$\begin{aligned} H\left( \Omega \right) \triangleq -\sum _{k}P\left( \omega _{k}\right) \log P\left( \omega _{k}\right) \end{aligned}$$

It is certainly that the NMI ranges between 0 and 1, and a good clustering algorithm should produce as highest NMI measure as possible.

We perform experiments on 15 well-known datasets for clustering task. The statistics of the experimental datasets is given in Table 1. These datasets are fully labeled and consequently, the CVIs like purity, rand index, and NMI can be completely estimated. We make comparison of our proposed SGD-LMSVC with the following baselines.

4.2.2 Baselines

Support vector clustering (SVC) [1] using LIBSVM [3] for finding domain of novelty and fully connected graph for clustering assignment.
Fast support vector clustering (FSVC) [8] an equilibrium-based approach for clustering assignment.

It is noteworthy that the first phase in our proposed SGD-LMSVC is SGD-based solution for LMOCSVM (cf. Algorithm 1) and the second phase is proposed in Algorithm 2. All competitive methods are run on a Windows computer with dual-core CPU 2.6 GHz and 4 GB RAM.

Table 3

The compactness and DB index of the clustering methods on the experimental datasets

Datasets	Compactness			DB index
Datasets	SVC	SGD	FSVC	SVC	SGD	FSVC
Aggregation	0.29	0.29	2.84	0.68	0.67	0.63
Breast cancer	1.26	0.68	0.71	1.58	1.38	0.53
Compound	0.5	0.21	2.43	2.45	0.86	0.67
Flame	0.58	0.44	2.28	1.3	0.65	3.56
Glass	0.72	0.68	1.85	0.53	0.56	0.93
Iris	0.98	0.25	0.99	1.95	1.17	0.77
Jain	0.96	0.36	1.16	1.23	1.08	0.71
Pathbased	0.18	0.3	1.04	0.36	0.73	1.07
R15	0.61	0.13	1.84	2.96	1.42	1.37
Spiral	2	0.17	0.18	1.41	0.98	0.36
D31	1.41	0.26	1.78	2.33	1.35	1.21
Abalone	3.88	0.40	4.97	3.78	3.91	1.29
Car	0.75	0.74	14.68	1.76	1.76	1.57
Musk	9.89	30.05	20.00	2.27	2.83	0.01
Shuttle	0.50	0.46	0.26	1.86	1.84	1.32

Table 4

Training time in second (i.e., the time for finding domain of novelty) and clustering time in second (i.e., the time for clustering assignment) of the clustering methods on the experimental datasets

Datasets	Training time			Clustering time
Datasets	SVC	SGD	FSVC	SVC	SGD	FSVC
Aggregation	0.05	0.03	0.05	31.42	2.83	7.51
Breast cancer	0.18	0.02	0.05	19.80	2.14	22.86
Compound	0.03	0.02	0.10	6.82	1.17	7.24
Flame	0.02	0.02	15.16	1.81	0.67	4.31
Glass	0.03	0.03	0.02	2.30	0.53	10.67
Iris	0.02	0.02	0.04	1.03	0.34	4.33
Jain	0.02	0.02	0.03	5.80	0.81	4.59
Pathbased	0.02	0.02	0.05	4.02	0.54	4.22
R15	0.02	0.02	0.02	4.14	3.68	10.43
Spiral	0.02	0.03	0.02	1.60	0.99	7.78
D31	0.17	0.09	0.09	467.72	6.56	33.08
Abalone	2.26	0.81	10.94	653.65	26.58	242.97
Car	5.62	0.64	8.15	67.66	7.05	84.47
Musk	55.93	5.79	58.49	602.09	432.58	510.25
Shuttle	10.03	0.46	68.43	1,972.61	925	1,125.46

4.2.3 Hyperparameter setting

The RBF kernel, given by $K\left( x,x'\right) =e^{-\gamma \left\| x-x'\right\| ^{2}}$, is employed. The width of kernel $\gamma $ is searched on the grid $\left\{ 2^{-5},\,2^{-3},\,\ldots ,\,2^{3},\,2^{5}\right\} $. The trade-off parameter C is searched on the same grid. In addition, the parameters p and $\varepsilon $ in FSVC are searched in the common grid $\left\{ 0.1,0.2,\ldots ,0.9,1\right\} $ which is the same as in [8]. Determining the number of iterations in Algorithm 1 is really challenging. To resolve it, we use the stopping criterion $\left\| \mathbf {w}_{t+1}-\mathbf {w}_{t}\right\| \le \theta =0.01$, i.e., the next hyperplane does only a slight change.

We report the experimental results of purity, rand index, and NMI in Table 2, compactness and DB index in Table 3, and the training time (i.e., the time for finding domain of novelty) and clustering time (i.e., the time for clustering assignment) in Table 4. For each CVI, we boldface the method that yields a better outcome, i.e., highest value for purity, rand index, NMI, and DB index and lowest value for compactness. As shown in Tables 2 and 3, our proposed SGD-LMSVC is generally comparable with other baselines in the CVIs. In particular, our proposed SGD-LMSVC is slightly better than others on purity, rand index, and NMI whereas it totally surpasses others on compactness. Moreover, our proposed SGD-LMSVC is slightly worse than SVC on DB index. Regarding the amounts of time taken for training and doing clustering assignment, our proposed SGD-LMSVC is totally superior than others. For the training time, the speedup is significant for the medium-scale or large-scale datasets including Shuttle, Musk, and Abalone. In particular, the speedup is really significant for the clustering time.

5 Conclusion

In this paper, we have proposed a fast support-based clustering method, which conjoins the advantages of SGD-based method and kernel-based method. Furthermore, we have also proposed a new strategy for clustering assignment. We validate our proposed method on 15 well-known datasets for clustering task. The experiment has shown that our proposed method has achieved a comparable clustering quality compared with the baselines while being significantly faster than them.

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Vorheriger Artikel Inferring the cause of errors for a scalable, accurate, and complete constraint-based data cleansing

Nächster Artikel A computationally lightweight and localized centrality metric in lieu of betweenness centrality for complex network analysis

Ben-Hur, A., Horn, D., Siegelmann, H.T., Vapnik, V.: Support vector clustering. J. Mach. Learn. Res. 2, 125–137 (2001)MATH

Camastra, F., Verri, A.: A novel kernel method for clustering. IEEE Trans. Pattern Anal. Mach. Intell. 27(5), 801–804 (2005)CrossRef

Chang, C.-C., Lin, C.-J.: Libsvm: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3), 27:1–27:27 (2011)CrossRef

Chu, C.S., Tsang, I.W., Kwok, J.T.: Scaling up support vector data description by using core-sets. In: Proceedings of the 2004 IEEE international joint conference on neural networks, IEEE 2004. vol. 1 (2004)

Halkidi, M., Batistakis, Y., Vazirgiannis, M.: Clustering validity checking methods: Part II. SIGMOD Rec. 31(3), 19–27 (2002)CrossRefMATH

Hazan, E., Kale, S.: Beyond the regret minimization barrier: optimal algorithms for stochastic strongly-convex optimization. J. Mach. Learn. Res. 15(1), 2489–2512 (2014)MathSciNetMATH

Joachims, T.: Advances in kernel methods. In: Schölkopf, B., Burges, C., Smola, A. (eds.) Making Large-Scale Support Vector Machine Learning Practical, pp. 169–184. The MIT Press, Cambridge (1999)

Jung, K.-H., Lee, D., Lee, J.: Fast support-based clustering method for large-scale problems. Pattern Recognit. 43(5), 1975–1983 (2010)CrossRefMATH

Le, T., Tran, D., Ma, W., Sharma, D.: An optimal sphere and two large margins approach for novelty detection. In: The 2010 international joint conference on neural networks (IJCNN), IEEE, pp. 1–6 (2010)

10.

Le, T., Tran, D., Nguyen, P., Ma, W., Sharma, D.: Proximity multisphere support vector clustering. Neural Comput. Appl. 22(7–8), 1309–1319 (2013)CrossRef

11.

Lee, J., Lee, D.: An improved cluster labeling method for support vector clustering. IEEE Trans. Pattern Anal. Mach. Intell. 27(3), 461–464 (2005)CrossRef

12.

Lee, J., Lee, D.: Dynamic characterization of cluster structures for robust and inductive support vector clustering. IEEE Trans. Pattern Anal. Mach. Intell. 28(11), 1869–1874 (2006)CrossRef

13.

Li, H.: A fast and stable cluster labeling method for support vector clustering. J. Comput. 8(12), 3251–3256 (2013)

14.

Murphy, K.P.: Machine learning: a probabilistic perspective. The MIT Press, Cambridge (2012)MATH

15.

Park, J.H., Ji, X., Zha, H., Kasturi, R.: Support vector clustering combined with spectral graph partitioning. In: Pattern Recognition, 2004. ICPR 2004. Proceedings of the 17th International Conference on, vol. 4, pp. 581–584. IEEE (2004)

16.

Pham, T., Dang, H., Le, T., Le, H-T.: Stochastic gradient descent support vector clustering. In: 2015 2nd national foundation for science and technology development conference on information and computer science (NICS), pp. 88–93 (2015)

17.

Platt, J.C.: Advances in kernel methods. In: Schölkopf, B., Burges, C., Smola, A. (eds.) Fast Training of Support Vector Machines Using Sequential Minimal Optimization, pp. 185–208. The MIT Press, Cambridge (1999)

18.

Schölkopf, B., Platt, J.C., Shawe-Taylor, J.C., Smola, A.J., Williamson, R.C.: Estimating the support of a high-dimensional distribution. Neural Comput. 13(7), 1443–1471 (2001)CrossRefMATH

19.

Shalev-Shwartz, S., Singer, Y.: Logarithmic regret algorithms for strongly convex repeated games. The Hebrew University, Jerusalem (2007)

20.

Shalev-Shwartz, S., Singer, Y., Srebro, N.: Pegasos: primal estimated sub-gradient solver for svm. In. Ghahramani, Z. (ed.) ICML, pp. 807–814 (2007)

21.

Shamir, R., Sharan, R.: Algorithmic approaches to clustering gene expression data. In: Current Topics in Computational Biology. pp. 269–300. MIT Press, Cambridge (2001)

22.

Tax, D.M.J., Duin, R.P.W.: Support vector data description. Mach. Learn. 54(1), 45–66 (2004)CrossRefMATH

23.

Yang, J., Estivill-Castro, V., Chalup, S.K.: Support vector clustering through proximity graph modelling. In: Neural information processing, 2002, ICONIP’02, vol. 2, pp. 898–903 (2002)

Titel: Fast support vector clustering
verfasst von: Tung Pham
Hang Dang
Trung Le
Thai Hoang Le
Publikationsdatum: 12.05.2016
Verlag: Springer Berlin Heidelberg
Erschienen in: Vietnam Journal of Computer Science / Ausgabe 1/2017
Print ISSN: 2196-8888
Elektronische ISSN: 2196-8896
DOI: https://doi.org/10.1007/s40595-016-0068-y

Springer Professional

Fast support vector clustering

Abstract

1 Introduction

2 Stochastic gradient descent large margin one-class support vector machine

2.1 Large margin one-class support vector machine

2.2 SGD-based Solution in the primal form

2.3 Convergence analysis

3 Clustering assignment

4 Experiments

4.1 Visual experiment

4.2 Experiment on real datasets

4.2.1 Clustering validity index

4.2.2 Baselines

4.2.3 Hyperparameter setting

5 Conclusion

Premium Partner

Springer Professional

Abstract

1 Introduction

2 Stochastic gradient descent large margin one-class support vector machine

2.1 Large margin one-class support vector machine

2.2 SGD-based Solution in the primal form

2.3 Convergence analysis

3 Clustering assignment

4 Experiments

4.1 Visual experiment

4.2 Experiment on real datasets

4.2.1 Clustering validity index

4.2.2 Baselines

4.2.3 Hyperparameter setting

5 Conclusion

Weitere Artikel der Ausgabe 1/2017

A computationally lightweight and localized centrality metric in lieu of betweenness centrality for complex network analysis

An HVS-inspired video deinterlacer based on visual saliency

Inferring the cause of errors for a scalable, accurate, and complete constraint-based data cleansing

Autonomic fine-grained replication and migration at component level on multicloud

From the Editor

Enhancing the sentence similarity measure by semantic and syntactico-semantic knowledge

Premium Partner