nach oben

International Journal of Data Science and Analytics

Erschienen in:

Open Access 11.05.2022 | Regular Paper

RASCL: a randomised approach to subspace clusters

verfasst von: Sandy Moens, Boris Cule, Bart Goethals

Erschienen in: International Journal of Data Science and Analytics | Ausgabe 3/2022

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Patentsuche

Aus

Abstract

Subspace clustering aims to discover clusters in projections of highly dimensional numerical data. In this paper, we focus on discovering small collections of highly interesting subspace clusters that do not try to cluster all data points, leaving noisy data points unclustered. To this end, we propose a randomised method that first converts the highly dimensional database to a binarised one using projected samples of the original database. Subsequently, this database is mined for frequent itemsets, which we show can be translated back to subspace clusters. In this way, we are able to explore multiple subspaces of different sizes at the same time. In our extensive experimental analysis, we show on synthetic as well as real-world data that our method is capable of discovering highly interesting subspace clusters efficiently.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1 Introduction

Clustering is an important field within data mining research. The main task of clustering is to group similar objects together, while keeping sufficiently different objects apart [1]. However, due to the well-known curse of dimensionality [2], traditional clustering methods struggle when encountering high-dimensional data. In short, with high-dimensional data, the distances between pairs of objects, measured over all dimensions, become increasingly similar. As a result, no proper clusters can be formed, as all objects end up almost equally distant from each other. In other words, while clustering attempts to find localised neighbourhoods of objects, there can be no talk of such neighbourhoods in high-dimensional space. In a time when companies and organisations collect more and more data about its customers, suppliers, operational processes, etc., avoiding the curse of dimensionality is becoming more and more important.

For example, a company may keep track of over a hundred attributes to describe their customers, and, while many customers may share some of these attributes, there will always be enough attributes that make them sufficiently different from each other to be considered similar. However, it is crucial that the company groups its customers into segments in order to, for example, apply targeted marketing campaigns, or conduct meaningful surveys.

Subspace clustering attempts to solve this problem by trying to discover clusters of objects that are similar in a limited number of dimensions [3]. However, given the exponential complexity of the search space, identifying the relevant set of dimensions is computationally demanding: with m-dimensional data, there are $2^m-1$ possible sets of dimensions within which clusters could be found, which is why existing subspace clustering methods suffer from long run-times [4]. Furthermore, some existing approaches produce full clusterings, thereby ensuring that each data object is assigned to exactly one cluster. This is not always desirable, for two reasons. First, the data may contain a lot of noise that should ideally not be assigned to any cluster. Second, there is no reason why a particular object should not be assigned to multiple clusters, especially if the sets of dimensions that define these clusters are entirely different.

For example, a customer could be part of a group targeted by one advertising campaign based on gender, income and marital status, and of another group targeted by an entirely different campaign based on location, being a parent, and having an interest in sports.

In this paper, we take a similar approach to the problem as the CartiClus[5] method: we first convert a numeric database to a transactional one and subsequently use frequent pattern mining to extract subspace clusters. Our method can efficiently produce highly interesting subspace clusters, along with the dimensions that define them. Moreover, we allow objects to be part of multiple clusters, and we leave objects that are not similar to any other objects in any set of dimensions unclustered. We avoid the computational complexity of existing subspace clustering methods by deploying a randomised algorithm. In the first step, we take a large number of samples from the original data, such that each sample consists of a number of objects in a fixed (random) set of dimensions. (Other dimensions are discarded.) In each sample, we then cluster the objects and subsequently assign all objects in the original data to the nearest cluster centroid. This produces a set of objects per centroid, which we interpret as a transaction. By merging the transactions produced for all different samples, we obtain a transaction database. This process is called binarisation of the data. Note that by using a large number of samples, this method will produce relatively many highly similar transactions. We then randomly sample maximal frequent itemsets from this database to obtain potential subspace clusters. Finally, we identify the relevant dimensions for each discovered cluster.

We perform an extensive experimental evaluation on a variety of datasets. First of all, we compare two variants of our algorithm to see whether additional computational effort results in better clustering performance. We evaluate the performance on two aspects—first, object quality, that measures whether objects that belong together are, in fact, placed into the same clusters, and, second, dimension quality, which measures to which extent the relevant dimensions are correctly identified for the discovered clusters. Furthermore, we perform experiments to evaluate how various parameter settings affect the performance of our algorithm, and provide the reader with guidance on how to set sensible parameter values without any knowledge of clusters that may or may not be present in the data. Finally, we compare our algorithm to $\textsc {CartiClus}$ and $\textsc {ProClus}$, two state-of-the-art techniques for subspace clustering. We conclude that, while these methods produce comparable results with optimal parameter settings, our algorithm is much less susceptible to producing poor results if the parameter values are changed. In an unsupervised learning setting, where no ground truth is available to the end user, this can be crucial. Without accidentally stumbling upon the correct parameter settings, $\textsc {CartiClus}$ and $\textsc {ProClus}$ will produce dramatically poorer results, while our method remains relatively robust with a variety of settings.

A preliminary version of this paper was published as A Sampling-based Approach for Discovering Subspace Clusters at the 2019 Discovery Science Conference [6]. Here, we extend our work with new material, both in terms of the theoretical analysis and the experimental evaluation. We now elaborate on the description of our own method by including the pseudocode, as well as by adding examples to illustrate the discussed concepts. Additionally, Sect. 3.3, discussing how our method could be generalised to using any clustering and classification algorithm, is entirely new. Similarly, the experimental section of our paper has been significantly expanded and now contains more than 20 new sets of experiments, comparing our algorithm to existing methods on a variety of evaluation metrics, as well as a thorough discussion of the achieved results.

The main contributions of this paper can be summarised as follows:

we propose a randomised sampling algorithm that efficiently identifies localised clusters and their relevant dimensions,
we simultaneously explore multiple subspaces, thus significantly reducing the required effort,
we allow data objects to be part of multiple clusters, and we leave noise objects unclustered,
we perform a theoretical evaluation to show the efficiency of our method and an extensive experimental evaluation to show the quality of the output.

The remainder of the paper is organised as follows. In Sect. 2, we introduce the necessary notations and describe the problem. Section 3 provides a thorough description of our algorithm, which we experimentally evaluate in Sect. 4. We discuss the related work in Sect. 5, before concluding the paper in Sect. 6.

2 Background

In this section, we introduce the necessary definitions and notations from the fields of subspace clustering and frequent itemset mining. We also discuss and formally define the quality measures used to evaluate our experimental results.

2.1 Subspace clustering

Let $\varvec{\mathcal {D}}= \{\textit{D}_1, \ldots , \textit{D}_m\}$ be a set of m dimensions. Each dimension $\textit{D}_i$ comes equipped with a domain $dom(\textit{D}_i)$. An m-dimensional data point $\textit{p}= (\textit{d}_1, \ldots , \textit{d}_m)$ is a tuple of values over $\varvec{\mathcal {D}}$, such that $\textit{d}_i \in dom(\textit{D}_i)$ for each $i = \{1, \ldots , m\}$. The input database $\varvec{\mathcal {P}}=(\textit{p}_1, \ldots , \textit{p}_q)$ contains a collection of q such m-dimensional data points. Furthermore, each dimension $\textit{D}_i$ comes equipped with a distance function $\delta _{\textit{D}_i}: dom(\textit{D}_i) \times dom(\textit{D}_i) \rightarrow \mathbb {R}^+$. Additionally, we assume that for any subset of dimensions $\mathtt {D}= \{\mathtt {D}_1, \ldots , \mathtt {D}_l\}$, with $1 \le l \le m$ and $\mathtt {D}\subseteq \varvec{\mathcal {D}}$ there exists a distance function $\delta _{\mathtt {D}}: (dom(\mathtt {D}_1) \times \ldots \times dom(\mathtt {D}_l)) \times (dom(\mathtt {D}_1) \times \ldots \times dom(\mathtt {D}_l)) \rightarrow \mathbb {R}^+$. All used distance functions must satisfy the usual conditions (non-negativity, identity, symmetry, and the triangle inequality). In our examples we use $\mathbb {R}$ as the domain for all the dimensions and we use the Euclidean distance as a distance function. Given a subset of dimensions $\mathtt {D}\subseteq \varvec{\mathcal {D}}$, we denote by $\textit{p}^\mathtt {D}$ a data point, and by $\mathtt {P}^\mathtt {D}$ a set of data points, projected onto the given dimensions. We define a subspace cluster as follows.

Definition 1

(Subspace cluster) A subspace cluster is a tuple containing a subset of data points and a subset of dimensions, formally

$$\begin{aligned} \textit{S}= (\mathtt {P}, \mathtt {D}), \text { with } \mathtt {P}\subseteq \varvec{\mathcal {P}}\text { and } \mathtt {D}\subseteq \varvec{\mathcal {D}}. \end{aligned}$$

This definition forms a logical extension of a traditional cluster by incorporating the relevant dimensions in which the data points form a cluster. Even more, this definition leans itself towards our itemset-based algorithm described in the next section. However, for some of the measures described in the next section, the following, alternative, set based definition of tuples is more natural and easier to use [7, 8]. We denote this set based representation as:

$$\begin{aligned} \textit{S}^{\mathfrak {T}}=\{\left( \textit{p}_i, \textit{D}_j\right) | \textit{p}_i \in \mathtt {P}\wedge \textit{D}_j \in \mathtt {D}\ \wedge (\mathtt {P}, \mathtt {D}) \in \textit{S}\}. \end{aligned}$$

Furthermore, let $\mathfrak {T}$ be a mapping from a tuple-based representation of a subspace cluster to a set-based representation, i.e., $\mathfrak {T}(\textit{S})=\textit{S}^{\mathfrak {T}}$.

In this work, we are interested in clusters for which the data points $\mathtt {P}$ lie in close proximity to one another in $\mathtt {D}$. Given the definition of a subspace cluster, the problem we tackle is to discover a concise collection of subspace clusters that have a high precision on both the discovered data points and on the set of dimensions.

2.2 Subspace clustering measures

For the sake of completeness, we add the definitions of the quality measures used throughout the experiments. We refer to the original paper [8] for an in depth study of these measures.

There are 3 flavours of quality measures depending on the type of information of a subspace cluster they operate on: measures taking into account only object (data point) information are sub-scripted with ${\textsc {Obj}}$, measures taking into account only dimension information are sub-scripted with ${\textsc {Dim}}$, and measures taking into account information of both objects and dimensions are sub-scripted with ${\textsc {SC}}$.

Given $\textit{S}_R=(\mathtt {P}_R, \mathtt {D}_R)$ a random subspace cluster and $\textit{S}_G=(\mathtt {P}_G, \mathtt {D}_G)$ the ground truth subspace cluster, the following measures are defined over the objects:

$$\begin{aligned} \begin{aligned} {\textsc {recall}}_{\textsc {Obj}}(\textit{S}_R, \textit{S}_G)&= \frac{|\mathtt {P}_R \cap \mathtt {P}_G|}{|\mathtt {P}_G|}\\&={\textsc {precision}}_{\textsc {Obj}}(\textit{S}_G, \textit{S}_R). \end{aligned} \end{aligned}$$

Given the same two subspace clusters the following measures are defined on the dimensions:

$$\begin{aligned} \begin{aligned} {\textsc {recall}}_{\textsc {Dim}}(\textit{S}_R, \textit{S}_G)&= \frac{|\mathtt {D}_R \cap \mathtt {D}_G|}{|\mathtt {D}_G|}\\&={\textsc {precision}}_{\textsc {Dim}}(\textit{S}_G, \textit{S}_R). \end{aligned} \end{aligned}$$

Finally, given the same two subspace clusters the following measures are more easily defined on the set based representation of the subspace cluster $\textit{S}^{\mathfrak {T}}$. We use the mapping $\mathfrak {T}$ to switch between the representations:

$$\begin{aligned} \begin{aligned} {\textsc {recall}}_{\textsc {SC}}(\textit{S}_R, \textit{S}_G)&=\frac{|\mathfrak {T}(\textit{S}_R) \cap \mathfrak {T}(\textit{S}_G)|}{|\mathfrak {T}(\textit{S}_G)|}\\&={\textsc {precision}}_{\textsc {SC}}(\textit{S}_G, \textit{S}_R). \end{aligned} \end{aligned}$$

The ${\textsc {F1}}$ measure is the harmonic mean between precision and recall and is defined as:

$$\begin{aligned}&{\textsc {F1}}_{*}(\textit{S}_R, \textit{S}_G)\\&=\frac{2\times {\textsc {recall}}_{*}(\textit{S}_R, \textit{S}_G)\times {\textsc {precision}}_{*}(\textit{S}_R, \textit{S}_G)}{{\textsc {recall}}_{*}(\textit{S}_R, \textit{S}_G)+{\textsc {precision}}_{*}(\textit{S}_R, \textit{S}_G)} \end{aligned}$$

with $*\in \{{\textsc {Obj}}, {\textsc {Dim}}, {\textsc {SC}}\}$.

Given a set of clusters $\varvec{\mathcal {S}}_R$ and a set of ground truth clusters $\varvec{\mathcal {S}}_G$ the ${\textsc {precision}}_{*}$, ${\textsc {recall}}_{*}$ and ${\textsc {F1}}_{*}$ can be defined on set level by assigning each cluster from $\varvec{\mathcal {S}}_R$ to the closest cluster $\varvec{\mathcal {S}}_G$ given the quality measure. A set quality measure $\textsc {Q}^{S}\in \{{\textsc {precision}}_{*}^{S}, {\textsc {recall}}_{*}^{S}, {\textsc {F1}}_{*}^{S}\}$ on set level is formally defined as

$$\begin{aligned} \begin{aligned} \textsc {Q}^{S}(\varvec{\mathcal {S}}_R, \varvec{\mathcal {S}}_G)=\frac{1}{|\varvec{\mathcal {S}}_G|}\sum _{\textit{S}_R\in \varvec{\mathcal {S}}_R}\mathop {\mathrm{argmax}}\limits _{\textit{S}_G\in \varvec{\mathcal {S}}_G}\textsc {Q}(\textit{S}_R, \textit{S}_G) \end{aligned} \end{aligned}$$

with $\textsc {Q}$ the non-set counterpart.

Finally, the subspace clustering measure $\textsc {M}^{E4SC}$ [8] on set level is formally defined as

$$\begin{aligned}&\textsc {M}^{E4SC}(\varvec{\mathcal {S}}_R, \varvec{\mathcal {S}}_G)=\frac{2\times {\textsc {F1}}_{{\textsc {SC}}}^{S}(\varvec{\mathcal {S}}_R, \varvec{\mathcal {S}}_G)\times {\textsc {F1}}_{{\textsc {SC}}}^{S}(\varvec{\mathcal {S}}_G, \varvec{\mathcal {S}}_R)}{{\textsc {F1}}_{{\textsc {SC}}}^{S}(\varvec{\mathcal {S}}_R, \varvec{\mathcal {S}}_G)+{\textsc {F1}}_{{\textsc {SC}}}^{S}(\varvec{\mathcal {S}}_G, \varvec{\mathcal {S}}_R)}. \end{aligned}$$

2.3 Frequent itemset mining

Let $\varvec{\mathcal {I}}=(\textit{i}_1, ... \textit{i}_n)$ be a finite set of n items. A transaction t is a subset of items. We denote by $\varvec{\mathcal {T}}=(\textit{t}_1, ..., \textit{t}_o)$ a database of o transactions. An itemset $\mathtt {I}$ is also a subset of items. A transaction t is said to support an itemset $\mathtt {I}$ if $\mathtt {I}\subseteq \textit{t}$. The set of all transactions in $\varvec{\mathcal {T}}$ that support an itemset is called the cover of that itemset, i.e., $cov(\mathtt {I})=\{\textit{t}\mid \textit{t}\in \varvec{\mathcal {T}}\wedge \mathtt {I}\subseteq \textit{t}\}$. The support of an itemset is the size of its cover, i.e., $sup(\mathtt {I})=|cov(\mathtt {I})|$. Frequent itemsets and maximal frequent itemsets are defined as follows.

Definition 2

(Frequent itemset) Given a minimal support threshold $\sigma \ge 0$, an itemset $\mathtt {I}$ is frequent if its support is larger than or equal to $\sigma $, i.e., $sup(\mathtt {I}) \ge \sigma $.

Definition 3

(Maximal frequent itemset) A frequent itemset $\mathtt {I}$ is called maximal if there exists no strict superset of $\mathtt {I}$ that is also frequent with respect to $\sigma $.

The anti-monotonic property of the support of itemsets guarantees that all subsets of a frequent itemset are also frequent.

3 Randomised subspace clusters

Existing methods for discovering subspace clusters from numeric data often focus on the complete raw dataset to compute subspace clusters using a bottom-up [4, 9, 10] or a top-down approach [11]. In this paper we introduce $\textsc {Rascl}$, which takes a different approach to the problem by using randomised subsets of the data (both in the data points and in the dimensions) as a starting point for detecting subspace clusters. The discovered clusters are then checked for occurrence in multiple subsamples of the data. If a cluster occurs frequently enough in the set of samples we output it as a subspace cluster. Our algorithm relies on two simple premises: 1) higher-dimensional subspace clusters also form subspace clusters in lower dimensions, which is also the basis for other bottom-up approaches to subspace clustering [9]; 2) if we take enough random samples and use them to detect clusters, a lot of similar subclusters of the same true cluster will be found in different projections. Moreover, by repeating such a randomised procedure many times we end up with a stable solution.

In a nutshell, our proposed algorithm consists of three consecutive steps:

convert a numerical database to a transactional one using ${\textit{K}}\text {-means}$ clustering (Section 3.1),

extract interesting subspace clusters from the transactional database using frequent itemset mining (Section 3.2),

heuristically select the most interesting subspace clusters (Section 3.4).

3.1 Randomised data transformation

3.1.1 Data binarisation

To binarise a numeric database $\varvec{\mathcal {P}}$ into a transaction database $\varvec{\mathcal {T}}$ we use the indices of data points as the items for $\varvec{\mathcal {T}}$, resulting in $|\varvec{\mathcal {P}}|$ items. In addition, we obtain a mapping between data points and items. Ideally, a transaction contains data points that are close to each other in some set of dimensions. In that case, an itemset (essentially a set of data points) that occurs in a large fraction of transactions can be seen as a subspace cluster over some set of dimensions.

We define a randomised process for constructing a single transaction database. We repeat this process ${\textit{n}}$ times and concatenate all transactions into a single database $\varvec{\mathcal {T}}^{*}$. We first sample a small subset of data points $\mathtt {P}$ and a small subset of dimensions $\mathtt {D}$. (The sampling strategy is explained below.) The data points are projected onto the subset of dimensions and used as input for the ${\textit{K}}\text {-means}$ clustering algorithm. The resulting cluster centroids are used to partition the original data points, assigning each data point to the closest centroid. As such, each centroid represents one transaction in the transactional database and its items are the data points assigned to it. Formally, for a set of centroids

https://static-content.springer.com/image/art%3A10.1007%2Fs41060-022-00327-y/MediaObjects/41060_2022_327_IEq73_HTML.gif

found by K-means, the closest centroid for a projected data point $\textit{p}^{\mathtt {D}}$ is given by

https://static-content.springer.com/image/art%3A10.1007%2Fs41060-022-00327-y/MediaObjects/41060_2022_327_Equ10_HTML.png

In the unlikely event that multiple centroids are the closest, we take the first one that was encountered.

Example 1

An example conversion from a numerical database to a binarised one is shown in Fig. 1. The figure shows a toy database of 11 data points in a 2D space. A red circle represents a synthetic cluster centroid and the surrounding square visually shows data points closest to that centroid. When constructing the binarised database, the index of a data point is added to the transaction of the nearest cluster centroid. The resulting transaction database is shown in Fig. 1b. For example, for $\textit{p}_1$, $\textit{p}_2$ and $\textit{p}_3$ the closest centroid is $\textit{c}_1$ resulting in the first transaction.

3.1.2 Generating data samples

As mentioned previously, our binarisation strategy requires a sample of data points and a sample of dimensions. The main question now is how we can bias the sampling procedure to obtain samples that will have a higher potential to contain cluster structures.

For the data points, we can sample ${\textit{k}}$ data points uniformly at random. By repeating this a large number of times, we expect each cluster to be represented by a sufficient number of data points in a high enough number of samples.

For the dimensions, a naive solution would be to sample uniformly at random a subset of dimensions of size x, with $1 \le x \le |\varvec{\mathcal {D}}|$. However, since the number of combinations larger than 2 can blow up, a random sample of dimensions will likely be too large to contain a meaningful cluster (e.g., if there are 100 dimensions, there are many more subsets of size 50 than of size 2, and a subspace cluster is, on the other hand, more likely to be found in low-dimensional space). The probability that such a uniformly sampled set contains a subspace cluster will decrease rapidly when the size of $\varvec{\mathcal {D}}$ increases. On the other hand, sampling just one dimension may result in discovering cluster structures that do not span multiple dimensions. Therefore, at this stage, we sample a set of two dimensions, and, if a cluster spans over more than two dimensions, the relevant set of dimensions will be identified at a later stage (see Sect. 3.2). We apply weighted sampling to boost the probability of sampling dimensions that contain cluster structures. Similar to Moise et al. [4], we assume that uniformly distributed dimensions do not contain any cluster structure. As such, to detect non-uniformity of a dimension we create a histogram using the Freedman–Diaconis’ rule [12] to compute an appropriate number of bins for the data. This rule is robust to outliers and does not assume data to be normally distributed. Let us denote by $B^\textit{D}$ the bins for a given dimension using the Freedman–Diaconis’ rule and let |b| denote the number of data points falling in bin b. The unnormalised sampling potential $\mathcal {W}$ of a dimension is given by

$$\begin{aligned} \mathcal {W}(\textit{D}) = \sqrt{\frac{|\{b\bigm | b \in B^\textit{D}\wedge |b| \le \frac{|\varvec{\mathcal {P}}|}{|B^\textit{D}|}\}|}{|B^\textit{D}|}}. \end{aligned}$$

(1)

The logic behind this formula is that we wish to prioritise those dimensions in which the points are not uniformly distributed, as they are more likely to contain clusters. Note that if the points are uniformly distributed, the expected number of points in each bin would be $\frac{|\varvec{\mathcal {P}}|}{|B^\textit{D}|}$. However, if a large number of points was clustered in a single bin, with the remaining points uniformly distributed across the remaining bins, then all those remaining bins would contain fewer than $\frac{|\varvec{\mathcal {P}}|}{|B^\textit{D}|}$ points. We therefore compute how many bins contain fewer than the number of expected data points under uniform data distribution. The larger the number of such bins in a given dimension $\textit{D}$, the higher the value of $\mathcal {W}(\textit{D})$. The resulting distribution thus favours dimensions with a larger clustering potential.

Note that randomly sampling sets of dimensions also makes our algorithm robust to missing values. Since we explore various sets of dimensions, missing values have very limited impact on our method. Concretely, a missing value in a particular dimension would only prevent a point from being included in a cluster that is defined by that dimension. Such a missing value would play no part at all in evaluating whether the point belongs to a cluster that is not defined by that dimension.

3.1.3 Pseudocode

The procedure for binarisation is given in Algorithm 1. It takes as input a database of data points, a number ${\textit{k}}$ of data points to sample and parameter ${\textit{K}}$ used for clustering. Line 1 initialises the list of transactions, line 2 samples the data points and line 3 samples the dimensions. Line 4 runs K-means on the projected data points and stores the projected centroids. Lines 5 to 7 create single transactions per cluster centroid, by adding all data point indices to whichever cluster centroid is closest. The binarised database is returned on line 8.

3.1.4 Time complexity

The worst case complexity of our binarisation method is mostly dependent on K-means. However, we use only a small subset of data points, typically $|\mathtt {P}| \ll |\varvec{\mathcal {P}}|$, to compute cluster centroids. For this small subset the complexity for clustering is $\mathcal {O}({\textit{n}}\times (|\mathtt {P}|\times |\mathtt {D}| \times {\textit{K}}\times i))$ with ${\textit{n}}$ the number of database samples and i the number of iterations. The generation of samples for data points can be done in $\mathcal {O}(|\varvec{\mathcal {P}}|)$ and for dimensions can be done in $\mathcal {O}(|\varvec{\mathcal {D}}|)$. The assignment of data points to cluster centroids is done in a single sweep, i.e., $\mathcal {O}({\textit{K}}\times |\varvec{\mathcal {P}}|)$. The total time complexity for generating samples and binarising the database is $\mathcal {O}({\textit{K}}\times |\varvec{\mathcal {P}}|+|\varvec{\mathcal {D}}|+ {\textit{n}}\times (|\mathtt {P}|\times |\mathtt {D}| \times {\textit{K}}\times i))$.

3.2 Extracting subspace clusters

We previously constructed a binarised database $\varvec{\mathcal {T}}^{*}$ by concatenating ${\textit{n}}$ binarised ones built using random samples of data points and dimensions. The premise for each database is that its transactions represent cluster centroids and their items are indices of data points in their close proximity for the set of dimensions. Since we generated ${\textit{n}}$ samples, we know that each index occurs ${\textit{n}}$ times within $\varvec{\mathcal {T}}^{*}$. If then a set of items occurs often together in the database, i.e., it is a frequent itemset with high support, then we know that in many sets of dimensions the same set of data points occur in close proximity, which is exactly the objective for a subspace cluster. This means essentially that every frequent itemset $\mathtt {I}$ for which $sup(\mathtt {I}) \ge \sigma $, with $0 < \sigma \le {\textit{n}}$, represents a subspace cluster in a currently unknown set of dimensions. However, the number of frequent itemsets is typically huge, largely because all subsets of frequent itemsets are frequent. To alleviate this problem we use maximal itemsets and, more particularly, our algorithm samples $\mu $ maximal frequent itemsets from the binarised database. The resulting itemsets are the data points for subspace clusters.

An effective method for sampling maximal frequent itemsets was introduced by Moens and Goethals [13]. It iteratively extends an itemset with new items, until the set is found to be maximal given a threshold $\tau $ and a monotonic quality measure (e.g., support). In each step a probability distribution is computed over the remaining items (i.e., augmentations that result in a score $\ge \tau $) given a quality function (e.g., the support of the itemset augmented with the item). After sampling a single item from the distribution, the itemset is updated and remaining items that result in a score $\le \tau $ are discarded. Next, the distribution is recomputed and the process is repeated until the list of remaining items is empty. The resulting itemset is maximal by construction.

After extracting a collection of data points, the next step is to discover the dimensions in which the data points form a cluster. In contrast to some existing methods [4, 11], we do not require to go back to the data itself to check each dimension individually, since our binarisation process preserved some essential information that can guide us here. That is, our algorithm previously sampled collections of dimensions which can be reused to determine a valid subset of dimensions. We denote by $dims(\textit{t})$ a map that for a transaction returns its linked dimensions, i.e., the dimensions that were used for its construction in the binarisation process. For a maximal itemset $\mathtt {I}$ we can use the transactions in its cover to determine its relevant dimensions, i.e., the set containing all linked dimensions for transactions in $cov(\mathtt {I})$. Formally, $dims(\mathtt {I}) = \{d | d \in \varvec{\mathcal {D}}\wedge d \in dims(\textit{t}) \wedge \textit{t}\in cov(\mathtt {I})\}$. An itemset $\mathtt {I}$, mapped to the data points $\mathtt {P}$, forms together with its relevant dimensions the subspace cluster $\textit{S}=(\mathtt {P}, dims(\mathtt {I})).$

Table 1

An overview of the main characteristics of the synthetic datasets that have been used throughout the experiments

	#rows	#dimensions	#clusters	#objects/cluster (avg)	#dimensions/cluster (avg)
$\hbox {dbsizescale}_\text {s1500}$	1,595	20	10	166.3	14.0
$\hbox {dbsizescale}_\text {s2500}$	2,658	20	10	276.5	14.0
$\hbox {dbsizescale}_\text {s3500}$	3,722	20	10	385.8	14.0
$\hbox {dbsizescale}_\text {s4500}$	4,785	20	10	496.2	14.0
$\hbox {dbsizescale}_\text {s5500}$	5,848	20	10	608.5	14.0
$\hbox {dimscale}_\text {d05}$	1,595	5	10	182.6	3.5
$\hbox {dimscale}_\text {d10}$	1,595	10	10	181.5	6.7
$\hbox {dimscale}_\text {d25}$	1,595	25	10	180.9	16.9
$\hbox {dimscale}_\text {d50}$	1,595	50	10	181.6	33.5
$\hbox {dimscale}_\text {d75}$	1,595	75	10	181.9	50.4
$\hbox {noisescale}_\text {n10}$	1,611	20	10	166.5	14.6
$\hbox {noisescale}_\text {n30}$	2,071	20	10	166.1	14.6
$\hbox {noisescale}_\text {n50}$	2,900	20	10	166.3	14.6
$\hbox {noisescale}_\text {n70}$	4,833	20	10	166.8	14.6

3.2.1 Pseudocode

The pseudocode for extracting subspace clusters is given in Algorithm 2. Line 1 initialises the list of subspace clusters. Lines 2 to 7 extract $\mu $ subspace clusters by first sampling a maximal itemset from $\varvec{\mathcal {T}}$ (line 3) and mapping the itemset to the real data points (line 4). Then the relevant dimensions are computed in line 5 and the subspace cluster is added to the list of clusters in line 6. Finally, line 8 returns all extracted subspace clusters.

3.3 General approach to binarisation

Above, we described the binarisation stage of our algorithm. In short, we transform the multi-dimensional dataset into a transaction database by first selecting a number of random points from the data, then using ${\textit{K}}\text {-means}$ to cluster those points, and, finally, assigning all the other original points to the nearest cluster centroid. This final step could also be seen as a 1-nearest neighbour classification of those remaining points. More generally, our algorithm is an instance of a procedure where clustering is performed on a small subset of points, and those clusters are then used as class labels to classify the remaining points.

In our case, we chose ${\textit{K}}\text {-means}$ clustering and 1-nearest neighbour classification as they are the simplest and the most efficient clustering and classification algorithms, respectively. However, in principle, there is no reason not to use any clustering algorithm for the first step or any classification algorithm for the second step. Obviously, an advantage of ${\textit{K}}\text {-means}$ is that it produces natural centroids, with which the coordinates of the remaining points can easily be compared. Clusters produced by other clustering algorithms may require different characterisation techniques, just as other classification algorithms may rely on different ways to assign class labels to the remaining points. It is beyond the scope of this paper to analyse how combinations of various clustering and classification algorithms would perform in our problem setting, but this does remain a potential avenue of future research.

Additionally, in the interest of efficiency, it is even possible to skip the clustering step altogether. In the extreme case, the chosen values for k (the number of randomly selected points) and K (the desired number of clusters) could be the same. In this case, each of the points would represent an initial centroid for ${\textit{K}}\text {-means}$, and the clustering algorithm would end after one iteration, resulting in each cluster containing exactly one point—namely, its original centroid. Naturally, in this case, it would make no sense to even call the clustering algorithm. In short, this method boils down to simply selecting K random points and then building a transaction database by assigning all other points to the nearest point. In Sect. 4, we experimentally compare this approach to using ${\textit{K}}\text {-means}$ on ${\textit{k}}$ random points and refer to this version of the algorithm as $\textsc {Rascl}_\textsc {R}$.

3.4 Selecting the best subspace clusters

After discovering a large number of subspace clusters (depending on parameter $\mu $), we finally select a small collection of ${\textit{r}}$ clusters that can be deemed the most interesting subspace clusters. The number of data points that is present in the subspace cluster is an indication that the same set of data points are often related even in different subsets of dimensions. That is, even though the different samples have different dimension sets, the data points in the resulting binarised database occur in transactions for similar cluster centroids. In our method we will employ this heuristic (i.e., the larger the cluster, the better) for sorting discovered subspace clusters. In Sect. 4.3, we experimentally demonstrate that this sorting strategy works well. Finally, to reduce redundancy in the cluster results, we sequentially evaluate each cluster and select those clusters that have less than 25% cluster overlap with previously selected ones. Note that when sorting the clusters using the number of objects, this results in smaller clusters as ${\textit{r}}$ increases. Finally, we exclude very small clusters with fewer than 10 data points.

4 Experiments

We implemented $\textsc {Rascl}$ in Python, but depend on a Java library for sampling maximal frequent itemsets. Our source code and all accompanying experiments have been made available online¹. In our experiments, we use synthetic benchmark data provided by Günnemann et al. in their overview paper on subspace clustering [8]. This data is used to evaluate the performance of subspace clustering techniques with respect to their invariance to size, dimensionality and noise. The characteristics of the datasets are shown in Table 1. To evaluate the performance of the various techniques we measure ${\textsc {precision}}_{\textsc {Obj}}$ and ${\textsc {recall}}_{\textsc {Obj}}$ scores on object (data point) level, as well as their harmonic mean ${\textsc {F1}}_{\textsc {Obj}}$. These measures evaluate the correctness of the objects defined by the subspace clusters with respect to some ground truth. Additionally we use their dimensionality aware counterparts which are indicated by the subscripts ${\textsc {Dim}}$ (for scores about the dimensions) and ${\textsc {SC}}$ (for scores about the combination of objects and dimensions) [8]. Finally, we also use $\textsc {M}^{E4SC}$ [8], a measure to assess the quality of clusterings. Note that unless stated otherwise, we assign just one discovered cluster to each ground truth cluster.

We compare two variants of our algorithm: $\textsc {Rascl}$ sets ${\textit{k}}> {\textit{K}}$ and $\textsc {Rascl}_\textsc {R}$ sets ${\textit{k}}={\textit{K}}$ which essentially skips the clustering step (i.e., we randomly pick ${\textit{k}}$ points and allocate the remaining points to the nearest point). For each dataset we use the ground truth and select the ground truth cluster with the largest overlap (highest ${\textsc {precision}}_{\textsc {SC}}$) with the cluster being evaluated to compute its quality. We run each experiment 10 times and report the average results for the first ${\textit{r}}$ subspace clusters. Note that based on the overlap parameter, less than ${\textit{r}}$ clusters may be reported. Finally, unless stated otherwise, we fix the following parameters: ${\textit{n}}=1000$, ${\textit{k}}=100$, ${\textit{K}}=20$, $\sigma =200$, $\mu =100$ and ${\textit{r}}=10$. As a general rule, we provide the following guidance: ${\textit{n}}$ should be set high enough to obtain a representative sample, ${\textit{k}}$ should be sufficiently larger than ${\textit{K}}$ for the clustering to make sense, ${\textit{r}}$ should be set to the desired number of discovered clusters, $\mu $ should be high enough so no information is lost due to randomisation. ${\textit{K}}$ and $\sigma $ are more difficult to set, but we show that the performance of $\textsc {Rascl}$ is not overly sensitive to changes in their values (as long as $\sigma $ is low enough to find some itemsets, otherwise no clusters at all can be found).

4.1 Cluster quality

We first compare our methods to $\textsc {CartiClus}$ [5] and $\textsc {ProClus}$ [11]. We used different setups of $\textsc {Rascl}$ and $\textsc {Rascl}_\textsc {R}$ by varying ${\textit{K}}$ and $\sigma $ (indicated by the superscript). For $\textsc {CartiClus}$ we use the parameter settings as selected by the authors of the original paper [5] as basis for this experiment. For $\textsc {ProClus}$ we set parameters following the ground truth.

Table 2

Detailed runtime breakdown for $\textsc {Rascl}^{{\textit{K}}20, \sigma 200}$

	Read data	Generate samples	Generate transactions	Sample clusters	Post-process	Total time
$\hbox {dbsizescale}_\text {s1500}$	0.02	0.19	11.71	2.89	0.02	14.83
$\hbox {dbsizescale}_\text {s2500}$	0.02	0.18	12.93	4.03	0.03	17.19
$\hbox {dbsizescale}_\text {s3500}$	0.02	0.17	14.30	5.82	0.02	20.33
$\hbox {dbsizescale}_\text {s4500}$	0.03	0.18	15.47	7.32	0.03	23.02
$\hbox {dbsizescale}_\text {s5500}$	0.03	0.17	16.42	8.39	0.02	25.04
$\hbox {dimscale}_\text {d05}$	0.01	0.17	12.47	2.80	0.01	15.46
$\hbox {dimscale}_\text {d10}$	0.04	0.17	11.79	2.70	0.01	14.71
$\hbox {dimscale}_\text {d25}$	0.02	0.20	11.51	2.93	0.02	14.68
$\hbox {dimscale}_\text {d50}$	0.03	0.22	11.90	2.91	0.02	15.08
$\hbox {dimscale}_\text {d75}$	0.04	0.25	11.63	2.83	0.02	14.76
$\hbox {noisescale}_\text {n10}$	0.02	0.19	12.40	2.80	0.03	15.43
$\hbox {noisescale}_\text {n30}$	0.02	0.19	11.71	2.97	0.03	14.92
$\hbox {noisescale}_\text {n50}$	0.02	0.17	13.79	3.58	0.02	17.58
$\hbox {noisescale}_\text {n70}$	0.03	0.19	15.75	5.02	0.02	21.01

The object quality results in terms of ${\textsc {precision}}_{\textsc {Obj}}$ and ${\textsc {recall}}_{\textsc {Obj}}$ are shown in Fig. 2. We can observe that all algorithms perform very well with respect to ${\textsc {precision}}_{\textsc {Obj}}$ except for $\textsc {ProClus}$. Moreover, using ${\textit{K}}=20$ and $\sigma =200$ our algorithms slightly outperform $\textsc {CartiClus}$. Turning to ${\textsc {recall}}_{\textsc {Obj}}$, we see that $\textsc {Rascl}_\textsc {R}^{{\textit{K}}10\sigma 100}$, $\textsc {Rascl}^{{\textit{K}}10\sigma 100}$ and $\textsc {Rascl}^{{\textit{K}}20\sigma 200}$ outperform the competitors on ${\textsc {recall}}_{\textsc {Obj}}$, while $\textsc {Rascl}_\textsc {R}^{{\textit{K}}20\sigma 200}$ often fails to deliver good results. This is due to the randomness introduced by our algorithm: using random centroids leads to fewer completely similar transactions and more partially similar transactions. Combined with a high support this results in small subclusters of the true ground truth clusters, leading to very high ${\textsc {precision}}_{\textsc {Obj}}$ with a low ${\textsc {recall}}_{\textsc {Obj}}$.

The results for the dimension quality are shown in Fig. 3. We see that our algorithms generally outperform the competitors by quite a margin and we see that our simple solution of using linked dimensions (Sect. 3.2) works really well. This holds when the sampled cluster truly defines a subspace cluster. For smaller subspace clusters with lower ${\textsc {precision}}_{\textsc {Obj}}$, the dimension quality decreases as a result. Comparing (${\textit{K}}=10$, $\sigma =100$) to (${\textit{K}}=20$, $\sigma =200$), the latter produces better results, mostly because there are more linked dimensions tied to the cover of the maximal itemset, substantially boosting ${\textsc {recall}}_{\textsc {Dim}}$.

Table 3

Detailed runtime breakdown for $\textsc {Rascl}_\textsc {R}^{{\textit{K}}20, \sigma 200}$

	Read data	Generate samples	Generate transactions	Sample clusters	Post-process	Total time
$\hbox {dbsizescale}_\text {s1500}$	0.02	0.15	2.40	2.36	0.01	4.94
$\hbox {dbsizescale}_\text {s2500}$	0.02	0.13	3.10	3.52	0.02	6.79
$\hbox {dbsizescale}_\text {s3500}$	0.02	0.12	3.91	4.79	0.01	8.85
$\hbox {dbsizescale}_\text {s4500}$	0.03	0.12	4.69	6.04	0.02	10.90
$\hbox {dbsizescale}_\text {s5500}$	0.03	0.12	5.61	7.15	0.02	12.93
$\hbox {dimscale}_\text {d05}$	0.01	0.12	2.33	2.28	0.01	4.76
$\hbox {dimscale}_\text {d10}$	0.02	0.13	2.32	2.25	0.01	4.73
$\hbox {dimscale}_\text {d25}$	0.02	0.16	2.35	2.21	0.01	4.75
$\hbox {dimscale}_\text {d50}$	0.03	0.19	2.35	2.28	0.02	4.86
$\hbox {dimscale}_\text {d75}$	0.04	0.21	2.30	2.18	0.02	4.75
$\hbox {noisescale}_\text {n10}$	0.02	0.13	2.30	2.51	0.01	4.97
$\hbox {noisescale}_\text {n30}$	0.02	0.13	2.67	2.69	0.03	5.54
$\hbox {noisescale}_\text {n50}$	0.02	0.12	3.38	3.36	0.02	6.89
$\hbox {noisescale}_\text {n70}$	0.03	0.12	5.03	5.17	0.02	10.38

4.2 Runtime experiment

We tested the time required to run various algorithms on each one of the synthetic datasets. Our aim here is not to show that our algorithm is faster than existing methods such as $\textsc {ProClus}$, but rather that our method can handle datasets with different, growing characteristics well.

We used the optimal parameter settings for all different methods tested and ran 10 trial runs to account for fluctuations in runtimes. The averaged timing results are shown in Fig. 4. In Fig. 5 we show the same runtimes for growing datasets characteristics but relative to the runtime of the baseline settings, i.e., for sizescale compared to the dataset with 1500 data points, for dimscale compared to the dataset with 5 dimensions and for noisescale compared to the dataset with 10 noise dimensions.

A first thing to notice is that $\textsc {ProClus}$ runs very fast for each of these datasets. $\textsc {Rascl}_\textsc {R}$ and $\textsc {Rascl}$ behave very similarly in general, but $\textsc {Rascl}_\textsc {R}$ requires significantly less time because it skips the clustering step to create cluster centroids. $\textsc {CartiClus}$ runs fast for smaller sized datasets, but tends to blow up quickly once the characteristics of the data increase.

In addition to the full runtimes, we also show the breakdown into smaller steps within the $\textsc {Rascl}$ and $\textsc {Rascl}_\textsc {R}$ algorithms in Tables 2 and 3, respectively. First, the reading of the data is a small cost. The generation of samples is fast and we see no impact of the increasing database size for the dbsizescale and noisescale datasets. For the dimscale datasets we see the time increasing because the number of dimensions is increasing and we compute the sampling potential using a density based approach in Eq. 1. For $\textsc {Rascl}$ the generation of transactions depends largely on the number of records in the dataset as we use ${\textit{K}}$-means to select centroids. In contrast, we do not use this step in $\textsc {Rascl}_\textsc {R}$, but rather use voronoi allocation using randomly selected centroids, which decreases the runtimes of the generation of transactions a lot. Sampling of clusters is tied to the number of items in the transaction dataset, which ultimately depends on the number of data objects in the original dataset. Therefore, we see the time increase with the size of the data for dbsizescale and noisescale. In addition, comparing $\textsc {Rascl}$ and $\textsc {Rascl}_\textsc {R}$, we see that the former runs longer, which can be assigned to the fact that it finds longer/better itemsets because the formed clusters are inherently better than in the fully randomised approach. Finally, the post-processing (final cluster selection) is a minor task in the algorithms.

4.3 Sorting strategy

In Sect. 3.4, we introduced a heuristic for sorting the sampled clusters. To evaluate our heuristic, we sampled a collection of clusters using default parameters and setting $\mu =10000$ for $\textsc {Rascl}$. The results are shown in Fig. 6: the upper chart shows a histogram of the length of sampled clusters, the lower chart shows the relationship between the object length and the ${\textsc {F1}}_{\textsc {SC}}$-score for $\hbox {dbsizescale}_\text {s5500}$. (Other datasets produce similar figures.) The colours indicate the cluster to which the subspace cluster is assigned when computing ${\textsc {F1}}_{\textsc {SC}}$. To the left we show the results using all sampled clusters and to the right we show the results when retaining only non-overlapping cluster allocations. Our algorithm mostly produces large clusters, i.e., our randomised process generates many highly similar transactions, resulting in large maximal itemsets with high ${\textsc {F1}}_{\textsc {SC}}$-scores: good object quality results in good dimension quality. An interesting observation is the nod at object size 572 in Fig. 6a, which is an artefact of the data creation process including overlapping clusters: subclusters of overlapping clusters are assigned to the wrong ground truth cluster resulting in lower cluster scores. More precisely, the data contains a cluster $C_1$ of object size 572 that is fully contained in another cluster $C_2$ of size 1101, but with a slightly modified dimension set. What we observe for sampled clusters with an object size larger than 572 is that our algorithm finds a subset of objects that is a subset of $C_2$. However, our dimension strategy finds the dimension set for cluster $C_1$, because there is not enough evidence that some dimensions are irrelevant for a $C_2$ subspace cluster. Consequently, the ${\textsc {precision}}_{\textsc {Obj}}$ of the cluster goes down, which is reflected in the ${\textsc {F1}}_{\textsc {SC}}$-score. Different quality scores show similar effects. Indeed, retaining only non-overlapping cluster allocations, our algorithm mostly produces full clusters. Note that the low score for the orange cluster is due to randomisation in the transaction dataset creation and other runs could produce much better results for this cluster (see also Fig. 6d).

4.4 Parameter sensitivity

We test the influence of our most important parameters ${\textit{K}}$ and $\sigma $ on the $\hbox {dimscale}_\text {d05}$, $\hbox {dimscale}_\text {d25}$ and $\hbox {dimscale}_\text {d75}$ datasets and show the $\textsc {M}^{E4SC}$ scores. Note that even though our algorithm is not meant for producing full clusterings, it produces very high-quality results on this metric. Figure 7 shows the scores for the first 10 subspace clusters using various parameter settings for the algorithms. For our algorithm we use a window around the default parameters. For $\textsc {CartiClus}$ we use a grid around the optimal parameters and for $\textsc {ProClus}$ we define sensible grids. We see that $\textsc {Rascl}$ is not overly susceptible to parameter changes and that, in general, the default parameters produce good and stable results. In contrast, $\textsc {Rascl}_\textsc {R}$ can still produce very good results, but the quality diminishes quickly when the parameters are not too far from the optimal parameters. Increasing ${\textit{K}}$ or $\sigma $ results in subclusters of the true clusters, thus decreasing the overall score. We see that finding good settings for $\textsc {CartiClus}$ and $\textsc {ProClus}$ is much harder. For $\textsc {ProClus}$ l cannot exceed the number of dimensions in the data, resulting in lots of 0 scores in the figures. The experiments on other datasets produced similar results.

We also show the impact of varying ${\textit{k}}$ and ${\textit{n}}$, whilst keeping the other parameters fixed at the default settings (${\textit{K}}=20$, $\sigma =200$ and $\mu =100$). For each setting, we create 10 sets of clusters and compute the $\textsc {M}^{E4SC}$ score. We then average the quality scores for each parameter setting and plot the resulting scores using a boxplot. The results on synthetic datasets are shown in Fig. 8. To show the stabilisation of the quality score we show three different setups. First, we varied ${\textit{k}}$ starting from 800 up to 2000 (incl.) with steps of 100, while varying ${\textit{n}}$ between 80 and 400 (incl.) with steps of 20 in Fig. 8a. Second, we summarise the scores when ${\textit{k}}$ varies between 800 and 1200 (incl.) and ${\textit{n}}$ varies between 80 and 120 (incl.) in Fig. 8b. Finally, we also show the quality when ${\textit{k}}$ is fixed to 1000 and ${\textit{n}}$ is fixed to 100 in Fig. 8c. These results show the stability of the scores while varying the two parameters. In addition, this shows how the fixed setting (${\textit{k}}=1000, {\textit{n}}=100$) consistently achieves one of the highest scores for each dataset.

4.5 Real-world data

We tested our method on the pendigits dataset, a classification dataset found in the UCI machine learning repository². Using $\textsc {Rascl}$ with ${\textit{n}}=1000$, ${\textit{k}}=100$, ${\textit{K}}=10$, $\sigma =100$, $\mu =100$ and ${\textit{r}}=10$ we discover multiple subspace clusters for each class. A general trend we found was that the discovered clusters have a very high ${\textsc {precision}}_{\textsc {Obj}}$ of approx. 91%, but they have rather low ${\textsc {recall}}_{\textsc {Obj}}$ averaging around 20%. We evaluate the largest subspace cluster in Fig. 9. The silhouette plot shows high similarity for points in the cluster (red) and a much lower score for points outside the cluster (blue). A similar trend is found in the scatter plots, which are obtained using t-SNE transformation [14] based on the relevant dimensions. The left scatter plot shows all data (blue) together with the subspace cluster (red), while the right scatter plot shows only data points not in the cluster. This shows that using our method we do not miss many data points that are within the region of the subspace cluster according to this transformation. The plot on the right is the Andrews plot [15], which can be seen as a smoothed parallel coordinates plot, showing cluster structures more clearly. It shows that the data points in the cluster are closely related in all dimensions. Similar plots are found for the remaining clusters.

Traditional clustering methods are known to be struggling in high-dimensional space. Clustering techniques that measure distances between data objects over all dimensions (i.e., the full data space) are hampered by irrelevant dimensions and produce meaningless results. There exist two well-known techniques for projecting a high-dimensional database to lower-dimensional projections—namely, dimensionality reduction [16] and unsupervised feature selection [17]. However, such projections, inevitably, suffer from information loss. In addition, these methods produce single views of the data, allowing for cluster detection only within a single projection, while other projections, that may yet reveal interesting clusters, are left unexplored.

Subspace clustering, on the other hand, attempts to find clusters in subsets of dimensions. However, some traditional clustering models, unsuited to this setting, have been adapted for this purpose. For example, the well-known ${\textit{K}}\text {-means}$ algorithm [18] has been adapted for subspace clustering. The resulting $\textsc {ProClus}$ [11] algorithm was one of the first methods for finding projected clusters and clusterings. The analogy to our work is the initialisation phase, where a two-step randomised procedure is used to obtain an approximation to a piercing set, i.e., a set of points each from a different cluster, which are refined to full clusters.

$\textsc {DOC}$ [19] is an algorithm that finds subspace clusters using a Monte Carlo method to sample a random point from a cluster as well as a discriminating set of points. It then extends the random point to a full subspace cluster using a bounding box around that point. Finally, it selects the subspace cluster with the highest quality. The latter is a trade-off between the number of data points and the number of dimensions. Its extension $\textsc {MineClus}$ [20] uses the same medoid points for expanding the cluster, but it drops the randomised procedure. Similar to our approach, it also converts the data to a binarised dataset. As such, finding the best projected cluster for random medoid p is transformed to the problem of finding the best itemset given a goodness function $\mu $. Other clustering algorithms, such as DBSCAN [21], have also been adapted for the subspace clustering task [22]. Since then, more general techniques have been proposed for searching the subspace [23, 24], where the discovery of clusters is left to specialised algorithms, followed by further efforts to reduce the time complexity of such methods [25]. However, as mentioned in Sect. 1, all of the above methods are computationally very expensive as they search in an exponential set of subspaces.

$\textsc {FIRES}$ [10] is a generic framework for finding subspace clusters, employing existing clustering techniques to compute a set of base clusters in single dimensions. These base clusters are then merged based on their similarity, and the resulting clusters are then pruned and refined to optimise accuracy. The $\textsc {CartiClus}$ algorithm [5], like our method, creates a binarised dataset. However, in $\textsc {CartiClus}$, the dimensions are defined during the construction of transactions (or carts), such that all carts rely on the same dimension sets. Finally, the carts are mined for frequent itemsets which are then translated back to subspace clusters. Bi-clustering [26] also simultaneously clusters rows and columns of numeric matrices. However, bi-clusters allow for more general clusters as they, for instance, group rows with constant values for a set of columns or group columns that decrease similarly over a set of rows. Typically, such methods are used for analysis of biological data such as gene expression data.

Recently, as in many other fields, there has been increased usage of deep learning methods for discovering subspace clusters. Examples of deep learning subspace clustering algorithms include StructAE, which uses deep neural networks [27], SSC, a graph structured autoencoder [28], and SDEC that performs semi-supervised deep embedded clustering [29]. Kelkar et al. provide a more complete overview of the wider subspace clustering field in a recent survey [30].

Our method addresses important problems that existing methods suffer from. First of all, by binarising the data in a random fashion, we can efficiently explore many different subspaces, of different dimensions, in one go. We are therefore neither limited to a single projection nor, like $\textsc {CartiClus}$, to fixed sets of dimensions defined during the binarisation step. This flexibility increases the likelihood of quickly discovering the correct set of dimensions for each cluster. Moreover, by deploying a randomised itemset search, we also avoid the computational complexity of traditional frequent itemset mining algorithms [31]. Finally, unlike the deep learning algorithms, our method is entirely transparent in terms of how points are allocated to clusters. If a point belongs to a cluster, it is fairly straightforward to explain why, knowing how our algorithm works. Concretely, points form a cluster together because they have repeatedly been found to be similar to each other in a particular set of dimensions during our random sampling procedure used for building a binarised transaction database.

6 Conclusion

In this paper, we present a novel method for discovering interesting clusters in high-dimensional data. In the first stage of our technique, we start by converting the original data into a transaction database by selecting a relatively small number of random data objects, projecting them to a small number of random dimensions, then clustering those objects, and, finally, building transactions by assigning all data objects to their closest cluster centroids. We repeat the above procedure a large number of times and merge all the resulting transactions into a single large transaction database. The main idea behind this approach is that objects that form subspace clusters will appear together in many transactions. In the second stage, we sample maximal itemsets randomly from the transaction database, and consider each such itemset to be a potentially interesting cluster of objects. Finally, in the third stage, for each discovered cluster, we identify a relevant set of dimensions, thus defining a subspace cluster.

A major advantage of our method is that, by using the two randomised procedures, we avoid both the combinatorial explosion of possible dimension sets, and the computational cost of frequent itemset mining. In addition, we do not attempt to produce full clusterings, and we allow data objects to be part of multiple clusters (typically in different sets of dimensions), while noise objects will not be part of any cluster at all. Experimentally, we evaluate two variants of our algorithm and compare our technique to existing state-of-the-art methods. We demonstrate that our method produces quality clusters and is not overly sensitive to changes in the parameter settings, which is crucial for an unsupervised learning task where getting the parameters right can be very challenging.

In Sect. 3.3, we described how the first stage of our method can be generalised to a clustering step followed by a classification step, and even how the clustering step can be skipped altogether. In our implementation, we have chosen the K-means algorithm for the clustering step, and the 1-nearest neighbour algorithm for the classification step. In future work, we intend to analyse how combinations of different clustering and classification algorithms would perform in this setting. In particular, we are interested in examining if investing additional computational effort for these two tasks would ultimately lead to better overall results, or is keeping it simple (as we do now) sufficient to obtain quality output.

Declarations

Conflict of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Vorheriger Artikel A survey on machine learning methods for churn prediction

Nächster Artikel Overfitting measurement of convolutional neural networks using trained network weights

Source code and experiments are available via https://gitlab.com/adrem/rascl

https://archive.ics.uci.edu/ml/datasets/Pen-Based+Recognition+of+Handwritten+Digits

Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: a review. ACM Comput. Surv. (CSUR) 31(3), 264–323 (1999)CrossRef

Bellman, R.E.: Adaptive Control Processes: A Guided Tour. Princeton University Press, Princeton (2015)

Parsons, L., Haque, E., Liu, H.: Subspace clustering for high dimensional data: a review. ACM SIGKDD Explor. Newslett. 6(1), 90–105 (2004)CrossRef

Moise, G., Sander, J., Ester, M.: P3c: A robust projected clustering algorithm. In Sixth international conference on data mining (ICDM’06). IEEE, 2006, pp. 414–425

Aksehirli, E., Goethals, B., Muller, E., Vreeken, J.: Cartification: a neighborhood preserving transformation for mining high dimensional data, In 2013 IEEE 13th international conference on data mining (ICDM), IEEE, (2013), pp. 937–942

Moens, S., Cule, B., Goethals, B.: A sampling-based approach for discovering subspace clusters. In: Discovery science - 22nd international conference, DS. Springer 2019, 61–71 (2019)

Patrikainen, A., Meila, M.: Comparing subspace clusterings. IEEE Trans. Knowl. Data Eng. 18(7), 902–916 (2006)CrossRef

Günnemann, S., Färber, I., Müller, E., Assent, I., Seidl, T.: External evaluation measures for subspace clustering, In: Proc of the 20th ACM international conference on information and knowledge management. ACM, 2011, pp. 1363–1372

Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. ACM, (1998), vol. 27, no. 2

10.

Kriegel, H.-P., Kroger, P., Renz, M., Wurst, S.: A generic framework for efficient subspace clustering of high-dimensional data, In: Fifth IEEE international conference on data mining (ICDM’05). IEEE, 2005, pp. 8–pp

11.

Aggarwal, C.C., Wolf, J.L., Yu, P.S., Procopiuc, C., Park: J.S.: Fast algorithms for projected clustering, In: ACM SIGMoD Record, vol. 28, no. 2. ACM, (1999), pp. 61–72

12.

Freedman, D., Diaconis, P.: On the histogram as a density estimator: L 2 theory. Probab. Theory Relat. Fields 57(4), 453–476 (1981)MathSciNetMATH

13.

Moens, S., Goethals, B.: Randomly sampling maximal itemsets. In: KDD workshop on interactive data exploration and analytics. ACM, 2013, pp. 79–86

14.

Maaten, L.V.D., Hinton, G.: Visualizing data using t-sne. J. Mach. Learn. Res. 9, 2579–2605 (2008)MATH

15.

Andrews, D.F.: Plots of high-dimensional data, Biometrics, pp. 125–136, (1972)

16.

Lee, J.A., Verleysen, M.: Nonlinear Dimensionality Reduction. Springer, Berlin (2007)CrossRef

17.

Dy, J.G., Brodley, C.E.: Feature selection for unsupervised learning. J. Mach. Learn. Res. 5, 845–889 (2004)MathSciNetMATH

18.

MacQueen, J., et al.: Some methods for classification and analysis of multivariate observations, In: proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol. 1, no. 14. Oakland, CA, USA, 1967, pp. 281–297

19.

Procopiuc, C.M., Jones, M., Agarwal, P.K., Murali, T.: A monte carlo algorithm for fast projective clustering, In: proceedings of the 2002 ACM SIGMOD international conference on management of data. ACM, 2002, pp. 418–427

20.

Yiu, M.L., Mamoulis, N., Frequent-pattern based iterative projected clustering. In: Third IEEE international conference on data mining, ICDM 2003, IEEE, pp. 689–692 (2003)

21.

Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al.: A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96(34), 226–231 (1996)

22.

Kailing, K., Kriegel, H.-P., Kröger, P.: Density-connected subspace clustering for high-dimensional data, In: proceedings of the 2004 SIAM international conference on data mining. SIAM, 2004, pp. 246–256

23.

Cheng, C.-H., Fu, A.W., Zhang, Y.: Entropy-based subspace clustering for mining numerical data, In: proceedings of the fifth ACM SIGKDD international conference on knowledge discovery and data mining. ACM, (1999), pp. 84–93

24.

Nguyen, H.V., Müller, E., Vreeken, J., Keller, F., Böhm, K.: Cmi: An information-theoretic contrast measure for enhancing subspace cluster and outlier detection, In: SIAM international conference on data mining. SIAM, (2013), pp. 198–206

25.

Pourkamali-Anaraki, F., Folberth, J., Becker, S.: Efficient solvers for sparse subspace clustering. Signal Process. 172, 107548 (2020)

26.

Madeira, S.C., Oliveira, A.L.: Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans. Comput. Biol. Bioinf. (TCBB) 1(1), 24–45 (2004)

27.

Peng, X., Feng, J., Xiao, S., Yau, W.-Y., Zhou, J.T., Yang, S.: Structured autoencoders for subspace clustering. IEEE Trans. Image Process. 27(10), 5076–5086 (2018)MathSciNetCrossRef

28.

Majumdar, A.: Graph structured autoencoder. Neural Netw. 106, 271–280 (2018)CrossRef

29.

Ren, Y., Hu, K., Dai, X., Pan, L., Hoi, S.C., Xu, Z.: Semi-supervised deep embedded clustering. Neurocomputing 325, 121–130 (2019)CrossRef

30.

Kelkar, B.A., Rodd, S.F.: Subspace clustering’a survey, In: Data Management, Analytics and Innovation. Springer, (2019), pp. 209–220

31.

Agrawal, R., Srikant, R., et al.: Fast algorithms for mining association rules. In: proceedings 20th international conference very large data bases, VLDB, vol. 1215, (1994), pp. 487–499

Titel: RASCL: a randomised approach to subspace clusters
verfasst von: Sandy Moens
Boris Cule
Bart Goethals
Publikationsdatum: 11.05.2022
Verlag: Springer International Publishing
Erschienen in: International Journal of Data Science and Analytics / Ausgabe 3/2022
Print ISSN: 2364-415X
Elektronische ISSN: 2364-4168
DOI: https://doi.org/10.1007/s41060-022-00327-y

	#rows	#dimensions	#clusters	#objects/cluster (avg)	#dimensions/cluster (avg)
\(\hbox {dbsizescale}_\text {s1500}\)	1,595	20	10	166.3	14.0
\(\hbox {dbsizescale}_\text {s2500}\)	2,658	20	10	276.5	14.0
\(\hbox {dbsizescale}_\text {s3500}\)	3,722	20	10	385.8	14.0
\(\hbox {dbsizescale}_\text {s4500}\)	4,785	20	10	496.2	14.0
\(\hbox {dbsizescale}_\text {s5500}\)	5,848	20	10	608.5	14.0
\(\hbox {dimscale}_\text {d05}\)	1,595	5	10	182.6	3.5
\(\hbox {dimscale}_\text {d10}\)	1,595	10	10	181.5	6.7
\(\hbox {dimscale}_\text {d25}\)	1,595	25	10	180.9	16.9
\(\hbox {dimscale}_\text {d50}\)	1,595	50	10	181.6	33.5
\(\hbox {dimscale}_\text {d75}\)	1,595	75	10	181.9	50.4
\(\hbox {noisescale}_\text {n10}\)	1,611	20	10	166.5	14.6
\(\hbox {noisescale}_\text {n30}\)	2,071	20	10	166.1	14.6
\(\hbox {noisescale}_\text {n50}\)	2,900	20	10	166.3	14.6
\(\hbox {noisescale}_\text {n70}\)	4,833	20	10	166.8	14.6

Springer Professional

Abstract

Publisher's Note

1 Introduction

2 Background

2.1 Subspace clustering

2.2 Subspace clustering measures

2.3 Frequent itemset mining

3 Randomised subspace clusters

3.1 Randomised data transformation

3.1.1 Data binarisation

3.1.2 Generating data samples

3.1.3 Pseudocode

3.1.4 Time complexity

3.2 Extracting subspace clusters

3.2.1 Pseudocode

3.3 General approach to binarisation

3.4 Selecting the best subspace clusters

4 Experiments

4.1 Cluster quality

4.2 Runtime experiment

4.3 Sorting strategy

4.4 Parameter sensitivity

4.5 Real-world data

5 Related work

6 Conclusion

Declarations

Conflict of interest

Publisher's Note

Weitere Artikel der Ausgabe 3/2022

A survey on machine learning methods for churn prediction

Overfitting measurement of convolutional neural networks using trained network weights

Predicting and explaining employee turnover intention

Path signature-based phase space reconstruction for stock trend prediction

Ensemble clustering of longitudinal bivariate HIV biomarker profiles to group patients by patterns of disease progression